Usage: datasets

Open In Colab

Here, we will review the datasets downladed and cleaned with DataLoader class. Methods of this class produces the following class instances.

  1. JHUData: the number of confirmed/infected/fatal/recovored cases

  2. PopulationData: population values

  3. OxCGRTData: indicators of government responses (OxCGRT)

  4. PCRData: the number of tests

  5. VaccineData: the number of vaccinations, people vaccinated

  6. LinelistData: linelist of case reports

  7. PyramidData: population pyramid

  8. JapanData: Japan-specific dataset

If you want to use a new dataset for your analysis, please kindly inform us with GitHub Issues: Request new method of DataLoader class.

In this notebook, review the cleaned datasets one by one and visualize them.

Preparation

Import the packages.

[1]:
# !pip install covsirphy --upgrade
from pprint import pprint
import covsirphy as cs
cs.__version__
[1]:
'2.19.1-iota-fu1'

Data cleaning classes will be produced with methods of DataLoader class. Please specify the directory to save CSV files when creating DataLoader instance. The default value of directory is “input” and we will set “../input” here.

Note:
When the directory has a CSV file with the same name, DataLoader will load them without downloading dataset from server. When the CSV file was created/updated more than 12 hours ago, the CSV file will be updated automatically. “12 hours” is the default value and we can change it with update_interval argument when creating DataLoader instance.
[2]:
# Create DataLoader instance
data_loader = cs.DataLoader("../input", update_interval=12)

Usage of methods will be explained in the following sections. If you want to download all datasets with copy & paste, please refer to Dataset preparation.

The number of cases (JHU style)

The main data for analysis is that of the number of cases. JHUData class created with DataLoader.jhu() method is for the number of confirmed/fatal/recovered cases. The number of infected cases will be calculated as “Confirmed - Recovered - Fatal” when data cleaning.

If you want to create this instance with your local CSV file, please refer to Dataset preparation: 3. Use a local CSV file which has the number of cases.

[3]:
# Create instance
jhu_data = data_loader.jhu()
Retrieving datasets from COVID-19 Data Hub https://covid19datahub.io/
        Please set verbose=2 to see the detailed citation list.
Retrieving COVID-19 dataset in Japan from https://github.com/lisphilar/covid19-sir/data/japan
[4]:
# Check type
type(jhu_data)
[4]:
covsirphy.cleaning.jhu_data.JHUData

JHUData.citation property shows the description of this dataset.

[5]:
print(jhu_data.citation)
(Secondary source) Guidotti, E., Ardia, D., (2020), "COVID-19 Data Hub", Journal of Open Source Software 5(51):2376, doi: 10.21105/joss.02376.
Lisphilar (2020), COVID-19 dataset in Japan, GitHub repository, https://github.com/lisphilar/covid19-sir/data/japan

Detailed citation list is saved in DataLoader.covid19dh_citation property. This is not a property of JHUData. Because many links are included, the will not be shown in this tutorial.

[6]:
# Detailed citations (string)
# data_loader.covid19dh_citation

We can check the raw data with JHUData.raw property.

[7]:
jhu_data.raw.tail()
[7]:
id Date vaccines Tests Confirmed Recovered Fatal hosp vent icu ... Province administrative_area_level_3 latitude longitude key key_apple_mobility key_google_mobility Country key_numeric key_alpha_2
436094 v9324r34 2021-05-02 0.0 6387.0 1552.0 1520.0 24.0 0.0 0.0 0.0 ... Vichada NaN 4.4234 -69.2878 0 0 Vichada Colombia 99.0 0
436095 v9324r34 2021-05-03 0.0 6387.0 1552.0 1520.0 24.0 0.0 0.0 0.0 ... Vichada NaN 4.4234 -69.2878 0 0 Vichada Colombia 99.0 0
436096 v9324r34 2021-05-04 0.0 6387.0 1554.0 1522.0 24.0 0.0 0.0 0.0 ... Vichada NaN 4.4234 -69.2878 0 0 Vichada Colombia 99.0 0
436097 v9324r34 2021-05-05 0.0 6433.0 1566.0 1525.0 24.0 0.0 0.0 0.0 ... Vichada NaN 4.4234 -69.2878 0 0 Vichada Colombia 99.0 0
436098 v9324r34 2021-05-06 0.0 6450.0 1580.0 1539.0 24.0 0.0 0.0 0.0 ... Vichada NaN 4.4234 -69.2878 0 0 Vichada Colombia 99.0 0

5 rows × 38 columns

The cleaned dataset is here.

[8]:
jhu_data.cleaned().tail()
[8]:
Date Country Province Confirmed Infected Fatal Recovered
20414 2021-05-03 Japan - 602862 61432 10361 531069
20415 2021-05-04 Japan - 607626 62370 10420 534836
20416 2021-05-05 Japan - 612360 62944 10470 538946
20417 2021-05-06 Japan - 616123 63037 10517 542569
20418 2021-05-07 Japan - 620994 64363 10589 546042

As you noticed, they are returned as a Pandas dataframe. Because tails are the latest values, pandas.DataFrame.tail() was used for reviewing it.

Check the data types and memory usage as follows.

[9]:
jhu_data.cleaned().info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 433513 entries, 0 to 20418
Data columns (total 7 columns):
 #   Column     Non-Null Count   Dtype
---  ------     --------------   -----
 0   Date       433513 non-null  datetime64[ns]
 1   Country    433513 non-null  category
 2   Province   433513 non-null  category
 3   Confirmed  433513 non-null  int64
 4   Infected   433513 non-null  int64
 5   Fatal      433513 non-null  int64
 6   Recovered  433513 non-null  int64
dtypes: category(2), datetime64[ns](1), int64(4)
memory usage: 21.5 MB

Note that date is pandas.datetime64, area names are pandas.Category and the number of cases is numpy.int64.

Total number of cases in all countries

JHUData.total() returns total number of cases in all countries. Fatality and recovery rate are added.

[10]:
total_df = jhu_data.total()
# Show the oldest data
display(total_df.loc[total_df["Confirmed"] > 0].head())
# Show the latest data
display(total_df.tail())
Confirmed Infected Fatal Recovered Fatal per Confirmed Recovered per Confirmed Fatal per (Fatal or Recovered)
Date
2020-01-07 1 0 0 1 0.0 1.0 0.0
2020-01-12 1 1 0 0 0.0 0.0 NaN
2020-01-13 1 1 0 0 0.0 0.0 NaN
2020-01-14 1 1 0 0 0.0 0.0 NaN
2020-01-15 1 1 0 0 0.0 0.0 NaN
Confirmed Infected Fatal Recovered Fatal per Confirmed Recovered per Confirmed Fatal per (Fatal or Recovered)
Date
2021-05-03 153649844 53841980 3216407 96591457 0.020933 0.628647 0.032226
2021-05-04 154441840 53978661 3229915 97233264 0.020913 0.629579 0.032150
2021-05-05 155286115 54133233 3243997 97908885 0.020890 0.630506 0.032070
2021-05-06 156135032 54321980 3257167 98555885 0.020861 0.631222 0.031992
2021-05-07 3246158 2172048 46087 1028023 0.014197 0.316689 0.042907

The first case (registered in the dataset) was 07Jan2020. COVID-19 outbreak is still ongoing.

We can create line plots with covsirphy.line_plot() function.

[11]:
cs.line_plot(total_df[["Infected", "Fatal", "Recovered"]], "Total number of cases over time")
_images/usage_dataset_26_0.png

Statistics of fatality and recovery rate are here.

[12]:
total_df.loc[:, total_df.columns.str.contains("per")].describe().T
[12]:
count mean std min 25% 50% 75% max
Fatal per Confirmed 483.0 0.034251 0.016757 0.0 0.022134 0.028212 0.041899 0.074190
Recovered per Confirmed 483.0 0.526095 0.181491 0.0 0.446332 0.616958 0.644101 1.000000
Fatal per (Fatal or Recovered) 477.0 0.083137 0.085011 0.0 0.034496 0.044274 0.101679 0.546667

Subset for area

JHUData.subset() creates a subset for a specific area. We can select country name and province name. In this tutorial, “Japan” and “Tokyo in Japan” will be used. Please replace it with your country/province name.

Subset for a country:
We can use both of country names and ISO3 codes.
[13]:
# Specify contry name
df, complement = jhu_data.records("Japan")
# Or, specify ISO3 code
# df, complement = jhu_data.records("JPN")
# Show records
display(df.tail())
# Show details of complement
print(complement)
Date Confirmed Infected Fatal Recovered
452 2021-05-03 602862 61432 10361 531069
453 2021-05-04 607626 62370 10420 534836
454 2021-05-05 612360 62944 10470 538946
455 2021-05-06 616123 63037 10517 542569
456 2021-05-07 620994 64363 10589 546042
partially complemented recovered data

Complement of records was performed. The second returned value is the description of complement. Details will be explained later and we can skip complement with auto_complement=False argument. Or, use just use JHUData.subset() method when the second returned value (False because no complement) is un-necessary.

[14]:
# Skip complement
df, complement = jhu_data.records("Japan", auto_complement=False)
# Or,
# df = jhu_data.subset("Japan")
display(df.tail())
# Show complement (False because not complemented)
print(complement)
Date Confirmed Infected Fatal Recovered
452 2021-05-03 602862 61432 10361 531069
453 2021-05-04 607626 62370 10420 534836
454 2021-05-05 612360 62944 10470 538946
455 2021-05-06 616123 63037 10517 542569
456 2021-05-07 620994 64363 10589 546042
False

Subset for a province (called “prefecture” in Japan):

[15]:
df, _ = jhu_data.records("Japan", province="Tokyo")
df.tail()
[15]:
Date Confirmed Infected Fatal Recovered
410 2021-05-02 141005 7009 1898 132098
411 2021-05-03 141713 7023 1898 132792
412 2021-05-04 142322 6935 1899 133488
413 2021-05-05 142943 6911 1899 134133
414 2021-05-06 143534 6970 1903 134661

The list of countries can be checked with JHUdata.countries() as folows.

[16]:
pprint(jhu_data.countries(), compact=True)
['Afghanistan', 'Albania', 'Algeria', 'Andorra', 'Angola',
 'Antigua and Barbuda', 'Argentina', 'Armenia', 'Australia', 'Austria',
 'Azerbaijan', 'Bahamas', 'Bahrain', 'Bangladesh', 'Barbados', 'Belarus',
 'Belgium', 'Belize', 'Benin', 'Bermuda', 'Bhutan', 'Bolivia',
 'Bosnia and Herzegovina', 'Botswana', 'Brazil', 'Brunei', 'Bulgaria',
 'Burkina Faso', 'Burundi', 'Cambodia', 'Cameroon', 'Canada', 'Cape Verde',
 'Central African Republic', 'Chad', 'Chile', 'China', 'Colombia', 'Comoros',
 'Costa Rica', "Cote d'Ivoire", 'Croatia', 'Cuba', 'Cyprus', 'Czech Republic',
 'Democratic Republic of the Congo', 'Denmark', 'Djibouti', 'Dominica',
 'Dominican Republic', 'Ecuador', 'Egypt', 'El Salvador', 'Equatorial Guinea',
 'Eritrea', 'Estonia', 'Ethiopia', 'Fiji', 'Finland', 'France', 'Gabon',
 'Gambia', 'Georgia', 'Germany', 'Ghana', 'Greece', 'Grenada', 'Guam',
 'Guatemala', 'Guinea', 'Guinea-Bissau', 'Guyana', 'Haiti', 'Holy See',
 'Honduras', 'Hungary', 'Iceland', 'India', 'Indonesia', 'Iran', 'Iraq',
 'Ireland', 'Israel', 'Italy', 'Jamaica', 'Japan', 'Jordan', 'Kazakhstan',
 'Kenya', 'Kosovo', 'Kuwait', 'Kyrgyzstan', 'Laos', 'Latvia', 'Lebanon',
 'Lesotho', 'Liberia', 'Libya', 'Liechtenstein', 'Lithuania', 'Luxembourg',
 'Madagascar', 'Malawi', 'Malaysia', 'Maldives', 'Mali', 'Malta',
 'Marshall Islands', 'Mauritania', 'Mauritius', 'Mexico', 'Moldova', 'Monaco',
 'Mongolia', 'Montenegro', 'Morocco', 'Mozambique', 'Myanmar', 'Namibia',
 'Nepal', 'Netherlands', 'New Zealand', 'Nicaragua', 'Niger', 'Nigeria',
 'North Macedonia', 'Northern Mariana Islands', 'Norway', 'Oman', 'Pakistan',
 'Palestine', 'Panama', 'Papua New Guinea', 'Paraguay', 'Peru', 'Philippines',
 'Poland', 'Portugal', 'Puerto Rico', 'Qatar', 'Republic of the Congo',
 'Romania', 'Russia', 'Rwanda', 'Saint Kitts and Nevis', 'Saint Lucia',
 'Saint Vincent and the Grenadines', 'Samoa', 'San Marino',
 'Sao Tome and Principe', 'Saudi Arabia', 'Senegal', 'Serbia', 'Seychelles',
 'Sierra Leone', 'Singapore', 'Slovakia', 'Slovenia', 'Solomon Islands',
 'Somalia', 'South Africa', 'South Korea', 'South Sudan', 'Spain', 'Sri Lanka',
 'Sudan', 'Suriname', 'Swaziland', 'Sweden', 'Switzerland', 'Syria', 'Taiwan',
 'Tajikistan', 'Tanzania', 'Thailand', 'Timor-Leste', 'Togo',
 'Trinidad and Tobago', 'Tunisia', 'Turkey', 'Uganda', 'Ukraine',
 'United Arab Emirates', 'United Kingdom', 'United States', 'Uruguay',
 'Uzbekistan', 'Vanuatu', 'Venezuela', 'Vietnam', 'Virgin Islands, U.S.',
 'Yemen', 'Zambia', 'Zimbabwe']

Complement

JHUData.records() automatically complement the records, if necessary and auto_complement=True (default). Each area can have either none or one or multiple complements, depending on the records and their preprocessing analysis.

We can show the specific kind of complements that were applied to the records of each country with JHUData.show_complement() method. The possible kinds of complement for each country are the following:

  1. “Monotonic_confirmed/fatal/recovered” (monotonic increasing complement) Force the variable show monotonic increasing.

  2. “Full_recovered” (full complement of recovered data) Estimate the number of recovered cases using the value of estimated average recovery period.

  3. “Partial_recovered” (partial complement of recovered data) When recovered values are not updated for some days, extrapolate the values.

Note:
“Recovery period” will be discussed in the next subsection.

For JHUData.show_complement(), we can specify country names and province names.

[17]:
# Specify country name
jhu_data.show_complement(country="Japan")
# Or, specify country and province name
# jhu_data.show_complement(country="Japan", province="Tokyo")
[17]:
Country Province Monotonic_confirmed Monotonic_fatal Monotonic_recovered Full_recovered Partial_recovered
0 Japan - False False True False True

When list was apllied was country argument, the all spefied countries will be shown. If None, all registered countries will be used.

[18]:
# Specify country names
jhu_data.show_complement(country=["Greece", "Japan"])
# Or, apply None
# jhu_data.show_complement(country=None)
[18]:
Country Province Monotonic_confirmed Monotonic_fatal Monotonic_recovered Full_recovered Partial_recovered
0 Greece - False False False False True
1 Japan - False False True False True

If complement was performed incorrectly or you need new algorithms, kindly let us know via issue page.

Recovery period

We defined “recovery period” as yhe time period between case confirmation and recovery (as it is subjectively defined per country). With the global cases records, we estimate the average recovery period using JHUData.calculate_recovery_period().

[19]:
recovery_period = jhu_data.calculate_recovery_period()
print(f"Average recovery period: {recovery_period} [days]")
Average recovery period: 16 [days]

What we currently do is to calculate the difference between confirmed cases and fatal cases and try to match it to some recovered cases value in the future. We apply this method for every country that has valid recovery data and average the partial recovery periods in order to obtain a single (average) recovery period. During the calculations, we ignore time intervals that lead to very short (<7 days) or very long (>90 days) partial recovery periods, if these exist with high frequency (>50%) in the records. We have to assume temporarily invariable compartments for this analysis to extract an approximation of the average recovery period.

Alternatively, we had tried to use linelist of case reports to get precise value of recovery period (average of recovery date minus confirmation date for cases), but the number of records was too small.

Visualize the number of cases at a timepoint

We can visualize the number of cases with JHUData.map() method. When country is None, global map will be shown.

Global map with country level data:

[20]:
# Global map with country level data
jhu_data.map(country=None, variable="Infected")
# To set included/exclude some countries
# jhu_data.map(country=None, variable="Infected", included=["Japan"])
# jhu_data.map(country=None, variable="Infected", excluded=["Japan"])
# To change the date
# jhu_data.map(country=None, variable="Infected", date="01Oct2021")
_images/usage_dataset_50_0.png

Values can be retrieved with .layer() method.

[21]:
jhu_data.layer(country=None).tail()
[21]:
Date ISO3 Country Confirmed Infected Fatal Recovered
91809 2021-05-03 JPN Japan 602862 61432 10361 531069
91810 2021-05-04 JPN Japan 607626 62370 10420 534836
91811 2021-05-05 JPN Japan 612360 62944 10470 538946
91812 2021-05-06 JPN Japan 616123 63037 10517 542569
91813 2021-05-07 JPN Japan 620994 64363 10589 546042

Country map with province level data:

[22]:
# Country map with province level data
jhu_data.map(country="Japan", variable="Infected")
# To set included/exclude some countries
# jhu_data.map(country="Japan", variable="Infected", included=["Tokyo"])
# jhu_data.map(country="Japan", variable="Infected", excluded=["Tokyo"])
# To change the date
# jhu_data.map(country="Japan", variable="Infected", date="01Oct2021")
_images/usage_dataset_54_0.png

Values are here.

[23]:
jhu_data.layer(country="Japan").tail()
[23]:
Date ISO3 Country Province Confirmed Infected Fatal Recovered
19957 2021-05-03 JPN Japan Entering 2743 141 3 2599
19958 2021-05-04 JPN Japan Entering 2754 133 4 2617
19959 2021-05-05 JPN Japan Entering 2757 128 4 2625
19960 2021-05-06 JPN Japan Entering 2768 139 4 2625
19961 2021-05-07 JPN Japan Entering 2797 161 4 2632
Note for Japan:
Province “Entering” means the number of cases who were confirmed when entering Japan.

Population values

Population values are necessary to calculate the number of susceptible people. Susceptible is a variable of SIR-derived models. PopulationData class will be created with DataLoader.population() method.

[24]:
population_data = data_loader.population()
[25]:
type(population_data)
[25]:
covsirphy.cleaning.population.PopulationData

Description is here. This is the same as JHUData. Raw data is also the same.

[26]:
# Description
print(population_data.citation)
# Raw
# population_data.raw.tail()
(Secondary source) Guidotti, E., Ardia, D., (2020), "COVID-19 Data Hub", Journal of Open Source Software 5(51):2376, doi: 10.21105/joss.02376.

The cleaned dataset is here.

[27]:
population_data.cleaned().tail()
[27]:
Date ISO3 Country Province Population
436094 2021-05-02 COL Colombia Vichada 107808
436095 2021-05-03 COL Colombia Vichada 107808
436096 2021-05-04 COL Colombia Vichada 107808
436097 2021-05-05 COL Colombia Vichada 107808
436098 2021-05-06 COL Colombia Vichada 107808

Show population

We will get the population values with PopulationData.value().

[28]:
# In a country
population_data.value("Japan", province=None)
# In a country with ISO3 code
# population_data.value("JPN", province=None)
# In a province (prefecture)
# population_data.value("Japan", province="Tokyo")
[28]:
126529100

Update population

We can update the population values.

[29]:
# Before
population_before = population_data.value("Japan", province="Tokyo")
print(f"Before: {population_before}")
# Register population value of Tokyo in Japan
# https://www.metro.tokyo.lg.jp/tosei/hodohappyo/press/2020/06/11/07.html
population_data.update(14_002_973, "Japan", province="Tokyo")
population_after = population_data.value("Japan", province="Tokyo")
print(f" After: {population_after}")
Before: 13942856
 After: 14002973

Visualize population

We can visualize population values with .map() method. When country is None, global map will be shown. Arguments are the same as JHUData.map(), but variable name cannot be specified.

Country level data:

[30]:
population_data.map(country=None)
_images/usage_dataset_71_0.png

Values are here.

[31]:
population_data.layer(country=None).tail()
[31]:
Date ISO3 Country Population
91844 2021-05-02 ZWE Zimbabwe 14439018
91845 2021-05-03 ZWE Zimbabwe 14439018
91846 2021-05-04 ZWE Zimbabwe 14439018
91847 2021-05-05 ZWE Zimbabwe 14439018
91848 2021-05-06 ZWE Zimbabwe 14439018

Province level data:

[32]:
population_data.map(country="Japan")
_images/usage_dataset_75_0.png

Values are here.

[33]:
population_data.layer(country="Japan").tail()
[33]:
Date ISO3 Country Province Population
22508 2021-05-03 JPN Japan Kagawa 956069
22509 2021-05-04 JPN Japan Kagawa 956069
22510 2021-05-05 JPN Japan Kagawa 956069
22511 2021-05-06 JPN Japan Kagawa 956069
22512 2021-05-07 JPN Japan Kagawa 956069

OxCGRT indicators

Government responses are tracked with Oxford Covid-19 Government Response Tracker (OxCGRT). Because government responses and activities of persons change the parameter values of SIR-derived models, this dataset is significant when we try to forcast the number of cases. OxCGRTData class will be created with DataLoader.oxcgrt() method.

[34]:
oxcgrt_data = data_loader.oxcgrt()
[35]:
type(oxcgrt_data)
[35]:
covsirphy.cleaning.oxcgrt.OxCGRTData

Because records will be retrieved via “COVID-19 Data Hub” as well as JHUData, data description and raw data is the same.

[36]:
# Description
print(oxcgrt_data.citation)
# Raw
# oxcgrt_data.raw.tail()
(Secondary source) Guidotti, E., Ardia, D., (2020), "COVID-19 Data Hub", Journal of Open Source Software 5(51):2376, doi: 10.21105/joss.02376.

The cleaned dataset is here.

[37]:
oxcgrt_data.cleaned().tail()
[37]:
Date ISO3 Country School_closing Workplace_closing Cancel_events Gatherings_restrictions Transport_closing Stay_home_restrictions Internal_movement_restrictions International_movement_restrictions Information_campaigns Testing_policy Contact_tracing Stringency_index
437507 2021-05-02 GRL Greenland 2.0 2.0 1.0 4.0 1.0 1.0 1.0 3.0 2.0 3.0 1.0 56.48
437508 2021-05-03 GRL Greenland 2.0 2.0 1.0 4.0 1.0 1.0 1.0 3.0 2.0 3.0 1.0 56.48
437509 2021-05-04 GRL Greenland 2.0 2.0 1.0 4.0 1.0 1.0 1.0 3.0 2.0 3.0 1.0 56.48
437510 2021-05-05 GRL Greenland 2.0 2.0 1.0 4.0 1.0 1.0 1.0 3.0 2.0 3.0 1.0 56.48
437511 2021-05-06 GRL Greenland 2.0 2.0 1.0 4.0 1.0 1.0 1.0 3.0 2.0 3.0 1.0 56.48

Subset for area

PopulationData.subset() creates a subset for a specific area. We can select only country name. Note that province level data is not registered in OxCGRTData.

Subset for a country:
We can use both of country names and ISO3 codes.
[38]:
oxcgrt_data.subset("Japan").tail()
# Or, with ISO3 code
# oxcgrt_data.subset("JPN").tail()
[38]:
Date School_closing Workplace_closing Cancel_events Gatherings_restrictions Transport_closing Stay_home_restrictions Internal_movement_restrictions International_movement_restrictions Information_campaigns Testing_policy Contact_tracing Stringency_index
474 2021-05-03 1.0 1.0 1.0 1.0 1.0 1.0 1.0 3.0 2.0 1.0 2.0 38.43
475 2021-05-04 1.0 1.0 1.0 1.0 1.0 1.0 1.0 3.0 2.0 1.0 2.0 38.43
476 2021-05-05 1.0 1.0 1.0 1.0 1.0 1.0 1.0 3.0 2.0 1.0 2.0 38.43
477 2021-05-06 1.0 1.0 1.0 1.0 1.0 1.0 1.0 3.0 2.0 1.0 2.0 38.43
478 2021-05-07 1.0 1.0 1.0 1.0 1.0 1.0 1.0 3.0 2.0 1.0 2.0 38.43

Visualize indicator values

We can visualize indicator values with .map() method. Arguments are the same as JHUData.map(), but country name cannot be specified.

[39]:
oxcgrt_data.map(variable="Stringency_index")
_images/usage_dataset_89_0.png

Values are here.

[40]:
oxcgrt_data.layer().tail()
[40]:
Date ISO3 Country School_closing Workplace_closing Cancel_events Gatherings_restrictions Transport_closing Stay_home_restrictions Internal_movement_restrictions International_movement_restrictions Information_campaigns Testing_policy Contact_tracing Stringency_index
437507 2021-05-02 GRL Greenland 2.0 2.0 1.0 4.0 1.0 1.0 1.0 3.0 2.0 3.0 1.0 56.48
437508 2021-05-03 GRL Greenland 2.0 2.0 1.0 4.0 1.0 1.0 1.0 3.0 2.0 3.0 1.0 56.48
437509 2021-05-04 GRL Greenland 2.0 2.0 1.0 4.0 1.0 1.0 1.0 3.0 2.0 3.0 1.0 56.48
437510 2021-05-05 GRL Greenland 2.0 2.0 1.0 4.0 1.0 1.0 1.0 3.0 2.0 3.0 1.0 56.48
437511 2021-05-06 GRL Greenland 2.0 2.0 1.0 4.0 1.0 1.0 1.0 3.0 2.0 3.0 1.0 56.48

The number of tests

The number of tests is also key information to understand the situation. PCRData class will be created with DataLoader.pcr() method.

[41]:
pcr_data = data_loader.pcr()
[42]:
type(pcr_data)
[42]:
covsirphy.cleaning.pcr_data.PCRData

Because records will be retrieved via “COVID-19 Data Hub” as well as JHUData, data description and raw data is the same.

[43]:
# Description
print(pcr_data.citation)
# Raw
# pcr_data.raw.tail()
(Secondary source) Guidotti, E., Ardia, D., (2020), "COVID-19 Data Hub", Journal of Open Source Software 5(51):2376, doi: 10.21105/joss.02376.
Hasell, J., Mathieu, E., Beltekian, D. et al. A cross-country database of COVID-19 testing. Sci Data 7, 345 (2020). https://doi.org/10.1038/s41597-020-00688-8
Lisphilar (2020), COVID-19 dataset in Japan, GitHub repository, https://github.com/lisphilar/covid19-sir/data/japan

The cleaned dataset is here.

[44]:
pcr_data.cleaned().tail()
[44]:
Date Country Province Tests Confirmed
20414 2021-05-03 Japan - 12007007 602862
20415 2021-05-04 Japan - 12053709 607626
20416 2021-05-05 Japan - 12082418 612360
20417 2021-05-06 Japan - 12150722 616123
20418 2021-05-07 Japan - 12241999 620994

Subset for area

PCRData.subset() creates a subset for a specific area. We can select country name and province name.

Subset for a country:
We can use both of country names and ISO3 codes.
[45]:
pcr_data.subset("Japan").tail()
# Or, with ISO3 code
# pcr_data.subset("JPN").tail()
# Note: from version 2.17.0-alpha (next stable 2.18.0), "Tests_diff" is included
[45]:
Date Tests Tests_diff Confirmed
452 2021-05-03 12007007 38558 602862
453 2021-05-04 12053709 46702 607626
454 2021-05-05 12082418 28709 612360
455 2021-05-06 12150722 68304 616123
456 2021-05-07 12241999 91277 620994

Positive rate

Under the assumption that all tests were PCR test, we can calculate the positive rate of PCR tests as “the number of confirmed cases per the number of tests” with PCRData.positive_rate() method.

[46]:
pcr_data.positive_rate("Japan").tail()
_images/usage_dataset_103_0.png
[46]:
Date ISO3 Country Province Tests Confirmed Tests_diff Confirmed_diff Test_positive_rate
451 2021-05-03 JPN Japan - 12007007 602862 77552.714286 5142.714286 6.631250
452 2021-05-04 JPN Japan - 12053709 607626 71873.571429 5226.571429 7.271896
453 2021-05-05 JPN Japan - 12082418 612360 64243.714286 5256.714286 8.182457
454 2021-05-06 JPN Japan - 12150722 616123 64076.571429 5019.285714 7.833262
455 2021-05-07 JPN Japan - 12241999 620994 63722.142857 4887.428571 7.669906

Visualize the number of tests

We can visualize the number of tests with .map() method. When country is None, global map will be shown. Arguments are the same as JHUData, but variable name cannot be specified.

Country level data:

[47]:
pcr_data.map(country=None)
_images/usage_dataset_106_0.png

Values are here.

[48]:
pcr_data.layer(country=None).tail()
[48]:
Date ISO3 Country Tests Confirmed
91829 2021-05-03 JPN Japan 12007007 602862
91830 2021-05-04 JPN Japan 12053709 607626
91831 2021-05-05 JPN Japan 12082418 612360
91832 2021-05-06 JPN Japan 12150722 616123
91833 2021-05-07 JPN Japan 12241999 620994

Province level data:

[49]:
pcr_data.map(country="Japan")
_images/usage_dataset_110_0.png

Values are here.

[50]:
pcr_data.layer(country="Japan").tail()
[50]:
Date ISO3 Country Province Tests Confirmed
19957 2021-05-03 JPN Japan Entering 647347 2743
19958 2021-05-04 JPN Japan Entering 649863 2754
19959 2021-05-05 JPN Japan Entering 650826 2757
19960 2021-05-06 JPN Japan Entering 652045 2768
19961 2021-05-07 JPN Japan Entering 653736 2797

Vaccinations

Vaccinations is a key factor to end the outbreak as soon as possible. VaccineData class will be created with DataLoader.vaccine() method.

[51]:
vaccine_data = data_loader.vaccine()
Retrieving COVID-19 vaccination dataset from https://github.com/owid/covid-19-data/
[52]:
type(vaccine_data)
[52]:
covsirphy.cleaning.vaccine_data.VaccineData

Description is here.

[53]:
print(vaccine_data.citation)
Hasell, J., Mathieu, E., Beltekian, D. et al. A cross-country database of COVID-19 testing. Sci Data 7, 345 (2020). https://doi.org/10.1038/s41597-020-00688-8

Raw data is here.

[54]:
vaccine_data.raw.tail()
[54]:
Country ISO3 Date Vaccinations Vaccinated_once Vaccinated_full Product
16699 Zimbabwe ZWE 2021-05-02 524199 430068 94131 Sinopharm/Beijing
16700 Zimbabwe ZWE 2021-05-03 537516 437751 99765 Sinopharm/Beijing
16701 Zimbabwe ZWE 2021-05-04 559777 452191 107586 Sinopharm/Beijing
16702 Zimbabwe ZWE 2021-05-05 576233 461023 115210 Sinopharm/Beijing
16703 Zimbabwe ZWE 2021-05-06 607355 478174 129181 Sinopharm/Beijing

The next is the cleaned dataset.

[55]:
vaccine_data.cleaned().tail()
[55]:
Date Country ISO3 Product Vaccinations Vaccinated_once Vaccinated_full
18209 2021-05-03 Zimbabwe ZWE Sinopharm/Beijing 537516 437751 99765
18210 2021-05-04 Zimbabwe ZWE Sinopharm/Beijing 559777 452191 107586
18211 2021-05-05 Zimbabwe ZWE Sinopharm/Beijing 576233 461023 115210
18212 2021-05-06 Zimbabwe ZWE Sinopharm/Beijing 607355 478174 129181
18213 2021-05-07 Zimbabwe ZWE Sinopharm/Beijing 607355 478174 129181

Note for variables

Definition of variables are as follows.

  • Vaccinations: cumulative number of vaccinations

  • Vaccinated_once: cumulative number of people who received at least one vaccine dose

  • Vaccinated_full: cumulative number of people who received all doses prescrived by the protocol

Registered countries can be checked with VaccineData.countries() method.

[56]:
pprint(vaccine_data.countries(), compact=True)
['Afghanistan', 'Africa', 'Albania', 'Algeria', 'Andorra', 'Angola', 'Anguilla',
 'Antigua and Barbuda', 'Argentina', 'Armenia', 'Aruba', 'Asia', 'Australia',
 'Austria', 'Azerbaijan', 'Bahamas', 'Bahrain', 'Bangladesh', 'Barbados',
 'Belarus', 'Belgium', 'Belize', 'Bermuda', 'Bhutan', 'Bolivia',
 'Bosnia and Herzegovina', 'Botswana', 'Brazil', 'Brunei', 'Bulgaria',
 'Cambodia', 'Cameroon', 'Canada', 'Cape Verde', 'Cayman Islands', 'Chile',
 'China', 'Colombia', 'Comoros', 'Congo', 'Costa Rica', "Cote d'Ivoire",
 'Croatia', 'Curacao', 'Cyprus', 'Czechia', 'Democratic Republic of Congo',
 'Denmark', 'Djibouti', 'Dominica', 'Dominican Republic', 'Ecuador', 'Egypt',
 'El Salvador', 'England', 'Equatorial Guinea', 'Estonia', 'Eswatini',
 'Ethiopia', 'Europe', 'European Union', 'Faeroe Islands', 'Falkland Islands',
 'Fiji', 'Finland', 'France', 'Gabon', 'Gambia', 'Georgia', 'Germany', 'Ghana',
 'Gibraltar', 'Greece', 'Greenland', 'Grenada', 'Guatemala', 'Guernsey',
 'Guinea', 'Guyana', 'Honduras', 'Hong Kong', 'Hungary', 'Iceland', 'India',
 'Indonesia', 'Iran', 'Iraq', 'Ireland', 'Isle of Man', 'Israel', 'Italy',
 'Jamaica', 'Japan', 'Jersey', 'Jordan', 'Kazakhstan', 'Kenya', 'Kosovo',
 'Kuwait', 'Kyrgyzstan', 'Laos', 'Latvia', 'Lebanon', 'Lesotho', 'Libya',
 'Liechtenstein', 'Lithuania', 'Luxembourg', 'Macao', 'Malawi', 'Malaysia',
 'Maldives', 'Mali', 'Malta', 'Mauritania', 'Mauritius', 'Mexico', 'Moldova',
 'Monaco', 'Mongolia', 'Montenegro', 'Montserrat', 'Morocco', 'Mozambique',
 'Myanmar', 'Namibia', 'Nauru', 'Nepal', 'Netherlands', 'New Zealand',
 'Nicaragua', 'Niger', 'Nigeria', 'North America', 'North Macedonia',
 'Northern Cyprus', 'Northern Ireland', 'Norway', 'Oceania', 'Oman', 'Pakistan',
 'Palestine', 'Panama', 'Papua New Guinea', 'Paraguay', 'Peru', 'Philippines',
 'Poland', 'Portugal', 'Qatar', 'Romania', 'Russia', 'Rwanda', 'Saint Helena',
 'Saint Kitts and Nevis', 'Saint Lucia', 'Saint Vincent and the Grenadines',
 'Samoa', 'San Marino', 'Sao Tome and Principe', 'Saudi Arabia', 'Scotland',
 'Senegal', 'Serbia', 'Seychelles', 'Sierra Leone', 'Singapore', 'Slovakia',
 'Slovenia', 'Solomon Islands', 'Somalia', 'South Africa', 'South America',
 'South Korea', 'South Sudan', 'Spain', 'Sri Lanka', 'Sudan', 'Suriname',
 'Sweden', 'Switzerland', 'Syria', 'Taiwan', 'Thailand', 'Timor', 'Togo',
 'Tonga', 'Trinidad and Tobago', 'Tunisia', 'Turkey',
 'Turks and Caicos Islands', 'Uganda', 'Ukraine', 'United Arab Emirates',
 'United Kingdom', 'United States', 'Uruguay', 'Uzbekistan', 'Venezuela',
 'Vietnam', 'Wales', 'World', 'Zambia', 'Zimbabwe']

Subset for area

VaccineData.subset() creates a subset for a specific area. We can select only country name. Note that province level data is not registered.

Subset for a country:
We can use both of country names and ISO3 codes.
[57]:
vaccine_data.subset("Japan").tail()
# Or, with ISO3 code
# vaccine_data.subset("JPN").tail()
[57]:
Date Vaccinations Vaccinated_once Vaccinated_full
75 2021-05-03 3489719 2493961 995758
76 2021-05-04 3489719 2493961 995758
77 2021-05-05 3836845 2793847 1042998
78 2021-05-06 4197463 3091529 1105934
79 2021-05-07 4197463 3091529 1105934

Visualize the number of vaccinations

We can visualize the number of vaccinations and the other variables with .map() method. Arguments are the same as JHUData, but country name cannot be specified.

[58]:
vaccine_data.map()
_images/usage_dataset_129_0.png

Values are here.

[59]:
vaccine_data.layer().tail()
[59]:
Date Country ISO3 Product Vaccinations Vaccinated_once Vaccinated_full
18209 2021-05-03 Zimbabwe ZWE Sinopharm/Beijing 537516 437751 99765
18210 2021-05-04 Zimbabwe ZWE Sinopharm/Beijing 559777 452191 107586
18211 2021-05-05 Zimbabwe ZWE Sinopharm/Beijing 576233 461023 115210
18212 2021-05-06 Zimbabwe ZWE Sinopharm/Beijing 607355 478174 129181
18213 2021-05-07 Zimbabwe ZWE Sinopharm/Beijing 607355 478174 129181

Linelist of case reports

The number of cases is important, but linelist of case reports will helpful to understand the situation deeply. LinelistData class will be created with DataLoader.linelist() method.

[60]:
linelist = data_loader.linelist()
Retrieving linelist from Open COVID-19 Data Working Group repository: https://github.com/beoutbreakprepared/nCoV2019
[61]:
type(linelist)
[61]:
covsirphy.cleaning.linelist.LinelistData

Description is here.

[62]:
print(linelist.citation)
Xu, B., Gutierrez, B., Mekaru, S. et al. Epidemiological data from the COVID-19 outbreak, real-time case information. Sci Data 7, 106 (2020). https://doi.org/10.1038/s41597-020-0448-0

Raw data is here.

[63]:
linelist.raw.tail()
[63]:
age sex province country date_admission_hospital date_confirmation symptoms chronic_disease outcome date_death_or_discharge
2676307 52 female Lima Peru NaN 17.05.2020 NaN NaN NaN NaN
2676308 52 female Lima Peru NaN 17.05.2020 NaN NaN NaN NaN
2676309 52 male Callao Peru NaN 17.05.2020 NaN NaN NaN NaN
2676310 52 male Lima Peru NaN 17.05.2020 NaN NaN NaN NaN
2676311 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

The next is the cleaned dataset.

[64]:
linelist.cleaned().tail()
[64]:
Country Province Hospitalized_date Confirmation_date Outcome_date Confirmed Infected Recovered Fatal Symptoms Chronic_disease Age Sex
2676306 Peru Coronel Portillo NaT 2020-05-17 NaT True False False False NaN NaN 52.0 female
2676307 Peru Lima NaT 2020-05-17 NaT True False False False NaN NaN 52.0 female
2676308 Peru Lima NaT 2020-05-17 NaT True False False False NaN NaN 52.0 female
2676309 Peru Callao NaT 2020-05-17 NaT True False False False NaN NaN 52.0 male
2676310 Peru Lima NaT 2020-05-17 NaT True False False False NaN NaN 52.0 male

Subset for area

LinelistData.subset() creates a subset for a specific area. We can select country name and province name.

Subset for a country:
We can use both of country names and ISO3 codes.
[65]:
linelist.subset("Japan").tail()
# Or, with ISO3 code
# linelist.subset("JPN").tail()
[65]:
Hospitalized_date Confirmation_date Outcome_date Confirmed Infected Recovered Fatal Symptoms Chronic_disease Age Sex
1809 NaT 2020-02-15 NaT True False False False fever:headache NaN NaN female
1810 NaT 2020-02-16 NaT True False False False fever NaN NaN male
1811 NaT 2020-02-17 NaT True False False False fever:full body slump NaN NaN male
1812 NaT 2020-02-18 NaT True False False False cough:fever NaN NaN male
1813 2020-08-02 2020-02-10 NaT True False False False cough:fever NaN NaN male

Subset of outcome

LinelistData.closed() returns a subset for specified outcome. We can select “Recovered” and “Fatal”.

[66]:
linelist.closed(outcome="Recovered").tail()
[66]:
Country Province Hospitalized_date Confirmation_date Recovered_date Symptoms Chronic_disease Age Sex
272 Singapore - 2020-02-02 2020-02-06 2020-02-17 NaN NaN 39.0 female
273 Malaysia Johor NaT 2020-01-25 2020-02-08 cough:fever NaN 40.0 male
274 China Gansu 2020-07-02 2020-02-08 2020-02-17 diarrhea NaN 1.0 female
275 Canada Ontario NaT 2020-01-25 2020-01-31 NaN hypertension NaN male
276 Canada Ontario NaT 2020-01-31 2020-02-19 NaN NaN NaN female

As the median value of the period from confirmation to recovery, we can calculate recovery period.

[67]:
print(f"Recovery period calculated with linelist: {linelist.recovery_period()} [days]")
Recovery period calculated with linelist: 12 [days]

Note that we small number of records to calculate recovery period. The number of records is here.

[68]:
len(linelist.closed(outcome="Recovered"))
[68]:
277

Population pyramid

With population pyramid, we can divide the population to sub-groups. This will be useful when we analyse the meaning of parameters. For example, how many days go out is different between the sub-groups. PyramidData class will be created with DataLoader.pyramid() method.

[69]:
pyramid_data = data_loader.pyramid()
[70]:
type(pyramid_data)
[70]:
covsirphy.cleaning.pyramid.PopulationPyramidData

Description is here.

[71]:
print(pyramid_data.citation)
World Bank Group (2020), World Bank Open Data, https://data.worldbank.org/

Raw dataset is not registered. Subset will be retrieved when PyramidData.subset() was called.

[72]:
pyramid_data.subset("Japan").tail()
Retrieving population pyramid dataset (Japan) from https://data.worldbank.org/
[72]:
Age Population Per_total
113 118 255035 0.002174
114 119 255035 0.002174
115 120 255035 0.002174
116 121 255035 0.002174
117 122 255035 0.002174

“Per_total” is the proportion of the age group in the total population.

Japan-specific dataset

This includes the number of confirmed/infected/fatal/recovered/tests/moderate/severe cases at country/prefecture level and metadata of each prefecture (province). JapanData class will be created with DataLoader.japan() method.

[73]:
japan_data = data_loader.japan()
[74]:
type(japan_data)
[74]:
covsirphy.cleaning.japan_data.JapanData

Description is here.

[75]:
print(japan_data.citation)
Lisphilar (2020), COVID-19 dataset in Japan, GitHub repository, https://github.com/lisphilar/covid19-sir/data/japan

The next is the cleaned dataset.

[76]:
japan_data.cleaned().tail()
[76]:
Date Country Province Confirmed Infected Fatal Recovered Tests Moderate Severe Vaccinations Vaccinated_once Vaccinated_full
20414 2021-05-03 Japan - 602862 61432 10361 531069 12007007 59213 1084 3533975 2538217 995758
20415 2021-05-04 Japan - 607626 62370 10420 534836 12053709 60342 1083 3542813 2547055 995758
20416 2021-05-05 Japan - 612360 62944 10470 538946 12082418 60729 1114 3552021 2556263 995758
20417 2021-05-06 Japan - 616123 63037 10517 542569 12150722 60757 1098 3900551 2841857 1058694
20418 2021-05-07 Japan - 620994 64363 10589 546042 12241999 61483 1131 3900551 2841857 1058694

Visualize values

We can visualize the values with .map() method. Arguments are the same as JHUData.

[77]:
japan_data.map(variable="Severe")
_images/usage_dataset_166_0.png

Values are here.

[78]:
japan_data.layer(country="Japan").tail()
[78]:
Date Country Province Confirmed Infected Fatal Recovered Tests Moderate Severe Vaccinations Vaccinated_once Vaccinated_full
19957 2021-05-03 Japan Entering 2743 141 3 2599 647347 141 0 0 0 0
19958 2021-05-04 Japan Entering 2754 133 4 2617 649863 133 0 0 0 0
19959 2021-05-05 Japan Entering 2757 128 4 2625 650826 128 0 0 0 0
19960 2021-05-06 Japan Entering 2768 139 4 2625 652045 139 0 0 0 0
19961 2021-05-07 Japan Entering 2797 161 4 2632 653736 161 0 0 0 0

Map with country level data is not prepared, but country level data can be retrieved.

[79]:
japan_data.layer(country=None).tail()
[79]:
Date Country Confirmed Infected Fatal Recovered Tests Moderate Severe Vaccinations Vaccinated_once Vaccinated_full
452 2021-05-03 Japan 602862 61432 10361 531069 12007007 59213 1084 3533975 2538217 995758
453 2021-05-04 Japan 607626 62370 10420 534836 12053709 60342 1083 3542813 2547055 995758
454 2021-05-05 Japan 612360 62944 10470 538946 12082418 60729 1114 3552021 2556263 995758
455 2021-05-06 Japan 616123 63037 10517 542569 12150722 60757 1098 3900551 2841857 1058694
456 2021-05-07 Japan 620994 64363 10589 546042 12241999 61483 1131 3900551 2841857 1058694

Metadata

Additionally, JapanData.meta() retrieves meta data for Japan prefectures.

[80]:
japan_data.meta().tail()
Retrieving Metadata of Japan dataset from https://github.com/lisphilar/covid19-sir/data/japan
[80]:
Prefecture Admin_Capital Admin_Region Admin_Num Area_Habitable Area_Total Clinic_bed_Care Clinic_bed_Total Hospital_bed_Care Hospital_bed_Specific Hospital_bed_Total Hospital_bed_Tuberculosis Hospital_bed_Type-I Hospital_bed_Type-II Population_Female Population_Male Population_Total Location_Latitude Location_Longitude
42 Kumamoto Kumamoto Kyushu 43 2796 7409 497 4628 8340 0 33710 95 2 46 933 833 1765 32.790513 130.742388
43 Oita Oita Kyushu 44 1799 6341 269 3561 2618 0 19834 50 2 38 607 546 1152 33.238391 131.612658
44 Miyazaki Miyazaki Kyushu 45 1850 7735 206 2357 3682 0 18769 33 1 30 577 512 1089 31.911188 131.423873
45 Kagoshima Kagoshima Kyushu 46 3313 9187 652 4827 7750 0 32651 98 1 44 863 763 1626 31.560052 130.557745
46 Okinawa Naha Okinawa 47 1169 2281 83 914 3804 0 18710 47 4 20 734 709 1443 26.211761 127.681119