Usage: datasets

Here, we will review the raw/cleaned datasets. Scenario class performs data cleaning internally using JHUData class and so on, but it is important to review the features and data types before analysing them.

Preparation

Prepare the packages.

[1]:
from pprint import pprint
import covsirphy as cs
cs.__version__
[1]:
'2.17.0-alpha'

Dataset preparation

Download the datasets to “input” directory and load them.

If “input” directory has the datasets, DataLoader instance will load the local files. If the datasets were updated in remote servers, DataLoader will update the local files automatically and download the datasets to “../input” directory and load them. We can change the directory when creating the instance.

[2]:
# Create DataLoader instance
data_loader = cs.DataLoader("../input")
[3]:
# The number of cases (JHU style)
jhu_data = data_loader.jhu(verbose=True)
# Population in each country
population_data = data_loader.population()
# Government Response Tracker (OxCGRT)
oxcgrt_data = data_loader.oxcgrt()
[4]:
# Linelist of case reports
linelist = data_loader.linelist()
# The number of tests
pcr_data = data_loader.pcr()
# The number of vaccinations
vaccine_data = data_loader.vaccine()
# Population pyramid
pyramid_data = data_loader.pyramid()
# Japan-specific dataset
japan_data = data_loader.japan()

The number of cases (JHU style)

The main dataset is that of the number of cases and was saved as jhu_data, an instance of JHUData class. This includes “Confirmed”, “Infected”, “Recovered” and “Fatal”. “Infected” was calculated as “Confirmed - Recovered - Fatal”.

[5]:
type(jhu_data)
[5]:
covsirphy.cleaning.jhu_data.JHUData

The dataset will be retrieved from COVID-19 Data Hub and Data folder of CovsirPhy project. Description of these projects will be shown as follows.

[6]:
# Description/citation
print(jhu_data.citation)
(Secondary source) Guidotti, E., Ardia, D., (2020), "COVID-19 Data Hub", Journal of Open Source Software 5(51):2376, doi: 10.21105/joss.02376.
Lisphilar (2020), COVID-19 dataset in Japan, GitHub repository, https://github.com/lisphilar/covid19-sir/data/japan
[7]:
# Detailed citation list of COVID-19 Data Hub
# print(data_loader.covid19dh_citation)
[8]:
# Raw data
jhu_data.raw.tail()
[8]:
ObservationDate Tests Confirmed Recovered Deaths Population ISO3 Province/State Country/Region school_closing ... cancel_events gatherings_restrictions transport_closing stay_home_restrictions internal_movement_restrictions international_movement_restrictions information_campaigns testing_policy contact_tracing stringency_index
443831 2021-02-27 5761.0 1364.0 1331 22 107808.0 COL Vichada Colombia 3 ... 1 3 1 1 1 4 2 2 2 81.02
443832 2021-02-28 5798.0 1366.0 1332 22 107808.0 COL Vichada Colombia 3 ... 1 3 1 1 1 4 2 2 2 81.02
443833 2021-03-01 5817.0 1369.0 1332 22 107808.0 COL Vichada Colombia 3 ... 1 3 1 1 1 4 2 2 2 81.02
443834 2021-03-02 5817.0 1377.0 1336 22 107808.0 COL Vichada Colombia 3 ... 1 3 1 1 1 4 2 2 2 81.02
443835 2021-03-03 5817.0 1377.0 1336 22 107808.0 COL Vichada Colombia 3 ... 1 3 1 1 1 4 2 2 2 81.02

5 rows × 21 columns

[9]:
# Cleaned data
jhu_data.cleaned().tail()
[9]:
Date Country Province Confirmed Infected Fatal Recovered
17229 2021-02-27 Japan - 430539 14712 7807 408020
17230 2021-02-28 Japan - 431740 14561 7860 409319
17231 2021-03-01 Japan - 432773 14282 7887 410604
17232 2021-03-02 Japan - 433504 13456 7933 412115
17233 2021-03-03 Japan - 434356 13038 7984 413334
[10]:
jhu_data.cleaned().info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 440526 entries, 0 to 17233
Data columns (total 7 columns):
 #   Column     Non-Null Count   Dtype
---  ------     --------------   -----
 0   Date       440526 non-null  datetime64[ns]
 1   Country    440526 non-null  category
 2   Province   440526 non-null  category
 3   Confirmed  440526 non-null  int64
 4   Infected   440526 non-null  int64
 5   Fatal      440526 non-null  int64
 6   Recovered  440526 non-null  int64
dtypes: category(2), datetime64[ns](1), int64(4)
memory usage: 21.9 MB

Total number of cases in all countries with JHUData.total() method.

[11]:
# Calculate total values
total_df = jhu_data.total()
total_df.tail()
[11]:
Confirmed Infected Fatal Recovered Fatal per Confirmed Recovered per Confirmed Fatal per (Fatal or Recovered)
Date
2021-02-27 113736464 40655486 2525478 70555500 0.022205 0.620342 0.034557
2021-02-28 114020351 40759369 2531020 70729962 0.022198 0.620328 0.034548
2021-03-01 114331878 40856146 2537491 70938241 0.022194 0.620459 0.034535
2021-03-02 114594562 40864230 2546460 71183872 0.022221 0.621180 0.034537
2021-03-03 114596369 40864688 2546527 71185154 0.022222 0.621182 0.034538
[12]:
# Plot the total values
cs.line_plot(total_df[["Infected", "Fatal", "Recovered"]], "Total number of cases over time")
_images/usage_dataset_17_0.png
[13]:
# Statistics of rate values in all countries
total_df.loc[:, total_df.columns.str.contains("per")].describe().T
[13]:
count mean std min 25% 50% 75% max
Fatal per Confirmed 422.0 0.035840 0.017479 0.000000 0.022496 0.031028 0.045110 0.074286
Recovered per Confirmed 422.0 0.527000 0.187639 0.018591 0.418480 0.617343 0.650029 1.000000
Fatal per (Fatal or Recovered) 422.0 0.088591 0.088323 0.000000 0.034768 0.052062 0.112994 0.539474

We can create a subset for a country using JHUData.subset() method.

[14]:
# Subset for a country
df, _ = jhu_data.records("Japan")
df.tail()
# We can use ISO3 code etc.
# df, _ = jhu_data.records("JPN")
# df.tail()
[14]:
Date Confirmed Infected Fatal Recovered
387 2021-02-27 430539 14712 7807 408020
388 2021-02-28 431740 14561 7860 409319
389 2021-03-01 432773 14282 7887 410604
390 2021-03-02 433504 13456 7933 412115
391 2021-03-03 434356 13038 7984 413334

Province (“prefecture” for Japan) name can be specified.

[15]:
df, _ = jhu_data.records("Japan", province="Tokyo")
df.tail()
[15]:
Date Confirmed Infected Fatal Recovered
345 2021-02-26 111010 3344 1355 106311
346 2021-02-27 111347 3342 1370 106635
347 2021-02-28 111676 3374 1376 106926
348 2021-03-01 111797 3091 1395 107311
349 2021-03-02 112029 3083 1400 107546
[16]:
# Countries we can select
pprint(jhu_data.countries(), compact=True)
['Afghanistan', 'Albania', 'Algeria', 'Andorra', 'Angola',
 'Antigua and Barbuda', 'Argentina', 'Armenia', 'Australia', 'Austria',
 'Azerbaijan', 'Bahamas', 'Bahrain', 'Bangladesh', 'Barbados', 'Belarus',
 'Belgium', 'Belize', 'Benin', 'Bermuda', 'Bhutan', 'Bolivia',
 'Bosnia and Herzegovina', 'Botswana', 'Brazil', 'Brunei', 'Bulgaria',
 'Burkina Faso', 'Burundi', 'Cambodia', 'Cameroon', 'Canada', 'Cape Verde',
 'Central African Republic', 'Chad', 'Chile', 'China', 'Colombia', 'Comoros',
 'Costa Rica', "Cote d'Ivoire", 'Croatia', 'Cuba', 'Cyprus', 'Czech Republic',
 'Democratic Republic of the Congo', 'Denmark', 'Djibouti', 'Dominica',
 'Dominican Republic', 'Ecuador', 'Egypt', 'El Salvador', 'Equatorial Guinea',
 'Eritrea', 'Estonia', 'Ethiopia', 'Fiji', 'Finland', 'France', 'Gabon',
 'Gambia', 'Georgia', 'Germany', 'Ghana', 'Greece', 'Grenada', 'Guam',
 'Guatemala', 'Guinea', 'Guinea-Bissau', 'Guyana', 'Haiti', 'Holy See',
 'Honduras', 'Hungary', 'Iceland', 'India', 'Indonesia', 'Iran', 'Iraq',
 'Ireland', 'Israel', 'Italy', 'Jamaica', 'Japan', 'Jordan', 'Kazakhstan',
 'Kenya', 'Kosovo', 'Kuwait', 'Kyrgyzstan', 'Laos', 'Latvia', 'Lebanon',
 'Lesotho', 'Liberia', 'Libya', 'Liechtenstein', 'Lithuania', 'Luxembourg',
 'Madagascar', 'Malawi', 'Malaysia', 'Maldives', 'Mali', 'Malta',
 'Marshall Islands', 'Mauritania', 'Mauritius', 'Mexico', 'Moldova', 'Monaco',
 'Mongolia', 'Montenegro', 'Morocco', 'Mozambique', 'Myanmar', 'Namibia',
 'Nepal', 'Netherlands', 'New Zealand', 'Nicaragua', 'Niger', 'Nigeria',
 'North Macedonia', 'Northern Mariana Islands', 'Norway', 'Oman', 'Pakistan',
 'Palestine', 'Panama', 'Papua New Guinea', 'Paraguay', 'Peru', 'Philippines',
 'Poland', 'Portugal', 'Puerto Rico', 'Qatar', 'Republic of the Congo',
 'Romania', 'Russia', 'Rwanda', 'Saint Kitts and Nevis', 'Saint Lucia',
 'Saint Vincent and the Grenadines', 'Samoa', 'San Marino',
 'Sao Tome and Principe', 'Saudi Arabia', 'Senegal', 'Serbia', 'Seychelles',
 'Sierra Leone', 'Singapore', 'Slovakia', 'Slovenia', 'Solomon Islands',
 'Somalia', 'South Africa', 'South Korea', 'South Sudan', 'Spain', 'Sri Lanka',
 'Sudan', 'Suriname', 'Swaziland', 'Sweden', 'Switzerland', 'Syria', 'Taiwan',
 'Tajikistan', 'Tanzania', 'Thailand', 'Timor-Leste', 'Togo',
 'Trinidad and Tobago', 'Tunisia', 'Turkey', 'Uganda', 'Ukraine',
 'United Arab Emirates', 'United Kingdom', 'United States', 'Uruguay',
 'Uzbekistan', 'Vanuatu', 'Venezuela', 'Vietnam', 'Virgin Islands, U.S.',
 'Yemen', 'Zambia', 'Zimbabwe']

JHUData.records() automatically complement the records, if necesssary and auto_complement=True (default). Each country can have either none or one or multiple complements, depending on the records and their preprocessing analysis.

We can show the specific kind of complements that were applied to the records of each country with JHUData.show_complement() method. The possible kinds of complement for each country are the following:

  1. “Monotonic_confirmed/fatal/recovered” (monotonic increasing complement)
    Force the variable show monotonic increasing.
  2. “Full_recovered” (full complement of recovered data)
    Estimate the number of recovered cases using the value of estimated average recovery period.
  3. “Partial_recovered” (partial complement of recovered data)
    When recovered values are not updated for some days, extrapolate the values.
[17]:
# For selected country
jhu_data.show_complement(country="Japan")
[17]:
Country Province Monotonic_confirmed Monotonic_fatal Monotonic_recovered Full_recovered Partial_recovered
0 Japan - False False True False True
[18]:
# Show the details of complement for all countries
# jhu_data.show_complement().tail()
# For selected province
# jhu_data.show_complement(country="Japan", province="Tokyo")
# For selected countries
# jhu_data.show_complement(country=["Greece", "Japan"])
Note for recovery period:
With the global cases records, we estimate the average recovery period using JHUData.calculate_recovery_period().

What we currently do is to calculate the difference between confirmed cases and fatal cases and try to match it to some recovered cases value in the future. We apply this method for every country that has valid recovery data and average the partial recovery periods in order to obtain a single (average) recovery period. During the calculations, we ignore time intervals that lead to very short (<7 days) or very long (>90 days) partial recovery periods, if these exist with high frequency (>50%) in the records. We have to assume temporarily invariable compartments for this analysis to extract an approximation of the average recovery period.

Alternatively, we had tried to use linelist data to get precise value of recovery period (average of recovery date minus confirmation date for cases), but the number of records was too small.

[19]:
recovery_period = jhu_data.calculate_recovery_period()
print(f"Average recovery period: {recovery_period} [days]")
Average recovery period: 16 [days]

We can visualize the number of cases with .map() method. When country is None, global map will be shown.

Global map with country level data:

[20]:
# Global map with country level data
jhu_data.map(country=None, variable="Infected")
# To set included/exclude some countries
# jhu_data.map(country=None, variable="Infected", included=["Japan"])
# jhu_data.map(country=None, variable="Infected", excluded=["Japan"])
# To change the date
# jhu_data.map(country=None, variable="Infected", date="01Oct2021")
_images/usage_dataset_32_0.png
[21]:
# Country level data
jhu_data.layer(country=None).tail()
[21]:
ISO3 Date Country Confirmed Infected Fatal Recovered
83419 JPN 2021-02-27 Japan 430539 14712 7807 408020
83420 JPN 2021-02-28 Japan 431740 14561 7860 409319
83421 JPN 2021-03-01 Japan 432773 14282 7887 410604
83422 JPN 2021-03-02 Japan 433504 13456 7933 412115
83423 JPN 2021-03-03 Japan 434356 13038 7984 413334

Country map with province level data:

[22]:
# Country map with province level data
jhu_data.map(country="Japan", variable="Infected")
# To set included/exclude some countries
# jhu_data.map(country="Japan", variable="Infected", included=["Tokyo"])
# jhu_data.map(country="Japan", variable="Infected", excluded=["Tokyo"])
# To change the date
# jhu_data.map(country="Japan", variable="Infected", date="01Oct2021")
_images/usage_dataset_35_0.png
[23]:
# Province level data
jhu_data.layer(country="Japan").tail()
[23]:
ISO3 Date Country Province Confirmed Infected Fatal Recovered
16837 JPN 2021-02-27 Japan Entering 2229 34 2 2193
16838 JPN 2021-02-28 Japan Entering 2235 38 2 2195
16839 JPN 2021-03-01 Japan Entering 2240 41 2 2197
16840 JPN 2021-03-02 Japan Entering 2254 54 2 2198
16841 JPN 2021-03-03 Japan Entering 2255 51 2 2202

Linelist of case reports

The number of cases is important, but linelist of case reports will helpful to understand the situation deeply. Linelist data was saved as linelist, an instance of LinelistData class. This dataset is from Open COVID-19 Data Working Group.

[24]:
type(linelist)
[24]:
covsirphy.cleaning.linelist.LinelistData
[25]:
# Citation
print(linelist.citation)
Xu, B., Gutierrez, B., Mekaru, S. et al. Epidemiological data from the COVID-19 outbreak, real-time case information. Sci Data 7, 106 (2020). https://doi.org/10.1038/s41597-020-0448-0
[26]:
# Raw dataset
linelist.raw.tail()
[26]:
age sex province country date_admission_hospital date_confirmation symptoms chronic_disease outcome date_death_or_discharge
2676307 52 female Lima Peru NaN 17.05.2020 NaN NaN NaN NaN
2676308 52 female Lima Peru NaN 17.05.2020 NaN NaN NaN NaN
2676309 52 male Callao Peru NaN 17.05.2020 NaN NaN NaN NaN
2676310 52 male Lima Peru NaN 17.05.2020 NaN NaN NaN NaN
2676311 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
[27]:
# Cleaned dataset
linelist.cleaned().tail()
[27]:
Country Province Hospitalized_date Confirmation_date Outcome_date Confirmed Infected Recovered Fatal Symptoms Chronic_disease Age Sex
2676306 Peru Coronel Portillo NaT 2020-05-17 NaT True False False False NaN NaN 52.0 female
2676307 Peru Lima NaT 2020-05-17 NaT True False False False NaN NaN 52.0 female
2676308 Peru Lima NaT 2020-05-17 NaT True False False False NaN NaN 52.0 female
2676309 Peru Callao NaT 2020-05-17 NaT True False False False NaN NaN 52.0 male
2676310 Peru Lima NaT 2020-05-17 NaT True False False False NaN NaN 52.0 male
[28]:
# Subset for specified area
linelist.subset("Japan", province="Tokyo").tail()
[28]:
Hospitalized_date Confirmation_date Outcome_date Confirmed Infected Recovered Fatal Symptoms Chronic_disease Age Sex
107 NaT 2020-01-30 NaT True False False False NaN NaN NaN female
108 NaT 2020-01-24 NaT True False True False fever:pneumonia:sore throat NaN 40.0 male
109 2020-10-01 2020-01-15 2020-01-15 True False True False cough:fever:sore throat NaN 30.0 male
110 NaT 2020-01-25 NaT True False False False cough:fever NaN NaN female
111 NaT 2020-01-26 NaT True False False False fever:joint pain:pneumonia NaN 40.0 male
[29]:
# Subset for outcome ("Recovered" or "Fatal")
linelist.closed(outcome="Recovered").tail()
[29]:
Country Province Hospitalized_date Confirmation_date Recovered_date Symptoms Chronic_disease Age Sex
272 Singapore - 2020-02-02 2020-02-06 2020-02-17 NaN NaN 39.0 female
273 Malaysia Johor NaT 2020-01-25 2020-02-08 cough:fever NaN 40.0 male
274 China Gansu 2020-07-02 2020-02-08 2020-02-17 diarrhea NaN 1.0 female
275 Canada Ontario NaT 2020-01-25 2020-01-31 NaN hypertension NaN male
276 Canada Ontario NaT 2020-01-31 2020-02-19 NaN NaN NaN female

As the median value of the period from confirmation to recovery, we can calculate recovery period.

[30]:
# Recovery period (integer) [days]
linelist.recovery_period()
[30]:
12

Population in each country

Population values are necessary to calculate the number of susceptible people. Susceptible is a variable of SIR-derived models. This dataset was saved as population_data, an instance of PopulationData class.

[31]:
type(population_data)
[31]:
covsirphy.cleaning.population.PopulationData
[32]:
# Description/citation
print(population_data.citation)
(Secondary source) Guidotti, E., Ardia, D., (2020), "COVID-19 Data Hub", Journal of Open Source Software 5(51):2376, doi: 10.21105/joss.02376.
[33]:
# Raw data (the same as jhu_data)
# population_data.raw.tail()
[34]:
# Cleaned data
population_data.cleaned().tail()
[34]:
ISO3 Country Province Date Population
384339 COL Colombia Vichada 2021-02-27 107808
384340 COL Colombia Vichada 2021-02-28 107808
384341 COL Colombia Vichada 2021-03-01 107808
384342 COL Colombia Vichada 2021-03-02 107808
384343 COL Colombia Vichada 2021-03-03 107808

We will get the population values with PopulationData.value().

[35]:
# In a country
population_data.value("Japan", province=None)
# In a country with ISO3 code
# population_data.value("JPN", province=None)
# In a province (prefecture)
# population_data.value("Japan", province="Tokyo")
[35]:
126529100

We can update the population values.

[36]:
# Before
population_before = population_data.value("Japan", province="Tokyo")
print(f"Before: {population_before}")
# Register population value of Tokyo in Japan
# https://www.metro.tokyo.lg.jp/tosei/hodohappyo/press/2020/06/11/07.html
population_data.update(14_002_973, "Japan", province="Tokyo")
population_after = population_data.value("Japan", province="Tokyo")
print(f" After: {population_after}")
Before: 13942856
 After: 14002973

We can visualize the number of cases with .map() method. When country is None, global map will be shown. Arguments are the same as JHUData, but variable name cannot be specified.

[37]:
# Global map with country level data
population_data.map(country=None)
_images/usage_dataset_56_0.png
[38]:
# Country level data
population_data.layer(country=None).tail()
[38]:
ISO3 Country Date Population
83455 ZWE Zimbabwe 2021-02-27 14439018
83456 ZWE Zimbabwe 2021-02-28 14439018
83457 ZWE Zimbabwe 2021-03-01 14439018
83458 ZWE Zimbabwe 2021-03-02 14439018
83459 ZWE Zimbabwe 2021-03-03 14439018
[39]:
# Country map with province level data
population_data.map(country="Japan")
_images/usage_dataset_58_0.png
[40]:
# Province level data
population_data.layer(country="Japan").tail()
[40]:
ISO3 Country Province Date Population
20112 JPN Japan Kagawa 2021-02-28 956069
20113 JPN Japan Kagawa 2021-03-01 956069
20114 JPN Japan Kagawa 2021-03-02 956069
20115 JPN Japan Kagawa 2021-03-03 956069
20116 - Japan Tokyo 2021-03-04 14002973

Government Response Tracker (OxCGRT)

Government responses are tracked with Oxford Covid-19 Government Response Tracker (OxCGRT). Because government responses and activities of persons change the parameter values of SIR-derived models, this dataset is significant when we try to forcast the number of cases.
With DataLoader class, the dataset was retrieved via COVID-19 Data Hub and saved as oxcgrt_data, an instance of OxCGRTData class.
[41]:
type(oxcgrt_data)
[41]:
covsirphy.cleaning.oxcgrt.OxCGRTData
[42]:
# Description/citation
print(oxcgrt_data.citation)
(Secondary source) Guidotti, E., Ardia, D., (2020), "COVID-19 Data Hub", Journal of Open Source Software 5(51):2376, doi: 10.21105/joss.02376.
[43]:
# Raw data (the same as jhu_data)
# oxcgrt_data.raw.tail()
[44]:
# Cleaned data
oxcgrt_data.cleaned().tail()
[44]:
Date Country ISO3 School_closing Workplace_closing Cancel_events Gatherings_restrictions Transport_closing Stay_home_restrictions Internal_movement_restrictions International_movement_restrictions Information_campaigns Testing_policy Contact_tracing Stringency_index
443831 2021-02-27 Colombia COL 3 1 1 3 1 1 1 4 2 2 2 81.02
443832 2021-02-28 Colombia COL 3 1 1 3 1 1 1 4 2 2 2 81.02
443833 2021-03-01 Colombia COL 3 1 1 3 1 1 1 4 2 2 2 81.02
443834 2021-03-02 Colombia COL 3 1 1 3 1 1 1 4 2 2 2 81.02
443835 2021-03-03 Colombia COL 3 1 1 3 1 1 1 4 2 2 2 81.02
[45]:
# Subset for a country
oxcgrt_data.subset("Japan").tail()
# We can use ISO3 codes
# oxcgrt_data.subset("JPN").tail()
[45]:
Date School_closing Workplace_closing Cancel_events Gatherings_restrictions Transport_closing Stay_home_restrictions Internal_movement_restrictions International_movement_restrictions Information_campaigns Testing_policy Contact_tracing Stringency_index
423 2021-02-27 1 1 1 0 1 1 1 4 2 2 1 51.39
424 2021-02-28 1 1 1 0 1 1 1 4 2 2 1 51.39
425 2021-03-01 1 1 1 0 1 1 1 4 2 2 1 51.39
426 2021-03-02 1 1 1 0 1 1 1 4 2 2 1 51.39
427 2021-03-03 1 1 1 0 1 1 1 4 2 2 1 51.39

We can visualize the number of cases with .map() method. Arguments are the same as JHUData, but country name cannot be specified.

[46]:
# Global map with country level data
oxcgrt_data.map(variable="Stringency_index")
_images/usage_dataset_67_0.png
[47]:
# Country level data
oxcgrt_data.layer().tail()
[47]:
Date Country ISO3 School_closing Workplace_closing Cancel_events Gatherings_restrictions Transport_closing Stay_home_restrictions Internal_movement_restrictions International_movement_restrictions Information_campaigns Testing_policy Contact_tracing Stringency_index
443831 2021-02-27 Colombia COL 3 1 1 3 1 1 1 4 2 2 2 81.02
443832 2021-02-28 Colombia COL 3 1 1 3 1 1 1 4 2 2 2 81.02
443833 2021-03-01 Colombia COL 3 1 1 3 1 1 1 4 2 2 2 81.02
443834 2021-03-02 Colombia COL 3 1 1 3 1 1 1 4 2 2 2 81.02
443835 2021-03-03 Colombia COL 3 1 1 3 1 1 1 4 2 2 2 81.02

The number of tests

The number of tests is also key information to understand the situation. This dataset was saved as pcr_data, an instance of PCRData class.

[48]:
type(pcr_data)
[48]:
covsirphy.cleaning.pcr_data.PCRData
[49]:
# Description/citation
print(pcr_data.citation)
(Secondary source) Guidotti, E., Ardia, D., (2020), "COVID-19 Data Hub", Journal of Open Source Software 5(51):2376, doi: 10.21105/joss.02376.
Hasell, J., Mathieu, E., Beltekian, D. et al. A cross-country database of COVID-19 testing. Sci Data 7, 345 (2020). https://doi.org/10.1038/s41597-020-00688-8
Lisphilar (2020), COVID-19 dataset in Japan, GitHub repository, https://github.com/lisphilar/covid19-sir/data/japan
[50]:
# Raw data (the same as jhu_data)
# pcr_data.raw.tail()
[51]:
# Cleaned data
pcr_data.cleaned().tail()
[51]:
Date Country Province Tests Confirmed
17229 2021-02-27 Japan - 8203285 430539
17230 2021-02-28 Japan - 8234982 431740
17231 2021-03-01 Japan - 8256602 432773
17232 2021-03-02 Japan - 8319692 433504
17233 2021-03-03 Japan - 8403376 434356
[52]:
# Subset for a country
pcr_data.subset("Japan").tail()
# We can use ISO3 codes
# pcr_data.subset("JPN").tail()
# Note: from version 2.17.0-alpha (stable: 2.18.0), "Tests_diff" will be added.
[52]:
Date Tests Tests_diff Confirmed
387 2021-02-27 8203285 59361 430539
388 2021-02-28 8234982 31697 431740
389 2021-03-01 8256602 21620 432773
390 2021-03-02 8319692 63090 433504
391 2021-03-03 8403376 83684 434356

Under the assumption that all tests were PCR test, we can calculate the positive rate of PCR tests as “the number of confirmed cases per the number of tests”.

[53]:
# Positive rate in Japan
_ = pcr_data.positive_rate("Japan")
_images/usage_dataset_76_0.png

We can visualize the number of cases with .map() method. When country is None, global map will be shown. Arguments are the same as JHUData, but variable name cannot be specified.

[54]:
# Global map with country level data
pcr_data.map(country=None)
_images/usage_dataset_78_0.png
[55]:
# Country level data
pcr_data.layer(country=None).tail()
[55]:
ISO3 Date Country Tests Confirmed
84703 JPN 2021-02-27 Japan 8203285 430539
84704 JPN 2021-02-28 Japan 8234982 431740
84705 JPN 2021-03-01 Japan 8256602 432773
84706 JPN 2021-03-02 Japan 8319692 433504
84707 JPN 2021-03-03 Japan 8403376 434356
[56]:
# Country map with province level data
pcr_data.map(country="Japan")
_images/usage_dataset_80_0.png
[57]:
# Province level data
pcr_data.layer(country="Japan").tail()
[57]:
ISO3 Date Country Province Tests Confirmed
16837 JPN 2021-02-27 Japan Entering 517631 2229
16838 JPN 2021-02-28 Japan Entering 520132 2235
16839 JPN 2021-03-01 Japan Entering 521977 2240
16840 JPN 2021-03-02 Japan Entering 524977 2254
16841 JPN 2021-03-03 Japan Entering 526337 2255

The number of vaccinations

Vaccinations is a key factor to end the outbreak as soon as possible. This dataset was saved as vaccine_data, an instance of VaccineData class.

[58]:
# The number of vaccinations
type(vaccine_data)
[58]:
covsirphy.cleaning.vaccine_data.VaccineData
[59]:
# Description/citation
print(vaccine_data.citation)
Hasell, J., Mathieu, E., Beltekian, D. et al. A cross-country database of COVID-19 testing. Sci Data 7, 345 (2020). https://doi.org/10.1038/s41597-020-00688-8
[60]:
# Raw data
# vaccine_data.raw.tail()
[61]:
# Cleaned data
vaccine_data.cleaned().tail()
[61]:
Date Country ISO3 Product Vaccinations Vaccinated_once Vaccinated_full
5073 2021-02-26 Zimbabwe ZWE Sinopharm/Beijing 12579 12579 0
5074 2021-02-27 Zimbabwe ZWE Sinopharm/Beijing 15705 15705 0
5075 2021-02-28 Zimbabwe ZWE Sinopharm/Beijing 18843 18843 0
5076 2021-03-01 Zimbabwe ZWE Sinopharm/Beijing 21456 21456 0
5077 2021-03-02 Zimbabwe ZWE Sinopharm/Beijing 25077 25077 0
[62]:
# Registered countries
pprint(vaccine_data.countries(), compact=True)
['Albania', 'Algeria', 'Andorra', 'Anguilla', 'Argentina', 'Australia',
 'Austria', 'Azerbaijan', 'Bahrain', 'Bangladesh', 'Barbados', 'Belarus',
 'Belgium', 'Bermuda', 'Bolivia', 'Brazil', 'Bulgaria', 'Cambodia', 'Canada',
 'Cayman Islands', 'Chile', 'China', 'Colombia', 'Costa Rica', 'Croatia',
 'Cyprus', 'Czechia', 'Denmark', 'Dominican Republic', 'Ecuador', 'Egypt',
 'El Salvador', 'England', 'Estonia', 'European Union', 'Faeroe Islands',
 'Falkland Islands', 'Finland', 'France', 'Germany', 'Gibraltar', 'Greece',
 'Greenland', 'Guatemala', 'Guernsey', 'Guyana', 'Honduras', 'Hong Kong',
 'Hungary', 'Iceland', 'India', 'Indonesia', 'Iran', 'Ireland', 'Isle of Man',
 'Israel', 'Italy', 'Japan', 'Jersey', 'Jordan', 'Kazakhstan', 'Kuwait',
 'Latvia', 'Lebanon', 'Liechtenstein', 'Lithuania', 'Luxembourg', 'Macao',
 'Malaysia', 'Maldives', 'Malta', 'Mauritius', 'Mexico', 'Monaco', 'Mongolia',
 'Montenegro', 'Montserrat', 'Morocco', 'Myanmar', 'Nepal', 'Netherlands',
 'New Zealand', 'Northern Cyprus', 'Northern Ireland', 'Norway', 'Oman',
 'Pakistan', 'Panama', 'Paraguay', 'Peru', 'Poland', 'Portugal', 'Qatar',
 'Romania', 'Russia', 'Saint Helena', 'San Marino', 'Saudi Arabia', 'Scotland',
 'Senegal', 'Serbia', 'Seychelles', 'Singapore', 'Slovakia', 'Slovenia',
 'South Africa', 'South Korea', 'Spain', 'Sri Lanka', 'Sweden', 'Switzerland',
 'Thailand', 'Trinidad and Tobago', 'Turkey', 'Turks and Caicos Islands',
 'Ukraine', 'United Arab Emirates', 'United Kingdom', 'United States',
 'Uruguay', 'Venezuela', 'Wales', 'World', 'Zimbabwe']
[63]:
# Subset for a country
vaccine_data.subset("United Kingdom").tail()
# We can use ISO3 codes
# pcr_data.subset("GBR").tail()
[63]:
Date Vaccinations Vaccinated_once Vaccinated_full
74 2021-02-25 19913592 19177555 736037
75 2021-02-26 20450858 19682048 768810
76 2021-02-27 20885683 20089551 796132
77 2021-02-28 21091267 20275451 815816
78 2021-03-01 21322717 20478619 844098

We can visualize the number of cases with .map() method. Arguments are the same as JHUData, but country name cannot be specified.

[64]:
# Global map with country level data
vaccine_data.map()
_images/usage_dataset_90_0.png
[65]:
# Country level data
vaccine_data.layer().tail()
[65]:
Date Country ISO3 Product Vaccinations Vaccinated_once Vaccinated_full
5073 2021-02-26 Zimbabwe ZWE Sinopharm/Beijing 12579 12579 0
5074 2021-02-27 Zimbabwe ZWE Sinopharm/Beijing 15705 15705 0
5075 2021-02-28 Zimbabwe ZWE Sinopharm/Beijing 18843 18843 0
5076 2021-03-01 Zimbabwe ZWE Sinopharm/Beijing 21456 21456 0
5077 2021-03-02 Zimbabwe ZWE Sinopharm/Beijing 25077 25077 0

Population pyramid

With population pyramid, we can divide the population to sub-groups. This will be useful when we analyse the meaning of parameters. For example, how many days go out is different between the sub-groups. This dataset was saved as pyramid_data, an instance of PopulationPyramidData class.

[66]:
# Population pyramid
type(pyramid_data)
[66]:
covsirphy.cleaning.pyramid.PopulationPyramidData
[67]:
# Description/citation
print(pyramid_data.citation)
World Bank Group (2020), World Bank Open Data, https://data.worldbank.org/
[68]:
# Subset will retrieved from the server when set
pyramid_data.subset("Japan").tail()
[68]:
Age Population Per_total
113 118 255035 0.002174
114 119 255035 0.002174
115 120 255035 0.002174
116 121 255035 0.002174
117 122 255035 0.002174

Japan-specific dataset

This includes the number of confirmed/infected/fatal/recovered/tests/moderate/severe cases at country/prefecture level and metadata of each prefecture. This dataset was saved as japan_data, an instance of JapanData class.

[69]:
# Japan-specific dataset
type(japan_data)
[69]:
covsirphy.cleaning.japan_data.JapanData
[70]:
# Description/citation
print(japan_data.citation)
Lisphilar (2020), COVID-19 dataset in Japan, GitHub repository, https://github.com/lisphilar/covid19-sir/data/japan
[71]:
# Cleaned dataset
japan_data.cleaned().tail()
[71]:
Date Country Province Confirmed Infected Fatal Recovered Tests Moderate Severe Vaccinations Vaccinated_once Vaccinated_full
17229 2021-02-27 Japan - 430539 14712 7807 408020 8203285 14057 440 28530 28530 0
17230 2021-02-28 Japan - 431740 14561 7860 409319 8234982 13929 434 28530 28530 0
17231 2021-03-01 Japan - 432773 14282 7887 410604 8256602 13618 436 31785 31785 0
17232 2021-03-02 Japan - 433504 13456 7933 412115 8319692 12775 413 34772 34772 0
17233 2021-03-03 Japan - 434356 13038 7984 413334 8403376 12382 407 37303 37303 0
[72]:
# Metadata
japan_data.meta().tail()
[72]:
Prefecture Admin_Capital Admin_Region Admin_Num Area_Habitable Area_Total Clinic_bed_Care Clinic_bed_Total Hospital_bed_Care Hospital_bed_Specific Hospital_bed_Total Hospital_bed_Tuberculosis Hospital_bed_Type-I Hospital_bed_Type-II Population_Female Population_Male Population_Total Location_Latitude Location_Longitude
42 Kumamoto Kumamoto Kyushu 43 2796 7409 497 4628 8340 0 33710 95 2 46 933 833 1765 32.790513 130.742388
43 Oita Oita Kyushu 44 1799 6341 269 3561 2618 0 19834 50 2 38 607 546 1152 33.238391 131.612658
44 Miyazaki Miyazaki Kyushu 45 1850 7735 206 2357 3682 0 18769 33 1 30 577 512 1089 31.911188 131.423873
45 Kagoshima Kagoshima Kyushu 46 3313 9187 652 4827 7750 0 32651 98 1 44 863 763 1626 31.560052 130.557745
46 Okinawa Naha Okinawa 47 1169 2281 83 914 3804 0 18710 47 4 20 734 709 1443 26.211761 127.681119

We can visualize the number of cases with .map() method. Arguments are the same as JHUData, but country name cannot be specified.

[73]:
# Country map with province level data
japan_data.map(variable="Severe")
_images/usage_dataset_102_0.png
[74]:
# Province level data
japan_data.layer(country="Japan").tail()
[74]:
Date Country Province Confirmed Infected Fatal Recovered Tests Moderate Severe Vaccinations Vaccinated_once Vaccinated_full
16837 2021-02-27 Japan Entering 2229 34 2 2193 517631 34 0 0 0 0
16838 2021-02-28 Japan Entering 2235 38 2 2195 520132 38 0 0 0 0
16839 2021-03-01 Japan Entering 2240 41 2 2197 521977 41 0 0 0 0
16840 2021-03-02 Japan Entering 2254 54 2 2198 524977 54 0 0 0 0
16841 2021-03-03 Japan Entering 2255 51 2 2202 526337 51 0 0 0 0

Map with country level data is not prepared, but country level data can be retrieved.

[75]:
# Country level data
japan_data.layer(country=None).tail()
[75]:
Date Country Confirmed Infected Fatal Recovered Tests Moderate Severe Vaccinations Vaccinated_once Vaccinated_full
387 2021-02-27 Japan 430539 14712 7807 408020 8203285 14057 440 28530 28530 0
388 2021-02-28 Japan 431740 14561 7860 409319 8234982 13929 434 28530 28530 0
389 2021-03-01 Japan 432773 14282 7887 410604 8256602 13618 436 31785 31785 0
390 2021-03-02 Japan 433504 13456 7933 412115 8319692 12775 413 34772 34772 0
391 2021-03-03 Japan 434356 13038 7984 413334 8403376 12382 407 37303 37303 0