Usage: datasets¶
Here, we will review the raw/cleaned datasets. Scenario
class performs data cleaning internally using JHUData
class and so on, but it is important to review the features and data types before analysing them.
Preparation¶
Prepare the packages.
[1]:
from pprint import pprint
import covsirphy as cs
cs.__version__
[1]:
'2.17.0-alpha'
Dataset preparation¶
Download the datasets to “input” directory and load them.
If “input” directory has the datasets, DataLoader
instance will load the local files. If the datasets were updated in remote servers, DataLoader
will update the local files automatically and download the datasets to “../input” directory and load them. We can change the directory when creating the instance.
[2]:
# Create DataLoader instance
data_loader = cs.DataLoader("../input")
[3]:
# The number of cases (JHU style)
jhu_data = data_loader.jhu(verbose=True)
# Population in each country
population_data = data_loader.population()
# Government Response Tracker (OxCGRT)
oxcgrt_data = data_loader.oxcgrt()
[4]:
# Linelist of case reports
linelist = data_loader.linelist()
# The number of tests
pcr_data = data_loader.pcr()
# The number of vaccinations
vaccine_data = data_loader.vaccine()
# Population pyramid
pyramid_data = data_loader.pyramid()
# Japan-specific dataset
japan_data = data_loader.japan()
The number of cases (JHU style)¶
The main dataset is that of the number of cases and was saved as jhu_data
, an instance of JHUData
class. This includes “Confirmed”, “Infected”, “Recovered” and “Fatal”. “Infected” was calculated as “Confirmed - Recovered - Fatal”.
[5]:
type(jhu_data)
[5]:
covsirphy.cleaning.jhu_data.JHUData
The dataset will be retrieved from COVID-19 Data Hub and Data folder of CovsirPhy project. Description of these projects will be shown as follows.
[6]:
# Description/citation
print(jhu_data.citation)
(Secondary source) Guidotti, E., Ardia, D., (2020), "COVID-19 Data Hub", Journal of Open Source Software 5(51):2376, doi: 10.21105/joss.02376.
Lisphilar (2020), COVID-19 dataset in Japan, GitHub repository, https://github.com/lisphilar/covid19-sir/data/japan
[7]:
# Detailed citation list of COVID-19 Data Hub
# print(data_loader.covid19dh_citation)
[8]:
# Raw data
jhu_data.raw.tail()
[8]:
ObservationDate | Tests | Confirmed | Recovered | Deaths | Population | ISO3 | Province/State | Country/Region | school_closing | ... | cancel_events | gatherings_restrictions | transport_closing | stay_home_restrictions | internal_movement_restrictions | international_movement_restrictions | information_campaigns | testing_policy | contact_tracing | stringency_index | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
443831 | 2021-02-27 | 5761.0 | 1364.0 | 1331 | 22 | 107808.0 | COL | Vichada | Colombia | 3 | ... | 1 | 3 | 1 | 1 | 1 | 4 | 2 | 2 | 2 | 81.02 |
443832 | 2021-02-28 | 5798.0 | 1366.0 | 1332 | 22 | 107808.0 | COL | Vichada | Colombia | 3 | ... | 1 | 3 | 1 | 1 | 1 | 4 | 2 | 2 | 2 | 81.02 |
443833 | 2021-03-01 | 5817.0 | 1369.0 | 1332 | 22 | 107808.0 | COL | Vichada | Colombia | 3 | ... | 1 | 3 | 1 | 1 | 1 | 4 | 2 | 2 | 2 | 81.02 |
443834 | 2021-03-02 | 5817.0 | 1377.0 | 1336 | 22 | 107808.0 | COL | Vichada | Colombia | 3 | ... | 1 | 3 | 1 | 1 | 1 | 4 | 2 | 2 | 2 | 81.02 |
443835 | 2021-03-03 | 5817.0 | 1377.0 | 1336 | 22 | 107808.0 | COL | Vichada | Colombia | 3 | ... | 1 | 3 | 1 | 1 | 1 | 4 | 2 | 2 | 2 | 81.02 |
5 rows × 21 columns
[9]:
# Cleaned data
jhu_data.cleaned().tail()
[9]:
Date | Country | Province | Confirmed | Infected | Fatal | Recovered | |
---|---|---|---|---|---|---|---|
17229 | 2021-02-27 | Japan | - | 430539 | 14712 | 7807 | 408020 |
17230 | 2021-02-28 | Japan | - | 431740 | 14561 | 7860 | 409319 |
17231 | 2021-03-01 | Japan | - | 432773 | 14282 | 7887 | 410604 |
17232 | 2021-03-02 | Japan | - | 433504 | 13456 | 7933 | 412115 |
17233 | 2021-03-03 | Japan | - | 434356 | 13038 | 7984 | 413334 |
[10]:
jhu_data.cleaned().info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 440526 entries, 0 to 17233
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Date 440526 non-null datetime64[ns]
1 Country 440526 non-null category
2 Province 440526 non-null category
3 Confirmed 440526 non-null int64
4 Infected 440526 non-null int64
5 Fatal 440526 non-null int64
6 Recovered 440526 non-null int64
dtypes: category(2), datetime64[ns](1), int64(4)
memory usage: 21.9 MB
Total number of cases in all countries with JHUData.total()
method.
[11]:
# Calculate total values
total_df = jhu_data.total()
total_df.tail()
[11]:
Confirmed | Infected | Fatal | Recovered | Fatal per Confirmed | Recovered per Confirmed | Fatal per (Fatal or Recovered) | |
---|---|---|---|---|---|---|---|
Date | |||||||
2021-02-27 | 113736464 | 40655486 | 2525478 | 70555500 | 0.022205 | 0.620342 | 0.034557 |
2021-02-28 | 114020351 | 40759369 | 2531020 | 70729962 | 0.022198 | 0.620328 | 0.034548 |
2021-03-01 | 114331878 | 40856146 | 2537491 | 70938241 | 0.022194 | 0.620459 | 0.034535 |
2021-03-02 | 114594562 | 40864230 | 2546460 | 71183872 | 0.022221 | 0.621180 | 0.034537 |
2021-03-03 | 114596369 | 40864688 | 2546527 | 71185154 | 0.022222 | 0.621182 | 0.034538 |
[12]:
# Plot the total values
cs.line_plot(total_df[["Infected", "Fatal", "Recovered"]], "Total number of cases over time")

[13]:
# Statistics of rate values in all countries
total_df.loc[:, total_df.columns.str.contains("per")].describe().T
[13]:
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
Fatal per Confirmed | 422.0 | 0.035840 | 0.017479 | 0.000000 | 0.022496 | 0.031028 | 0.045110 | 0.074286 |
Recovered per Confirmed | 422.0 | 0.527000 | 0.187639 | 0.018591 | 0.418480 | 0.617343 | 0.650029 | 1.000000 |
Fatal per (Fatal or Recovered) | 422.0 | 0.088591 | 0.088323 | 0.000000 | 0.034768 | 0.052062 | 0.112994 | 0.539474 |
We can create a subset for a country using JHUData.subset()
method.
[14]:
# Subset for a country
df, _ = jhu_data.records("Japan")
df.tail()
# We can use ISO3 code etc.
# df, _ = jhu_data.records("JPN")
# df.tail()
[14]:
Date | Confirmed | Infected | Fatal | Recovered | |
---|---|---|---|---|---|
387 | 2021-02-27 | 430539 | 14712 | 7807 | 408020 |
388 | 2021-02-28 | 431740 | 14561 | 7860 | 409319 |
389 | 2021-03-01 | 432773 | 14282 | 7887 | 410604 |
390 | 2021-03-02 | 433504 | 13456 | 7933 | 412115 |
391 | 2021-03-03 | 434356 | 13038 | 7984 | 413334 |
Province (“prefecture” for Japan) name can be specified.
[15]:
df, _ = jhu_data.records("Japan", province="Tokyo")
df.tail()
[15]:
Date | Confirmed | Infected | Fatal | Recovered | |
---|---|---|---|---|---|
345 | 2021-02-26 | 111010 | 3344 | 1355 | 106311 |
346 | 2021-02-27 | 111347 | 3342 | 1370 | 106635 |
347 | 2021-02-28 | 111676 | 3374 | 1376 | 106926 |
348 | 2021-03-01 | 111797 | 3091 | 1395 | 107311 |
349 | 2021-03-02 | 112029 | 3083 | 1400 | 107546 |
[16]:
# Countries we can select
pprint(jhu_data.countries(), compact=True)
['Afghanistan', 'Albania', 'Algeria', 'Andorra', 'Angola',
'Antigua and Barbuda', 'Argentina', 'Armenia', 'Australia', 'Austria',
'Azerbaijan', 'Bahamas', 'Bahrain', 'Bangladesh', 'Barbados', 'Belarus',
'Belgium', 'Belize', 'Benin', 'Bermuda', 'Bhutan', 'Bolivia',
'Bosnia and Herzegovina', 'Botswana', 'Brazil', 'Brunei', 'Bulgaria',
'Burkina Faso', 'Burundi', 'Cambodia', 'Cameroon', 'Canada', 'Cape Verde',
'Central African Republic', 'Chad', 'Chile', 'China', 'Colombia', 'Comoros',
'Costa Rica', "Cote d'Ivoire", 'Croatia', 'Cuba', 'Cyprus', 'Czech Republic',
'Democratic Republic of the Congo', 'Denmark', 'Djibouti', 'Dominica',
'Dominican Republic', 'Ecuador', 'Egypt', 'El Salvador', 'Equatorial Guinea',
'Eritrea', 'Estonia', 'Ethiopia', 'Fiji', 'Finland', 'France', 'Gabon',
'Gambia', 'Georgia', 'Germany', 'Ghana', 'Greece', 'Grenada', 'Guam',
'Guatemala', 'Guinea', 'Guinea-Bissau', 'Guyana', 'Haiti', 'Holy See',
'Honduras', 'Hungary', 'Iceland', 'India', 'Indonesia', 'Iran', 'Iraq',
'Ireland', 'Israel', 'Italy', 'Jamaica', 'Japan', 'Jordan', 'Kazakhstan',
'Kenya', 'Kosovo', 'Kuwait', 'Kyrgyzstan', 'Laos', 'Latvia', 'Lebanon',
'Lesotho', 'Liberia', 'Libya', 'Liechtenstein', 'Lithuania', 'Luxembourg',
'Madagascar', 'Malawi', 'Malaysia', 'Maldives', 'Mali', 'Malta',
'Marshall Islands', 'Mauritania', 'Mauritius', 'Mexico', 'Moldova', 'Monaco',
'Mongolia', 'Montenegro', 'Morocco', 'Mozambique', 'Myanmar', 'Namibia',
'Nepal', 'Netherlands', 'New Zealand', 'Nicaragua', 'Niger', 'Nigeria',
'North Macedonia', 'Northern Mariana Islands', 'Norway', 'Oman', 'Pakistan',
'Palestine', 'Panama', 'Papua New Guinea', 'Paraguay', 'Peru', 'Philippines',
'Poland', 'Portugal', 'Puerto Rico', 'Qatar', 'Republic of the Congo',
'Romania', 'Russia', 'Rwanda', 'Saint Kitts and Nevis', 'Saint Lucia',
'Saint Vincent and the Grenadines', 'Samoa', 'San Marino',
'Sao Tome and Principe', 'Saudi Arabia', 'Senegal', 'Serbia', 'Seychelles',
'Sierra Leone', 'Singapore', 'Slovakia', 'Slovenia', 'Solomon Islands',
'Somalia', 'South Africa', 'South Korea', 'South Sudan', 'Spain', 'Sri Lanka',
'Sudan', 'Suriname', 'Swaziland', 'Sweden', 'Switzerland', 'Syria', 'Taiwan',
'Tajikistan', 'Tanzania', 'Thailand', 'Timor-Leste', 'Togo',
'Trinidad and Tobago', 'Tunisia', 'Turkey', 'Uganda', 'Ukraine',
'United Arab Emirates', 'United Kingdom', 'United States', 'Uruguay',
'Uzbekistan', 'Vanuatu', 'Venezuela', 'Vietnam', 'Virgin Islands, U.S.',
'Yemen', 'Zambia', 'Zimbabwe']
JHUData.records()
automatically complement the records, if necesssary and auto_complement=True
(default). Each country can have either none or one or multiple complements, depending on the records and their preprocessing analysis.
We can show the specific kind of complements that were applied to the records of each country with JHUData.show_complement()
method. The possible kinds of complement for each country are the following:
- “Monotonic_confirmed/fatal/recovered” (monotonic increasing complement)Force the variable show monotonic increasing.
- “Full_recovered” (full complement of recovered data)Estimate the number of recovered cases using the value of estimated average recovery period.
- “Partial_recovered” (partial complement of recovered data)When recovered values are not updated for some days, extrapolate the values.
[17]:
# For selected country
jhu_data.show_complement(country="Japan")
[17]:
Country | Province | Monotonic_confirmed | Monotonic_fatal | Monotonic_recovered | Full_recovered | Partial_recovered | |
---|---|---|---|---|---|---|---|
0 | Japan | - | False | False | True | False | True |
[18]:
# Show the details of complement for all countries
# jhu_data.show_complement().tail()
# For selected province
# jhu_data.show_complement(country="Japan", province="Tokyo")
# For selected countries
# jhu_data.show_complement(country=["Greece", "Japan"])
JHUData.calculate_recovery_period()
.What we currently do is to calculate the difference between confirmed cases and fatal cases and try to match it to some recovered cases value in the future. We apply this method for every country that has valid recovery data and average the partial recovery periods in order to obtain a single (average) recovery period. During the calculations, we ignore time intervals that lead to very short (<7 days) or very long (>90 days) partial recovery periods, if these exist with high frequency (>50%) in the records. We have to assume temporarily invariable compartments for this analysis to extract an approximation of the average recovery period.
Alternatively, we had tried to use linelist data to get precise value of recovery period (average of recovery date minus confirmation date for cases), but the number of records was too small.
[19]:
recovery_period = jhu_data.calculate_recovery_period()
print(f"Average recovery period: {recovery_period} [days]")
Average recovery period: 16 [days]
We can visualize the number of cases with .map()
method. When country
is None
, global map will be shown.
Global map with country level data:
[20]:
# Global map with country level data
jhu_data.map(country=None, variable="Infected")
# To set included/exclude some countries
# jhu_data.map(country=None, variable="Infected", included=["Japan"])
# jhu_data.map(country=None, variable="Infected", excluded=["Japan"])
# To change the date
# jhu_data.map(country=None, variable="Infected", date="01Oct2021")

[21]:
# Country level data
jhu_data.layer(country=None).tail()
[21]:
ISO3 | Date | Country | Confirmed | Infected | Fatal | Recovered | |
---|---|---|---|---|---|---|---|
83419 | JPN | 2021-02-27 | Japan | 430539 | 14712 | 7807 | 408020 |
83420 | JPN | 2021-02-28 | Japan | 431740 | 14561 | 7860 | 409319 |
83421 | JPN | 2021-03-01 | Japan | 432773 | 14282 | 7887 | 410604 |
83422 | JPN | 2021-03-02 | Japan | 433504 | 13456 | 7933 | 412115 |
83423 | JPN | 2021-03-03 | Japan | 434356 | 13038 | 7984 | 413334 |
Country map with province level data:
[22]:
# Country map with province level data
jhu_data.map(country="Japan", variable="Infected")
# To set included/exclude some countries
# jhu_data.map(country="Japan", variable="Infected", included=["Tokyo"])
# jhu_data.map(country="Japan", variable="Infected", excluded=["Tokyo"])
# To change the date
# jhu_data.map(country="Japan", variable="Infected", date="01Oct2021")

[23]:
# Province level data
jhu_data.layer(country="Japan").tail()
[23]:
ISO3 | Date | Country | Province | Confirmed | Infected | Fatal | Recovered | |
---|---|---|---|---|---|---|---|---|
16837 | JPN | 2021-02-27 | Japan | Entering | 2229 | 34 | 2 | 2193 |
16838 | JPN | 2021-02-28 | Japan | Entering | 2235 | 38 | 2 | 2195 |
16839 | JPN | 2021-03-01 | Japan | Entering | 2240 | 41 | 2 | 2197 |
16840 | JPN | 2021-03-02 | Japan | Entering | 2254 | 54 | 2 | 2198 |
16841 | JPN | 2021-03-03 | Japan | Entering | 2255 | 51 | 2 | 2202 |
Linelist of case reports¶
The number of cases is important, but linelist of case reports will helpful to understand the situation deeply. Linelist data was saved as linelist
, an instance of LinelistData
class. This dataset is from Open COVID-19 Data Working Group.
[24]:
type(linelist)
[24]:
covsirphy.cleaning.linelist.LinelistData
[25]:
# Citation
print(linelist.citation)
Xu, B., Gutierrez, B., Mekaru, S. et al. Epidemiological data from the COVID-19 outbreak, real-time case information. Sci Data 7, 106 (2020). https://doi.org/10.1038/s41597-020-0448-0
[26]:
# Raw dataset
linelist.raw.tail()
[26]:
age | sex | province | country | date_admission_hospital | date_confirmation | symptoms | chronic_disease | outcome | date_death_or_discharge | |
---|---|---|---|---|---|---|---|---|---|---|
2676307 | 52 | female | Lima | Peru | NaN | 17.05.2020 | NaN | NaN | NaN | NaN |
2676308 | 52 | female | Lima | Peru | NaN | 17.05.2020 | NaN | NaN | NaN | NaN |
2676309 | 52 | male | Callao | Peru | NaN | 17.05.2020 | NaN | NaN | NaN | NaN |
2676310 | 52 | male | Lima | Peru | NaN | 17.05.2020 | NaN | NaN | NaN | NaN |
2676311 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
[27]:
# Cleaned dataset
linelist.cleaned().tail()
[27]:
Country | Province | Hospitalized_date | Confirmation_date | Outcome_date | Confirmed | Infected | Recovered | Fatal | Symptoms | Chronic_disease | Age | Sex | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2676306 | Peru | Coronel Portillo | NaT | 2020-05-17 | NaT | True | False | False | False | NaN | NaN | 52.0 | female |
2676307 | Peru | Lima | NaT | 2020-05-17 | NaT | True | False | False | False | NaN | NaN | 52.0 | female |
2676308 | Peru | Lima | NaT | 2020-05-17 | NaT | True | False | False | False | NaN | NaN | 52.0 | female |
2676309 | Peru | Callao | NaT | 2020-05-17 | NaT | True | False | False | False | NaN | NaN | 52.0 | male |
2676310 | Peru | Lima | NaT | 2020-05-17 | NaT | True | False | False | False | NaN | NaN | 52.0 | male |
[28]:
# Subset for specified area
linelist.subset("Japan", province="Tokyo").tail()
[28]:
Hospitalized_date | Confirmation_date | Outcome_date | Confirmed | Infected | Recovered | Fatal | Symptoms | Chronic_disease | Age | Sex | |
---|---|---|---|---|---|---|---|---|---|---|---|
107 | NaT | 2020-01-30 | NaT | True | False | False | False | NaN | NaN | NaN | female |
108 | NaT | 2020-01-24 | NaT | True | False | True | False | fever:pneumonia:sore throat | NaN | 40.0 | male |
109 | 2020-10-01 | 2020-01-15 | 2020-01-15 | True | False | True | False | cough:fever:sore throat | NaN | 30.0 | male |
110 | NaT | 2020-01-25 | NaT | True | False | False | False | cough:fever | NaN | NaN | female |
111 | NaT | 2020-01-26 | NaT | True | False | False | False | fever:joint pain:pneumonia | NaN | 40.0 | male |
[29]:
# Subset for outcome ("Recovered" or "Fatal")
linelist.closed(outcome="Recovered").tail()
[29]:
Country | Province | Hospitalized_date | Confirmation_date | Recovered_date | Symptoms | Chronic_disease | Age | Sex | |
---|---|---|---|---|---|---|---|---|---|
272 | Singapore | - | 2020-02-02 | 2020-02-06 | 2020-02-17 | NaN | NaN | 39.0 | female |
273 | Malaysia | Johor | NaT | 2020-01-25 | 2020-02-08 | cough:fever | NaN | 40.0 | male |
274 | China | Gansu | 2020-07-02 | 2020-02-08 | 2020-02-17 | diarrhea | NaN | 1.0 | female |
275 | Canada | Ontario | NaT | 2020-01-25 | 2020-01-31 | NaN | hypertension | NaN | male |
276 | Canada | Ontario | NaT | 2020-01-31 | 2020-02-19 | NaN | NaN | NaN | female |
As the median value of the period from confirmation to recovery, we can calculate recovery period.
[30]:
# Recovery period (integer) [days]
linelist.recovery_period()
[30]:
12
Population in each country¶
Population values are necessary to calculate the number of susceptible people. Susceptible is a variable of SIR-derived models. This dataset was saved as population_data
, an instance of PopulationData
class.
[31]:
type(population_data)
[31]:
covsirphy.cleaning.population.PopulationData
[32]:
# Description/citation
print(population_data.citation)
(Secondary source) Guidotti, E., Ardia, D., (2020), "COVID-19 Data Hub", Journal of Open Source Software 5(51):2376, doi: 10.21105/joss.02376.
[33]:
# Raw data (the same as jhu_data)
# population_data.raw.tail()
[34]:
# Cleaned data
population_data.cleaned().tail()
[34]:
ISO3 | Country | Province | Date | Population | |
---|---|---|---|---|---|
384339 | COL | Colombia | Vichada | 2021-02-27 | 107808 |
384340 | COL | Colombia | Vichada | 2021-02-28 | 107808 |
384341 | COL | Colombia | Vichada | 2021-03-01 | 107808 |
384342 | COL | Colombia | Vichada | 2021-03-02 | 107808 |
384343 | COL | Colombia | Vichada | 2021-03-03 | 107808 |
We will get the population values with PopulationData.value()
.
[35]:
# In a country
population_data.value("Japan", province=None)
# In a country with ISO3 code
# population_data.value("JPN", province=None)
# In a province (prefecture)
# population_data.value("Japan", province="Tokyo")
[35]:
126529100
We can update the population values.
[36]:
# Before
population_before = population_data.value("Japan", province="Tokyo")
print(f"Before: {population_before}")
# Register population value of Tokyo in Japan
# https://www.metro.tokyo.lg.jp/tosei/hodohappyo/press/2020/06/11/07.html
population_data.update(14_002_973, "Japan", province="Tokyo")
population_after = population_data.value("Japan", province="Tokyo")
print(f" After: {population_after}")
Before: 13942856
After: 14002973
We can visualize the number of cases with .map()
method. When country
is None
, global map will be shown. Arguments are the same as JHUData
, but variable name cannot be specified.
[37]:
# Global map with country level data
population_data.map(country=None)

[38]:
# Country level data
population_data.layer(country=None).tail()
[38]:
ISO3 | Country | Date | Population | |
---|---|---|---|---|
83455 | ZWE | Zimbabwe | 2021-02-27 | 14439018 |
83456 | ZWE | Zimbabwe | 2021-02-28 | 14439018 |
83457 | ZWE | Zimbabwe | 2021-03-01 | 14439018 |
83458 | ZWE | Zimbabwe | 2021-03-02 | 14439018 |
83459 | ZWE | Zimbabwe | 2021-03-03 | 14439018 |
[39]:
# Country map with province level data
population_data.map(country="Japan")

[40]:
# Province level data
population_data.layer(country="Japan").tail()
[40]:
ISO3 | Country | Province | Date | Population | |
---|---|---|---|---|---|
20112 | JPN | Japan | Kagawa | 2021-02-28 | 956069 |
20113 | JPN | Japan | Kagawa | 2021-03-01 | 956069 |
20114 | JPN | Japan | Kagawa | 2021-03-02 | 956069 |
20115 | JPN | Japan | Kagawa | 2021-03-03 | 956069 |
20116 | - | Japan | Tokyo | 2021-03-04 | 14002973 |
Government Response Tracker (OxCGRT)¶
DataLoader
class, the dataset was retrieved via COVID-19 Data Hub and saved as oxcgrt_data
, an instance of OxCGRTData
class.[41]:
type(oxcgrt_data)
[41]:
covsirphy.cleaning.oxcgrt.OxCGRTData
[42]:
# Description/citation
print(oxcgrt_data.citation)
(Secondary source) Guidotti, E., Ardia, D., (2020), "COVID-19 Data Hub", Journal of Open Source Software 5(51):2376, doi: 10.21105/joss.02376.
[43]:
# Raw data (the same as jhu_data)
# oxcgrt_data.raw.tail()
[44]:
# Cleaned data
oxcgrt_data.cleaned().tail()
[44]:
Date | Country | ISO3 | School_closing | Workplace_closing | Cancel_events | Gatherings_restrictions | Transport_closing | Stay_home_restrictions | Internal_movement_restrictions | International_movement_restrictions | Information_campaigns | Testing_policy | Contact_tracing | Stringency_index | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
443831 | 2021-02-27 | Colombia | COL | 3 | 1 | 1 | 3 | 1 | 1 | 1 | 4 | 2 | 2 | 2 | 81.02 |
443832 | 2021-02-28 | Colombia | COL | 3 | 1 | 1 | 3 | 1 | 1 | 1 | 4 | 2 | 2 | 2 | 81.02 |
443833 | 2021-03-01 | Colombia | COL | 3 | 1 | 1 | 3 | 1 | 1 | 1 | 4 | 2 | 2 | 2 | 81.02 |
443834 | 2021-03-02 | Colombia | COL | 3 | 1 | 1 | 3 | 1 | 1 | 1 | 4 | 2 | 2 | 2 | 81.02 |
443835 | 2021-03-03 | Colombia | COL | 3 | 1 | 1 | 3 | 1 | 1 | 1 | 4 | 2 | 2 | 2 | 81.02 |
[45]:
# Subset for a country
oxcgrt_data.subset("Japan").tail()
# We can use ISO3 codes
# oxcgrt_data.subset("JPN").tail()
[45]:
Date | School_closing | Workplace_closing | Cancel_events | Gatherings_restrictions | Transport_closing | Stay_home_restrictions | Internal_movement_restrictions | International_movement_restrictions | Information_campaigns | Testing_policy | Contact_tracing | Stringency_index | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
423 | 2021-02-27 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 4 | 2 | 2 | 1 | 51.39 |
424 | 2021-02-28 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 4 | 2 | 2 | 1 | 51.39 |
425 | 2021-03-01 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 4 | 2 | 2 | 1 | 51.39 |
426 | 2021-03-02 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 4 | 2 | 2 | 1 | 51.39 |
427 | 2021-03-03 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 4 | 2 | 2 | 1 | 51.39 |
We can visualize the number of cases with .map()
method. Arguments are the same as JHUData
, but country name cannot be specified.
[46]:
# Global map with country level data
oxcgrt_data.map(variable="Stringency_index")

[47]:
# Country level data
oxcgrt_data.layer().tail()
[47]:
Date | Country | ISO3 | School_closing | Workplace_closing | Cancel_events | Gatherings_restrictions | Transport_closing | Stay_home_restrictions | Internal_movement_restrictions | International_movement_restrictions | Information_campaigns | Testing_policy | Contact_tracing | Stringency_index | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
443831 | 2021-02-27 | Colombia | COL | 3 | 1 | 1 | 3 | 1 | 1 | 1 | 4 | 2 | 2 | 2 | 81.02 |
443832 | 2021-02-28 | Colombia | COL | 3 | 1 | 1 | 3 | 1 | 1 | 1 | 4 | 2 | 2 | 2 | 81.02 |
443833 | 2021-03-01 | Colombia | COL | 3 | 1 | 1 | 3 | 1 | 1 | 1 | 4 | 2 | 2 | 2 | 81.02 |
443834 | 2021-03-02 | Colombia | COL | 3 | 1 | 1 | 3 | 1 | 1 | 1 | 4 | 2 | 2 | 2 | 81.02 |
443835 | 2021-03-03 | Colombia | COL | 3 | 1 | 1 | 3 | 1 | 1 | 1 | 4 | 2 | 2 | 2 | 81.02 |
The number of tests¶
The number of tests is also key information to understand the situation. This dataset was saved as pcr_data
, an instance of PCRData
class.
[48]:
type(pcr_data)
[48]:
covsirphy.cleaning.pcr_data.PCRData
[49]:
# Description/citation
print(pcr_data.citation)
(Secondary source) Guidotti, E., Ardia, D., (2020), "COVID-19 Data Hub", Journal of Open Source Software 5(51):2376, doi: 10.21105/joss.02376.
Hasell, J., Mathieu, E., Beltekian, D. et al. A cross-country database of COVID-19 testing. Sci Data 7, 345 (2020). https://doi.org/10.1038/s41597-020-00688-8
Lisphilar (2020), COVID-19 dataset in Japan, GitHub repository, https://github.com/lisphilar/covid19-sir/data/japan
[50]:
# Raw data (the same as jhu_data)
# pcr_data.raw.tail()
[51]:
# Cleaned data
pcr_data.cleaned().tail()
[51]:
Date | Country | Province | Tests | Confirmed | |
---|---|---|---|---|---|
17229 | 2021-02-27 | Japan | - | 8203285 | 430539 |
17230 | 2021-02-28 | Japan | - | 8234982 | 431740 |
17231 | 2021-03-01 | Japan | - | 8256602 | 432773 |
17232 | 2021-03-02 | Japan | - | 8319692 | 433504 |
17233 | 2021-03-03 | Japan | - | 8403376 | 434356 |
[52]:
# Subset for a country
pcr_data.subset("Japan").tail()
# We can use ISO3 codes
# pcr_data.subset("JPN").tail()
# Note: from version 2.17.0-alpha (stable: 2.18.0), "Tests_diff" will be added.
[52]:
Date | Tests | Tests_diff | Confirmed | |
---|---|---|---|---|
387 | 2021-02-27 | 8203285 | 59361 | 430539 |
388 | 2021-02-28 | 8234982 | 31697 | 431740 |
389 | 2021-03-01 | 8256602 | 21620 | 432773 |
390 | 2021-03-02 | 8319692 | 63090 | 433504 |
391 | 2021-03-03 | 8403376 | 83684 | 434356 |
Under the assumption that all tests were PCR test, we can calculate the positive rate of PCR tests as “the number of confirmed cases per the number of tests”.
[53]:
# Positive rate in Japan
_ = pcr_data.positive_rate("Japan")

We can visualize the number of cases with .map()
method. When country
is None
, global map will be shown. Arguments are the same as JHUData
, but variable name cannot be specified.
[54]:
# Global map with country level data
pcr_data.map(country=None)

[55]:
# Country level data
pcr_data.layer(country=None).tail()
[55]:
ISO3 | Date | Country | Tests | Confirmed | |
---|---|---|---|---|---|
84703 | JPN | 2021-02-27 | Japan | 8203285 | 430539 |
84704 | JPN | 2021-02-28 | Japan | 8234982 | 431740 |
84705 | JPN | 2021-03-01 | Japan | 8256602 | 432773 |
84706 | JPN | 2021-03-02 | Japan | 8319692 | 433504 |
84707 | JPN | 2021-03-03 | Japan | 8403376 | 434356 |
[56]:
# Country map with province level data
pcr_data.map(country="Japan")

[57]:
# Province level data
pcr_data.layer(country="Japan").tail()
[57]:
ISO3 | Date | Country | Province | Tests | Confirmed | |
---|---|---|---|---|---|---|
16837 | JPN | 2021-02-27 | Japan | Entering | 517631 | 2229 |
16838 | JPN | 2021-02-28 | Japan | Entering | 520132 | 2235 |
16839 | JPN | 2021-03-01 | Japan | Entering | 521977 | 2240 |
16840 | JPN | 2021-03-02 | Japan | Entering | 524977 | 2254 |
16841 | JPN | 2021-03-03 | Japan | Entering | 526337 | 2255 |
The number of vaccinations¶
Vaccinations is a key factor to end the outbreak as soon as possible. This dataset was saved as vaccine_data
, an instance of VaccineData
class.
[58]:
# The number of vaccinations
type(vaccine_data)
[58]:
covsirphy.cleaning.vaccine_data.VaccineData
[59]:
# Description/citation
print(vaccine_data.citation)
Hasell, J., Mathieu, E., Beltekian, D. et al. A cross-country database of COVID-19 testing. Sci Data 7, 345 (2020). https://doi.org/10.1038/s41597-020-00688-8
[60]:
# Raw data
# vaccine_data.raw.tail()
[61]:
# Cleaned data
vaccine_data.cleaned().tail()
[61]:
Date | Country | ISO3 | Product | Vaccinations | Vaccinated_once | Vaccinated_full | |
---|---|---|---|---|---|---|---|
5073 | 2021-02-26 | Zimbabwe | ZWE | Sinopharm/Beijing | 12579 | 12579 | 0 |
5074 | 2021-02-27 | Zimbabwe | ZWE | Sinopharm/Beijing | 15705 | 15705 | 0 |
5075 | 2021-02-28 | Zimbabwe | ZWE | Sinopharm/Beijing | 18843 | 18843 | 0 |
5076 | 2021-03-01 | Zimbabwe | ZWE | Sinopharm/Beijing | 21456 | 21456 | 0 |
5077 | 2021-03-02 | Zimbabwe | ZWE | Sinopharm/Beijing | 25077 | 25077 | 0 |
[62]:
# Registered countries
pprint(vaccine_data.countries(), compact=True)
['Albania', 'Algeria', 'Andorra', 'Anguilla', 'Argentina', 'Australia',
'Austria', 'Azerbaijan', 'Bahrain', 'Bangladesh', 'Barbados', 'Belarus',
'Belgium', 'Bermuda', 'Bolivia', 'Brazil', 'Bulgaria', 'Cambodia', 'Canada',
'Cayman Islands', 'Chile', 'China', 'Colombia', 'Costa Rica', 'Croatia',
'Cyprus', 'Czechia', 'Denmark', 'Dominican Republic', 'Ecuador', 'Egypt',
'El Salvador', 'England', 'Estonia', 'European Union', 'Faeroe Islands',
'Falkland Islands', 'Finland', 'France', 'Germany', 'Gibraltar', 'Greece',
'Greenland', 'Guatemala', 'Guernsey', 'Guyana', 'Honduras', 'Hong Kong',
'Hungary', 'Iceland', 'India', 'Indonesia', 'Iran', 'Ireland', 'Isle of Man',
'Israel', 'Italy', 'Japan', 'Jersey', 'Jordan', 'Kazakhstan', 'Kuwait',
'Latvia', 'Lebanon', 'Liechtenstein', 'Lithuania', 'Luxembourg', 'Macao',
'Malaysia', 'Maldives', 'Malta', 'Mauritius', 'Mexico', 'Monaco', 'Mongolia',
'Montenegro', 'Montserrat', 'Morocco', 'Myanmar', 'Nepal', 'Netherlands',
'New Zealand', 'Northern Cyprus', 'Northern Ireland', 'Norway', 'Oman',
'Pakistan', 'Panama', 'Paraguay', 'Peru', 'Poland', 'Portugal', 'Qatar',
'Romania', 'Russia', 'Saint Helena', 'San Marino', 'Saudi Arabia', 'Scotland',
'Senegal', 'Serbia', 'Seychelles', 'Singapore', 'Slovakia', 'Slovenia',
'South Africa', 'South Korea', 'Spain', 'Sri Lanka', 'Sweden', 'Switzerland',
'Thailand', 'Trinidad and Tobago', 'Turkey', 'Turks and Caicos Islands',
'Ukraine', 'United Arab Emirates', 'United Kingdom', 'United States',
'Uruguay', 'Venezuela', 'Wales', 'World', 'Zimbabwe']
[63]:
# Subset for a country
vaccine_data.subset("United Kingdom").tail()
# We can use ISO3 codes
# pcr_data.subset("GBR").tail()
[63]:
Date | Vaccinations | Vaccinated_once | Vaccinated_full | |
---|---|---|---|---|
74 | 2021-02-25 | 19913592 | 19177555 | 736037 |
75 | 2021-02-26 | 20450858 | 19682048 | 768810 |
76 | 2021-02-27 | 20885683 | 20089551 | 796132 |
77 | 2021-02-28 | 21091267 | 20275451 | 815816 |
78 | 2021-03-01 | 21322717 | 20478619 | 844098 |
We can visualize the number of cases with .map()
method. Arguments are the same as JHUData
, but country name cannot be specified.
[64]:
# Global map with country level data
vaccine_data.map()

[65]:
# Country level data
vaccine_data.layer().tail()
[65]:
Date | Country | ISO3 | Product | Vaccinations | Vaccinated_once | Vaccinated_full | |
---|---|---|---|---|---|---|---|
5073 | 2021-02-26 | Zimbabwe | ZWE | Sinopharm/Beijing | 12579 | 12579 | 0 |
5074 | 2021-02-27 | Zimbabwe | ZWE | Sinopharm/Beijing | 15705 | 15705 | 0 |
5075 | 2021-02-28 | Zimbabwe | ZWE | Sinopharm/Beijing | 18843 | 18843 | 0 |
5076 | 2021-03-01 | Zimbabwe | ZWE | Sinopharm/Beijing | 21456 | 21456 | 0 |
5077 | 2021-03-02 | Zimbabwe | ZWE | Sinopharm/Beijing | 25077 | 25077 | 0 |
Population pyramid¶
With population pyramid, we can divide the population to sub-groups. This will be useful when we analyse the meaning of parameters. For example, how many days go out is different between the sub-groups. This dataset was saved as pyramid_data
, an instance of PopulationPyramidData
class.
[66]:
# Population pyramid
type(pyramid_data)
[66]:
covsirphy.cleaning.pyramid.PopulationPyramidData
[67]:
# Description/citation
print(pyramid_data.citation)
World Bank Group (2020), World Bank Open Data, https://data.worldbank.org/
[68]:
# Subset will retrieved from the server when set
pyramid_data.subset("Japan").tail()
[68]:
Age | Population | Per_total | |
---|---|---|---|
113 | 118 | 255035 | 0.002174 |
114 | 119 | 255035 | 0.002174 |
115 | 120 | 255035 | 0.002174 |
116 | 121 | 255035 | 0.002174 |
117 | 122 | 255035 | 0.002174 |
Japan-specific dataset¶
This includes the number of confirmed/infected/fatal/recovered/tests/moderate/severe cases at country/prefecture level and metadata of each prefecture. This dataset was saved as japan_data
, an instance of JapanData
class.
[69]:
# Japan-specific dataset
type(japan_data)
[69]:
covsirphy.cleaning.japan_data.JapanData
[70]:
# Description/citation
print(japan_data.citation)
Lisphilar (2020), COVID-19 dataset in Japan, GitHub repository, https://github.com/lisphilar/covid19-sir/data/japan
[71]:
# Cleaned dataset
japan_data.cleaned().tail()
[71]:
Date | Country | Province | Confirmed | Infected | Fatal | Recovered | Tests | Moderate | Severe | Vaccinations | Vaccinated_once | Vaccinated_full | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
17229 | 2021-02-27 | Japan | - | 430539 | 14712 | 7807 | 408020 | 8203285 | 14057 | 440 | 28530 | 28530 | 0 |
17230 | 2021-02-28 | Japan | - | 431740 | 14561 | 7860 | 409319 | 8234982 | 13929 | 434 | 28530 | 28530 | 0 |
17231 | 2021-03-01 | Japan | - | 432773 | 14282 | 7887 | 410604 | 8256602 | 13618 | 436 | 31785 | 31785 | 0 |
17232 | 2021-03-02 | Japan | - | 433504 | 13456 | 7933 | 412115 | 8319692 | 12775 | 413 | 34772 | 34772 | 0 |
17233 | 2021-03-03 | Japan | - | 434356 | 13038 | 7984 | 413334 | 8403376 | 12382 | 407 | 37303 | 37303 | 0 |
[72]:
# Metadata
japan_data.meta().tail()
[72]:
Prefecture | Admin_Capital | Admin_Region | Admin_Num | Area_Habitable | Area_Total | Clinic_bed_Care | Clinic_bed_Total | Hospital_bed_Care | Hospital_bed_Specific | Hospital_bed_Total | Hospital_bed_Tuberculosis | Hospital_bed_Type-I | Hospital_bed_Type-II | Population_Female | Population_Male | Population_Total | Location_Latitude | Location_Longitude | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
42 | Kumamoto | Kumamoto | Kyushu | 43 | 2796 | 7409 | 497 | 4628 | 8340 | 0 | 33710 | 95 | 2 | 46 | 933 | 833 | 1765 | 32.790513 | 130.742388 |
43 | Oita | Oita | Kyushu | 44 | 1799 | 6341 | 269 | 3561 | 2618 | 0 | 19834 | 50 | 2 | 38 | 607 | 546 | 1152 | 33.238391 | 131.612658 |
44 | Miyazaki | Miyazaki | Kyushu | 45 | 1850 | 7735 | 206 | 2357 | 3682 | 0 | 18769 | 33 | 1 | 30 | 577 | 512 | 1089 | 31.911188 | 131.423873 |
45 | Kagoshima | Kagoshima | Kyushu | 46 | 3313 | 9187 | 652 | 4827 | 7750 | 0 | 32651 | 98 | 1 | 44 | 863 | 763 | 1626 | 31.560052 | 130.557745 |
46 | Okinawa | Naha | Okinawa | 47 | 1169 | 2281 | 83 | 914 | 3804 | 0 | 18710 | 47 | 4 | 20 | 734 | 709 | 1443 | 26.211761 | 127.681119 |
We can visualize the number of cases with .map()
method. Arguments are the same as JHUData
, but country name cannot be specified.
[73]:
# Country map with province level data
japan_data.map(variable="Severe")

[74]:
# Province level data
japan_data.layer(country="Japan").tail()
[74]:
Date | Country | Province | Confirmed | Infected | Fatal | Recovered | Tests | Moderate | Severe | Vaccinations | Vaccinated_once | Vaccinated_full | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
16837 | 2021-02-27 | Japan | Entering | 2229 | 34 | 2 | 2193 | 517631 | 34 | 0 | 0 | 0 | 0 |
16838 | 2021-02-28 | Japan | Entering | 2235 | 38 | 2 | 2195 | 520132 | 38 | 0 | 0 | 0 | 0 |
16839 | 2021-03-01 | Japan | Entering | 2240 | 41 | 2 | 2197 | 521977 | 41 | 0 | 0 | 0 | 0 |
16840 | 2021-03-02 | Japan | Entering | 2254 | 54 | 2 | 2198 | 524977 | 54 | 0 | 0 | 0 | 0 |
16841 | 2021-03-03 | Japan | Entering | 2255 | 51 | 2 | 2202 | 526337 | 51 | 0 | 0 | 0 | 0 |
Map with country level data is not prepared, but country level data can be retrieved.
[75]:
# Country level data
japan_data.layer(country=None).tail()
[75]:
Date | Country | Confirmed | Infected | Fatal | Recovered | Tests | Moderate | Severe | Vaccinations | Vaccinated_once | Vaccinated_full | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
387 | 2021-02-27 | Japan | 430539 | 14712 | 7807 | 408020 | 8203285 | 14057 | 440 | 28530 | 28530 | 0 |
388 | 2021-02-28 | Japan | 431740 | 14561 | 7860 | 409319 | 8234982 | 13929 | 434 | 28530 | 28530 | 0 |
389 | 2021-03-01 | Japan | 432773 | 14282 | 7887 | 410604 | 8256602 | 13618 | 436 | 31785 | 31785 | 0 |
390 | 2021-03-02 | Japan | 433504 | 13456 | 7933 | 412115 | 8319692 | 12775 | 413 | 34772 | 34772 | 0 |
391 | 2021-03-03 | Japan | 434356 | 13038 | 7984 | 413334 | 8403376 | 12382 | 407 | 37303 | 37303 | 0 |