Data preparation

The first step of data science is data preparation. covsirphy has the following three functionality for that.

Downloading datasets from recommended data servers
Reading pandas.DataFrame
Generator of sample data with SIR-derived ODE model

[1]:

from datetime import date
from pprint import pprint
import numpy as np
import pandas as pd
import covsirphy as cs
cs.__version__

[1]:

'3.1.2'

From version 2.28.0, some classes and methods output the following information using loguru: library which aims to bring enjoyable logging in Python. We can change logging level with cs.config(level=2) (as default).

level=0: errors (exceptions)
level=1: errors, warnings
level=2: errors, warnings, info (start downloading/optimization)
level=3: errors, warnings, info, debug

[2]:

cs.config.logger(level=2)

1. Downloading datasets from recommended data

We will download datasets from the following recommended data servers.

COVID-19 Data Hub
- Guidotti, E., Ardia, D., (2020), “COVID-19 Data Hub”, Journal of Open Source Software 5(51):2376, doi: 10.21105/joss.02376.
- The number of cases (JHU style)
- Population values in each country/province
- Government Response Tracker (OxCGRT)
- The number of tests
Our World In Data
- Hasell, J., Mathieu, E., Beltekian, D. et al. A cross-country database of COVID-19 testing. Sci Data 7, 345 (2020). https://doi.org/10.1038/s41597-020-00688-8
- The number of tests
- The number of vaccinations
- The number of people who received vaccinations
COVID-19 Open Data by Google Cloud Platform
- 1. Wahltinez and others (2020), COVID-19 Open-Data: curating a fine-grained, global-scale data repository for SARS-CoV-2, Work in progress, https://goo.gle/covid-19-open-data
- percentage to baseline in visits
- Note: Please refer to Google Terms of Service in advance.
- This will be removed because not updated. Refer to https://github.com/lisphilar/covid19-sir/issues/1224
World Population Prospects 2022
- United Nations, Department of Economic and Social Affairs, Population Division (2022). World Population Prospects 2022, Online Edition.
- Total population in each country
Datasets for CovsirPhy
- Hirokazu Takaya (2020-2022), GitHub repository, COVID-19 dataset in Japan, https://github.com/lisphilar/covid19-sir/tree/master/data
- The number of cases in Japan (total/prefectures)
- Metadata regarding Japan prefectures

How to request new data loader:

If you want to use a new dataset for your analysis, please kindly inform us using GitHub Issues: Request new method of DataLoader class. Please read Guideline of contribution in advance.

1-1. With `DataEngineer` class

We can use DataEngineer().download() for data downloading from recommended data servers as the quickest way.

[3]:

eng = cs.DataEngineer()
eng.download(databases=["japan", "covid19dh", "owid"])

2024-10-03 at 13:13:07 | INFO | Retrieving COVID-19 dataset from https://github.com/lisphilar/covid19-sir/data/
2024-10-03 at 13:13:07 | INFO | Retrieving datasets from COVID-19 Data Hub https://covid19datahub.io/
2024-10-03 at 13:13:13 | INFO | Retrieving datasets from Our World In Data https://github.com/owid/covid-19-data/
2024-10-03 at 13:13:14 | INFO | Retrieving datasets from Our World In Data https://github.com/owid/covid-19-data/
2024-10-03 at 13:13:14 | INFO | Retrieving datasets from Our World In Data https://github.com/owid/covid-19-data/

[3]:

<covsirphy.engineering.engineer.DataEngineer at 0x7f07f18693d0>

We can get the all downloaded records as a pandas.DataFrame with DataEngineer().all() method.

[4]:

all_df = eng.all()
# Overview of the records
all_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 297996 entries, 0 to 297995
Data columns (total 27 columns):
 #   Column                               Non-Null Count   Dtype
---  ------                               --------------   -----
 0   ISO3                                 297996 non-null  category
 1   Province                             297996 non-null  category
 2   City                                 297996 non-null  category
 3   Date                                 297996 non-null  datetime64[ns]
 4   Cancel_events                        197839 non-null  Float64
 5   Confirmed                            239373 non-null  Float64
 6   Contact_tracing                      197865 non-null  Float64
 7   Country                              287329 non-null  string
 8   Fatal                                222176 non-null  Float64
 9   Gatherings_restrictions              197839 non-null  Float64
 10  Information_campaigns                197865 non-null  Float64
 11  Internal_movement_restrictions       197865 non-null  Float64
 12  International_movement_restrictions  197872 non-null  Float64
 13  Population                           286167 non-null  Float64
 14  Product                              176264 non-null  string
 15  Recovered                            74361 non-null   Float64
 16  School_closing                       197864 non-null  Float64
 17  Stay_home_restrictions               197833 non-null  Float64
 18  Stringency_index                     197828 non-null  Float64
 19  Testing_policy                       197865 non-null  Float64
 20  Tests                                91153 non-null   Float64
 21  Transport_closing                    197845 non-null  Float64
 22  Vaccinated_full                      59857 non-null   Float64
 23  Vaccinated_once                      62933 non-null   Float64
 24  Vaccinations                         66925 non-null   Float64
 25  Vaccinations_boosters                38050 non-null   Float64
 26  Workplace_closing                    197864 non-null  Float64
dtypes: Float64(21), category(3), datetime64[ns](1), string(2)
memory usage: 61.7 MB

DataEngineer.citations() shows citations of the datasets.

[5]:

print("\n".join(eng.citations()))

Hirokazu Takaya (2020-2024), COVID-19 dataset in Japan, GitHub repository, https://github.com/lisphilar/covid19-sir/data/japan
Guidotti, E., Ardia, D., (2020), "COVID-19 Data Hub", Journal of Open Source Software 5(51):2376, doi: 10.21105/joss.02376.
Hasell, J., Mathieu, E., Beltekian, D. et al. A cross-country database of COVID-19 testing. Sci Data 7, 345 (2020). https: //doi.org/10.1038/s41597-020-00688-8

Note that, as default, DataEngineer().download() collects country-level data and save the datasets as CSV files in “input” (=directory argument of DataEngineer().download()) folder of the current directory. If the last modification time of the saved CSV files is within the last 12 (=update_interval argument of DataEngineer().download()) hours, the saved CSV files will be used as the database.

For some countries (eg. Japan), province/state/prefecture level data is available and we can download it as follows.

[6]:

eng_jpn = cs.DataEngineer()
eng_jpn.download(country="Japan", databases=["japan", "covid19dh", "owid"])
eng_jpn.all().head()

2024-10-03 at 13:13:19 | INFO | Retrieving COVID-19 dataset from https://github.com/lisphilar/covid19-sir/data/
2024-10-03 at 13:13:20 | INFO | Retrieving datasets from COVID-19 Data Hub https://covid19datahub.io/

[6]:

	ISO3	Province	City	Date	Cancel_events	Confirmed	Contact_tracing	Country	Fatal	...	Stringency_index	Testing_policy	Tests	Vaccinated_full	Vaccinated_once	Vaccinations	Vaccinations_boosters	Workplace_closing
0	JPN	Aichi	-	2020-03-18	1.0	130.0	2.0	Japan	14.0	...	-40.74	1.0	847.0	None	None	None	None	1.0
1	JPN	Aichi	-	2020-03-19	1.0	134.0	2.0	Japan	14.0	...	-40.74	1.0	847.0	None	None	None	None	1.0
2	JPN	Aichi	-	2020-03-20	1.0	139.0	2.0	Japan	16.0	...	-40.74	1.0	847.0	None	None	None	None	1.0
3	JPN	Aichi	-	2020-03-21	1.0	141.0	2.0	Japan	16.0	...	-40.74	1.0	847.0	None	None	None	None	1.0
4	JPN	Aichi	-	2020-03-22	1.0	143.0	2.0	Japan	16.0	...	-40.74	1.0	847.0	None	None	None	None	1.0

5 rows × 26 columns

For some countries (eg. USA), city-level data is available and we can download it as follows.

[7]:

eng_alabama = cs.DataEngineer()
eng_alabama.download(country="USA", province="Alabama", databases=["japan", "covid19dh", "owid"])
eng_alabama.all().head()

2024-10-03 at 13:13:22 | INFO | Retrieving datasets from COVID-19 Data Hub https://covid19datahub.io/

[7]:

	ISO3	Province	City	Date	Cancel_events	Confirmed	Contact_tracing	Country	Gatherings_restrictions	...	International_movement_restrictions	Population	Recovered	School_closing	Stay_home_restrictions	Stringency_index	Testing_policy	Tests	Transport_closing	Workplace_closing
0	USA	Alabama	Autauga	2020-03-24	-2.0	1.0	1.0	United States	-4.0	...	3.0	55869.0	<NA>	3.0	-2.0	-68.98	1.0	<NA>	-1.0	-3.0
1	USA	Alabama	Autauga	2020-03-25	-2.0	4.0	1.0	United States	-4.0	...	3.0	55869.0	<NA>	3.0	-2.0	-68.98	1.0	<NA>	-1.0	-3.0
2	USA	Alabama	Autauga	2020-03-26	-2.0	6.0	1.0	United States	-4.0	...	3.0	55869.0	<NA>	3.0	-2.0	-68.98	1.0	<NA>	-1.0	-3.0
3	USA	Alabama	Autauga	2020-03-27	-2.0	6.0	1.0	United States	-4.0	...	3.0	55869.0	<NA>	3.0	-2.0	-68.98	1.0	<NA>	-1.0	-3.0
4	USA	Alabama	Autauga	2020-03-28	2.0	6.0	1.0	United States	-4.0	...	3.0	55869.0	<NA>	3.0	-2.0	-73.61	1.0	<NA>	-1.0	3.0

5 rows × 22 columns

Move forward to Tutorial: Data engineering.

1-2. With `DataDownloader` class

DataEngineer class is useful because it has data cleaning methods and so on (explained with Tutorial: Data engineering), but we can use DataDownloader class for data downloading.

[8]:

dl = cs.DataDownloader()
dl_df = dl.layer(country=None, province=None, databases=["japan", "covid19dh", "owid"])

[9]:

# Overview of the records
dl_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 297996 entries, 0 to 297995
Data columns (total 27 columns):
 #   Column                               Non-Null Count   Dtype
---  ------                               --------------   -----
 0   ISO3                                 297996 non-null  string
 1   Province                             297996 non-null  string
 2   City                                 297996 non-null  string
 3   Date                                 297996 non-null  datetime64[ns]
 4   Cancel_events                        197839 non-null  Int64
 5   Confirmed                            239373 non-null  Float64
 6   Contact_tracing                      197865 non-null  Int64
 7   Country                              287329 non-null  string
 8   Fatal                                222176 non-null  Float64
 9   Gatherings_restrictions              197839 non-null  Int64
 10  Information_campaigns                197865 non-null  Int64
 11  Internal_movement_restrictions       197865 non-null  Int64
 12  International_movement_restrictions  197872 non-null  Int64
 13  Population                           286167 non-null  Int64
 14  Product                              176264 non-null  string
 15  Recovered                            74361 non-null   Float64
 16  School_closing                       197864 non-null  Int64
 17  Stay_home_restrictions               197833 non-null  Int64
 18  Stringency_index                     197828 non-null  Float64
 19  Testing_policy                       197865 non-null  Int64
 20  Tests                                91153 non-null   Float64
 21  Transport_closing                    197845 non-null  Int64
 22  Vaccinated_full                      59857 non-null   Float64
 23  Vaccinated_once                      62933 non-null   Float64
 24  Vaccinations                         66925 non-null   Float64
 25  Vaccinations_boosters                38050 non-null   Float64
 26  Workplace_closing                    197864 non-null  Int64
dtypes: Float64(9), Int64(12), datetime64[ns](1), string(5)
memory usage: 67.4 MB

Note that ISO3/Province/City columns have string data instead of categorical data.

[10]:

# Citations
print("\n".join(dl.citations()))

Hirokazu Takaya (2020-2024), COVID-19 dataset in Japan, GitHub repository, https://github.com/lisphilar/covid19-sir/data/japan
Guidotti, E., Ardia, D., (2020), "COVID-19 Data Hub", Journal of Open Source Software 5(51):2376, doi: 10.21105/joss.02376.
Hasell, J., Mathieu, E., Beltekian, D. et al. A cross-country database of COVID-19 testing. Sci Data 7, 345 (2020). https: //doi.org/10.1038/s41597-020-00688-8

Acknowledgement

In Feb2020, CovsirPhy project started in Kaggle platform with COVID-19 data with SIR model notebook by Hirokazu Takaya helped by Kagglers using the following datasets.

The number of cases (JHU) and linelist: Novel Corona Virus 2019 Dataset by SRK
Population in each country: covid19 global forecasting: locations population by Dmitry A. Grechka
The number of cases in Japan: COVID-19 dataset in Japan by Lisphilar

The current version of covsirphy does not have interfaces to use the datasets in Kaggle because they are not updated at this time. However, we could not have done CovsirPhy project without their supports. Thank you!!

2. Reading `pandas.DataFrame`

We may need to use our own datasets for analysis because the dataset is not included in the recommended data servers. DataEngineer().register() registers new datasets of pandas.DataFrame format.

2-1. Retrieve Monkeypox line list

At first, we will prepare the new dataset as pandas.DataFrame. We will use Global.health Monkeypox under CC BY 4.0 license, line list data regarding Monkeypox 2022, just for demonstration.

[11]:

today = date.today()
mp_cite = f"Global.health Monkeypox (accessed on {today.strftime('%Y-%m-%d')}):\n" \
    "Kraemer, Tegally, Pigott, Dasgupta, Sheldon, Wilkinson, Schultheiss, et al. " \
    "Tracking the 2022 Monkeypox Outbreak with Epidemiological Data in Real-Time. " \
    "The Lancet Infectious Diseases. https://doi.org/10.1016/S1473-3099(22)00359-0.\n" \
    "European Centre for Disease Prevention and Control/WHO Regional Office for Europe." \
    f" Monkeypox, Joint Epidemiological overview, {today.day} {today.month}, 2022"
print(mp_cite)

Global.health Monkeypox (accessed on 2024-10-03):
Kraemer, Tegally, Pigott, Dasgupta, Sheldon, Wilkinson, Schultheiss, et al. Tracking the 2022 Monkeypox Outbreak with Epidemiological Data in Real-Time. The Lancet Infectious Diseases. https://doi.org/10.1016/S1473-3099(22)00359-0.
European Centre for Disease Prevention and Control/WHO Regional Office for Europe. Monkeypox, Joint Epidemiological overview, 3 10, 2022

Retrieve CSV file with pandas.read_csv(), using Pyarrow as the engine.

[12]:

# Deprecated: Final line list from Global.health, as of 2022-09-22
raw_url = "https://raw.githubusercontent.com/globaldothealth/monkeypox/9035dfb1303c05c380f86a3c1e00f81a1cc4046b/latest_deprecated.csv"
raw = pd.read_csv(raw_url, engine="pyarrow")
raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 69595 entries, 0 to 69594
Data columns (total 36 columns):
 #   Column                   Non-Null Count  Dtype
---  ------                   --------------  -----
 0   ID                       69595 non-null  object
 1   Status                   69595 non-null  object
 2   Location                 52387 non-null  object
 3   City                     1383 non-null   object
 4   Country                  69595 non-null  object
 5   Country_ISO3             69595 non-null  object
 6   Age                      3021 non-null   object
 7   Gender                   2475 non-null   object
 8   Date_onset               77 non-null     object
 9   Date_confirmation        65546 non-null  object
 10  Symptoms                 220 non-null    object
 11  Hospitalised (Y/N/NA)    354 non-null    object
 12  Date_hospitalisation     35 non-null     object
 13  Isolated (Y/N/NA)        493 non-null    object
 14  Date_isolation           16 non-null     object
 15  Outcome                  101 non-null    object
 16  Contact_comment          91 non-null     object
 17  Contact_ID               27 non-null     float64
 18  Contact_location         6 non-null      object
 19  Travel_history (Y/N/NA)  365 non-null    object
 20  Travel_history_entry     42 non-null     object
 21  Travel_history_start     11 non-null     object
 22  Travel_history_location  114 non-null    object
 23  Travel_history_country   99 non-null     object
 24  Genomics_Metadata        24 non-null     object
 25  Confirmation_method      100 non-null    object
 26  Source                   69595 non-null  object
 27  Source_II                8240 non-null   object
 28  Source_III               893 non-null    object
 29  Source_IV                54 non-null     object
 30  Source_V                 0 non-null      float64
 31  Source_VI                0 non-null      float64
 32  Source_VII               0 non-null      float64
 33  Date_entry               69595 non-null  object
 34  Date_death               83 non-null     object
 35  Date_last_modified       69595 non-null  object
dtypes: float64(4), object(32)
memory usage: 19.1+ MB

Review the data.

[13]:

raw.head()

[13]:

	ID	Status	Location	City	Country	Country_ISO3	Age	Gender	Date_onset	Date_confirmation	...	Source	Source_II	Source_III	Source_IV	Source_V	Source_VI	Source_VII	Date_entry	Date_death	Date_last_modified
0	N1	confirmed	Guy's and St Thomas Hospital London	London	England	GBR	None	None	2022-04-29	2022-05-06	...	https://www.gov.uk/government/news/monkeypox-c...	https://www.who.int/emergencies/disease-outbre...	None	None	NaN	NaN	NaN	2022-05-18	None	2022-05-18
1	N2	confirmed	Guy's and St Thomas Hospital London	London	England	GBR	None	None	2022-05-05	2022-05-12	...	https://www.gov.uk/government/news/monkeypox-c...	None	None	None	NaN	NaN	NaN	2022-05-18	None	2022-05-18
2	N3	confirmed	London	London	England	GBR	None	None	2022-04-30	2022-05-13	...	https://www.gov.uk/government/news/monkeypox-c...	None	None	None	NaN	NaN	NaN	2022-05-18	None	2022-05-18
3	N4	confirmed	London	London	England	GBR	None	male	None	2022-05-15	...	https://www.gov.uk/government/news/monkeypox-c...	None	None	None	NaN	NaN	NaN	2022-05-18	None	2022-05-18
4	N5	confirmed	London	London	England	GBR	None	male	None	2022-05-15	...	https://www.gov.uk/government/news/monkeypox-c...	None	None	None	NaN	NaN	NaN	2022-05-18	None	2022-05-18

5 rows × 36 columns

[14]:

pprint(raw.Status.unique())
pprint(raw.Outcome.unique())

array(['confirmed', 'discarded', 'suspected', 'omit_error'], dtype=object)
array([None, 'Recovered', 'Death'], dtype=object)

2-2. Convert line list to the number of cases data

Prepare analyzable data, converting the line list to the number of cases. This step may be skipped because we have datasets with the number of cases.

Prepare PPT (per protocol set) data.

[15]:

date_cols = [
    "Date_onset", "Date_confirmation", "Date_hospitalisation",
    "Date_isolation", "Date_death", "Date_last_modified"
]
cols = ["ID", "Status", "City", "Country_ISO3", "Outcome", *date_cols]
df = raw.loc[:, cols].rename(columns={"Country_ISO3": "ISO3"})
df = df.loc[df["Status"].isin(["confirmed", "suspected"])]

for col in date_cols:
    df[col] = pd.to_datetime(df[col])

df["Date_min"] = df[date_cols].min(axis=1)
df["Date_recovered"] = df[["Outcome", "Date_last_modified"]].apply(
    lambda x: x[1] if x[0] == "Recovered" else pd.NaT, axis=1)
df["City"] = df["City"].replace('', pd.NA).fillna("Unknown")

ppt_df = df.copy()
ppt_df.head()

[15]:

	ID	Status	City	ISO3	Outcome	Date_onset	Date_confirmation	Date_hospitalisation	Date_isolation	Date_death	Date_last_modified	Date_min	Date_recovered
0	N1	confirmed	London	GBR	None	2022-04-29	2022-05-06	2022-05-04	2022-05-04	NaT	2022-05-18	2022-04-29	NaT
1	N2	confirmed	London	GBR	None	2022-05-05	2022-05-12	2022-05-06	2022-05-09	NaT	2022-05-18	2022-05-05	NaT
2	N3	confirmed	London	GBR	None	2022-04-30	2022-05-13	NaT	NaT	NaT	2022-05-18	2022-04-30	NaT
3	N4	confirmed	London	GBR	None	NaT	2022-05-15	NaT	NaT	NaT	2022-05-18	2022-05-15	NaT
4	N5	confirmed	London	GBR	None	NaT	2022-05-15	NaT	NaT	NaT	2022-05-18	2022-05-15	NaT

Calculate daily new confirmed cases.

[16]:

df = ppt_df.rename(columns={"Date_min": "Date"})
series = df.groupby(["ISO3", "City", "Date"])["ID"].count()
series.name = "Confirmed"
c_df = pd.DataFrame(series)
c_df.head()

[16]:

			Confirmed
ISO3	City	Date
ABW	Unknown	2022-08-22	1
		2022-08-29	1
		2022-09-13	1
AND	Unknown	2022-07-25	1
AND	Unknown	2022-07-26	2

Calculate daily new recovered cases.

[17]:

df = ppt_df.rename(columns={"Date_recovered": "Date"})
series = df.groupby(["ISO3", "City", "Date"])["ID"].count()
series.name = "Recovered"
r_df = pd.DataFrame(series)
r_df.head()

[17]:

			Recovered
ISO3	City	Date
AUS	Melbourne	2022-06-30	1
TUR	Unknown	2022-08-05	4

Calculate daily new fatal cases.

[18]:

df = ppt_df.rename(columns={"Date_death": "Date"})
series = df.groupby(["ISO3", "City", "Date"])["ID"].count()
series.name = "Fatal"
f_df = pd.DataFrame(series)
f_df.head()

[18]:

			Fatal
ISO3	City	Date
BEL	Unknown	2022-08-29	1
BRA	Unknown	2022-07-28	1
BRA	Unknown	2022-08-29	1
CAF	Unknown	2022-03-04	2
COD	Unknown	2022-01-16	18

Combine data (cumulative number).

[19]:

df = c_df.combine_first(f_df).combine_first(r_df)
df = df.unstack(level=["ISO3", "City"])
df = df.asfreq("D").fillna(0).cumsum()
df = df.stack(level=["ISO3", "City"]).reorder_levels(["ISO3", "City", "Date"])
df = df.sort_index().reset_index()
all_df_city = df.copy()
all_df_city.head()

[19]:

	ISO3	City	Date
0	ABW	Unknown	2022-01-16
1	ABW	Unknown	2022-01-17
2	ABW	Unknown	2022-01-18
3	ABW	Unknown	2022-01-19
4	ABW	Unknown	2022-01-20

At country level (City = “-”) and city level (City != “-“):

[20]:

df2 = all_df_city.groupby(["ISO3", "Date"], as_index=False).sum()
df2["City"] = "-"
df = pd.concat([df2, all_df_city], axis=0)
df = df.loc[df["City"] != "Unknown"]
all_df = df.convert_dtypes()
all_df

[20]:

	ISO3	Date	City	Confirmed	Fatal	Recovered
0	ABW	2022-01-16	-	0	0	0
1	ABW	2022-01-17	-	0	0	0
2	ABW	2022-01-18	-	0	0	0
3	ABW	2022-01-19	-	0	0	0
4	ABW	2022-01-20	-	0	0	0
...	...	...	...	...	...	...
69245	ZAF	2022-09-18	Johannesburg	1	0	0
69246	ZAF	2022-09-19	Johannesburg	1	0	0
69247	ZAF	2022-09-20	Johannesburg	1	0	0
69248	ZAF	2022-09-21	Johannesburg	1	0	0
69249	ZAF	2022-09-22	Johannesburg	1	0	0

70500 rows × 6 columns

Check data.

[21]:

gis = cs.GIS(layers=["ISO3", "City"], country="ISO3", date="Date")
gis.register(data=all_df, convert_iso3=False);

[22]:

variable = "Confirmed"
gis.choropleth(
    variable=variable, filename=None,
    title=f"Choropleth map (the number of {variable} cases)"
)

2024-10-03 at 13:14:15 | INFO | Retrieving GIS data from Natural Earth https://www.naturalearthdata.com/

[23]:

global_df = gis.subset(geo=None).set_index("Date").astype(np.int64)
global_df.tail()
cs.line_plot(global_df, title="The number of cases (Global)")

2-3. Retrieve total population data

So that we can analyze the data, total population values are necessary (we will confirm this with Tutorial: SIR-derived ODE models later).

Population data at country-level can be retrieved with DataDownloader().layer(databases=["wpp"]) via DataEngineer().register(databases=["wpp"]).

[24]:

# Set layers and specify layer name of country
# (which will be converted to ISO3 code for standardization)
eng = cs.DataEngineer(layers=["ISO3", "City"], country=["ISO3"])
# Download and automated registration of population data
eng.download(databases=["wpp"])
# Specify date range to reduce the memory
date_range = (all_df["Date"].min(), all_df["Date"].max())
eng.clean(kinds=["resample"], date_range=date_range)
# Show all data
display(eng.all())
# Show citations
pprint(eng.citations())

2024-10-03 at 13:14:17 | INFO |  [INFO] 'Province' layer was removed.

	ISO3	City	Date	Population
0	ABW	-	2022-07-01	106445.0
1	AFE	-	2022-07-01	720859132.0
2	AFG	-	2022-07-01	41128771.0
3	AFW	-	2022-07-01	490330870.0
4	AGO	-	2022-07-01	35588987.0
...	...	...	...	...
256	XKX	-	2022-07-01	1768086.0
257	YEM	-	2022-07-01	33696614.0
258	ZAF	-	2022-07-01	59893885.0
259	ZMB	-	2022-07-01	20017675.0
260	ZWE	-	2022-07-01	16320537.0

261 rows × 4 columns

['United Nations, Department of Economic and Social Affairs, Population '
 'Division (2022). World Population Prospects 2022, Online Edition.']

2-4. Register Monkeypox data

Register the Monkeypox data to DataEngineer() instance.

[25]:

eng.register(data=all_df, citations=[mp_cite])
# Show all data
display(eng.all())
# Show citations
pprint(eng.citations())

	ISO3	City	Date	Confirmed	Fatal	Population	Recovered
0	ABW	-	2022-01-16	0	0	<NA>	0
1	ABW	-	2022-01-17	0	0	<NA>	0
2	ABW	-	2022-01-18	0	0	<NA>	0
3	ABW	-	2022-01-19	0	0	<NA>	0
4	ABW	-	2022-01-20	0	0	<NA>	0
...	...	...	...	...	...	...	...
70644	ZMB	-	2022-09-19	1	0	<NA>	0
70645	ZMB	-	2022-09-20	1	0	<NA>	0
70646	ZMB	-	2022-09-21	1	0	<NA>	0
70647	ZMB	-	2022-09-22	1	0	<NA>	0
70648	ZWE	-	2022-07-01	<NA>	<NA>	16320537.0	<NA>

70649 rows × 7 columns

['United Nations, Department of Economic and Social Affairs, Population '
 'Division (2022). World Population Prospects 2022, Online Edition.',
 'Global.health Monkeypox (accessed on 2024-10-03):\n'
 'Kraemer, Tegally, Pigott, Dasgupta, Sheldon, Wilkinson, Schultheiss, et al. '
 'Tracking the 2022 Monkeypox Outbreak with Epidemiological Data in Real-Time. '
 'The Lancet Infectious Diseases. '
 'https://doi.org/10.1016/S1473-3099(22)00359-0.\n'
 'European Centre for Disease Prevention and Control/WHO Regional Office for '
 'Europe. Monkeypox, Joint Epidemiological overview, 3 10, 2022']

Move forward to Tutorial: Data engineering.

3. Generator of sample data with SIR-derived ODE model

CovsirPhy can generate sample data with subclasses of ODEModel and Dynamics class. Refer to the followings.

3.1 Sample data of one-phase ODE model

Regarding ODE models, please refer to Tutorial: SIR-derived ODE models. Here, we will create a sample data with one-phase SIR model and tau value 1440 min, the first date 01Jan2022, the last date 30Jun2022. ODE parameter values are preset.

[26]:

# Create solver with preset
model = cs.SIRModel.from_sample(date_range=("01Jan2022", "30Jun2022"), tau=1440)
# Show settings
pprint(model.settings())

{'date_range': ('01Jan2022', '30Jun2022'),
 'initial_dict': {'Fatal or Recovered': 0,
                  'Infected': 1000,
                  'Susceptible': 999000},
 'param_dict': {'rho': 0.2, 'sigma': 0.075},
 'tau': 1440}

Solve the ODE model with ODEModel().solve() method.

[27]:

one_df = model.solve()
display(one_df.head())
display(one_df.tail())

	Susceptible	Infected	Fatal or Recovered
Date
2022-01-01	999000	1000	0
2022-01-02	998787	1133	80
2022-01-03	998546	1283	170
2022-01-04	998273	1454	273
2022-01-05	997964	1647	389

	Susceptible	Infected	Fatal or Recovered
Date
2022-06-26	88354	750	910895
2022-06-27	88342	708	910950
2022-06-28	88329	669	911002
2022-06-29	88318	632	911050
2022-06-30	88307	596	911096

Plot the time-series data.

[28]:

cs.line_plot(one_df, title=f"Sample data of SIR model {model.settings()['param_dict']}")

3.2 Sample data of multi-phase ODE model

Regarding multi-phase ODE models, please refer to Phase-dependent SIR models. Here, we will create a sample data with two-phase SIR model and tau value 1440 min, the first date 01Jan2022, the last date 30Jun2022.

The 0th phase: 01Jan2022 - 28Feb2022, rho=0.2, sigma=0.075 (preset)

The 1st phase: 01Mar2022 - 30Jun2022, rho=0.4, sigma=0.075

We will use Dynamics class. At first, set the first/date of dynamics and set th 0th phase ODE parameters.

[29]:

dyn = cs.Dynamics.from_sample(model=cs.SIRModel, date_range=("01Jan2022", "30Jun2022"))
# Show summary
dyn.summary()

[29]:

	Start	End	Rt	rho	sigma	1/beta [day]	1/gamma [day]
Phase
0th	2022-01-01	2022-06-30	2.67	0.2	0.075	5	13

Add the 1st phase with Dynamics.register() method.

[30]:

setting_df = dyn.register()
setting_df.loc["01Mar2022": "30Jun2022", ["rho", "sigma"]] = [0.4, 0.075]
setting_df

[30]:

	Susceptible	Infected	Recovered	Fatal	rho	sigma
Date
2022-01-01	999000.0	1000.0	0.0	0.0	0.2	0.075
2022-01-02	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>
2022-01-03	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>
2022-01-04	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>
2022-01-05	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>
...	...	...	...	...	...	...
2022-06-26	<NA>	<NA>	<NA>	<NA>	0.4	0.075
2022-06-27	<NA>	<NA>	<NA>	<NA>	0.4	0.075
2022-06-28	<NA>	<NA>	<NA>	<NA>	0.4	0.075
2022-06-29	<NA>	<NA>	<NA>	<NA>	0.4	0.075
2022-06-30	<NA>	<NA>	<NA>	<NA>	0.4	0.075

181 rows × 6 columns

[31]:

dyn.register(data=setting_df)
# Show summary
dyn.summary()

[31]:

	Start	End	Rt	rho	sigma	1/beta [day]	1/gamma [day]
Phase
0th	2022-01-01	2022-02-28	2.67	0.2	0.075	5	13
1st	2022-03-01	2022-06-30	5.33	0.4	0.075	2	13

Solve the ODE model with Dynamics().simulate() method and plot the time-series data.

[32]:

two_df = dyn.simulate(model_specific=True)
cs.line_plot(two_df, title="Sample data of two-phase SIR model", v=["01Mar2022"])

When we need convert model-specific variables to model-free variables (Susceptible/Infected/Fatal/Recovered), we will set model_specific=False (default). Because R=”Fatal or Recovered” in SIR model, we assume that R=”Recovered” and F = 0.

[33]:

two_df = dyn.simulate(model_specific=False)
cs.line_plot(two_df, title="Sample data of two-phase SIR model with SIRF variables", v=["01Mar2022"])

Actually, observable variables are Population/Confirmed/Infected/Recovered. We can calculate Population and Confirmed as follows.

Confirmed = Infected + Fatal + Recovered
Population = Susceptible + Confirmed

[34]:

real_df = two_df.copy()
real_df["Confirmed"] = real_df[["Infected", "Fatal", "Recovered"]].sum(axis=1)
real_df["Population"] = real_df[["Susceptible", "Confirmed"]].sum(axis=1)
real_df = real_df.loc[:, ["Population", "Confirmed", "Recovered", "Fatal"]]
cs.line_plot(real_df, title="Sample data of two-phase SIR model with observable variables", v=["01Mar2022"])

Thank you!