Binder

Data preparation

The first step of data science is data preparation. covsirphy has the following three functionality for that.

  1. Downloading datasets from recommended data servers

  2. Reading pandas.DataFrame

  3. Generator of sample data with SIR-derived ODE model

[1]:
from datetime import date
from pprint import pprint
import numpy as np
import pandas as pd
import covsirphy as cs
cs.__version__
[1]:
'3.1.1'

From version 2.28.0, some classes and methods output the following information using loguru: library which aims to bring enjoyable logging in Python. We can change logging level with cs.config(level=2) (as default).

  • level=0: errors (exceptions)

  • level=1: errors, warnings

  • level=2: errors, warnings, info (start downloading/optimization)

  • level=3: errors, warnings, info, debug

[2]:
cs.config.logger(level=2)

2. Reading pandas.DataFrame

We may need to use our own datasets for analysis because the dataset is not included in the recommended data servers. DataEngineer().register() registers new datasets of pandas.DataFrame format.

2-1. Retrieve Monkeypox line list

At first, we will prepare the new dataset as pandas.DataFrame. We will use Global.health Monkeypox under CC BY 4.0 license, line list data regarding Monkeypox 2022, just for demonstration.

[11]:
today = date.today()
mp_cite = f"Global.health Monkeypox (accessed on {today.strftime('%Y-%m-%d')}):\n" \
    "Kraemer, Tegally, Pigott, Dasgupta, Sheldon, Wilkinson, Schultheiss, et al. " \
    "Tracking the 2022 Monkeypox Outbreak with Epidemiological Data in Real-Time. " \
    "The Lancet Infectious Diseases. https://doi.org/10.1016/S1473-3099(22)00359-0.\n" \
    "European Centre for Disease Prevention and Control/WHO Regional Office for Europe." \
    f" Monkeypox, Joint Epidemiological overview, {today.day} {today.month}, 2022"
print(mp_cite)
Global.health Monkeypox (accessed on 2024-04-27):
Kraemer, Tegally, Pigott, Dasgupta, Sheldon, Wilkinson, Schultheiss, et al. Tracking the 2022 Monkeypox Outbreak with Epidemiological Data in Real-Time. The Lancet Infectious Diseases. https://doi.org/10.1016/S1473-3099(22)00359-0.
European Centre for Disease Prevention and Control/WHO Regional Office for Europe. Monkeypox, Joint Epidemiological overview, 27 4, 2022

Retrieve CSV file with pandas.read_csv(), using Pyarrow as the engine.

[12]:
# Deprecated: Final line list from Global.health, as of 2022-09-22
raw_url = "https://raw.githubusercontent.com/globaldothealth/monkeypox/9035dfb1303c05c380f86a3c1e00f81a1cc4046b/latest_deprecated.csv"
raw = pd.read_csv(raw_url, engine="pyarrow")
raw.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 69595 entries, 0 to 69594
Data columns (total 36 columns):
 #   Column                   Non-Null Count  Dtype
---  ------                   --------------  -----
 0   ID                       69595 non-null  object
 1   Status                   69595 non-null  object
 2   Location                 52387 non-null  object
 3   City                     1383 non-null   object
 4   Country                  69595 non-null  object
 5   Country_ISO3             69595 non-null  object
 6   Age                      3021 non-null   object
 7   Gender                   2475 non-null   object
 8   Date_onset               77 non-null     object
 9   Date_confirmation        65546 non-null  object
 10  Symptoms                 220 non-null    object
 11  Hospitalised (Y/N/NA)    354 non-null    object
 12  Date_hospitalisation     35 non-null     object
 13  Isolated (Y/N/NA)        493 non-null    object
 14  Date_isolation           16 non-null     object
 15  Outcome                  101 non-null    object
 16  Contact_comment          91 non-null     object
 17  Contact_ID               27 non-null     float64
 18  Contact_location         6 non-null      object
 19  Travel_history (Y/N/NA)  365 non-null    object
 20  Travel_history_entry     42 non-null     object
 21  Travel_history_start     11 non-null     object
 22  Travel_history_location  114 non-null    object
 23  Travel_history_country   99 non-null     object
 24  Genomics_Metadata        24 non-null     object
 25  Confirmation_method      100 non-null    object
 26  Source                   69595 non-null  object
 27  Source_II                8240 non-null   object
 28  Source_III               893 non-null    object
 29  Source_IV                54 non-null     object
 30  Source_V                 0 non-null      float64
 31  Source_VI                0 non-null      float64
 32  Source_VII               0 non-null      float64
 33  Date_entry               69595 non-null  object
 34  Date_death               83 non-null     object
 35  Date_last_modified       69595 non-null  object
dtypes: float64(4), object(32)
memory usage: 19.1+ MB

Review the data.

[13]:
raw.head()
[13]:
ID Status Location City Country Country_ISO3 Age Gender Date_onset Date_confirmation ... Source Source_II Source_III Source_IV Source_V Source_VI Source_VII Date_entry Date_death Date_last_modified
0 N1 confirmed Guy's and St Thomas Hospital London London England GBR None None 2022-04-29 2022-05-06 ... https://www.gov.uk/government/news/monkeypox-c... https://www.who.int/emergencies/disease-outbre... None None NaN NaN NaN 2022-05-18 None 2022-05-18
1 N2 confirmed Guy's and St Thomas Hospital London London England GBR None None 2022-05-05 2022-05-12 ... https://www.gov.uk/government/news/monkeypox-c... None None None NaN NaN NaN 2022-05-18 None 2022-05-18
2 N3 confirmed London London England GBR None None 2022-04-30 2022-05-13 ... https://www.gov.uk/government/news/monkeypox-c... None None None NaN NaN NaN 2022-05-18 None 2022-05-18
3 N4 confirmed London London England GBR None male None 2022-05-15 ... https://www.gov.uk/government/news/monkeypox-c... None None None NaN NaN NaN 2022-05-18 None 2022-05-18
4 N5 confirmed London London England GBR None male None 2022-05-15 ... https://www.gov.uk/government/news/monkeypox-c... None None None NaN NaN NaN 2022-05-18 None 2022-05-18

5 rows × 36 columns

[14]:
pprint(raw.Status.unique())
pprint(raw.Outcome.unique())
array(['confirmed', 'discarded', 'suspected', 'omit_error'], dtype=object)
array([None, 'Recovered', 'Death'], dtype=object)

2-2. Convert line list to the number of cases data

Prepare analyzable data, converting the line list to the number of cases. This step may be skipped because we have datasets with the number of cases.

Prepare PPT (per protocol set) data.

[15]:
date_cols = [
    "Date_onset", "Date_confirmation", "Date_hospitalisation",
    "Date_isolation", "Date_death", "Date_last_modified"
]
cols = ["ID", "Status", "City", "Country_ISO3", "Outcome", *date_cols]
df = raw.loc[:, cols].rename(columns={"Country_ISO3": "ISO3"})
df = df.loc[df["Status"].isin(["confirmed", "suspected"])]

for col in date_cols:
    df[col] = pd.to_datetime(df[col])

df["Date_min"] = df[date_cols].min(axis=1)
df["Date_recovered"] = df[["Outcome", "Date_last_modified"]].apply(
    lambda x: x[1] if x[0] == "Recovered" else pd.NaT, axis=1)
df["City"] = df["City"].replace('', pd.NA).fillna("Unknown")

ppt_df = df.copy()
ppt_df.head()
[15]:
ID Status City ISO3 Outcome Date_onset Date_confirmation Date_hospitalisation Date_isolation Date_death Date_last_modified Date_min Date_recovered
0 N1 confirmed London GBR None 2022-04-29 2022-05-06 2022-05-04 2022-05-04 NaT 2022-05-18 2022-04-29 NaT
1 N2 confirmed London GBR None 2022-05-05 2022-05-12 2022-05-06 2022-05-09 NaT 2022-05-18 2022-05-05 NaT
2 N3 confirmed London GBR None 2022-04-30 2022-05-13 NaT NaT NaT 2022-05-18 2022-04-30 NaT
3 N4 confirmed London GBR None NaT 2022-05-15 NaT NaT NaT 2022-05-18 2022-05-15 NaT
4 N5 confirmed London GBR None NaT 2022-05-15 NaT NaT NaT 2022-05-18 2022-05-15 NaT

Calculate daily new confirmed cases.

[16]:
df = ppt_df.rename(columns={"Date_min": "Date"})
series = df.groupby(["ISO3", "City", "Date"])["ID"].count()
series.name = "Confirmed"
c_df = pd.DataFrame(series)
c_df.head()
[16]:
Confirmed
ISO3 City Date
ABW Unknown 2022-08-22 1
2022-08-29 1
2022-09-13 1
AND Unknown 2022-07-25 1
2022-07-26 2

Calculate daily new recovered cases.

[17]:
df = ppt_df.rename(columns={"Date_recovered": "Date"})
series = df.groupby(["ISO3", "City", "Date"])["ID"].count()
series.name = "Recovered"
r_df = pd.DataFrame(series)
r_df.head()
[17]:
Recovered
ISO3 City Date
AUS Melbourne 2022-06-30 1
TUR Unknown 2022-08-05 4

Calculate daily new fatal cases.

[18]:
df = ppt_df.rename(columns={"Date_death": "Date"})
series = df.groupby(["ISO3", "City", "Date"])["ID"].count()
series.name = "Fatal"
f_df = pd.DataFrame(series)
f_df.head()
[18]:
Fatal
ISO3 City Date
BEL Unknown 2022-08-29 1
BRA Unknown 2022-07-28 1
2022-08-29 1
CAF Unknown 2022-03-04 2
COD Unknown 2022-01-16 18

Combine data (cumulative number).

[19]:
df = c_df.combine_first(f_df).combine_first(r_df)
df = df.unstack(level=["ISO3", "City"])
df = df.asfreq("D").fillna(0).cumsum()
df = df.stack(level=["ISO3", "City"]).reorder_levels(["ISO3", "City", "Date"])
df = df.sort_index().reset_index()
all_df_city = df.copy()
all_df_city.head()
[19]:
ISO3 City Date Confirmed Fatal Recovered
0 ABW Unknown 2022-01-16 0.0 0.0 0.0
1 ABW Unknown 2022-01-17 0.0 0.0 0.0
2 ABW Unknown 2022-01-18 0.0 0.0 0.0
3 ABW Unknown 2022-01-19 0.0 0.0 0.0
4 ABW Unknown 2022-01-20 0.0 0.0 0.0

At country level (City = “-”) and city level (City != “-“):

[20]:
df2 = all_df_city.groupby(["ISO3", "Date"], as_index=False).sum()
df2["City"] = "-"
df = pd.concat([df2, all_df_city], axis=0)
df = df.loc[df["City"] != "Unknown"]
all_df = df.convert_dtypes()
all_df
[20]:
ISO3 Date City Confirmed Fatal Recovered
0 ABW 2022-01-16 - 0 0 0
1 ABW 2022-01-17 - 0 0 0
2 ABW 2022-01-18 - 0 0 0
3 ABW 2022-01-19 - 0 0 0
4 ABW 2022-01-20 - 0 0 0
... ... ... ... ... ... ...
69245 ZAF 2022-09-18 Johannesburg 1 0 0
69246 ZAF 2022-09-19 Johannesburg 1 0 0
69247 ZAF 2022-09-20 Johannesburg 1 0 0
69248 ZAF 2022-09-21 Johannesburg 1 0 0
69249 ZAF 2022-09-22 Johannesburg 1 0 0

70500 rows × 6 columns

Check data.

[21]:
gis = cs.GIS(layers=["ISO3", "City"], country="ISO3", date="Date")
gis.register(data=all_df, convert_iso3=False);
[22]:
variable = "Confirmed"
gis.choropleth(
    variable=variable, filename=None,
    title=f"Choropleth map (the number of {variable} cases)"
)
2024-04-27 at 02:43:38 | INFO | Retrieving GIS data from Natural Earth https://www.naturalearthdata.com/
_images/01_data_preparation_47_1.png
[23]:
global_df = gis.subset(geo=None).set_index("Date").astype(np.int64)
global_df.tail()
cs.line_plot(global_df, title="The number of cases (Global)")
_images/01_data_preparation_48_0.png

2-3. Retrieve total population data

So that we can analyze the data, total population values are necessary (we will confirm this with Tutorial: SIR-derived ODE models later).

Population data at country-level can be retrieved with DataDownloader().layer(databases=["wpp"]) via DataEngineer().register(databases=["wpp"]).

[24]:
# Set layers and specify layer name of country
# (which will be converted to ISO3 code for standardization)
eng = cs.DataEngineer(layers=["ISO3", "City"], country=["ISO3"])
# Download and automated registration of population data
eng.download(databases=["wpp"])
# Specify date range to reduce the memory
date_range = (all_df["Date"].min(), all_df["Date"].max())
eng.clean(kinds=["resample"], date_range=date_range)
# Show all data
display(eng.all())
# Show citations
pprint(eng.citations())
2024-04-27 at 02:43:41 | INFO | Retrieving datasets from World Population Prospects https://population.un.org/wpp/
2024-04-27 at 02:43:56 | INFO |  [INFO] 'Province' layer was removed.
ISO3 City Date Population
0 ABW - 2022-07-01 106445.0
1 AIA - 2022-07-01 15857.0
2 ASM - 2022-07-01 44273.0
3 AUS - 2022-07-01 26177413.0
4 AZE - 2022-07-01 10358074.0
... ... ... ... ...
57 USA - 2022-07-01 338289857.0
58 VGB - 2022-07-01 31305.0
59 VIR - 2022-07-01 99465.0
60 WLF - 2022-07-01 11572.0
61 XKX - 2022-07-01 1659714.0

62 rows × 4 columns

['United Nations, Department of Economic and Social Affairs, Population '
 'Division (2022). World Population Prospects 2022, Online Edition.']

2-4. Register Monkeypox data

Register the Monkeypox data to DataEngineer() instance.

[25]:
eng.register(data=all_df, citations=[mp_cite])
# Show all data
display(eng.all())
# Show citations
pprint(eng.citations())
ISO3 City Date Confirmed Fatal Population Recovered
0 ABW - 2022-01-16 0 0 <NA> 0
1 ABW - 2022-01-17 0 0 <NA> 0
2 ABW - 2022-01-18 0 0 <NA> 0
3 ABW - 2022-01-19 0 0 <NA> 0
4 ABW - 2022-01-20 0 0 <NA> 0
... ... ... ... ... ... ... ...
70529 ZMB - 2022-09-18 1 0 <NA> 0
70530 ZMB - 2022-09-19 1 0 <NA> 0
70531 ZMB - 2022-09-20 1 0 <NA> 0
70532 ZMB - 2022-09-21 1 0 <NA> 0
70533 ZMB - 2022-09-22 1 0 <NA> 0

70534 rows × 7 columns

['United Nations, Department of Economic and Social Affairs, Population '
 'Division (2022). World Population Prospects 2022, Online Edition.',
 'Global.health Monkeypox (accessed on 2024-04-27):\n'
 'Kraemer, Tegally, Pigott, Dasgupta, Sheldon, Wilkinson, Schultheiss, et al. '
 'Tracking the 2022 Monkeypox Outbreak with Epidemiological Data in Real-Time. '
 'The Lancet Infectious Diseases. '
 'https://doi.org/10.1016/S1473-3099(22)00359-0.\n'
 'European Centre for Disease Prevention and Control/WHO Regional Office for '
 'Europe. Monkeypox, Joint Epidemiological overview, 27 4, 2022']

Move forward to Tutorial: Data engineering.

3. Generator of sample data with SIR-derived ODE model

CovsirPhy can generate sample data with subclasses of ODEModel and Dynamics class. Refer to the followings.

3.1 Sample data of one-phase ODE model

Regarding ODE models, please refer to Tutorial: SIR-derived ODE models. Here, we will create a sample data with one-phase SIR model and tau value 1440 min, the first date 01Jan2022, the last date 30Jun2022. ODE parameter values are preset.

[26]:
# Create solver with preset
model = cs.SIRModel.from_sample(date_range=("01Jan2022", "30Jun2022"), tau=1440)
# Show settings
pprint(model.settings())
{'date_range': ('01Jan2022', '30Jun2022'),
 'initial_dict': {'Fatal or Recovered': 0,
                  'Infected': 1000,
                  'Susceptible': 999000},
 'param_dict': {'rho': 0.2, 'sigma': 0.075},
 'tau': 1440}

Solve the ODE model with ODEModel().solve() method.

[27]:
one_df = model.solve()
display(one_df.head())
display(one_df.tail())
Susceptible Infected Fatal or Recovered
Date
2022-01-01 999000 1000 0
2022-01-02 998787 1133 80
2022-01-03 998546 1283 170
2022-01-04 998273 1454 273
2022-01-05 997964 1647 389
Susceptible Infected Fatal or Recovered
Date
2022-06-26 88354 750 910895
2022-06-27 88342 708 910950
2022-06-28 88329 669 911002
2022-06-29 88318 632 911050
2022-06-30 88307 596 911096

Plot the time-series data.

[28]:
cs.line_plot(one_df, title=f"Sample data of SIR model {model.settings()['param_dict']}")
_images/01_data_preparation_60_0.png

3.2 Sample data of multi-phase ODE model

Regarding multi-phase ODE models, please refer to Phase-dependent SIR models. Here, we will create a sample data with two-phase SIR model and tau value 1440 min, the first date 01Jan2022, the last date 30Jun2022.

The 0th phase: 01Jan2022 - 28Feb2022, rho=0.2, sigma=0.075 (preset)
The 1st phase: 01Mar2022 - 30Jun2022, rho=0.4, sigma=0.075

We will use Dynamics class. At first, set the first/date of dynamics and set th 0th phase ODE parameters.

[29]:
dyn = cs.Dynamics.from_sample(model=cs.SIRModel, date_range=("01Jan2022", "30Jun2022"))
# Show summary
dyn.summary()
[29]:
Start End Rt rho sigma 1/beta [day] 1/gamma [day]
Phase
0th 2022-01-01 2022-06-30 2.67 0.2 0.075 5 13

Add the 1st phase with Dynamics.register() method.

[30]:
setting_df = dyn.register()
setting_df.loc["01Mar2022": "30Jun2022", ["rho", "sigma"]] = [0.4, 0.075]
setting_df
[30]:
Susceptible Infected Recovered Fatal rho sigma
Date
2022-01-01 999000.0 1000.0 0.0 0.0 0.2 0.075
2022-01-02 <NA> <NA> <NA> <NA> <NA> <NA>
2022-01-03 <NA> <NA> <NA> <NA> <NA> <NA>
2022-01-04 <NA> <NA> <NA> <NA> <NA> <NA>
2022-01-05 <NA> <NA> <NA> <NA> <NA> <NA>
... ... ... ... ... ... ...
2022-06-26 <NA> <NA> <NA> <NA> 0.4 0.075
2022-06-27 <NA> <NA> <NA> <NA> 0.4 0.075
2022-06-28 <NA> <NA> <NA> <NA> 0.4 0.075
2022-06-29 <NA> <NA> <NA> <NA> 0.4 0.075
2022-06-30 <NA> <NA> <NA> <NA> 0.4 0.075

181 rows × 6 columns

[31]:
dyn.register(data=setting_df)
# Show summary
dyn.summary()
[31]:
Start End Rt rho sigma 1/beta [day] 1/gamma [day]
Phase
0th 2022-01-01 2022-02-28 2.67 0.2 0.075 5 13
1st 2022-03-01 2022-06-30 5.33 0.4 0.075 2 13

Solve the ODE model with Dynamics().simulate() method and plot the time-series data.

[32]:
two_df = dyn.simulate(model_specific=True)
cs.line_plot(two_df, title="Sample data of two-phase SIR model", v=["01Mar2022"])
_images/01_data_preparation_67_0.png

When we need convert model-specific variables to model-free variables (Susceptible/Infected/Fatal/Recovered), we will set model_specific=False (default). Because R=”Fatal or Recovered” in SIR model, we assume that R=”Recovered” and F = 0.

[33]:
two_df = dyn.simulate(model_specific=False)
cs.line_plot(two_df, title="Sample data of two-phase SIR model with SIRF variables", v=["01Mar2022"])
_images/01_data_preparation_69_0.png

Actually, observable variables are Population/Confirmed/Infected/Recovered. We can calculate Population and Confirmed as follows.

  • Confirmed = Infected + Fatal + Recovered

  • Population = Susceptible + Confirmed

[34]:
real_df = two_df.copy()
real_df["Confirmed"] = real_df[["Infected", "Fatal", "Recovered"]].sum(axis=1)
real_df["Population"] = real_df[["Susceptible", "Confirmed"]].sum(axis=1)
real_df = real_df.loc[:, ["Population", "Confirmed", "Recovered", "Fatal"]]
cs.line_plot(real_df, title="Sample data of two-phase SIR model with observable variables", v=["01Mar2022"])
_images/01_data_preparation_71_0.png

Thank you!