covsirphy.engineering package

Submodules

covsirphy.engineering.engineer module

class DataEngineer(layers=None, country='ISO3', **kwargs)[source]

Bases: Term

Class for data engineering including loading, cleaning, transforming, complementing, EDA (explanatory data analysis).

Parameters:

layers (list[str] | None) – list of layers of geographic information or None ([“ISO3”, “Province”, “City”])
country (str) – layer name of countries or None (countries are not included in the layers)
Raises – ValueError: @layers has duplicates
Note – Country level data specified with @country will be stored with ISO3 codes.

add(columns, new=None, fill_value=0)[source]

Calculate element-wise addition with pandas.DataFrame.sum(axis=1), X1 + X2 + X3 +…

Parameters:

columns (list[str] | str) – columns (or alias) to add
new (str | None) – column name of addition or None (f”{X1}+{X2}+{X3}…”)
fill_value (float | int) – value to fill in NAs

Return type:

Self

Returns:

updated DataEngineer instance

all(variables=None)[source]

Return all available data, converting dtypes with pandas.DataFrame.convert_dtypes().

Parameters:

variables (list[str] | str | None) – list of variables to collect or alias or None (all available variables)

Raises:

NotRegisteredError – No records have been registered yet

Return type:

DataFrame

Returns:

Index

reset index

Column

columns defined by @layers of `DataEngineer()|
Date (pandas.Timestamp): observation dates defined by @date of DataEngineer()
the other columns

assign(**kwargs)[source]

Assign a new column with pandas.DataFrame.assign().

Parameters:: **kwargs – dict of {str: callable or pandas.Series}
Return type:: Self

Note

Refer to documentation of pandas.DataFrame.assign(), https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.assign.html

choropleth(geo, variable, on=None, title='Choropleth map', filename='choropleth.jpg', logscale=True, directory=None, natural_earth=None, **kwargs)[source]

Create choropleth map.

Parameters:

geo (tuple[list[str] | tuple[str] | str | None, ...]) – location names to specify the layer or None (the top level)
variable (str) – variable name to show
on (str | None) – the date, like 22Jan2020, or None (the last date of each location)
title (str) – title of the map
filename (str) – filename to save the figure or None (display)
logscale (bool) – whether convert the value to log10 scale values or not
directory (str | None) – directory to save GeoJSON file of “Natural Earth” GitHub repository or None (the directory of GIS class script)
natural_earth (str | None) – title of GeoJSON file(without extension) of “Natural Earth” GitHub repository or None (automatically determined)
**kwargs –
keyword arguments of the following classes and methods.
- matplotlib.pyplot.savefig(),
- matplotlib.pyplot.legend(), and
- pandas.DataFrame.plot()`

Return type:

None

Note

Regarding @ geo argument, please refer to covsirphy.GIS.layer().

Note

GeoJSON files are listed in

https://github.com/nvkelso/natural-earth-vector/tree/master/geojson
https://www.naturalearthdata.com/
https://github.com/nvkelso/natural-earth-vector
Natural Earth (Free vector and raster map data at naturalearthdata.com, Public Domain)

citations(variables=None)[source]

Return citation list of the secondary data sources.

Parameters:: variables (list[str] | str | None) – list of variables to collect or alias or None (all available variables)
Return type:: list[str]
Returns:: citations

clean(kinds=None, **kwargs)[source]

Clean all registered data.

Parameters:

kinds (list[str] | None) –
kinds of data cleaning with order or None (all available kinds as follows)
- ”convert_date”: Convert dtype of date column to pandas.Timestamp.
- ”resample”: Resample records with dates.
- ”fillna”: Fill NA values with ‘-’ (layers) and the previous values and 0.
**kwargs – keyword arguments of data cleaning refer to note

Return type:

Self

Returns:

updated DataEngineer instance

Note

When “convert_date” included, keyword arguments of pandas.to_datetime() including “dayfirst (bool): whether date format is DD/MM or not” can be used.

Note

When “resample” included, date_range=<tuple of (str or None, str or None) or None>) can be applied as keyword arguments to set the range.

diff(column, suffix='_diff', freq='D')[source]

Calculate daily new cases with “f(x>0) = F(x) - F(x-1), x(0) = 0 when F is cumulative numbers”.

Parameters:

column (str) – column name of the cumulative numbers
suffix (str) – suffix if the column (new column name will be ‘{column}{suffix}’)
freq (str) – offset aliases of shifting dates

Return type:

Self

Returns:

updated DataEngineer instance

Note

Regarding @freq, refer to https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases

div(numerator, denominator, new=None, fill_value=0)[source]

Calculate element-wise floating division with pandas.Series.div(), numerator / denominator.

Parameters:

numerator (str) – numerator column
denominator (str) – denominator column
new (str | None) – column name of floating division or None (f”{numerator}_per_({denominator.replace(’ ‘, ‘_’)})”)
fill_value (float | int) – value to fill in NAs

Return type:

Self

Returns:

updated DataEngineer instance

Note

Positive rate could be calculated with Confirmed / Tested, .div(numerator=”Confirmed”, denominator=”Tested”, new=”Positive_rate”)

download(**kwargs)[source]

Download datasets from the recommended data servers using covsirphy.DataDownloader.

Parameters:: **kwargs – keyword arguments of covsirphy.DataDownloader() and covsirphy.DataDownloader.layer()
Return type:: Self
Returns:: updated DataEngineer instance

inverse_transform()[source]

Perform inverse transformation, calculating total population and confirmed.

Return type:: Self
Returns:: updated DataEngineer instance

Note

Population = Susceptible + Confirmed
Confirmed = Infected + Fatal + Recovered

layer(geo=None, start_date=None, end_date=None, variables=None)[source]

Return the data at the selected layer in the date range.

Parameters:

geo (tuple[list[str] | tuple[str] | str | None, ...]) – location names to specify the layer or None (the top level)
start_date (str | None) – start date, like 22Jan2020
end_date (str | None) – end date, like 01Feb2020
variables (list[str] | None) – list of variables to add or None (all available columns)

Raises:

TypeError – @geo has un-expected types
ValueError – the length of @geo is larger than the length of layers
NotRegisteredError – No records have been registered at the layer yet

Return type:

DataFrame

Returns:

Index – reset index Columns

(str): columns defined by covsirphy.GIS(layers)

Date (pandas.Timestamp): observation dates

columns defined by @ variables

Note

Note that records with NAs as country names will be always removed.

Note

Regarding @geo argument, please refer to covsirphy.GIS.layer().

mul(columns, new=None, fill_value=0)[source]

Calculate element-wise multiplication with pandas.DataFrame.product(axis=1), X1 * X2 * X3 *…

Parameters:

columns (list[str] | str) – columns (or alias) to multiply
new (str | None) – column name of multiplication or None (f”{X1}*{X2}*{X3}…”)
fill_value (float | int) – value to fill in NAs

Return type:

Self

Returns:

updated DataEngineer instance

classmethod recovery_period(data)[source]

Calculate mode value of recovery period of the data.

Parameters:

data (DataFrame) –

data for calculation Index

Date (pandas.Timestamp): observation dates

Columns

Confirmed (int): the number of confirmed cases, optional
Fatal (int): the number of fatal cases, optional
Recovered (int): the number of recovered cases, optional
the other columns will be ignored

Return type:

int

Returns:

mode value of recovery period [days]

register(data, citations=None, **kwargs)[source]

Register new data.

Parameters:

data (DataFrame) –
new data Index

reset index
Columns
- columns defined by covsirphy.DataEngineer(layer)
- Date (pandas.DataFrame): observation dates
- Population (int): total population, optional
- Tests (int): column of the number of tests, optional
- Confirmed (int): the number of confirmed cases, optional
- Fatal (int): the number of fatal cases, optional
- Recovered (int): the number of recovered cases, optional
- the other columns will be also registered
citations (list[str] | str | None) – citations of the dataset or None ([“my own dataset”])
**kwargs – keyword arguments of pandas.to_datetime() including “dayfirst (bool): whether date format is DD/MM or not”

Return type:

Self

Returns:

updated DataEngineer instance

sub(minuend, subtrahend, new=None, fill_value=0)[source]

Calculate element-wise subtraction with pandas.Series.sub(), minuend - subtrahend.

Parameters:

minuend (str) – numerator column
subtrahend (str) – subtrahend column
new (str | None) – column name of subtraction or None (f”{minuend}-{subtrahend}”)
fill_value (float | int) – value to fill in NAs

Return type:

Self

Returns:

updated DataEngineer instance

subset(geo=None, start_date=None, end_date=None, variables=None, complement=True, get_dummies=True, **kwargs)[source]

Return subset of the location and date range.

Parameters:

geo (tuple[list[str] | tuple[str] | str | None, ...]) – location names to filter or None (total at the top level)
start_date (str | None) – start date, like 22Jan2020
end_date (str | None) – end date, like 01Feb2020
variables (list[str] | str | None) – list of variables to add or None (all available columns)
complement (bool) – whether perform data complement or not, True as default
get_dummies (bool) – whether convert categorical variable into dummy variables or not, True as default
**Kwargs –
keyword arguments for complement and default values
- recovery_period (int): expected value of recovery period[days], 17
- interval (int): expected update interval of the number of recovered cases[days], 2
- max_ignored (int): Max number of recovered cases to be ignored[cases], 100
- max_ending_unupdated (int): Max number of days to apply full complement, where max recovered cases are not updated[days], 14
- upper_limit_days (int): maximum number of valid partial recovery periods[days], 90
- lower_limit_days (int): minimum number of valid partial recovery periods[days], 7
- upper_percentage (float): fraction of partial recovery periods with value greater than upper_limit_days, 0.5
- lower_percentage (float): fraction of partial recovery periods with value less than lower_limit_days, 0.5

Return type:

tuple[DataFrame, str, dict[str, bool]]

Returns:

pandas.DataFrame

Index
Date(pandas.DataFrame): observation dates

Columns
Population(int): total population Tests(int): column of the number of tests Confirmed(int): the number of confirmed cases Fatal(int): the number of fatal cases Recovered(int): the number of recovered cases the other columns registered
str: status code: will be selected from
- ’’ (not complemented)
- ’monotonic increasing complemented confirmed data’
- ’monotonic increasing complemented fatal data’
- ’monotonic increasing complemented recovered data’
- ’fully complemented recovered data’
- ’partially complemented recovered data’
dict[str, bool]: status for each complement type, keys are
- Monotonic_confirmed
- Monotonic_fatal
- Monotonic_recovered
- Full_recovered
- Partial_recovered

Note

Regarding @geo argument, please refer to covsirphy.GIS.subset().

Note

Re-calculation of Susceptible and Infected will be done automatically.

subset_alias(alias=None, update=False, **kwargs)[source]

Set/get/list-up alias name(s) of subset.

Parameters:

alias (str | None) – alias name or None (list-up alias names)
update (bool) – force updating the alias when @alias is not None
**kwargs – keyword arguments of covsirphy.DataEngineer().subset()

Return type:

tuple[DataFrame, str, dict[str, bool]] | dict[str, tuple[DataFrame, str, dict[str, bool]]]

Returns:

- tuple[pandas.DataFrame, str, dict] – when @alias is not None, the subset of the alias - dict[str, tuple[pandas.DataFrame, str, dict]]: when @alias is None, dictionary of aliases and subsets

Note

When the alias name was a new one, subset will be registered with covsirphy.DataEngineer.subset(**kwargs).

transform()[source]

Transform all registered data, calculating the number of susceptible and infected cases.

Return type:: Self
Returns:: updated DataEngineer instance

Note

Susceptible = Population - Confirmed
Infected = Confirmed - Fatal - Recovered

variables_alias(alias=None, variables=None)[source]

Set/get/list-up alias name(s) of variables.

Parameters:

alias (str | None) – alias name or None (list - up alias names)
variables (list[str] | None) – variables to register with the alias

Raises:

NotIncludedError – the alias is not None and un - registered

Return type:

list[str] | dict[str, list[str]]

Returns:

- list[str] – when @alias is not None, the variables of the alias - dict[str, list[str]]: when @alias is None, dictionary of aliases and variables

Note

When @variables is not None, alias will be registered/updated.

Note

Some aliases are preset. We can check them with covsirphy.DataEngineer().variables_alias().