covsirphy.engineering package

covsirphy.engineering.engineer module

class DataEngineer(layers=None, country='ISO3', **kwargs)[source]

Bases: Term

Class for data engineering including loading, cleaning, transforming, complementing, EDA (explanatory data analysis).

Parameters:
  • layers (list[str] | None) – list of layers of geographic information or None ([“ISO3”, “Province”, “City”])

  • country (str) – layer name of countries or None (countries are not included in the layers)

  • Raises – ValueError: @layers has duplicates

  • Note – Country level data specified with @country will be stored with ISO3 codes.

add(columns, new=None, fill_value=0)[source]

Calculate element-wise addition with pandas.DataFrame.sum(axis=1), X1 + X2 + X3 +…

Parameters:
  • columns (list[str] | str) – columns (or alias) to add

  • new (str | None) – column name of addition or None (f”{X1}+{X2}+{X3}…”)

  • fill_value (float | int) – value to fill in NAs

Return type:

Self

Returns:

updated DataEngineer instance

all(variables=None)[source]

Return all available data, converting dtypes with pandas.DataFrame.convert_dtypes().

Parameters:

variables (list[str] | str | None) – list of variables to collect or alias or None (all available variables)

Raises:

NotRegisteredError – No records have been registered yet

Return type:

DataFrame

Returns:

Index

reset index

Column
  • columns defined by @layers of `DataEngineer()|

  • Date (pandas.Timestamp): observation dates defined by @date of DataEngineer()

  • the other columns

assign(**kwargs)[source]

Assign a new column with pandas.DataFrame.assign().

Parameters:

**kwargs – dict of {str: callable or pandas.Series}

Return type:

Self

Note

Refer to documentation of pandas.DataFrame.assign(), https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.assign.html

choropleth(geo, variable, on=None, title='Choropleth map', filename='choropleth.jpg', logscale=True, directory=None, natural_earth=None, **kwargs)[source]

Create choropleth map.

Parameters:
  • geo (tuple[list[str] | tuple[str] | str | None, ...]) – location names to specify the layer or None (the top level)

  • variable (str) – variable name to show

  • on (str | None) – the date, like 22Jan2020, or None (the last date of each location)

  • title (str) – title of the map

  • filename (str) – filename to save the figure or None (display)

  • logscale (bool) – whether convert the value to log10 scale values or not

  • directory (str | None) – directory to save GeoJSON file of “Natural Earth” GitHub repository or None (the directory of GIS class script)

  • natural_earth (str | None) – title of GeoJSON file(without extension) of “Natural Earth” GitHub repository or None (automatically determined)

  • **kwargs

    keyword arguments of the following classes and methods.

    • matplotlib.pyplot.savefig(),

    • matplotlib.pyplot.legend(), and

    • pandas.DataFrame.plot()`

Return type:

None

Note

Regarding @ geo argument, please refer to covsirphy.GIS.layer().

Note

GeoJSON files are listed in

citations(variables=None)[source]

Return citation list of the secondary data sources.

Parameters:

variables (list[str] | str | None) – list of variables to collect or alias or None (all available variables)

Return type:

list[str]

Returns:

citations

clean(kinds=None, **kwargs)[source]

Clean all registered data.

Parameters:
  • kinds (list[str] | None) –

    kinds of data cleaning with order or None (all available kinds as follows)

    • ”convert_date”: Convert dtype of date column to pandas.Timestamp.

    • ”resample”: Resample records with dates.

    • ”fillna”: Fill NA values with ‘-’ (layers) and the previous values and 0.

  • **kwargs – keyword arguments of data cleaning refer to note

Return type:

Self

Returns:

updated DataEngineer instance

Note

When “convert_date” included, keyword arguments of pandas.to_datetime() including “dayfirst (bool): whether date format is DD/MM or not” can be used.

Note

When “resample” included, date_range=<tuple of (str or None, str or None) or None>) can be applied as keyword arguments to set the range.

diff(column, suffix='_diff', freq='D')[source]

Calculate daily new cases with “f(x>0) = F(x) - F(x-1), x(0) = 0 when F is cumulative numbers”.

Parameters:
  • column (str) – column name of the cumulative numbers

  • suffix (str) – suffix if the column (new column name will be ‘{column}{suffix}’)

  • freq (str) – offset aliases of shifting dates

Return type:

Self

Returns:

updated DataEngineer instance

div(numerator, denominator, new=None, fill_value=0)[source]

Calculate element-wise floating division with pandas.Series.div(), numerator / denominator.

Parameters:
  • numerator (str) – numerator column

  • denominator (str) – denominator column

  • new (str | None) – column name of floating division or None (f”{numerator}_per_({denominator.replace(’ ‘, ‘_’)})”)

  • fill_value (float | int) – value to fill in NAs

Return type:

Self

Returns:

updated DataEngineer instance

Note

Positive rate could be calculated with Confirmed / Tested, .div(numerator=”Confirmed”, denominator=”Tested”, new=”Positive_rate”)

download(**kwargs)[source]

Download datasets from the recommended data servers using covsirphy.DataDownloader.

Parameters:

**kwargs – keyword arguments of covsirphy.DataDownloader() and covsirphy.DataDownloader.layer()

Return type:

Self

Returns:

updated DataEngineer instance

inverse_transform()[source]

Perform inverse transformation, calculating total population and confirmed.

Return type:

Self

Returns:

updated DataEngineer instance

Note

  • Population = Susceptible + Confirmed

  • Confirmed = Infected + Fatal + Recovered

layer(geo=None, start_date=None, end_date=None, variables=None)[source]

Return the data at the selected layer in the date range.

Parameters:
  • geo (tuple[list[str] | tuple[str] | str | None, ...]) – location names to specify the layer or None (the top level)

  • start_date (str | None) – start date, like 22Jan2020

  • end_date (str | None) – end date, like 01Feb2020

  • variables (list[str] | None) – list of variables to add or None (all available columns)

Raises:
Return type:

DataFrame

Returns:

Index – reset index Columns

  • (str): columns defined by covsirphy.GIS(layers)

  • Date (pandas.Timestamp): observation dates

  • columns defined by @ variables

Note

Note that records with NAs as country names will be always removed.

Note

Regarding @geo argument, please refer to covsirphy.GIS.layer().

mul(columns, new=None, fill_value=0)[source]

Calculate element-wise multiplication with pandas.DataFrame.product(axis=1), X1 * X2 * X3 *

Parameters:
  • columns (list[str] | str) – columns (or alias) to multiply

  • new (str | None) – column name of multiplication or None (f”{X1}*{X2}*{X3}…”)

  • fill_value (float | int) – value to fill in NAs

Return type:

Self

Returns:

updated DataEngineer instance

classmethod recovery_period(data)[source]

Calculate mode value of recovery period of the data.

Parameters:

data (DataFrame) –

data for calculation Index

Date (pandas.Timestamp): observation dates

Columns
  • Confirmed (int): the number of confirmed cases, optional

  • Fatal (int): the number of fatal cases, optional

  • Recovered (int): the number of recovered cases, optional

  • the other columns will be ignored

Return type:

int

Returns:

mode value of recovery period [days]

register(data, citations=None, **kwargs)[source]

Register new data.

Parameters:
  • data (DataFrame) –

    new data Index

    reset index

    Columns
    • columns defined by covsirphy.DataEngineer(layer)

    • Date (pandas.DataFrame): observation dates

    • Population (int): total population, optional

    • Tests (int): column of the number of tests, optional

    • Confirmed (int): the number of confirmed cases, optional

    • Fatal (int): the number of fatal cases, optional

    • Recovered (int): the number of recovered cases, optional

    • the other columns will be also registered

  • citations (list[str] | str | None) – citations of the dataset or None ([“my own dataset”])

  • **kwargs – keyword arguments of pandas.to_datetime() including “dayfirst (bool): whether date format is DD/MM or not”

Return type:

Self

Returns:

updated DataEngineer instance

sub(minuend, subtrahend, new=None, fill_value=0)[source]

Calculate element-wise subtraction with pandas.Series.sub(), minuend - subtrahend.

Parameters:
  • minuend (str) – numerator column

  • subtrahend (str) – subtrahend column

  • new (str | None) – column name of subtraction or None (f”{minuend}-{subtrahend}”)

  • fill_value (float | int) – value to fill in NAs

Return type:

Self

Returns:

updated DataEngineer instance

subset(geo=None, start_date=None, end_date=None, variables=None, complement=True, get_dummies=True, **kwargs)[source]

Return subset of the location and date range.

Parameters:
  • geo (tuple[list[str] | tuple[str] | str | None, ...]) – location names to filter or None (total at the top level)

  • start_date (str | None) – start date, like 22Jan2020

  • end_date (str | None) – end date, like 01Feb2020

  • variables (list[str] | str | None) – list of variables to add or None (all available columns)

  • complement (bool) – whether perform data complement or not, True as default

  • get_dummies (bool) – whether convert categorical variable into dummy variables or not, True as default

  • **Kwargs

    keyword arguments for complement and default values

    • recovery_period (int): expected value of recovery period[days], 17

    • interval (int): expected update interval of the number of recovered cases[days], 2

    • max_ignored (int): Max number of recovered cases to be ignored[cases], 100

    • max_ending_unupdated (int): Max number of days to apply full complement, where max recovered cases are not updated[days], 14

    • upper_limit_days (int): maximum number of valid partial recovery periods[days], 90

    • lower_limit_days (int): minimum number of valid partial recovery periods[days], 7

    • upper_percentage (float): fraction of partial recovery periods with value greater than upper_limit_days, 0.5

    • lower_percentage (float): fraction of partial recovery periods with value less than lower_limit_days, 0.5

Return type:

tuple[DataFrame, str, dict[str, bool]]

Returns:

  • pandas.DataFrame
    Index

    Date(pandas.DataFrame): observation dates

    Columns

    Population(int): total population Tests(int): column of the number of tests Confirmed(int): the number of confirmed cases Fatal(int): the number of fatal cases Recovered(int): the number of recovered cases the other columns registered

  • str: status code: will be selected from

    • ’’ (not complemented)

    • ’monotonic increasing complemented confirmed data’

    • ’monotonic increasing complemented fatal data’

    • ’monotonic increasing complemented recovered data’

    • ’fully complemented recovered data’

    • ’partially complemented recovered data’

  • dict[str, bool]: status for each complement type, keys are

    • Monotonic_confirmed

    • Monotonic_fatal

    • Monotonic_recovered

    • Full_recovered

    • Partial_recovered

Note

Regarding @geo argument, please refer to covsirphy.GIS.subset().

Note

Re-calculation of Susceptible and Infected will be done automatically.

subset_alias(alias=None, update=False, **kwargs)[source]

Set/get/list-up alias name(s) of subset.

Parameters:
  • alias (str | None) – alias name or None (list-up alias names)

  • update (bool) – force updating the alias when @alias is not None

  • **kwargs – keyword arguments of covsirphy.DataEngineer().subset()

Return type:

tuple[DataFrame, str, dict[str, bool]] | dict[str, tuple[DataFrame, str, dict[str, bool]]]

Returns:

- tuple[pandas.DataFrame, str, dict] – when @alias is not None, the subset of the alias - dict[str, tuple[pandas.DataFrame, str, dict]]: when @alias is None, dictionary of aliases and subsets

Note

When the alias name was a new one, subset will be registered with covsirphy.DataEngineer.subset(**kwargs).

transform()[source]

Transform all registered data, calculating the number of susceptible and infected cases.

Return type:

Self

Returns:

updated DataEngineer instance

Note

  • Susceptible = Population - Confirmed

  • Infected = Confirmed - Fatal - Recovered

variables_alias(alias=None, variables=None)[source]

Set/get/list-up alias name(s) of variables.

Parameters:
  • alias (str | None) – alias name or None (list - up alias names)

  • variables (list[str] | None) – variables to register with the alias

Raises:

NotIncludedError – the alias is not None and un - registered

Return type:

list[str] | dict[str, list[str]]

Returns:

- list[str] – when @alias is not None, the variables of the alias - dict[str, list[str]]: when @alias is None, dictionary of aliases and variables

Note

When @variables is not None, alias will be registered/updated.

Note

Some aliases are preset. We can check them with covsirphy.DataEngineer().variables_alias().