covsirphy.engineering package
Submodules
covsirphy.engineering.engineer module
- class DataEngineer(layers=None, country='ISO3', **kwargs)[source]
Bases:
Term
Class for data engineering including loading, cleaning, transforming, complementing, EDA (explanatory data analysis).
- Parameters:
layers (
list
[str
] |None
) – list of layers of geographic information or None ([“ISO3”, “Province”, “City”])country (
str
) – layer name of countries or None (countries are not included in the layers)Raises – ValueError: @layers has duplicates
Note – Country level data specified with @country will be stored with ISO3 codes.
- add(columns, new=None, fill_value=0)[source]
Calculate element-wise addition with pandas.DataFrame.sum(axis=1), X1 + X2 + X3 +…
- all(variables=None)[source]
Return all available data, converting dtypes with pandas.DataFrame.convert_dtypes().
- Parameters:
variables (
list
[str
] |str
|None
) – list of variables to collect or alias or None (all available variables)- Raises:
NotRegisteredError – No records have been registered yet
- Return type:
- Returns:
- Index
reset index
- Column
columns defined by @layers of `DataEngineer()|
Date (pandas.Timestamp): observation dates defined by @date of DataEngineer()
the other columns
- assign(**kwargs)[source]
Assign a new column with pandas.DataFrame.assign().
- Parameters:
**kwargs – dict of {str: callable or pandas.Series}
- Return type:
Self
Note
Refer to documentation of pandas.DataFrame.assign(), https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.assign.html
- choropleth(geo, variable, on=None, title='Choropleth map', filename='choropleth.jpg', logscale=True, directory=None, natural_earth=None, **kwargs)[source]
Create choropleth map.
- Parameters:
geo (
tuple
[list
[str
] |tuple
[str
] |str
|None
,...
]) – location names to specify the layer or None (the top level)variable (
str
) – variable name to showon (
str
|None
) – the date, like 22Jan2020, or None (the last date of each location)title (
str
) – title of the mapfilename (
str
) – filename to save the figure or None (display)logscale (
bool
) – whether convert the value to log10 scale values or notdirectory (
str
|None
) – directory to save GeoJSON file of “Natural Earth” GitHub repository or None (the directory of GIS class script)natural_earth (
str
|None
) – title of GeoJSON file(without extension) of “Natural Earth” GitHub repository or None (automatically determined)**kwargs –
keyword arguments of the following classes and methods.
matplotlib.pyplot.savefig(),
matplotlib.pyplot.legend(), and
pandas.DataFrame.plot()`
- Return type:
Note
Regarding @ geo argument, please refer to covsirphy.GIS.layer().
Note
GeoJSON files are listed in
https://github.com/nvkelso/natural-earth-vector/tree/master/geojson
Natural Earth (Free vector and raster map data at naturalearthdata.com, Public Domain)
- clean(kinds=None, **kwargs)[source]
Clean all registered data.
- Parameters:
kinds of data cleaning with order or None (all available kinds as follows)
”convert_date”: Convert dtype of date column to pandas.Timestamp.
”resample”: Resample records with dates.
”fillna”: Fill NA values with ‘-’ (layers) and the previous values and 0.
**kwargs – keyword arguments of data cleaning refer to note
- Return type:
Self
- Returns:
updated DataEngineer instance
Note
When “convert_date” included, keyword arguments of pandas.to_datetime() including “dayfirst (bool): whether date format is DD/MM or not” can be used.
Note
When “resample” included, date_range=<tuple of (str or None, str or None) or None>) can be applied as keyword arguments to set the range.
- diff(column, suffix='_diff', freq='D')[source]
Calculate daily new cases with “f(x>0) = F(x) - F(x-1), x(0) = 0 when F is cumulative numbers”.
- Parameters:
- Return type:
Self
- Returns:
updated DataEngineer instance
Note
Regarding @freq, refer to https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases
- div(numerator, denominator, new=None, fill_value=0)[source]
Calculate element-wise floating division with pandas.Series.div(), numerator / denominator.
- Parameters:
- Return type:
Self
- Returns:
updated DataEngineer instance
Note
Positive rate could be calculated with Confirmed / Tested, .div(numerator=”Confirmed”, denominator=”Tested”, new=”Positive_rate”)
- download(**kwargs)[source]
Download datasets from the recommended data servers using covsirphy.DataDownloader.
- Parameters:
**kwargs – keyword arguments of covsirphy.DataDownloader() and covsirphy.DataDownloader.layer()
- Return type:
Self
- Returns:
updated DataEngineer instance
- inverse_transform()[source]
Perform inverse transformation, calculating total population and confirmed.
- Return type:
Self
- Returns:
updated DataEngineer instance
Note
Population = Susceptible + Confirmed
Confirmed = Infected + Fatal + Recovered
- layer(geo=None, start_date=None, end_date=None, variables=None)[source]
Return the data at the selected layer in the date range.
- Parameters:
- Raises:
TypeError – @geo has un-expected types
ValueError – the length of @geo is larger than the length of layers
NotRegisteredError – No records have been registered at the layer yet
- Return type:
- Returns:
Index – reset index Columns
(str): columns defined by covsirphy.GIS(layers)
Date (pandas.Timestamp): observation dates
columns defined by @ variables
Note
Note that records with NAs as country names will be always removed.
Note
Regarding @geo argument, please refer to covsirphy.GIS.layer().
- mul(columns, new=None, fill_value=0)[source]
Calculate element-wise multiplication with pandas.DataFrame.product(axis=1), X1 * X2 * X3 *…
- classmethod recovery_period(data)[source]
Calculate mode value of recovery period of the data.
- Parameters:
data (
DataFrame
) –data for calculation Index
Date (pandas.Timestamp): observation dates
- Columns
Confirmed (int): the number of confirmed cases, optional
Fatal (int): the number of fatal cases, optional
Recovered (int): the number of recovered cases, optional
the other columns will be ignored
- Return type:
- Returns:
mode value of recovery period [days]
- register(data, citations=None, **kwargs)[source]
Register new data.
- Parameters:
data (
DataFrame
) –new data Index
reset index
- Columns
columns defined by covsirphy.DataEngineer(layer)
Date (pandas.DataFrame): observation dates
Population (int): total population, optional
Tests (int): column of the number of tests, optional
Confirmed (int): the number of confirmed cases, optional
Fatal (int): the number of fatal cases, optional
Recovered (int): the number of recovered cases, optional
the other columns will be also registered
citations (
list
[str
] |str
|None
) – citations of the dataset or None ([“my own dataset”])**kwargs – keyword arguments of pandas.to_datetime() including “dayfirst (bool): whether date format is DD/MM or not”
- Return type:
Self
- Returns:
updated DataEngineer instance
- sub(minuend, subtrahend, new=None, fill_value=0)[source]
Calculate element-wise subtraction with pandas.Series.sub(), minuend - subtrahend.
- subset(geo=None, start_date=None, end_date=None, variables=None, complement=True, get_dummies=True, **kwargs)[source]
Return subset of the location and date range.
- Parameters:
geo (
tuple
[list
[str
] |tuple
[str
] |str
|None
,...
]) – location names to filter or None (total at the top level)variables (
list
[str
] |str
|None
) – list of variables to add or None (all available columns)complement (
bool
) – whether perform data complement or not, True as defaultget_dummies (
bool
) – whether convert categorical variable into dummy variables or not, True as default**Kwargs –
keyword arguments for complement and default values
recovery_period (int): expected value of recovery period[days], 17
interval (int): expected update interval of the number of recovered cases[days], 2
max_ignored (int): Max number of recovered cases to be ignored[cases], 100
max_ending_unupdated (int): Max number of days to apply full complement, where max recovered cases are not updated[days], 14
upper_limit_days (int): maximum number of valid partial recovery periods[days], 90
lower_limit_days (int): minimum number of valid partial recovery periods[days], 7
upper_percentage (float): fraction of partial recovery periods with value greater than upper_limit_days, 0.5
lower_percentage (float): fraction of partial recovery periods with value less than lower_limit_days, 0.5
- Return type:
- Returns:
- pandas.DataFrame
- Index
Date(pandas.DataFrame): observation dates
- Columns
Population(int): total population Tests(int): column of the number of tests Confirmed(int): the number of confirmed cases Fatal(int): the number of fatal cases Recovered(int): the number of recovered cases the other columns registered
str: status code: will be selected from
’’ (not complemented)
’monotonic increasing complemented confirmed data’
’monotonic increasing complemented fatal data’
’monotonic increasing complemented recovered data’
’fully complemented recovered data’
’partially complemented recovered data’
dict[str, bool]: status for each complement type, keys are
Monotonic_confirmed
Monotonic_fatal
Monotonic_recovered
Full_recovered
Partial_recovered
Note
Regarding @geo argument, please refer to covsirphy.GIS.subset().
Note
Re-calculation of Susceptible and Infected will be done automatically.
- subset_alias(alias=None, update=False, **kwargs)[source]
Set/get/list-up alias name(s) of subset.
- Parameters:
- Return type:
tuple
[DataFrame
,str
,dict
[str
,bool
]] |dict
[str
,tuple
[DataFrame
,str
,dict
[str
,bool
]]]- Returns:
- tuple[pandas.DataFrame, str, dict] – when @alias is not None, the subset of the alias - dict[str, tuple[pandas.DataFrame, str, dict]]: when @alias is None, dictionary of aliases and subsets
Note
When the alias name was a new one, subset will be registered with covsirphy.DataEngineer.subset(**kwargs).
- transform()[source]
Transform all registered data, calculating the number of susceptible and infected cases.
- Return type:
Self
- Returns:
updated DataEngineer instance
Note
Susceptible = Population - Confirmed
Infected = Confirmed - Fatal - Recovered
- variables_alias(alias=None, variables=None)[source]
Set/get/list-up alias name(s) of variables.
- Parameters:
- Raises:
NotIncludedError – the alias is not None and un - registered
- Return type:
- Returns:
- list[str] – when @alias is not None, the variables of the alias - dict[str, list[str]]: when @alias is None, dictionary of aliases and variables
Note
When @variables is not None, alias will be registered/updated.
Note
Some aliases are preset. We can check them with covsirphy.DataEngineer().variables_alias().