actableai.stats package

Module contents

class actableai.stats.Stats

Bases: object

Class handling calculation of correlation and decorrelation of features.

corr(df: pandas.core.frame.DataFrame, target_col: str, target_value: Optional[str] = None, p_value: float = 0.05, categorical_columns: Optional[List] = None, gen_categorical_columns: Optional[List] = None) list

Calculate correlation between target and all other columns.

Parameters
  • df – DataFrame containing the target column and other columns.
  • target_col – Name of the target column.
  • target_value – Value of the target column if categorical. Defaults to None.
  • p_value – P-value threshold for significance. Defaults to 0.05.
Raises

ValueError – If target_col is not in df.columns or (target_col is categorical and target_value is not in df[target_col].unique()).

Returns

List containing the correlation between the target and all

Return type

list

decorrelate(df, target_col, control_col, target_value=None, control_value=None, kde_steps='auto', corr_max=0.05, pval_max=0.05, kde_steps_=10) list

Re-sample df to de-correlate target_col and control_col.

Parameters
  • df – Input DataFrame.
  • target_col – Name of the target column.
  • control_col – Name of the control column.
  • target_value – Value of the target column if categorical. Defaults to None.
  • control_value – Value of the control column if categorical. Defaults to None.
  • kde_steps – used to compute KDE bandwidth. Higher kde_steps leads to better decorrelation but samples with smaller size. Set kde_steps as “auto” to search for the smaller value where correlation is insignificant.
  • corr_max – Correlation threshold for significance. Defaults to 0.05.
  • pval_max – P-value threshold for significance. Defaults to 0.05.
  • kde_steps – Used to compute KDE bandwidth. Higher kde_steps_ leads to better decorrelation but samples with smaller size. Defaults to 10.
Returns

Sampled indices.

Return type

list