CleanDataFrame Docstring

This class is used to clean and optimize a pandas dataframe for further analysis.

clean_df.CleanDataFrame.df

The dataframe to be cleaned and optimized or after cleaning and optimization.

Type:: pandas.DataFrame

clean_df.CleanDataFrame.max_num_cat

The maximum number of unique values in a column for it to be considered categorical.

Type:: int

clean_df.CleanDataFrame.duplicate_inds

Array of the indices of duplicated rows.

Type:: numpy ndarray, readonly

clean_df.CleanDataFrame.cols_to_optimize

A dictionary of all numerical columns that can be memory optimized, it will be {column name: optimized data type}.

Type:: dict, readonly

clean_df.CleanDataFrame.outliers

A dictionary for outliers details in descinding order in an array as {column name: outlier details} format, the list has:

The number of lower outliers.

The number of upper outliers.

The total number of outliers.

The percentage of the total values that are outliers.

Type:: dict, readonly

clean_df.CleanDataFrame.missing_cols

A dictionary for missing details in descinding order in an array as {column name: missing details} format, the list has:

The total number of missing values.

The percentage of the total values that are missing.

Type:: dict, readonly

clean_df.CleanDataFrame.cat_cols

A dictionary for columns that can convert to categiorical type as {column name: array of unique values} format.

Type:: dict, readonly

clean_df.CleanDataFrame.num_cols

Array of numerical columns.

Type:: numpy ndarray of str, readonly

clean_df.CleanDataFrame.__init__(self, df, max_num_cat=10) → None:

Constructor for CleanDataFrame.

Parameters:

df (pandas.DataFrame) – The dataframe to be cleaned and optimized.
max_num_cat (int, optional) – The maximum number of unique values in a column for it to be considered categorical, defaults to 10.

clean_df.CleanDataFrame.report(self, show_matrix=True, show_heat=True, matrix_kws={}, heat_kws={}) → None:

Generate a summary report of the dataset, including:

Duplicated rows report.
Columns’ Datatype to optimize memory report.
Columns to convert to categorical report.
Outliers report.
Missing values report.

Parameters:

show_matrix (bool, optional) – A flag to control whether to show the missing value matrix plot or not, defaults to True.
heat_matrix (bool, optional) – A flag to control whether to show the missing value heatmap plot or not, defaults to True.
matrix_kws (dict, optional) – Keyword arguments passed to the missing value matrix plot, defaults to {}.
heat_kws (dict, optional) – Keyword arguments passed to the missing value heatmap plot, defaults to {}.

Raises:

TypeError – If any parameter has the wrong type.

clean_df.CleanDataFrame.clean(self, min_missing_ratio=0.05, drop_nan=True, drop_kws={}, drop_duplicates_kws={}) → None:

Drops columns with a high ratio of missing values and duplicate rows.

Parameters:

min_missing_ratio (float, optional) – The minimum ratio of missing values for columns to drop. Value should be between 0 and 1. Defaults to 0.05.
drop_nan (bool, optional) – A flag to decide whether to drop any rows that contain missing values after dropping the columns with above min_missing_ratio missing values. Defaults to True.
drop_kws (dict, optional) – Keyword arguments passed to the drop() function. Defaults to {}.
drop_duplicates_kws (dict, optional) – Keyword arguments passed to the drop_duplicates() function. Defaults to {}.

Raises:

TypeError – If drop_nan is not boolean, or if drop_kws or drop_duplicates_kws have the wrong types.
ValueError – If min_missing_ratio is not between 0 and 1, or if drop_kws or drop_duplicates_kws have a key inplace.

clean_df.CleanDataFrame.optimize(self) → None:

Optimizes the dataframe by converting columns to the desired data type and converting categorical columns to ‘category’ data type. Note that numerical columns should not contain missing values.

Raises:: Warning – If any numerical column contains missing values.