CleanDataFrame Docstring

This class is used to clean and optimize a pandas dataframe for further analysis.

clean_df.CleanDataFrame.df

The dataframe to be cleaned and optimized or after cleaning and optimization.

Type:

pandas.DataFrame

clean_df.CleanDataFrame.max_num_cat

The maximum number of unique values in a column for it to be considered categorical.

Type:

int

clean_df.CleanDataFrame.duplicate_inds

Array of the indices of duplicated rows.

Type:

numpy ndarray, readonly

clean_df.CleanDataFrame.cols_to_optimize

A dictionary of all numerical columns that can be memory optimized, it will be {column name: optimized data type}.

Type:

dict, readonly

clean_df.CleanDataFrame.outliers

A dictionary for outliers details in descinding order in an array as {column name: outlier details} format, the list has:

  • The number of lower outliers.

  • The number of upper outliers.

  • The total number of outliers.

  • The percentage of the total values that are outliers.

Type:

dict, readonly

clean_df.CleanDataFrame.missing_cols

A dictionary for missing details in descinding order in an array as {column name: missing details} format, the list has:

  • The total number of missing values.

  • The percentage of the total values that are missing.

Type:

dict, readonly

clean_df.CleanDataFrame.cat_cols

A dictionary for columns that can convert to categiorical type as {column name: array of unique values} format.

Type:

dict, readonly

clean_df.CleanDataFrame.num_cols

Array of numerical columns.

Type:

numpy ndarray of str, readonly

clean_df.CleanDataFrame.__init__(self, df, max_num_cat=10) None:

Constructor for CleanDataFrame.

Parameters:
  • df (pandas.DataFrame) – The dataframe to be cleaned and optimized.

  • max_num_cat (int, optional) – The maximum number of unique values in a column for it to be considered categorical, defaults to 10.

clean_df.CleanDataFrame.report(self, show_matrix=True, show_heat=True, matrix_kws={}, heat_kws={}) None:
Generate a summary report of the dataset, including:
  1. Duplicated rows report.

  2. Columns’ Datatype to optimize memory report.

  3. Columns to convert to categorical report.

  4. Outliers report.

  5. Missing values report.

Parameters:
  • show_matrix (bool, optional) – A flag to control whether to show the missing value matrix plot or not, defaults to True.

  • heat_matrix (bool, optional) – A flag to control whether to show the missing value heatmap plot or not, defaults to True.

  • matrix_kws (dict, optional) – Keyword arguments passed to the missing value matrix plot, defaults to {}.

  • heat_kws (dict, optional) – Keyword arguments passed to the missing value heatmap plot, defaults to {}.

Raises:

TypeError – If any parameter has the wrong type.

clean_df.CleanDataFrame.clean(self, min_missing_ratio=0.05, drop_nan=True, drop_kws={}, drop_duplicates_kws={}) None:

Drops columns with a high ratio of missing values and duplicate rows.

Parameters:
  • min_missing_ratio (float, optional) – The minimum ratio of missing values for columns to drop. Value should be between 0 and 1. Defaults to 0.05.

  • drop_nan (bool, optional) – A flag to decide whether to drop any rows that contain missing values after dropping the columns with above min_missing_ratio missing values. Defaults to True.

  • drop_kws (dict, optional) – Keyword arguments passed to the drop() function. Defaults to {}.

  • drop_duplicates_kws (dict, optional) – Keyword arguments passed to the drop_duplicates() function. Defaults to {}.

Raises:
  • TypeError – If drop_nan is not boolean, or if drop_kws or drop_duplicates_kws have the wrong types.

  • ValueError – If min_missing_ratio is not between 0 and 1, or if drop_kws or drop_duplicates_kws have a key inplace.

clean_df.CleanDataFrame.optimize(self) None:

Optimizes the dataframe by converting columns to the desired data type and converting categorical columns to ‘category’ data type. Note that numerical columns should not contain missing values.

Raises:

Warning – If any numerical column contains missing values.