CleanDataFrame Docstring
This class is used to clean and optimize a pandas dataframe for further analysis.
- clean_df.CleanDataFrame.df
The dataframe to be cleaned and optimized or after cleaning and optimization.
- Type:
pandas.DataFrame
- clean_df.CleanDataFrame.max_num_cat
The maximum number of unique values in a column for it to be considered categorical.
- Type:
int
- clean_df.CleanDataFrame.duplicate_inds
Array of the indices of duplicated rows.
- Type:
numpy ndarray, readonly
- clean_df.CleanDataFrame.cols_to_optimize
A dictionary of all numerical columns that can be memory optimized, it will be {column name: optimized data type}.
- Type:
dict, readonly
- clean_df.CleanDataFrame.outliers
A dictionary for outliers details in descinding order in an array as {column name: outlier details} format, the list has:
The number of lower outliers.
The number of upper outliers.
The total number of outliers.
The percentage of the total values that are outliers.
- Type:
dict, readonly
- clean_df.CleanDataFrame.missing_cols
A dictionary for missing details in descinding order in an array as {column name: missing details} format, the list has:
The total number of missing values.
The percentage of the total values that are missing.
- Type:
dict, readonly
- clean_df.CleanDataFrame.cat_cols
A dictionary for columns that can convert to categiorical type as {column name: array of unique values} format.
- Type:
dict, readonly
- clean_df.CleanDataFrame.num_cols
Array of numerical columns.
- Type:
numpy ndarray of str, readonly
- clean_df.CleanDataFrame.__init__(self, df, max_num_cat=10) None:
Constructor for CleanDataFrame.
- Parameters:
df (pandas.DataFrame) – The dataframe to be cleaned and optimized.
max_num_cat (int, optional) – The maximum number of unique values in a column for it to be considered categorical, defaults to 10.
- clean_df.CleanDataFrame.report(self, show_matrix=True, show_heat=True, matrix_kws={}, heat_kws={}) None:
- Generate a summary report of the dataset, including:
Duplicated rows report.
Columns’ Datatype to optimize memory report.
Columns to convert to categorical report.
Outliers report.
Missing values report.
- Parameters:
show_matrix (bool, optional) – A flag to control whether to show the missing value matrix plot or not, defaults to True.
heat_matrix (bool, optional) – A flag to control whether to show the missing value heatmap plot or not, defaults to True.
matrix_kws (dict, optional) – Keyword arguments passed to the missing value matrix plot, defaults to {}.
heat_kws (dict, optional) – Keyword arguments passed to the missing value heatmap plot, defaults to {}.
- Raises:
TypeError – If any parameter has the wrong type.
- clean_df.CleanDataFrame.clean(self, min_missing_ratio=0.05, drop_nan=True, drop_kws={}, drop_duplicates_kws={}) None:
Drops columns with a high ratio of missing values and duplicate rows.
- Parameters:
min_missing_ratio (float, optional) – The minimum ratio of missing values for columns to drop. Value should be between 0 and 1. Defaults to 0.05.
drop_nan (bool, optional) – A flag to decide whether to drop any rows that contain missing values after dropping the columns with above min_missing_ratio missing values. Defaults to True.
drop_kws (dict, optional) – Keyword arguments passed to the drop() function. Defaults to {}.
drop_duplicates_kws (dict, optional) – Keyword arguments passed to the drop_duplicates() function. Defaults to {}.
- Raises:
TypeError – If drop_nan is not boolean, or if drop_kws or drop_duplicates_kws have the wrong types.
ValueError – If min_missing_ratio is not between 0 and 1, or if drop_kws or drop_duplicates_kws have a key inplace.
- clean_df.CleanDataFrame.optimize(self) None:
Optimizes the dataframe by converting columns to the desired data type and converting categorical columns to ‘category’ data type. Note that numerical columns should not contain missing values.
- Raises:
Warning – If any numerical column contains missing values.