
TL;DR The klib package provides a number of very easily applicable functions with sensible default values that can be used on virtually any DataFrame to assess data quality, gain insight, perform cleaning operations and visualizations which results in a much lighter and more convenient to work with Pandas DataFrame.
While the previous article mainly focused on visualizations, this piece will demonstrate the Data Cleaning capabilities the latest release of klib has to offer. Specifically, it comes with a number of improvements targeted at facilitating data cleaning and preparation.
For those of you who want to follow along, let’s make sure you have access to the Kaggle API to download the data. For that you need to create an API token in your Kaggle account settings and save it under ~/.kaggle/kaggle.json
We download three datasets using the Kaggle API, unzip them and read the resulting .csv files into pd.DataFrames. We then hand them to _klib.data_cleaning()_ using the default settings and obtain the cleaned DataFrames.
Alternatively, I encourage you to follow along using your own data! In this case, just read the data in and pass it to the data_cleaning() function.
Performance gains and memory savings
The following table highlights the key differences between the original and the cleaned datasets. We can see that even when containing mostly numerical data – as is the case for the "fraud" dataset – the size already drops by about 50%. For mixed datatypes, which we can find in the US pollution dataset, the savings are significantly higher; from more than 1.3GB down to 200MB!
Dataset: Before: After:
Fraud Shape: (284807, 31) Shape: (283726, 31)
Pollution Shape: (1746661, 29) Shape: (1746661, 25)
Hotel Shape: (515738, 17) Shape: (515212, 17)

Aside from significant memory savings these optimizations also reduce the computation time for anything that follows. For instance transformations on your DataFrame or queries. As we can see in the table above, these are typically in a similar range than the memory savings.
A value that stands out is the slightly slower (17ms) call to the nlargest() method for the hotel dataset. While this may be explained by other tasks running on my laptop or CPU throttling, it could also be due to some numerical optimization working better on the original datatype.
Detailed functionality
_klib.data_cleaning()_ performs a number of steps, among them:
- cleaning the column names:This unifies the column names by formatting them, splitting, among others, CamelCase into camel_case, removing special characters as well as leading and trailing white-spaces and formatting all column names to l_owercase_and_underscoreseparated. This also checks for and fixes duplicate column names, which you sometimes get when reading data from a file.
Some column name examples:
Yards.Gained --> yards_gained
PlayAttempted --> play_attempted
Challenge.Replay --> challenge_replay
- dropping empty and virtually empty columns:You can use the parameters _drop_thresholdcols and _drop_thresholdrows to adjust the dropping to your needs. The default is to drop columns and rows with more than 90% of the values missing.
- removes single valued columns:As the name states, this removes columns in which each cell contains the same value. This comes in handy when columns such as "year" are included while you’re just looking at a single year. Other examples are "download_date" or indicator variables which are identical for all entries.
- drops duplicate rows:This is a straightforward drop of entirely duplicate rows. If you are dealing with data where duplicates add value, consider setting d_ropduplicates=False.
- Lastly, and often times most importantly, especially for memory reduction and therefore for speeding up the subsequent steps in your workflow, _klib.data_cleaning()_ also optimizes the datatypes as we saw in the table above.
If you like to check for yourself or investigate further, take a look at the notebook I’ve used to create these benchmarks. Feel free to share the improvements you see on your own data __ in the comments below!
Conclusion
All in all, _klib.data_cleaning()_ offers a very convenient way to clean your data. Simply plug your DataFrame into the function and optionally customize the cleaning process using the wide variety of optional parameters.
My typical workflow comprises the following steps:
Please note, the datasets we used in the examples are already pretty well formatted and barely contain any missing values. In messy real world data, the savings are often considerably higher since many NANs are removed.
Bonus:
Another function which recently received a major update is _klib.dist_plot(). Contrary to the previously introduced klib.cat_plot()_ it is used to visualize distributions of numerical features. Additionally, it includes the values for the first four moments (mean, standard deviation, skewness, and excess kurtosis), giving you a good idea about the need of transformations, such as log, root or square and also highlighting possible outliers.

Coming up next:
- _mv_col_handling()_ → Follows a sophisticated 3-step process, attempting to identify any remaining information in columns with many missing values, instead of simply dropping them right away.
- _pool_duplicate_subsets()_ → Checks for duplicates in subsets of columns and pools them as efficiently as possible, balancing reduction of memory and loss of information.