Long story short, this compilation of data cleaning codes is the updated version of my previous article – The Simple Yet Practical Data Cleaning Codes that unexpectedly went viral.
Having received a lot of messages from data scientists and data professionals on how the toolbox has helped them in their day-to-day data cleaning tasks, I hope to make this toolbox more complete by updating it with my latest compilation of data cleaning codes.
By the end of this article, I hope you’d find some of the data cleaning codes helpful to your data cleaning tasks and that you can implement them at the shortest time possible.
Think of this toolbox as your arsenal as a data scientist with a collection of general and useful codes that you can execute them with slightest to no modification for your common data cleaning tasks.
Let’s get started!
Why Do You Need This Toolbox?
The world is imperfect, so is data.
It doesn’t matter whether you’re an aspiring data scientist, experienced data scientist, or data analyst.
You might already have got stuck in data cleaning tasks before for 40%-70% of your time before you could even start analyzing and making sense of the data that you had.
And yes, the reality is that this process is often not avoidable (unless you have someone to help you clean data for you).
Personally, I’ve been through this period of time-consuming and tedious parts as data cleaning is simply too important to be neglected.
Therefore, just few months ago, I started building my personal toolbox for data cleaning tasks as I noticed a lot of data cleaning codes could actually be reused for many common data cleaning tasks.
In other words, these codes could be generalized to be used for other common scenarios of data cleaning where they have similar patterns – hence the compilation of all my previous data cleaning codes.
This is crucial as my toolbox has saved my tons of time of thinking and googling as I can just "copy and modify" from my codebase in seconds.
By sharing this toolbox with you I hope to save your time and make your Data Science workflow much more efficient so that you can focus on other important tasks.
My Updated Toolbox for Data Cleaning
The codes below are added on top of the previous data cleaning codebase. Feel free to check out my previous article – The Simple Yet Practical Data Cleaning Codes for more codes for data cleaning.
In the following code snippets, the codes are written in functions for self-explanatory purposes. You can always use the codes directly without putting them into functions with a small change of parameters.
1. Rename column names
When faced with column names that have capital letters or space in between, very often we need to change the column names and replace them with lowercase and underscore – to be more clean and Pythonic. Also, we want to make the column names as explicit as possible such that your friends will roughly know what a column contains just by looking at the column name.
Sounds trivial, but important.
2. List comprehension
You may think that list comprehension is so common and wonder how that could be considered as part of data cleaning. Well… To me, list comprehension is so elegant and clean that you can abstract your logic in a single line of code without having any for loops, and compute that much faster than for loops.
Typically, I’d use list comprehension if I want to get a list of values based on certain conditions to append to the existing dataframe or use that for further analysis.
3. Output a dataframe without NaN values for a particular column
This is useful when you want to output a dataframe with all the available data that a column has.
For example, you have a dataframe with all customers’ information and you want to output an updated dataframe with all the available customers’ ID and remove the rows with missing customers’ ID. In this case, the code should look like this: df = df[df['id'].notnull()]
.
Similar concept can be applied to time series data where the column is timestamp
.
4. Output a dataframe based on unique last strings in a column
Using back the dataframe with all customers’ information and adding a column of timestamp
for each id
.
Now each id
is not unique and it’s repeated throughout their respective time period. For each id
you want to get the last row of the id
because you only care about the final customers’ information for each id
.
This is when you can use the code snippet above to drop all duplicated id
and just keep the last row. This makes sure that we always get the final customers’ information with unique id
. Again, this can be done with just a single line of code!
5. Get discrete intervals from numerical values
This is one of my favourite tools when it comes to converting numerical values in a column to discrete intervals based on the range specified. I highly recommend you to use pd.cut
if you want to convert a continuous variable to a categorical variable.
For instance, in the code snippet above, we have a rating
column that consists of numerical values from 1–10. What if we want to convert these rating values to certain groups within given their values and specified range? We can bin the values in discrete intervals and label them as bad, moderate, good, strong
given the range specified using pd.cut
.
Another common use case of pd.cut
would be to convert ages to groups of age ranges where you can categorize each age to a label.
Final Thoughts

Thank you for reading.
By the end of this article, I hope you’ll find these little tools useful to your common data cleaning tasks and that would make your work much productive and of course, life much easier for someone who has to deal with messy data from time to time.
Again, the codes by nature are relatively simple to implement. I hope this updated toolbox of data cleaning – together with the one that I shared in my previous article – gave you more convenience and confidence to perform data cleaning as well as better ideas of how datasets typically look like based on my experience.
As always, if you have any questions or comments feel free to leave your feedback below or you can always reach me on LinkedIn. Till then, see you in the next post! 😄
About the Author
Admond Lee is currently the Co-Founder/CTO of Staq – the #1 business banking API platform for Southeast Asia.
Want to get free weekly data science and startup insights?
Join Admond’s email newsletter – Hustle Hub, where every week he shares actionable data science career tips, mistakes & learnings from building his startup – Staq.
You can connect with him on LinkedIn, Medium, Twitter, and Facebook.