The world’s leading publication for data science, AI, and ML professionals.

7 Simple Python Functions to Clean Your Data

Automating tasks with Python is easy. Once you have a script that works, turn it into a function that will help you code more efficiently!

Whether you are a Python beginner or a pro, whether you use Python at work or at home, chances are you also have some specific tasks that you have to perform several times a day/week/month. Do you find yourself going back to old projects trying to find that awesome bit of code you developed for another project? "Where is it? I could swear I did this exact thing in another project!" – is this a familiar thought?

This "problem" might be especially relevant if, like me, you do a lot of extracting, transforming and loading (ETL) of data in your work.

There are several examples I can think of, and I am sure you can find hundreds of other articles with cool tricks and functions, some better and some worse. I’ve seen them too. But most of them will show you the same snippets all over again.

Which is why I tried to get out of the typical examples and share something that I actually created and use regularly! It’s not that they are the most amazing, well-written, mind-blowing functions ever, but at least I am pretty sure you haven’t seen them elsewhere, and maybe you can even tweak one or two to help you out in the future. In my next article, I will show you how to create your own Class in Python.


What kind of tasks are you talking about?

Do you have a folder where you upload csv files every day and then need to merge them from time to time into one big dataset? Or maybe you want to add one column to every file on a specific folder, and save them with a different name? Maybe the data you extract from that database comes with weird column names, or one of the columns needs to be split in two.

You get the idea, right?!

I am going to share some snippets of code to address those issues and a few more! But enough teasing already. Below there is a list of the type of tasks we will be coding in this article. I hope some of them are as useful to you as they are to me almost daily!

  • Merging all files from a specific folder
  • Edit every file in the same folder and re-save them again
  • Cleaning the header of your datasets
  • Split dataframe columns into two or more columns
  • Filter specific dataframe columns based on their column names
  • Calculate the number of days between two dates
  • Calculate number of weeks/months/years between two dates

Handling csv files

I remember I was quite proud when I got these two functions to work. Several minutes (or even hours) magically turned into a few seconds. "It’s the little things in life", I guess!

Having to deal with a lot of csv files throughout the day, I have daily extractions with daily data that needs to be compiled every week or so. If you have similar tasks, you’re probably going to like these!

Merge all files in the same folder

This one started as a simple function but then it evolved a bit. The starting point was just pointing to a folder with the csv files that I wanted to merge, and saving the merged dataset in a new file. Please notice that I am using files with the exact same structure, but you can easily adapt this code to match your needs.

But csv files can only fit so many rows… So after a while, I added an option to merge the files in memory only. And I added an input variable, so I could just call the function and it would ask me for the path, instead of me copying the path when I was writing the command itself.

You just need to import glob and pandas for this one. The strftime is just a nice way to keep track of when the file was created, by adding the date to the filename.

By default, you can just run it like this: _folder_csv_merge(‘mypreffix’). It will then prompt you to paste the file path. The new file should appear in the same folder as the originals. This may not be the best practice, so make sure to edit the last row to match a location you want the final files to be saved in.

If you simply want to get all the data in the csv files to a dataframe in memory and not to a new file, simply add the argument memory=’yes’, but then you need to call the function into a variable. Like this: _merged_file = folder_csv_merge(‘mypreffix’, memory = ‘yes’). You will now have a variable with all the files merged inside.

Edit several files at once

On a recent occasion, I realized I had spent a few hours waiting to extract hundreds of files, but forgot to add an important column to each of them while extracting. Waiting another day to have everything again was not an option…

No problem! That’s what for loops are… for?!

The first part is very similar because I wanted the console to prompt me for the file path. After specifying the file path, it will now loop over every file, and do something with each of them.

In the example above, I had a column in every file that displayed how many days the customer had been with our company. The problem is that I forgot to add 1 to this value upon the extraction and transforming (because "seniority = 0" looks strange to me!). That was the only change I needed to make, but you can add more stuff if you want. Rename columns, delete columns, create new ones, etc.

This one saved me a lot of time and if you must know, I did pat myself in the back once the job was done! It was actually used when I was trying to calculate Customer Retention, and if you are interested in that, please check my article about it!


Transforming dataframes to suit your needs

The next functions are all about cleaning those datasets you get from the guys at the Sales department, or any other colleagues that just can’t be bothered to care about data consistency and readability! Even some databases built a long time ago can have weird column names that look like sentences, filled with strange characters.

Cleaning dataframe headers

This one is pretty easy, and quite useful. Pretty much self-explanatory! It is also important because you want to keep your column names consistent across your projects.

Split a column based on its contents

Next up, a simple function to help you split a column based on a specific character or string.

Let’s say you have a column "tag" in a dataframe, that actually consists of a category and a sub-category, separated by an underscore. You would rather have these two apart which means we need a column named "category" and another named "subcategory". Easy:

_splitcolumns(dataframe, ‘tag’, ‘category’, ‘subcategory’, ‘‘)_

Check your dataframe again, and now you have two new columns!

Filter specific columns based on their names

This is not something I use every day, but there is a specific file that made me create it. This file has several columns with the same names for different years. But I typically just need to work with one year at the time, so I have to filter the relevant columns.

I included a small example at the end of this snippet to make it clearer! Just run it and see for yourself how it works.


Manipulating date columns

Back when I started learning Python, I remember one of the coolest features I found at the time was the built-in date/time functionality. We were all noobs at some point! These next functions assume you have already transformed your date columns into a datetime object (example here).

Calculate the number of days between dates

I’ll be the first to admit that this next function is not the greatest breakthrough of all time… unless the columns you want to use are always the same and only the dataframe name changes. If that’s the case, you can hardcode the column names in the function and use it very easily with a single argument – that is, the name of the dataframe.

This function creates a new column with the difference between the two dates. More on what the "CohortIndex" is, in the next one.

Calculate the number of weeks/months/years between dates

To wrap things up, I’ll share these 2 for the price of 1! I needed these for the Customer Retention article I mentioned previously. It was necessary to calculate the cohorts.

Daily cohorts can be obtained with the function above, but it would not work for monthly cohorts, or yearly cohorts. If you want to understand why, you can check out the article to get a sense of what I needed this for.

The first function _get_dateint is necessary so we can separate the year, month and week from each date. The function will return an integer for each variable, and we will incorporate it in the _calccohorts function.

Moving on to the second function, we need a dataframe with two date columns (in this case, it’s the signup date and the last login date, but you can change the names to suit your project).

Once you run it, you get two new columns in your dataframe, called _cohort_w and cohort_y_. Your cohorts are ready to be used to calculate retention rates!


Stay tuned for more!

That is it for now, but as I mentioned in the beginning, I am writing the next article about how to create your own Class in Python. A class will let you use these functions directly as if you were using a "normal" package like Pandas or Numpy. No need to define the functions every time. Simply import your own package and use them directly.

In the meantime, I want to thank you for reading/commenting and following me, if that’s the case. It has been really rewarding to be able to share this journey and connect with so many people!

And don’t forget to check out some of my other articles:

Increase your Instagram followers with a simple Python bot

I was looking for a house, so I built a web scraper in Python!

Turn your previous Python projects into awesome tools – with Tkinter


Thank you for reading! As always, I welcome feedback and constructive criticism. If you’d like to get in touch, you can contact me here or simply reply to the article below.


Related Articles