The world’s leading publication for data science, AI, and ML professionals.

Increase Productivity: Data Cleaning using Python and Pandas

Data cleaning can be time-consuming, but understanding the different types of missing values, and how to deal with them, will…

Getting Started

Photo by Max Duzij on Unsplash
Photo by Max Duzij on Unsplash

According to IBM Data Analytics, Data Scientists can spend up to 80% of their valuable time simply finding, organising and cleaning data [1]. So, it’s natural that you want to increase productivity on data cleaning to get back to what matter most – generating insights. You will need to quickly handle a common type of messy data: missing values, ** also known as Not a Number (Na**N).

In practice, some datasets arrive with missing data. For example, people usually don’t like sharing their income in online forms and surveys, so they will either leave it blank or input something completely unexpected. If they leave it blank, then it’s a reasonably straightforward issue.

However, values, such as ‘n/a‘ or ‘ – ‘, should be considered as a NaN but Pandas will not recognise them as such. These unusual types of missing values will make the process of data cleaning longer than it should, especially for beginners and entrepreneurs who are not Data Scientists themselves.

Why should you care?

In statistical terms, if the number of cases is less than 5% of the sample, then you can drop them [2]. However, messy datasets can have much more than 5%. In this case, you will need to investigate further and handle the issues yourself. If the missing values are not handled correctly, you might reach inaccurate conclusions, and potentially, waste more time by repeating your analysis.

To avoid wasting time and improve productivity, follow the data cleaning checklist for missing values below. You will benefit from one of the most important Python libraries: Pandas.

Data cleaning checklist

We will use the laptops.csv file as an example. This CSV file was adapted from the Laptop Prices dataset on Kaggle. Below is what the raw data looks like, and you will notice there is a lot o missing values.

The following work is available on my GitHub. But, before you start going through the checklist, it’s always good practice to get an overview of your dataset. Then, we will move on to our first check.

1) Standard Type

Now, let’s begin by checking standard missing values. By ‘standard’ I mean values that Pandas will immediately detect such as empty cells and NA values. You can check for standard missing values by typing the .isnull() method. It will return a boolean: True for missing values and False if otherwise.

2) Non-standard Types

Frequently, if users enter data manually (not using a dropdown menu or multiple answers), then you will find inputs other than the standard types mentioned above, ‘NA’ or blank. Some users might prefer typing ‘na‘ instead of ‘NA‘, it is just a matter of choice. In those cases, missing values will have a different format which we can call them non-standard missing values. Examples, such as ‘n/a‘, ‘na‘, or ‘ ‘ are not detected by Pandas. Take a look at indexes 7 and 8.

You can easily handle non-standard formats by creating a list and importing the file again. Pandas will then detect the new missing values. Here is how you can do that:

Once you have created a list of non-standard missing values, check what happens to indexes 7 and 8 after applying the .isnull() method.

Voilà! Pandas have recognised the different formats as missing values.

Some DataFrames can be massive, and you will not see all formats straight away. Don’t worry, as you keep working and find more types of missing values, just add them to the list. Try to detect and convert non-standard types before summarising or counting the number of missing values. Otherwise, you will develop unreliable results.

3) Random Types

As a Data Scientist, you create models to predict what can happen. However, somewhere in the process, there is human input. So, you can expect the unexpected. For example, you might find a numeric type instead of a string (‘Male’ or ‘Female’). Random types of missing values are more common than we would like. Take a look at the value at index 3. Pandas could not find it, but you can.

Random missing values are not like standard ones, we need a different approach to find them. Python is a versatile and flexible programming language, which allows you to tackle the same problem in different ways. Here is one technique you can use:

Let’s unpack the for-loop above:

First, import NumPy. The for-loop iterates over each row in the selected column, ‘In stock.’ The goal is to try and change any entry to an integer using int(row). If successful, NumPy will convert the entry to a missing value. However, if unsuccessful, Python will pass and keep going.

Second, you will need to use try and except ValueError. This is called handling exceptions, and you can read more about it by check Python’s documentation [3]. Handling exceptions alerts you if Python could not change the entry into an integer. As a consequence, it returns ValueError, and the code stops.

Finally, make sure to use the .loc method. This is Pandas’ preferred method for modifying in place entries [4].

4) Summarising

Now that you have cleaned the missing values, you are ready to summarise them. Suppose you want to check the total number of missing values for each column.

Alright, now that you have summarised the number of missing values, it is time to do some simple replacements.

5) Replacing

Sometimes it’s not possible to delete the rows with missing values. Whenever that happens, you will have to sort it out yourself. So, here are some ways to replace missing values:

Conclusion

You want to increase productivity, but data cleaning can be time-consuming. As a Data Scientist, you will inevitably deal with messy datasets containing missing values. So, keep in mind the different types of missing values (Standard, Non-Standard and Random) to quickly move forward (Summarising and Replacing) in your analysis. You will spend less time on data cleaning, avoid inaccurate inferences and become more productive.

Thanks for reading. Here are other articles you might like it:

Save Time Using the Command-Line: Glob Patterns and Wildcards

Trends in Data Science That Will Change Business Strategies

Best Cities to Work as a Data Scientist

How to Boost Your Coding Skills


References:

[1] IBM – https://www.ibm.com/cloud/blog/ibm-data-catalog-data-scientists-productivity

[2] Schafer, J. L. (1999). Multiple imputation: a primer. Statistical methods in medical research, 8(1), 3–15.

[3] Python – https://docs.python.org/3/tutorial/errors.html#handling-exceptions

[4] Pandas – https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html


Related Articles