The world’s leading publication for data science, AI, and ML professionals.

How To Deal With Missing Values in Data Science

Three practical ways to deal with missing values in a DS project

Photo by Pierre Bamin on Unsplash
Photo by Pierre Bamin on Unsplash

When dealing with real-world data, you may often find missing values in your data frame. This can happen for several reasons, for example:

  • some measurements may be missing
  • lack of information
  • transcript errors

So the question is: how to deal with missing data? Can we accept 0 as a value for missing data? Can we drop rows with missing data?

In this article, I’ll show you three ways to deal with missing data and I’ll answer these questions.

1. Ask questions

The first thing you have to do when you find missing values in your dataset is to ask questions because, by asking questions, you will understand the problem; understanding the problem is the most important task of a Data Science project: we can not provide value if we do not understand the problem.

If someone provided you with data, ask them questions like:

  • where did you get the data?
  • why there are missing values?
  • what do these features mean? Can I accept 0 as a value for these missing data?

If you get the data on your own ask yourself the same questions.

Also, Google a lot. Search the reference values for the missing data you have and try to understand which value you may give to your missing data (remember: even 0 is a value!!).

Also, if you can, try to contact specialists in the field. For example, if you are dealing with a medical dataset contact a doctor (maybe, your doctor!) and ask questions about the data you have, especially about your missing data.

2. Dropping rows/columns

There are some cases in which we can drop the columns or the rows in which we have missing values or Nans (Nan=Not a Number; it can even be a string like "not measured" or "missing").

As said before, we have to be sure to a certain percentage that we are doing a good job in dropping these columns/rows. For example, in this project I made, I found some missing values and decide to drop the rows.

Let’s analyze the case: the data are related to food production in years, for all the countries in the world. Let’s say our data frame is "df", this is what I found:

df.isnull().sum()
>>>
   Area Abbreviation      0
   Area Code              0
   Area                   0
   Item Code              0
   Item                   0
                       ... 
   Y2009                104
   Y2010                104
   Y2011                104
   Y2012                  0
   Y2013                  0
   Length: 63, dtype: int64

We have 104 null values for the years 2009, 2010, and 2011; but the columns listed here are not all with null values for this data frame; anyway, if we take a look at the data we can see that for some years, in some countries, there are values of food production that are 0 tons. This means a simple thing: the data could not be recorded or, for that particular year in that particular country, that particular food could not be produced (or they haven’t produced that particular food in that particular country at all, in their history!).

Whit that simple analysis, I decided to drop the rows with null values using the following code:

df = df.loc[(df.loc[:, 'Y1961':'Y2013']!=0).any(axis=1)]

3. Substitute the missing values with the mean

Sometimes the value 0 is simply not acceptable and, maybe because we have a little data, deleting rows/columns is not an option. So, what can we do? One possibility is to substitute the null values with the mean of the other values in the same row/column.

For example, in this project, I’ve analyzed a dataset in which I used Machine Learning to predict diabetes. In such a case, it is easy to understand that we could not accept 0 as a value for BMI (Body Mass Index) or for blood pressure. Since data were a few, I could not drop any row. So, I filled the zeros with the mean of the other values in that same column; I did it with the following code:

#filling zeros with mean value
non_zero = ['Glucose','BloodPressure','SkinThickness','Insulin','BMI']
for coloumn in non_zero:
    diab[coloumn] = diab[coloumn].replace(0,np.NaN)
    mean = int(diab[coloumn].mean(skipna = True))
    diab[coloumn] = diab[coloumn].replace(np.NaN, mean)
    print(diab[coloumn])

This way, I substituted the zeros in the BMI column with the mean calculated for the other BMI values, etcetera.

Conclusions

Dealing with missing values is always hard because we have to take decisions, and these decisions must be well thought out before writing the actual code.

So, firstly ask questions and Google a lot, and try to deeply understand the problem and the data you have to deal with. Then, decide what to do with your missing data (e.g., accept a null value, drop columns/rows, substitute the missing values or the null values with the mean value).


Let’s connect together!

MEDIUM

LINKEDIN (send me a connection request)

If you want, you can subscribe to my mailing list so you can stay always updated!


Consider becoming a member: you could support me and other writers like me with no additional fee. Click here to become a member.


Related Articles