Photo by user6694312 on Freepik

Best Practices for Handling Missing Data

Missing Data Effects on the Correlation Between Ice Cream Sales and Temperature

Missing data is present in many real-world datasets. Here I share one of the best techniques to deal with that.

Every single time I come across a data set that has no missing data, I instantaneously start to get very skeptical about the data integrity and the data cleaning that, supposedly, was done beforehand.

We should never close our eyes and avoid dealing with missing data. We need to investigate and comprehend why we may do not have missing data, and/or why we do have missing data. In any case, it does not hurt to ask, but moreover, it’s our responsibility to truly care about the data we work with.

Two rule of thumbs: always question a data set that has no missing data! Likewise, always question a data set that has missing data!

While one might see that not having missing data is a good thing, which ultimately is, the reality is that it is very common to find datasets with missing data. This phenomenon can occur for a myriad of different reasons, and understanding why, is important in order to mitigate this problem. Our focus today will be on techniques we can use to better manage missing data, and not so much on why this happens.

In particular, we will focus on missing data for numeric values, because that is usually the bottom neck when dealing with missing data, and where a LOT of people make mistakes.

That said, there are basically two main approaches to deal with such a common problem:

A- Dropping observations where missing data is present (such as rows and columns, for instance).

B- Utilizing imputation to fill the blank space with a value, which is based on some statistical metric, or other parameters.

Missing Data Unveiled

One of the first things you want to do when exploring any data set is hunting down for missing values. Once you have detected the presence of NAs (missing values), you want to work on trying to understand why they are missing, and if the effort to get the “original” data is worthy or not. Or even if there is a possibility of doing so. You got to decide on this trade-off as quickly as possible.

The truth is that, regardless of the technique you use to deal with the missing data, none of them will be better than having the original data. So, if you have the means, it’s always good making an effort to get a data set with no missing data (that sounds very naïve, huh?!).

You can follow these three basic steps when first assessing a data set that you’ve never seen before — assuming it has missing data:

I — Get a total count of missing data. Establish an initial threshold for dealing with them. This will depend on the size and nature of your data set — there is no “one size fits all” solution here. For instance, if the total missing data just accounts for about ~1% of the total data, you may want to decide on dropping those observations. Remember, every time you delete features and/or observations, you are potentially reducing the power of your analysis.

II — Explore the reasons we have missing data. One way to do that is to perform a test called “Little’s Missing Completely at Random (MCAR) Test”. This test will check whether or not the missing data are missing completely at random or not.

III — Check for patterns on missing data. The common patterns are arbitrary (random) and monotone (systematic). To check patterns, you can plot a graph that helps you to visualize the missing data. Usually, the arbitrary, which tends to have a random pattern is the one we find the most.

Alright, now that we’ve scratched a little bit of the surface, let’s circle back to our two main strategies to deal with missing data.

A— Dropping observations where missing data is present

I noticed that many research papers take this approach when facing missing data. In many cases, people decide to drop features and observations. I guess this seems like a “safe place” to be, but the reality is that by deleting data we reduce our chances to optimize our analysis. Because some data sets are difficult to get, we want to avoid dropping observations and features as much as possible.

What one should do, is test different techniques and look for outputs that can give a better outcome. For that, consider the questions you are trying to answer, and the problems you are trying to solve. The perfect world is to not have missing data, but the truth is: real-world data, most likely, will have missing data, and there is not a perfect way or technique to deal with this issue.

B — Utilizing imputation to fill the blank space with a value, which is based on some statistical metric, or other parameters.

Here is where things get interesting. There are different imputations techniques available that we can pick from. Imputation can be scary, and in some cases not ideal. It’s rough terrain, and maybe one of the reasons people just prefer to drop data off.

In the next section, we will perform an experiment to see how three different ways to deal with missing data can influence, for example, the correlation between two variables: temperature and ice cream sales throughout a given year.

The Experiment

Photo by stolyevych_yuliya on Freepik

In this experiment, I present you with three practical ways to deal with missing data:

1- Dropping rows where missing data is present.

2- Using mean/median to input values where we have missing data.

3- Using Maximum Likelihood Estimation to input data.

The idea is to show how these three techniques influence the correlation between ice cream sales and temperature. Moreover, we will see the worst and the best one to use. I caution you that, we have to understand that every data set is unique, so, you might get a different result if you try that on a different data set.

The idea is to open the discussion and give the idea that we must experiment before, blinding, picking a technique.

In the table below, you will see the values registered for all months of the year. We have all values for Temperature (in Fahrenheit), and the Monthly Sales ($). As you can see, there are no missing values.

That’s our starting pointing, and what we will use as a benchmark to analyze the impact of the techniques, we listed above, when deciding what to do when facing missing data.

Table 1 — Ice Cream Monthly Sales and Temperatures by Month. The data source can be found here.

Below are some basic statistics of the original data set (with no NAs). You won’t see the standard deviation here because I want to keep this simple and straightforward. However, depending on the technique you choose to deal with missing data, you can expect an effect of that on the standard deviation (and variance) as well.

Table 2 — Basic Stats

Also, here is a plot, so we can better visualize the events over time:

Image 1 — Ice Cream Sales and Temperature Over Time.

Just by looking at the two plots, we can see a strong pattern. Both lines mimic each other. In other words, if one goes up, the other goes up as well, and vice-versa. I will not explore the concept between causation versus correlation here.

However, the correlation between Temperature and Sales is about 0.96. This is a very strong correlation. In this case, it’s a high positive correlation. Check this link here to learn a bit more.

I won’t cover collinearity in this post, but it’s worth mention that, if we were working on a regression model, we would have to investigate for collinearity. In the next plot, we can find the correlation score and other insights like the distribution of these two variables we are studying.

Image 2 — Pairwise

We can better visualize this strong correlation by plotting this following scatterplot, with the confidence interval. This will be an important plot to keep in mind when we start messing up with this data set.

Image 3— Correlation Plot

Okay, this was all very basic analysis. Many people would assume that, if the Temperature increases, the Ice Cream Sales would increase as well. Nothing really special up to this point, right?!

The point of doing all of this was just to show an analysis performed on a “perfect” data set. By perfect, I just mean without missing data. Now we know how the real numbers look like.

The problem when we first get a data set with missing values is that we would never know how things would look like IF we have all data (no NAs). That is the reason why I did all of this initial assessment, to show you the “perfect world”.

1- Dropping rows where missing data is present.

Continuing on our experiment, we know introduce missing values into the data set. For example, looking at the following table, you’ll notice that we don’t know how much we sold in March (column “Monthly Sales” has NA for March, for example).

Table 3 — Introducing Missing Values (NAs).

In this approach, we are going to drop any row— in this case, month — that has missing data in any of the columns. This method is also known as “Complete-Case Analysis”. The problem with that is that, in this process, we ended up throwing out the original data that we had.

As you can see below, after dropping all NAs, the entire data for March is gone. On top of that, we now have fewer observations. Instead of 12, we have 7.

Table 4 — After Dropping NAs

It turns out that, the correlation between Temperature and Sales is now 0.85. Although, we can say that this new correlation still strong, it was a significant drop already.

The thing is that, when you drop variables using a real data set, you never know how much you’ve lost. Here, we were able to measure a little bit of that and get a glance at how dropping variables can affect your analysis.

Check out the next plot looks like, after dropping NAs:

Image 4 — Pairwise After Dropping NAs

As a general rule of thumb, we should just decide on dropping variables, if the amount of missing data is small. Probably, somewhere in the neighborhood of 1% or so. Obviously, this is not set in stone, you have to experiment always.

2- Using mean/median to input values where we have missing data.

I confess I’m guilty of that. I mean, who has never used the mean or median to input data before?! This is one of the most intuitive ideas that comes to one’s mind. But please, from now on, do never use this method again.

“We mean it, but I promise we’re not mean” — Lorde

Using the same table used to drop the NAs, we know filled the missing values with the mean as follows:

Table 5 — Basic Stats with Missing Values

Keep in mind that using the mean or median, the results are quite similar. For simplicity, I am only showing the example using the mean. But, using the median is also bad — usually less harmful though.

Table 6 — Filling the Missing Values with the Mean

As shown above, know we kept the same number of observations — 12 months — but the effects of using the mean were devastating.

Our correlation went free falling. We have now a correlation score of 0.36 between Temperature and Sales. See below:

Image 5 — Pairwise After Filling NAs with Mean

In short, we can assume that there is no strong correlation between those two variables anymore. This is how bad using the mean was to our data set in this particular experiment.

That totally shaped our analysis. Check this next graph out:

Image 6 — No Significant Correlation

Let’s say you are in the business of selling Ice Cream. If you know that the Temperature will increase in the next week or so, normally, you would get prepare for an increment in your sales. But, with this weak correlation we got, you cannot go around and stock too much Ice Cream anymore!

That’s is because the confidence interval is too large. Although you can still expect to increase your sales, you cannot just bet on that too much. This alone complicates the decision-making process.

In this scenario, people may find themselves “having” to gamble, and that is definitely what we don’t want to do!

I mean, never use the mean!

3- Using Maximum Likelihood Estimation to input data.

Can we do better than that? We sure can!

One of the best methods to input data is the Maximum Likelihood Estimation (MLE). This method “Finds the maximum likelihood estimates of the mean vector and variance-covariance matrix for multivariate normal data with (potentially) missing values.” After getting the covariance scores, we need to convert that from covariance into correlation. Check the documentation here.

As we see below, we got a correlation score of 0.93 between Temperature [2], and Sales [3].

Table 7 — Correlation After Using Maximum Likelihood Estimation

The downside of using this method is that we don’t have much documentation available. Nevertheless, experimenting was the whole point of writing and doing all of these testings.

There are other methods available that I didn’t have the time to cover here, such as MICE (Multiple Imputation with Chained Equations). More here.

In a nutshell, there is no magical formula to deal with missing data. What we want is, to mitigate the effect of having missing data in any given data set.

Conclusion

The biggest takeaway is that we must experiment with different approaches for dealing with missing data. What worked here, might not work on a different data set (really?).

My goal was to bring attention to this subject because people overlook it quite often. They pick one approach and they married to that. It is always good to try different things and see what gives you the best outcome possible.

We saw here that using mean, was the worst method; but taking advantage of MLE was the best one. Dropping out missing values was better than using the mean.

One of the limitations is because this data set is so tiny, and the correlation very strong, it can be hard to come to a sound conclusion. For instance, even when dropping the data, the correlation was still strong. This might not be the case when trying this technique on a different data set.

Next time you come across a data set containing NAs, think about the possibilities you have in your belt to better deal with that. Just don’t ever consider using the mean or median. Please.

BONUS: One Simple Best Practice

Warning! The data in this example has nothing to do with what we’ve been talking about so far.

Let me make sure to emphasize a good practice when handling missing data, particularly, when introducing imputation.

Unfortunately, many people when imputing data, regardless of the method, tend to fill the missing values in the original feature (row/column). The problem with that, is that assuming you’ll pass that data to somebody else to work with it, she/he will never know that we had missing data, and they will have no idea of the method you used either. On top of that, you can also forget that you did the imputation on the data set. Following this simple trick can help you with getting neat documentation processing.

If you still don’t get it, here is an illustration I came up with (this is all dumb/fictional data):

Table 8— Missing Data Best Practice

Breaking this table down:

Picture this, let’s say you sell some electronics and for whatever reason, you don’t have the Price for the Laptop item. You decide applying some imputation (IMP) to fill that space, but instead of doing this directly on the Price column, you create a new column — Price_IMP — and fill that blank space with a value. Next, you create an additional feature — Price_IMP_Y/N — to function as a binary indicator to quick eye spot where the imputation was applied. Voila! You just created a simple way to keep track of imputations.

How you name your new features it’s up to you and your team. The point here is that, whenever you are using imputation, you must indicate that someway somehow, and keep the original feature intact. Don’t forget to disclose what method you used and why.

--

--