By Huey Fern Tay
With Greg Page

Missing data appears for all sorts of reasons.
It happens when surveys are either too long or ask questions that are very personal (e.g. annual income, monthly expenses).
It occurs when machines break down.
Sometimes, it is simply the result of inconsistent record keeping.
Whatever the cause, one should think very carefully before filling in the blanks. Models built with the imputed data may deliver results that are strongly impacted by the newly altered values.
In a previous example involving a fictional amusement park in Maine called Lobster Land, I showed that it was reasonable to estimate the revenue of an indoor gaming centre by looking at the number of day passes sold, because both factors had a strong and linear relationship.
But what about the other variable ‘precipitation’ which also contained missing information? Inference would not be applicable in this instance because rainfall does not depend on factors in our dataset such as the days of the week or sales revenue at SnackShack.

One reasonable way to treat the missing data would be to check the data to see if it rained most of the time. After all, if Maine was usually dry, it would not be right for us to assume rain occurred on the day our record keeping failed us.
To perform this verification using R, we use the mfv() function which stands for ‘most frequent value’.
Since rain was infrequent in Maine that year, we could address the missing values by inserting zeros into those cells.
Other times, simply using zero as an automatic substitute for missing values can be inappropriate. For instance, suppose we were missing values for coal consumed in the United States for the months shown below:

Firstly, imputing with zero would imply that coal mysteriously disappeared from the country’s energy mix for four months! While the NAs in the dataset are certainly frustrating, we would create a far larger problem here by suggesting that such consumption simply did not happen.
Secondly, this erroneous substitution would weaken multi-regression models. Note the difference in model strength when we switch from Imputation with 0 to imputation with the mean value among the known months – the adjusted R-squared is immediately lifted from .5135 to .8586.


While not demonstrated here, a weak multi-regression model can have a negative chain effect on forecasting accuracy. Inaccurate planning made with those estimates can have a negative ‘ripple effect’ all the way down the supply chain. Once again, we see why it is important to proceed with caution when faced with Missing Data.
At the end of the day, there is no ‘one-size-fits-all’ solution for NA imputation. Replacement with zeros does have its merits, but as with the other approaches, context matters. Ultimately, the modeler needs to know the dataset, understand the problem, and be able to make the best decision for a particular set of circumstances.