Talking About Missing Data

The vocabulary you’ll need for a technical interview

S. T. Lanier
Towards Data Science

--

Missing data. Image by Author.

If you’ve worked at all with real data, you’ve probably already had to handle cases of missing data. (I wonder what the probability of missing data is in any given natural data. I suspect about as close to certainty as you can get.)

We should be suspicious of any dataset (large or small) which appears perfect.

— David J. Hand

How did you handle it? Row-wise deletion? Column-wise deletion? Imputation? What did you impute? If continuous, did you use the mean, baseline, or a KNN-derived value? If categorical, did you make a new categorical value or assign it to a preexisting one?

But most importantly, why? Why did you choose the method you did?

I think early on in learning about missing data, the focus is often on what tools are available for handling it––see that whole list of questions above––that the justification for each is often left a bit up to intuition. I, at least, didn’t have a rigorous framework for defining missing data, and I suspect I’m not the only one in that boat; but twice now that very topic has come up in a technical interview setting. “How do you handle missing data?” Both times, the most important part of the question was classifying the data into one of three categories: missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR). We owe these definitions to Donald B. Rubin’s 1976 paper, “Inference and Missing Data.” (If you want to jump straight to the white paper, you can read it here.)

If you’ve handled missing data thoughtfully in the past, you probably already have the intuition for each of these, but I can vouch for the importance of being able to connect the intuition to the statistical terms. You gotta talk the talk if you’re gonna walk the walk.

Gif by Kristen Wiig on giphy.com

Missing Completely at Random (MCAR)

If the probability of being missing is the same for all cases, then the data are said to be missing completely at random (MCAR). This effectively implies that causes of the missing data are unrelated to the data. We may consequently ignore many of the complexities that arise because data are missing, apart from the obvious loss of information…While convenient, MCAR is often unrealistic for the data at hand.

––Stef van Buuren, Flexible Imputation of Missing Data

Data that is MCAR is exactly what it sounds like: really random missing data. No pattern to it, that it’s missing isn’t indicative of anything about said missing data. Imagine printing out your perfect database/spreadsheet/DataFrame, taping it to the far wall, and shooting at it with a BB gun. Blindfolded. Now you have to do your data analysis off this hole-y data. Your data is MCAR. This is in many ways the ideal, but unrealistic, case for missing data.

For handling this MCAR data, row-wise deletion is fair game as it produces unbiased estimates of means, variances, and regression weights; mean imputation is also fair game, with the gentle guidance that missing values should probably be few; regression and stochastic regression imputation are fair game [1].

Missing at Random (MAR)

If the probability of being missing is the same only within groups defined by the observed data, then the data are missing at random (MAR)… MAR is more general and more realistic than MCAR. Modern missing data methods generally start from the MAR assumption.

— Stef van Buuren, Flexible Imputation of Missing Data

By name, the distinction between MCAR and MAR feels a bit blurry––is there a difference between random and completely random? Isn’t it superlative?––but by these definitions, there really are two types of randomly missing data. Back to the previous analogy, we taped the whole table to the wall, yes? Well, with MAR, it’s more analogous to only taping one column from the data to the wall for our juvenile firing squad. Data will be missing, and randomly missing at that, but only for one feature.

To a degree that’s an oversimplification. Data can be missing from multiple columns, but if you look at a single column, the missing data is missing randomly. Maybe there’s more data missing from column A than there is from column B. In that case, missing data is more likely in column A, but within column A, that data is still missing randomly.

In makes possible this interesting statement. Really think about it, because it threatens to sound paradoxical. If a piece of missing data is presented, and you don’t know where in the table it came from, it’s more likely to belong to column A; and yet, within column A, all the data is missing randomly. Thus the distinction between missing completely at random and missing at random. For MCAR, you really can’t say anything about missing data, not which column it came from, not what value it might have held. For MAR, maybe you can say missing data is more likely to come from a certain column, but you still can’t say anything about what value it held.

For handling MAR data, deletion is not an option as it can severely bias estimates of means, regression coefficients, and correlations; regression imputation and stochastic regression imputation, however, are fair game [1].

Missing Not at Random (MNAR)

If neither MCAR nor MAR holds, then we speak of missing not at random (MNAR)… MNAR means that the probability of being missing varies for reasons that are unknown to us… Strategies to handle MNAR are to find more data about the causes for the missingness, or to perform what-if analyses to see how sensitive the results are under various scenarios.

— Stef van Buuren, Flexible Imputation of Missing Data

Exactly what it sounds like. People with lower salaries are less likely to report their salaries in a survey, so missing data is meaningful, not random, and that missing data is skewing the average salary for the dataset higher than it really is.

Sources, Citations, Further Reading

Buuren, Stef van (2018). Flexible Imputation of Missing Data. Boca Raton: CRC Press. https://stefvanbuuren.name/fimd/

Rubin, D. B. (Dec. 1976). Inference and Missing Data. Biometrika, 63(3), 581–592. http://math.wsu.edu/faculty/xchen/stat115/lectureNotes3/Rubin%20Inference%20and%20Missing%20Data.pdf

For a cool example of KNN imputation in action, check out Kyaw Saw Htoot’s example article on the Titanic dataset.

--

--

Student of data science. Translator (日本語). Tutor. Bicyclist. Stoic. Tea pot. Seattle.