Hands-on Tutorials

Smart handling of missing data in R

Missing data are everywhere — learn how to summarise, visualize & impute them while keeping an eye on statistical uncertainty.

Hannah Roos

Published in

Towards Data Science

15 min readFeb 2, 2021

When values should have been reported but were not available, we end up with missing values. In real-life data, missing values occur almost automatically — like a shadow nobody really can get rid of. Think of nonresponse in surveys, technical issues during data collection or joining data from different sources — annoyingly enough, data for which we have only complete cases are rather scarce. But why should you care about it? And what can go wrong with simply ignoring missing data?

This article will show you why missing data require special treatment and why it is worth it.

Get to know visualization techniques to detect interesting patterns in missing data.
Learn why mean-imputation or listwise-deletion are not necessarily always the best choice.
Perform multiple imputations by chained equations (mice) in R.
Assess the quality of imputation to account for statistical uncertainty and make your analysis more robust.

The problem with missing data

Some analyses (e.g. linear regression) require complete observations, but still in common statistical software you won’t get an error when feeding the system with data containing missing values. Instead, incomplete cases are automatically deleted and this happens usually silently. However, if you plan to test different models on the same dataset, a statistical comparison between them won’t be appropriate since you cannot guarantee that the models were based on the same observations. Moreover, by dropping the observations completely, we do not only lose statistical power, but we may even get biased results — the dropped observations could provide crucial information about the problem of interest, so it would be a pity to simply ignore them.

For example, imagine you were consulted to assess the psychological working conditions in organization Z. For this purpose, you create an employee survey before you start to interview the stakeholders. Because frustrated employees usually skip unpleasant but crucial questions, missing data are almost inevitable. Since these values should definitely inform overall employee satisfaction, we should take care of them. In this case the data are not missing at random or at least not missing completely at random because missingness depends on the employee satisfaction itself. A bit too complicated? Okay, let us take it more slowly: Which types of missing data are out there and how does it affect data analysis?

Missing not at random (MNAR): Locations of missing values in the dataset depend on the missing values themselves. In our example, employees tend to report positive or neutral responses and skip each question in case of disagreement. For example, individuals may systematically leave the following item blank because they feel uncomfortable leaving a bad rating for their leader: “I do not hesitate to report unpleasant issues to my leader.” Still, removing these observations would distort the results to the positive because it would evoke the impression that everyone is perfectly fine and the organization doesn’t need any intervention whatsoever. In this case, the presence of missings is interesting because it tells us something about the phenomena under study — we should definitely not ignore them.
Missing at random (MAR): Locations of missing values in the dataset depend on some other observed data. In another scenario of that employee survey, some respondents are not completely certain and may feel unable to give accurate responses about their thoughts and behaviour on a numeric scale. In the end, they rarely respond to the questions being asked. Thus, the responses somewhat depend on another, unobserved mechanism: in this case the individual need for more precise self-reports. Removing these responses would be a shame because they can be predicted from previous responses and may even introduce a bias: in fact, the dataset would only include responses from people who spontaneously respond to the items and we do not know how this affects the results overall. Thus, ignoring missing data for your analysis is not recommended.
Missing completely at random (MCAR): Locations of missing values in the dataset are purely random, they do not depend on any other data. In a third scenario, all employees fill in the survey at the same time as suddenly there is a technical issue on the survey platform. Consequently, everyone needs to stop prematurely and therefore missing values are present all participants towards the end of the survey. Pairwise deletion of the data will make it difficult for you to analyse the last variables in the dataset, but at least it probably won’t introduce any specific bias to the results. A rule of thumb says that when the data include less than 5% random missingness which does not depend on observed or unobserved values, complete case analysis may be an acceptable approach, too (Graham, 2009).

You do not know whether or not values in your dataset are missing at random? There are statistical tests to if the data is missing at random, but given that you need some hypothesis about missing values and where you expect them, endless testing seems a bit cumbersome. It probably makes more sense to explore the data visually and stay attentive to potential method-related biases in case you have no strong ideas right-away. If you are interested in a real-life missing data problem, I highly recommend a paper from Köhler, Pohl and Carstensen (2017): the authors demonstrate how different treatments of nonresponse in large-scale educational student assessments affect important outcomes such as ability scores.

Explore missing data with naniar — get a birds-eye view

The data we will work with are survey data from the US National Health and Nutrition Examination Study — it contains 10000 observations on health-related outcomes that have been collected in the early 1960’s along with some demographic variables (age, income etc.). It is available in R by installing the NHANES package by Randall Pruim (2016).

Firstly, we load the dataset and reduce the sample size to 500 observations by randomly sampling from the original indices — you will probably work with smaller datasets and we will make plotting a bit easier. I assume that you have dplyr already installed on your computer.

Now we are going to get a rough glimpse on the missingness situation with the pretty neat naniar package by Nicholas Tierney and colleagues (2020)…

The results show that there are indeed missing data in the dataset which account for about 18% of the values (n = 1165). Except for the “Age” variable, there is a substantial amount of missing values in each variable. Please note that since we have drawn a random sample, it could happen that the results may vary a bit each time you run the script.

By looking at missing summary per variable, we notice that especially the “PhysActiveDays”-Variable has the highest amount of missings among all variables in the dataset. The variable represents the “number of days in a typical week that participant does moderate or vigorously intense activity.” As we all know that physical activity makes such a great difference, this variable seems to be an interesting lifestyle variable which could be a potential predictor for a variety of health outcomes. Keeping that in mind, it is noteworthy that the number of missing values exceeds the number of recorded values in this dataset.

We can transfer this summary to a visual representation like this:

To get a better understanding whether or not the data are missing at random, we are going to visualize the locations of missing values across all variables.

We can see where the missing values are “clustered” and it seems to match our findings from our previous overview on the presence of missing values per variable. The left part of the plot is particularly interesting: some participants have not responded to a row of mental health-related questions altogether (for example “LitteInterest” and “Depression”) — these questions may have gone beyond their personal comfort zone (but this is just a hypothesis of mine).

Now is the presence of missing values related with missings in other variables? Let’s find out.

The results are compatible with the observation that there is a substantial number of cases in which some missings happen to occur across certain variables (e.g. missing Work, Education, LittleInterest and Depression information along with the absence of recordings in PhysActiveDays). Thus, it seems to me that the data are not missing completely at random. We have learnt that if the data are MAR or MNAR, imputing missing values is advisable.

If you had concrete hypothesis about the impact of the presence of missing values in a certain variable on a target variable, you can test it like this:

For some reason, you expect that the percentage of missing values in BMI differs depending on the level of perceived interest in doing things.

The output suggests we cannot reject the null-hypothesis and thus assume that there is no difference in BMI-missingness per level of interest. Note that conceptually this does not make any sense, but I did not want to keep it away from you. 😉

Okay — before starting with the imputation, let us check one thing first: Reading the documentation of the NHANES dataset, we can see that some variables were not recorded for children who were under 9 or 12 years old ( for example self-reported health status or the number of days the participant did not good physically within the last month). To find out how age affects the presence of missing values in our dataset, we can create a heatmap that represents the density of missings per variable broken down by age.

Indeed, there are a lot more missing values in many variables for individuals aged 0–9. In some cases, this also applies to the demographic variables and depression-related variables for teenagers (10–19), but we won’t touch this for now. It seems to be reasonable however to exclude children for our statistical analysis to reduce bias in our results.

Case study — investigate cardiovascular health

Imagine that you are interested in cardiovascular health since you run an intervention program that promotes the prevention of cardiovascular diseases — without having the any further information about your patients’ physical condition, you would like to know if there are a few common parameters that are probably associated with cardiovascular health. You are pretty sure that the more acitive an individual lives, the less likely you will observe an abnormally increased blood pressure (Whelton et al., 2002). Similarly, the body-mass-index (BMI) might be also related to cardiovascular health since obese individuals often experience hypertension whereas skinnier people’s blood pressure tends to be low (e.g., Bogers, 2007; Hadaegh et al., 2012). You decide to test your hypothesis on this large dataset — however you have to take care of the missing values to find out if it is worth it to specifically target those individuals that are at risk at developing cardiovascular problems.

For the ease of the computation, you use the median arterial blood pressure (MAP) as your target variable — a valid parameter (Kundu, Biswas & Das, 2017) that represents the mean value of blood pressure prevailing in the vascular system irrespective of systolic and diastolic fluctuations.

Here is how we can calculate the MAP for each individual based on systolic and diastolic blood pressure (Psychrembel, 2004):

MAP = Diastolic blood pressure + 1/3 (systolic blood pressure — diastolic blood pressure)

Impute missing data — fill in the blanks

Before diving into my preferred imputation technique, let us acknowledge the large variety of imputation techniques for example Mean imputation, Maximum Likelihood imputation, hot deck imputation and k-nearest-neighbours imputation. Even if they are certainly somewhat useful, they have one downside in common: They do not account for the statistical uncertainty that is naturally associated with imputing missing values. How can we know whether or not we have accurately predicted the values that could have been recorded? Moreover, how does the quality of our imputation affect our statistical model?

Understanding the good means understanding the bad — what happens to the data if we simply replace all continuous variables by their respective mean? Let us see.

We can see the imputed values in red and natural values in blue — the imputed values seem to form almost a kind of “cross” which looks somewhat artificial. This is because unlike the recorded values, mean-imputed values do not include natural variance. Therefore, these values are less “scattered” and would technically minimize the standard error in our linear regression. We would perceive our estimates to be more accurate than they actually are in real-life.

So, we need a good alternative — multiple imputation by chained equations (MICE) is my favourite approach since it is very flexible and can handle different variable types at once (e.g., continuous, binary, ordinal etc.). For each variable containing missing values, we can use the remaining information in the data to create a model that predicts what could have been recorded to fill in the blanks — when using statistical software, this happens totally silently in the background. To account for the statistical uncertainty in the imputations, the MICE procedure goes through several rounds and computes replacements for missing values in each round. As the name suggests, we thus fill in the missing values multiple times and create several complete datasets before we pool the results to arrive at more realistic results. If you are interested in more details about multiple imputations by chained equations, I recommend you to read this nicely written paper by Azur and colleagues (2011). Maybe this idea rings a bell when you already know the benefits of random forest over simple decision trees?

Let us implement the MICE procedure in R by making use of the wonderful mice package by Stef van Buuren (2020).

install.packages("mice")
library(mice)

To arrive at good predictions for each of the target variable containing missing values, we save the variables that are at least somewhat correlated (r > 0.25) with it.

Then we run the actual imputation procedure 10 times, set a seed, select a method and use the prediction matrix on our original dataset. Depending on how many rounds you have selected, the computation may take a while…

Now we run our regression on each of the 10 imputed datasets and pool the results in the end. To get an impression about the statistical uncertainty, we will include 95%-confidence intervals in the regression summary for the pooled results.

As expected, we can see that BMI as well as the degree of physical activity significantly predicts mean blood pressure in our NHANES-subsample (p < .001). The regression estimate for BMI amounts to about 0.41 which means that for every additional unit upwards, we expect the mean arterial pressure to increase by 0.41 mm Hg. Taking the confidence intervals into account, it seems like the relationship between BMI and blood pressure is consistently positive when sampling regression coefficients from our different imputation rounds (95% CI [0.25,0.56]). For the degree of physical activity however, our confidence interval includes both positive and negative estimates (95% CI [- 1.07, 0.44]) which should make us sceptical. It means that depending on the imputation quality of each round, we would get different results and thus would interpret the relationship between Pulse and BMI differently. Imagine you would have only one round (simple imputation), then you would have no chance to evaluate the reliability of your coefficient estimates. Thus, we largely benefit from imputing the missing values multiple times and pool the results!

Note that we have no information whether or not the relationship between blood pressure and BMI is causal, but it seems to be not far-fetched to assume a slight association — even if it is perhaps moderated by a healthy lifestyle (e.g. frequent physical activity, appropriate nutrition etc.).

Additionally, we will create a strip-plot the assess the quality of imputation — do the red points fit in the reported values naturally?

It seems like there are more imputed values for low BMI values which are caused by a higher density of missing values (as you can guess from the mean imputation scatterplot). Apart from this the imputed values are nicely scattered within the data-cloud and do not seem to differ substantially across imputation rounds.

We are done — now we can use the pooled imputation to complete our dataset so no missings are left.

Bonus round — assess the model’s robustness with cross-validation

Now we have a simple regression model that somewhat predicts mean arterial blood pressure. A common scenario would be that we want to actually make use of our knowledge and predict unknown blood pressure in a fresh sample of participants. But is it really accurate enough for this job already?

We start by splitting the data into test- and training-data and train the algorithm on one part of the data only.

Consistent with our previous analysis, only BMI is significantly related to blood pressure. The R-squared value suggests that our model explains about only 5% of the variance in blood pressure.

Still we try to use that model to actually predict blood pressure within a dataset the algorithm has never seen before — the test dataset. Finally, we will assess the model’s accuracy.

The output gives us a RMSE value of 11.83 which means that on average, the prediction deviates about 12 blood pressure units from the actual values. Given that normal MAP values lie between 65 and 110 mm HG, a deviation by about 12 mm Hg could shift near-to normal values (e.g. 62 mm Hg) towards the cut-off values for complication and heart failure. Scholars suggest that even 1 minute at a mean arterial pressure of 50 mmHg increases the risk of mortality during surgical operation by 5% (Maheshwari et al., 2018). In this case, our bad estimation accuracy demonstrates that our model cannot replace real data (e.g., actually recorded blood pressure). Furthermore, the error rate amounts to 14% which is still pretty high when compared to high-quality algorithms (Google, Facebook etc.) reaching more than 95% accuracy. For me, it appears that the model does not really hits the mark but hey — it is still helpful enough to get rough estimates. When keeping these limitations in mind, it is not bad to start with!

Know your puzzle-pieces

The best thing to do with missing data is to not have any.
— Gertrude Mary Cox

… but still there are everywhere. So, it is definitely worth it to have some know-how on how to deal with missingness. You have learnt how to summarise, visualise and impute missing data in order to comply with the subsequent analysis. This way you do not only know where your puzzle is lacking some pieces, but you have the technical skills to see the bigger picture.

References

[1] J. W. Graham, Missing data analysis: Making it work in the real world. (2009), Annual review of psychology, 60, 549–576

[2] C. Köhler, S. Pohl & C. H. Carstensen, (2017), Dealing with item nonresponse in large‐scale cognitive assessments: The impact of missing data methods on estimated explanatory relationships, Journal of Educational Measurement, 54(4), 397–419

[3] R. Pruim, NHANES: Data from the US National Health and Nutrition Examination Study (2016), R Package

[4] N. Tierney, D. Cook, M. McBain, C. Fay, M. O’Hara-Wild & J. Hester, Naniar: Data structures, summaries, and visualisations for missing data (2019), R Package

[5] S. P. Whelton, A. Chin, X. Xin & J. He, Effect of aerobic exercise on blood pressure: a meta-analysis of randomized, controlled trials (2002), Annals of internal medicine, 136(7), 493–503

[6] R. P. Bogers, W. J. Bemelmans, R. T. Hoogenveen, H. C. Boshuizen, M. Woodward, P. Knekt… & M. J. Shipley, Association of overweight with increased risk of coronary heart disease partly independent of blood pressure and cholesterol levels: a meta-analysis of 21 cohort studies including more than 300 000 persons (2007), Archives of internal medicine, 167(16), 1720–1728

[7] F. Hadaegh, G. Shafiee, M. Hatami & F. Azizi, Systolic and diastolic blood pressure, mean arterial pressure and pulse pressure for prediction of cardiovascular events and mortality in a Middle Eastern population (2012), Blood pressure, 21(1), 12–18

[8] R. N. Kundu, S. Biswas & M. Das (2017), Mean arterial pressure classification: a better tool for statistical interpretation of blood pressure related risk covariates, Cardiology and Angiology: An International Journal, 1–7

[9] W. Psychrembel, Mittlerer arterieller Druck (2004). Klinisches Wörterbuch. 260. Aufl. de Gryter, München

[10] M. J. Azur, E. A. Stuart, C. Frangakis, & P. J. Leaf, Multiple imputation by chained equations: what is it and how does it work? (2011), International journal of methods in psychiatric research, 20(1), 40–49

[11] S. V. Buuren & K. Groothuis-Oudshoorn (2010), mice: Multivariate imputation by chained equations in R, Journal of statistical software, 1–68

[12] K. Maheshwari, S. Khanna, G. R. Bajracharya, N. Makarova, Q. Riter, S. Raza, … & D. I. Sessler, A randomized trial of continuous noninvasive blood pressure monitoring during noncardiac surgery (2018), Anesthesia and analgesia, 127(2), 424