
In modern life, everything is going so fast. If you act slow, someone will come and do what you have to do. Therefore, sometimes we are rushed to produce results in Data Science problems both in academia and sector. Fit model and try to improve the required performance metric by changing model parameters. Building an appropriate model and hyperparameter tuning are essential, but we need first to understand the data itself.
One of the most common problems with dataset quality is missing values. Do we really know why these values are missing? How should we impute them? Should we really impute them? Should we drop or let them stay? Figure out the answer to these questions; we need to know more about the missing data mechanisms.
Let’s consider an example. You want to develop a model that predicts house prices from a dataset containing for sale home advertisements. Then, you notice that the building’s construction year; in other words, building age is missing in some observations. Was it by chance, or is there an intention behind it? Should we delete observations whose construction year is missing, or should we use simple mean imputation and go our way? If you take these steps without fully understanding the reason behind the missingness, you can create a bias in your model. Because the advertiser may not have entered the building construction year information, thinking that people are less willing to see an old house and adjust their filtering settings according to it and the house will be sold more difficult. Now, if you dropped these observations, you would discard old houses.
There are three types of missingness mechanisms.
- Missing Completely at Random (MCAR): Missingness of a variable is completely independent of itself and other variables. For example, missingness at the number of fireplaces is independent of itself or kitchen area information or exterior covering material on the house. As the name implies, missingness is completely random.
- Missing Not at Random (MNAR): Missingness of a variable is related to itself. If you are going to rent or buy a summer house, one of the important criteria may be the proximity to the sea because it is a great convenience to reach the sea by walking. Likewise building age example, if this information is not given in an advertisement, it may be because it is far from the sea.
- Missing at Random (MAR): Missingness of a variable is dependent on another variable. Let’s assume that with a regulation made in 2013, the seismic resistance certificate was made compulsory for new houses. For homes built before 2013, the information may appear as NA in your dataset. But that doesn’t mean that this house is not resistant to earthquakes. In other words, the missingness pattern is not MNAR. Or let’s think of variables that are more clearly related to each other. If a house does not have a garage, we cannot talk about the garage capacity or quality, and these variables will naturally be missing.
I have created a toy Missing Data set to illustrate missing mechanisms visually. There are five attributes in this data set. They are, namely, x1, x2, x3,x4, and x5.
- For MAR, x1 is missing if x2 is greater than 46.
- For MNAR, If x3 is greater than 67, x3 is missing.
- For MCAR, some of the observations of x4 were randomly deleted.
Let’s scatter x1 and x2 with respect to the missing state of x1. Blue dots indicate x1 existence, and red dots indicate the absence of x1. As I manipulated the data, the relationship between x1 and x2 is clearly visible → when x2 is greater than 46, x1 is missing.

So, how do we observe the relationship of x1 with other variables or itself in this case? As you can see from the below graph, there is no specific pattern. It looks like blue and red dots scattered randomly. Therefore, it can be said that the missingness of x1 is not related to itself or x4?

In the same manner, let’s check x3. The below visualization show the relationship between x3 with x1, itself, and x5, respectively. x3 is missing when x3 is greater than 67.

Finally, if we look at the x4, we cannot observe any pattern because I deleted observations randomly. So it can be said that the missingness mechanism is MCAR.

Please be aware that it is a toy example; you don’t know which values are really missing in real life (and you cannot visualize missingness in this way). If you are a domain expert, you may have an idea, but you cannot be sure 100%. Specifically, detecting the MNAR mechanism is very difficult if there is no prior information about the data set and don’t know the domain’s context. But luckily, there are some statistical tests for checking mechanisms, and I will give detailed examples with codes in my next post. Please stay tuned.