The use of KNN for missing values

Update

Major update of the code to improve perfomance for computing the weighted hamming distance. The method was changed from a point by point distance computation to a matricial one resulting in multiple order of magnitudes of time saving.

Access the code:

Github link


Problem:

When working with real world data, you will often encounter missing values in your data-set. How you deal with them can be crucial for your analysis and the conclusion you will draw.

Missing values can be of three general types:

  1. Missing Completely At Random (MCAR): When missing data are MCAR, the presence/absence of data is completely independent of observable variables and parameters of interest. In this case, the analysis performed on the data are unbiased. In practice, it is highly unlikely.
  2. Missing At Random (MAR): When missing data is not random but can be totally related to a variable where there is complete information. An example is that males are less likely to fill in a depression survey but this has nothing to do with their level of depression, after accounting for maleness. This kind of missing data can induce a bias in your analysis especially if it unbalances your data because of many missing values in a certain category.
  3. Missing Not At Random (MNAR): When the missing values are neither MCAR nor MAR. In the previous example that would be the case if people tended not to answer the survey depending on their depression level.

In this tutorial we will use a non-parametric algorithm called k-nearest-neighbors (KNN) to replace missing values. This algorithm is applicable in any of the three previous situation, as long as there is a relationship between the variable with the missing value and the other variables.

Why using KNN ?

KNN is an algorithm that is useful for matching a point with its closest k neighbors in a multi-dimensional space. It can be used for data that are continuous, discrete, ordinal and categorical which makes it particularly useful for dealing with all kind of missing data.

The assumption behind using KNN for missing values is that a point value can be approximated by the values of the points that are closest to it, based on other variables.

Let’s keep the previous example and add another variable, the income of the person. Now we have three variables, the gender, the income and the level of depression that has missing values. We then assume that people of similar income and of same gender tend to have the same level of depression. For a given missing value, we will look at the gender of the person, its income, look for its k nearest neighbors and get their level of depression. We can then approximate the depression level of the person we wanted.

KNN parameters calibration

When using KNN, you have to take many parameters into consideration. You will find below what is allowed by the function provided with this article:

  • The number of neighbors to look for. Taking a low k will increase the influence of noise and the results are going to be less generalizable. On the other hand, taking a high k will tend to blur local effects which are exactly what we are looking for. It is also recommended to take an odd k for binary classes to avoid ties.
  • The aggregation method to use. Here we allow for arithmetic mean, median and mode for numeric variables and mode for categorical ones.
  • Normalizing the data is a method that allows to give every attribute the same influence in identifying neighbors when computing certain type of distances like the Euclidean one. You should normalize your data when the scales have no meaning and/or you have inconsistent scales like centimeters and meters. It implies prior knowledge on the data to know which one are more important. The algorithm automatically normalize the data when both numeric and categorical variable are provided.
  • Numeric attribute distances: among the various distance metrics available, we will focus on the main ones, Euclidean and Manhattan. Euclidean is a good distance measure to use if the input variables are similar in type (e.g. all measured widths and heights). Manhattan distance is a good measure to use if the input variables are not similar in type (such as age, height, etc…).
  • Categorical attribute distances: without prior transformation, applicable distances are related to frequency and similarity. Here we allow the use of two distances: Hamming distance and the Weighted Hamming distance. 
     
    - Hamming distance: take all the categorical attributes and for each, count one if the value is not the same between two points. The Hamming distance is then the number of attributes for which the value was different. 
     
    - Weighted Hamming distance: also return one if the value is different, but returns the frequency of the value in the attribute if they are matching, increasing the distance when the value is more frequent. When more than one attribute is categorical, the harmonic mean is applied. The result remain between zero and one but the mean value is shifted toward the lower values compared to the arithmetic mean.
  • Binary attribute distances: those attributes are generally obtained via categorical variables transformed into dummies. As already mentioned for the continuous variables, the Euclidean distance can also be applied here. However there is also another metric based on dissimilarity that can be used, the Jaccard distance.

In order to identify the best distance metric to use for your data, you can use the tips given above but above all, you will have to experiment to find what best improves your model.


Test on Titanic data-set

Objective: Predict which passenger survived the Titanic. You can download the data on Kaggle. To be able to do multiple test quickly, I also downloaded the complete data-set with the list of every passenger and whether they survived.

Procedure:

  • In order to build a simple model, I removed the columns ‘Name’, ‘Cabin’ and ‘Ticket’.
  • Then I identified two columns with missing values, ‘Age’ and ‘Embarked’. The first one has a lot of missing values while the second one has only a few. For those two columns I applied two methods: 
    1- use the global mean for numeric column and global mode for categorical ones.
    2- Apply the knn_impute function.
  • Build a simple random forest model

Simple approach results: 
sensitivity = 66 %; specificity = 79%; precision = 63%

KNN imputation results with the best model:
sensitivity = 69 %; specificity = 80%; precision = 66%

Code example:

The difference in results between the two methods is not that high for this data-set and yet on a Kaggle competition people can spend a lot of time to gain this few extra percentages.

If you test this function on other data-set, I am looking forward to hear your feedback on how helpful this function was in helping you improve your results!

Ressources: