Missing Data Imputation

Concepts and techniques about how to handle missing data imputation

Thiago Pereira
Towards Data Science

--

In the previous article, called The Problem of Missing Data, I introduce the basic concepts of this problem. In this article, I’ll explain some techniques about how to replace missing values for the other ones.

The idea of imputation is both seductive and dangerous. It is seductive because it can lull the user into the pleasurable state of believing that the data are complete after all, and it is dangerous because it lumps together situations where the problem is sufficiently minor that it can be legitimately handled in this way and situations where standard estimators applied to the real and imputed data have substantial biases [Little and Rubin, 2019].

Simple Data Imputation

Essentially, Simple Data Imputation is a method applied to impute one value for each missing item. According to Little and Rubin [2019], simple data imputations can be defined as averages or extractions from a predictive distribution of missing values, require a method of creating a predictive distribution for imputation based on the observed data and define two generic approaches for generating this distribution: explicit modeling and implicit modeling.

In explicit modeling, the predictive distribution is based on a formal statistical model, for example, multivariate normal, therefore the assumptions are explicit. Examples of explicit modeling are average imputation, regression imputation, stochastic regression imputation.

In implicit modeling, the focus is on an algorithm, which implies an underlying model. Assumptions are implied, but they still need to be carefully evaluated to ensure they are reasonable. These are examples of implicit modeling: Hot Deck imputation, imputation by replacement and Cold Deck imputation.

Explicit Modeling

1- Mean Imputation: the missing value is replaced for the mean of all data formed within a specific cell or class. This technique isn’t a good idea because the mean is sensitive to data noise like outliers. Tavares and Soares [2018] compare some other techniques with mean and conclude that mean is not a good idea.

2- Regression Imputation: the missing value is replaced for the predicted value generated by the regression of the missing item on items observed for the unit. Mean imputation can be regarded as a special case of regression imputation where the predictor variables are dummy indicator variables for the cells within which the means are imputed [Little and Rubin, 2019].

3- Stochastic Regression Imputation: the missing value is replaced for the predicted value generated by the regression plus a residual that reflects uncertainty in the predicted value. Normal Linear Regression and Logistic Regression models are examples.

Implicit Modeling

1- Hot deck Imputation: the idea, in this case, is to use some criteria of similarity to cluster the data before executing the data imputation. This is one of the most used techniques.

2- Substitution: this technique is more convenient in a survey context and consists in replace nonresponding units with alternative units not observed in the current sample.

3- Cold deck Imputation: this technique consists in replace the missing value for one constant from an external source, such as a value from a previous realization of the same survey. This technique is similar to substitution, but in this case, a constant value is used and in the substitution technique different values can be used to substitute the missing values.

4- Composite Method (hybrid): this technique propose to combine different other techniques to predict the plausible value. For example, to combine hot-deck and regression imputation to predict the missing value (this technique is a little different from the one shown below).

Multiple Data Imputation

Single imputation replaces an unknown missing value by a single value and then treat it as if it were a true value [Rubin, 1988]. As a result, single imputation ignores uncertainty and almost always underestimates the variance. Multiple imputations overcome this problem, by taking into account both within-imputation uncertainty and between-imputation uncertainty.

The multiple data imputation method produces n suggestions for each missing value. Each one of these n values ​​is assigned as a plausible value and n new datasets are created as if a simple imputation has occurred in each dataset. In this way, a single column of a table generates n new data sets, which are analyzed on a case-by-case basis using specific methods. These analyses are combined in a second step, generating or consolidated results of that data set. Figure 1 illustrates these concepts and the steps in the multiple imputation process are as follows:

1- For each attribute that has a missing value in a data set record, a set of n values to be imputed is generated;

2- A statistical analysis is performed on each data set, generated from the use of one of the n replacement suggestions generated in the previous item;

3- The results of the analyses performed are combined to produce a set of results.

Figure 1: Schematic representation of multiple imputation, where m is the number of imputations. Source: Adapted from [Schaferand Graham, 2002]

I did not find in the literature a consensus in choosing the number of n and choosing a high number for n may not be performative for the process as a whole due to a large number of data sets generated for each new plausible value.

Composite Data Imputation

Proposed by Soares [2007], composite imputation represents a class of imputation techniques that combine the execution of one or more tasks used in the KDD (Knowledge Discovery in Databases) process before predicting a new value to be imputed. For example, combine the execution of a clustering algorithm like k-means and/or selection feature algorithms like PCA and then execute some machine learning algorithms to predict the new value. This technique can be used in the context of single or multiple imputations. Soares[2007] also introduces the missing data imputation committee concepts that consist to evaluate with some statistical method, between all predictions, the more plausible value.

The composite imputation process is based on the definition of the following elements:

  • T: a task in the Knowledge Discovery in Databases (KDD) process.
    Examples: T = feature selection, T = clustering, T₃ = creation of association rules, T₄ = imputation, among others.
  • →: Operator that defines an order of precedence for KDD tasks. The expression XY means that task X precedes task Y.
    Example: clustering →imputation means that the clustering task will precede the imputation task.
  • E(v, B): strategy used in the process of imputing an attribute v from a database B. E(v, B) is represented by T→ T→… →T, where T is necessarily an imputation task.
  • Α: the algorithm used in the imputation process.
    Examples: Α = mean, Α = algorithm of the k-nearest neighbors.
  • ⇒ : Operator that defines an order of precedence of application of
    algorithms. The expression Α ⇒ Α means that the algorithm Α is applied before the algorithm Α.
  • P(v, B): imputation plan used in the process of imputing an attribute v from a database B. P(v, B) is represented by
    Α ⇒ Α ⇒ … ⇒ Α, where Α is necessarily an algorithm of
    imputation.
    Example: Α ⇒ Α ⇒ Α₃ represents the sequenced application of
    algorithms Α = k-centroid algorithm, Α = principal component analysis, and Α₃ = k-nearest neighbor algorithm.
  • ψᵢ: the instance of the application of an Α algorithm, according to parameters Θᵢ = ᵢ₁, Θᵢ₂, …, Θᵢₚ}. ψᵢ = f(Aᵢ, Θᵢ)
  • I(v, B): the instance of an imputation plan of an attribute v of a database B, represented by an ordered sequence of q instances of algorithm applications. ψ⇒ψ⇒…⇒ψₚ, where ψit is necessarily
    an instance of imputation algorithm application.
  • ε(I(v, B)): a measure of the error in running an instance of a plan
    attribution of the attribute v.

The set of values for imputation assumed by an imputation plan will consist of the values of your instance that have the lowest average error of all instances of that plan (ε(P(v)) = ε(I(v)), where ε(I(v)) < ε(I(v)), ∀k).

In this way, we can define composite imputation as the application of one or more strategies in the process of complementing missing data in an attribute v of a B database.

Cascading Data Imputation

Proposed by Ferlin [2008], the cascading imputation takes into account the previous data imputation to perform the next imputation. The previously complemented groups of data are reused for the imputation of the later groups in a cascade effect. With this division-and-conquer approach, it is believed to simplify the imputation process and improve data quality imputed. In your experiment, Ferlin [2008] utilizes the approach proposed by Soares [2007] and executes one KDD task before imputing the data (Clustering in her case). Figure 2 illustrates these concepts.

Figure 2: Cascading imputation concepts. Source: Adapted from [Ferlin, 2008]

In this article, I demonstrated some techniques and concepts to handle missing data imputation. There is no recipe for this type of problem. So each case must be studied to define a better treatment. In the next articles, I’ll demonstrate some results of data imputation.

References

  1. Schafer, J. L. and Graham, J. W. (2002). Missing data: our view of the state of the art.Psychological methods, 7(2):147.
  2. Little, R. J. and Rubin, D. B. (2019).Statistical analysis with missing data, volume 793.Wiley.
  3. Tavares, R. d. S., Castaneda, R., Ferlin, C., Goldschmidt, R., Alfredo, L. V. C., and Soares,J. d. A. (2018). Apoiando o processo de imputação com técnicas de aprendizado de máquina. Celso Suckow da Fonseca CEFET/RJ, pages 1–6.
  4. Rubin, D. B. (1988). An overview of multiple imputation. InProceedings of the survey research methods section of the American statistical association pages 79–84. Citeseer.
  5. Soares, J. (2007). Pré-processamento em mineração de dados: Um estudo comparativo em complementação. tese de doutorado. engenharia de sistemas e computação. ufrj2007.
  6. FERLIN, Claudia. Imputação Multivariada: Uma Abordagem em Cascata. Rio de Janeiro, RJ, 2008.

--

--

MSc. Systems Informations, Big Data Engineer and Machine Learning Researcher