Detecting And Treating Outliers In Python — Part 2

Hands-On Tutorial On Multivariate Outliers

Alicia Horsch
Towards Data Science

--

Photo by LegioSeven from Pexels

An Exploratory Data Analysis (EDA) is essential when working on data science projects. Understanding your underlying data, its nature, and structure can simplify decision making on features, algorithms, and hyperparameters. One crucial part of the EDA is the detection of outliers. Outliers are observations that are far away from the other data points in a random sample of a population.

In a previously posted article, I introduced statistical methods to detect univariate outliers commonly used in practice. In this post, I want to discuss what multivariate outliers are, how they can be detected, and visualized during EDA. In an third and last article, I would like to explain how both types of outliers can be treated:

1. Detecting univariate outliers

2. Detecting multivariate outliers

3. Treatment of both types of outliers

There are many ways to detect outliers, including statistical methods, proximity-based methods, or supervised outlier detection. Again, I will solely focus on commonly used statistical methods. As in my previous post, I will use the Boston housing data set (sklearn library) for illustration and provide example code in Python so you can easily follow.

Multivariate outliers (Recap)

A multivariate outlier is an unusual combination of values in an observation across several variables. For example, it could be a human with a height measurement of 2 meters (in the 95th percentile) and weight measurement of 50kg (in the 5th percentile).

Visualization

A common way to plot multivariate outliers is the scatter plot. Keep in mind that visualizing multivariate outliers across more than two variables is not feasible in a 2D space. Therefore, we will stick to outliers found across two variables for visualization — so-called bi-variate outliers.

The scatterplot visualizes the relationship between two (numerical) variables. In a scatterplot, every observation is plotted as a point with two coordinates (X,Y) that represent two variables. Here, for example, X represents the value for variable 1 and Y the value for variable 2.

Let’s create a scatterplot between two variables from the Boston housing data set. I choose the variables ‘CRIM’ and ‘LSTAT’ to understand the relationship between the crime rate per capita by town and the percentage of lower status in the population. You can play around and choose any set of numerical variables from the data set and see what changes. Other meaningful combinations worth looking at are, for example, ‘DIS’ & ‘INDUS’, ‘LSTAT’ & ‘PTRATIO’ or ‘INDUS’ & ‘ZN’.

Scatterplot: crime rate per capita by town (CRIM) against the percentage of lower status in the population (LSTAT). Image by author.

Like box plots, scatter plots visualize outlying observations very well but do not identify or mark them for easy treatment. When dealing with multivariate outliers, distance metrics can be helpful for detection. With distance metrics, the distance between two vectors is determined. These two vectors can be two different observations (rows) or an observation (row) compared to the mean vector (row of means of all columns). Distance metrics can be calculated independent of the number of variables in the dataset (columns).

Mahalanobis Distance

A widely used distance metric for the detection of multivariate outliers is the Mahalanobis distance (MD). The MD is a measure that determines the distance between a data point x and a distribution D. It is a multivariate generalization of the internally studentized residuals (z-score) introduced in my last article. This means the MD defines how many standard deviations x is away from the mean of D.

It is defined as:

Image by author

Here, x represents an observation vector and µ the arithmetic mean vector of independent variables (columns) in the sample. C(-1) is the inverse covariance matrix of the independent variables in the sample. To better understand the mathematical intuition behind the MD, I recommend reading under point four in this blog post.

Like the z-score, the MD of each observation is compared to a cut-off point. Assuming a multivariate normal distribution of the data with K variables, the Mahalanobis distance follows a chi-squared distribution with K degrees of freedom. Using a reasonable significance level (e.g., 2.5%, 1%, 0.01%), the cut-off point is defined as:

Image by author

Let’s see an example using a 0.01% significance level and searching for bi-variate outliers in the subset used before.

Using the Mahalanobis distance, we can see that 8 observations are marked as bi-variate outliers. When including all variables of the Boston dataset (df=13), we detect 17 multivariate outliers.

Look closer at observation 398. It seems like when looking at the subset of two variables, it is an outlier. However, when considering the whole dataset, it is not. Therefore, when running your own data science project, make sure to only include variables in your detection process that are interesting for your further analysis. Otherwise, some observations may be flagged as outliers because of an irrelevant variable.

A drawback of the MD is that it uses the arithmetic mean and covariance matrix and, with that, is highly sensitive to outliers in the data. Several methods exist that use robust estimates for µ and C. In the following passage, I will explain the Minimum Covariance Determinant method, introduced by Rousseeuw, as an example.

Robust Mahalanobis Distance

The Minimum Covariance Determinant method (MCD) provides robust estimates for µ and C by only using a subset of the sample. It only uses the observations where the determinant of the covariance matrix is as small as possible. It is defined as the classic MD but with robust estimates for mean and covariance:

Image by author.

The Python library sklearn includes a function to fit the MCD to any dataset to receive a robust covariance matrix and mean. Let’s go back to our example and see how the result changes:

We can see that the robust MD finds a few more outliers than the classic MD. 10 bi-variate outliers for the 2D dataset example and 21 multivariate outliers when considering all features of the Boston dataset.

The two plots below (based on the 2D dataset example) help visualizing the difference between the classic (left) and robust (right) distance measurement:

Left: CRIM by LSTAT with outliers in orange using MD. Right: CRIM by LSTAT with outliers in orange using robust MD. Image by author.

Wrapping up

Next to univariate outliers, it is also important to examine an underlying data set for multivariate outliers. Both types of outliers can significantly impact the outcomes of a data analysis or machine learning projects. In this post, we learned that multivariate outliers are a unique combination of values in an observation and can be detected through distance metrics.

A commonly used distance metric is the Mahalanobis distance. Its classic definition relies on the mean and covariance between all variables of a dataset. It is, therefore, sensitive to outliers. One way to receive more robust estimates for the mean and covariance is the Minimum Covariance Determinants method (MCD).

Finally, it is important to note that there are several other ways of detecting univariate and multivariate outliers. Other popular methods are k-nearest neighbours, DBSCAN, or isolation forests, just to name a few. There is no right or wrong method, but one might be more appropriate than another for your data set. When deciding on the outlier detection method you would like to use, I recommend basing your decision on the data’s distribution, sample size, and the number of dimensions.

Now that we know how to detect univariate and multivariate outliers: what now? In an upcoming post, I want to discuss how outliers can be treated.

Resources:

--

--