The world’s leading publication for data science, AI, and ML professionals.

How To Detect Outliers in a Data Science Project

Three methods to detect outliers, with examples in Python

Photo by Will Myers on Unsplash
Photo by Will Myers on Unsplash

At the beginning of a Data Science project, one important part is outlier detection. When we perform Exploratory Data Analysis, in fact, one of the things to do is to find outliers and treat them, in some ways.

In this article, we will see three methods to detect outliers. But, before it…what is an outlier? Let’s quote Wikipedia:

In statistics, an outlier is a data point that differs significantly from other observations. An outlier may be due to variability in the measurement or it may indicate experimental error; the latter are sometimes excluded from the data set. An outlier can cause serious problems in statistical analyses.

So, an outlier is data that has a value too high or too low with respect to the other data we are analyzing. Of course, in a dataset we won’t find a unique outlier: there are several outliers; this is why often we exclude them from the data set: otherwise, the outliers can cause statistics problems in our analysis.

But what are the criteria to exclude the outliers? We’ll see three methodologies.

1. The graphical approach

This is my favorite approach because it gives you the power to decide; in fact, as I said before, we often delete outliers from our dataset, and since this approach is graphical you are in charge to decide what outliers to delete; moreover, you decide which data have to be considered outliers, just watching the plot.

Let’s see a couple of examples derived from projects I’ve done.

First, let’s say we have a data frame called "cons"; let’s create a scatterplot with seaborn:

import seaborn as sns
#plotting the scatterplot 
sns.relplot(data=food, x='mean', y='latitude')

#defining the size of the figure
sns.set(rc={'figure.figsize':(30,15)})

#labeling
plt.title(f'MEAN PRODUCTION PER LATITUDE', fontsize=16) #plot TITLE
plt.xlabel('Mean production [1000 tons]', fontsize=14) #x-axis label
plt.ylabel('Latitude', fontsize=14) #y-axis label

#showing grid for better visualization
sns.set_style("ticks",{'axes.grid' : True})
A scatterplot. Image by Author.
A scatterplot. Image by Author.

Do not consider what "latitude" and "mean production" refers to, just look at the data: which points would you consider outliers? Here we clearly see that the outliers are just "higher" numbers; you can decide that the outliers are the ones whose values are greater than 75’000. Even 50’00 would do. You decide as I said; but decide on a whole analysis (just this plot is not sufficient).

Anyway, this is one method to detect outliers. There is another graphical method which is plotting boxplots. Let’s see another example from one of my projects.

Let’s say we have a data frame called "phase"; we want to plot a boxplot:

import seaborn as sns
#boxplot
sns.boxplot(data=phase, x='month', y='time')
#labeling
plt.title(f"BOXPLOT", fontsize=16) #plot TITLE
plt.xlabel("MONTH", fontsize=14) #x-axis label
plt.ylabel("CYCLE TIME[m]", fontsize=14) #y-axis label
#adding an horizotal line to the mean cycle time
plt.axhline(mean, color="red") 
red_line = mpatches.Patch(color="red",
label=f"mean value: {mean:.1f} [m]")
#handling the legend
plt.legend(handles=[red_line],prop={"size":12})
A boxplot. Image by Author.
A boxplot. Image by Author.

Even here, let’s just consider the plot and not what x and y mean. Even here the outliers are only the ones with "higher" values (the little spots). With boxplots, you have a little less control, but outliers are detected based on statistics: in this case, are the ones with values greater than the maximum; just to remember, in a boxplot the Maximus value is calculated as "Q3+1.5*IQR", where IQR is the inter-quartile range and is calculated as IQR=Q3-Q1, where Q1 is the first quartile and Q3 is the third quartile.

With this graphical method you have a little less control over which points to consider outliers, considering values; I’d say better: you have a statistical (graphical) methodology to define which values can be considered outliers; so it is not about "you decide it all": here statistics helps you, and I find it a very good method.

There is just one big disadvantage with these graphical methods: they can only be used with bi-dimensional data. What I mean is that you can use these plots if you have just one column representing the input and one representing the output. For example, if you have 20 columns representing the input data and one representing the output, if you want to use these plots you have to plot 20 graphs; and what about you have 100 columns representing the input? You can understand that these methods are very powerful, but can be used in limited situations. We, then, need something else.

2. The Z-score method

Z-score, also known as "standard score", is a statistical measure that tells you how many standard deviations a given observation is away from the mean.

For example, a Z score of 1.2 means that the data point is 1.2 standard deviation far from the mean.

Whit this method, you have to define a threshold: when a datapoint has a value taha is greater than the threshold, then it is an outlier.

We calculate Z as follows:

The formula for "Z". Image by Author.
The formula for "Z". Image by Author.

Where:

  • X is the value of the datapoint
  • μ is the mean
  • σ is the standard deviation

Let’s make an example. Suppose we have these numbers:

import numpy as np
#my random data
data = [1, 2, 2, 2, 3, 1, 1, 15, 2, 2, 2, 3, 1, 1, 2]
#calculating mean
mean = np.mean(data)
#calculating standard deviation
std = np.std(data)
#printing values
print(f'the mean is: {mean: .2f}')
print(f'the standard deviation ins:{std: .2f}')
-----------------
>>>
the mean is:  2.67
the standard deviation ins: 3.36

Now, we can set a threshold to identify the outliers like that:

#treshold
threshold = 3
#list of outliers
outlier = []
#outlier detection
for i in data:
    z = (i-mean)/std
    if z > threshold:
        outlier.append(i)

print(f'the outliers are: {outlier}')
-----------------
>>>
the outliers are: [15]

We can even use the "stats.zscore" from the Scipy library which can be applied to Numpy array and to data frames. Let’s make an example with a data frame:

import pandas as pd
import scipy.stats as stats
#creating dataframe
data = pd.DataFrame(np.random.randint(0, 10, size=(5, 3)), columns=['A', 'B', 'C'])
#z-score
data.apply(stats.zscore)
----------------
>>>

     A          B           C
0 -0.392232  -0.707107   0.500000
1 -0.392232  -0.353553   -1.166667
2 1.568929    1.767767    1.333333
3 -1.372813  -1.060660   -1.166667
4 0.588348    0.353553    0.500000

And in this case, we have to define a threshold and decide what values to delete.

The Z-score method has several disadvantages:

  • it can only be used with one-dimensional data (single column of data frames, arrays, lists, etc…)
  • it must be used just with normally distributed data
  • we have to define a threshold, depending on the data

3. Isolation Forest

Let’s understand what Isolation Forest is, quoting from Wikipedia:

Isolation forest is an anomaly detection algorithm. It detects anomalies using isolation (how far a data point is to the rest of the data), rather than modelling the normal points

Going deeper, we can say that Isolation Forest is built based on decision trees, similar to Random Forest, and it is an unsupervised model, as there are no pre-defined labels. It is nothing but an ensemble of binary decision trees, where each tree is called an "Isolation Tree".

This means that in an Isolation Forest we have randomly sub-sampled data that are processed in a tree structure based, on randomly selected features. The samples that travel deeper into the tree are less likely to be anomalies as they require more cuts to isolate them.

Let’s take an example using the famous "diabetes dataset" provided by scikit-learn:

from sklearn.datasets import load_diabetes #importing data
from sklearn.ensemble import IsolationForest #importing IF
#importing dataset
diab = load_diabetes()
#defining feature and label
X = diab['data']
y = diab['target']
#creating dataframe
df = pd.DataFrame(X, columns=["age","sex","bmi","bp", "tc", "ldl", "hdl","tch", "ltg", "glu"])
#checking shape
df.shape
------------------
>>>
(442, 10)

So, this data frame has 442 rows and 10 columns. Let’s use the Isolation Forest now:

#identifying outliers 
iso = IsolationForest()
y_outliers = iso.fit_predict(df)
#droping outliers rows
for i in range(len(y_outliers)):
    if y_outliers[i] == -1:
        df.drop(i, inplace = True)
#chechink new dataframe shape
df.shape
---------------------
>>>
(388, 10)

As we can see, the number of rows decreased because we dropped the rows with the outliers.

We have to remind that we have used an ensemble, unsupervised model; this means that if we run again all the code, the final shape (the shape of the data frame after dropping the rows with the outliers) may be different from 388 (try it for yourself, as an exercise).

Isolation Forest, even if does not leave you any control over the data (is an unsupervised model) is the only one, among the three we have seen in this article, that gives us the possibility to treat (and drop) outliers in multi-dimensional data frames; in fact, we have worked on the whole dataset, without reducing it.

Conclusions

We have seen three methods to detect outliers. As we have seen, Isolation Forest is the only one (of the three) that makes us work with multi-dimensional data, so generally is the one to be used; unless we are studying a very simple set of data in which we can use the graphical method (I’m a big fan on graphs!) or the z-score.


Let’s connect together!

MEDIUM (follow me)

LINKEDIN (send me a connection request)

If you want, you can subscribe to my mailing list so you can stay always updated!


Consider becoming a member: you could support me and other writers like me with no additional fee. Click here to become a member.


Related Articles