Introduction
Outliers are a frequently discussed subject in many Data Science forums and blogs. This is probably because these data points can distort our analysis and can affect the modeling, if the algorithm we’re using is not robust to those anomalies.
A dataset, many times, will bring the majority of the observations within a certain range of values, following some patterns, staying not too far from the "group". These are the inliers. But there will be also those observations that won’t fit anywhere, that are far away from the standards of that data and don’t follow that pattern. Those are anomalies, outliers.
An algorithm that is very affected by outliers is the good old Linear Regression. If you have an observation that is too off of the central value, it will distort the regression calculation, making the model perform worse.
Let’s learn two quick ways to find outliers using unsupervised learning algorithms: the KNN with the Local Outlier Factor and the Gaussian Mixture, both from Scikit-Learn.
Local Outlier Factor [LOF]
This algorithm is available in sklearn.neighbors module and you can import it using from sklearn.neighbors import [LocalOutlierFactor](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.LocalOutlierFactor.html)
. It relies on the K-nearest neighbors algorithm and the interesting thing here is that LOF has the hyperparameter contamination
to help us to determine the threshold for outliers. So, if you use contamination= 0.1
, it means that you want to account 10% of data as outliers.
The amount of contamination of the data set, i.e. the proportion of outliers in the data set. (sklearn documentation)
The coding applying this algorithm comes next. We will be using the ‘_car_crashes_’ dataset that is native from seaborn package in Python.
# Dataset
df = sns.load_dataset('car_crashes')
I will just add some weird data here that will be our outliers. Notice that I want it to be very different to make sure it is an outlier, so I can demonstrate the outlier detectors in action.
# Creating very odd observations = OUTLIERS
s1 = pd.DataFrame([30,30,30,30,30,30,30,'AB']).T
s2 = pd.DataFrame([40,40,40,40,40,40,40,'AB']).T
s3 = pd.DataFrame([30,30,30,30,50,50,50,'AB']).T
s4 = pd.DataFrame([99,99,99,39,99,59,59,'AB']).T
s5 = pd.DataFrame([99,99,90,9,99,9,99,'AB']).T
s1.columns = s2.columns = s3.columns = s4.columns = s5.columns= df.columns
# Adding them to the dataset
df=pd.concat([df,s1, s2, s3, s4, s5], axis=0)
# X
X = df.drop('abbrev', axis=1)
Next, we can import the Pipeline
, StandardScaler
and create this simple pipeline to scale the data, putting it in the same range and then running the LOF algorithm. Notice that we’re using a contamination rate of 9%, since the data has 51 observations and we know that we have 5 outliers to be found (5/51 = 9.8%).
# Let's create a Pipeline to scale the data and find outliers using KNN Classifier
steps = [
('scale', StandardScaler()),
('LOF', LocalOutlierFactor(contamination=0.09))
]
# Fit and predict
outliers = Pipeline(steps).fit_predict(X)
This is what you would see if you call the output outliers
.
Once this is done, we can add the data to the original dataset and look at it.
# Add column
df['outliers'] = outliers
# Look at the top 8
df.sort_values(by='outliers').head(8)
Excellent. We were able to find the outliers previously created. If we use t-SNE to create a 2D plot of this dataset, here’s what we will see (the code for this plot can be found in the GitHub link at the end of the article).
Going over what have just happened: LOF uses KNN, which is a classifier that clusters data points based on the distances and similarities between them. Naturally, as the data created is different from the rest of the data, they are marked as anomalies.
Gaussian Mixture Models
Another algorithm that can be found is the Gaussian Mixture Model (GMM). This one will look at the data and divide it in n groups. To add each observation to a group, the algorithm calculates and creates n Gaussian distributions, then it checks where the data point fits better (higher probability) among those Gaussian Distributions.
The GMM from sklearn calculates scores for the observations according to the densities of where each point is located in that space. So points in areas with higher densities are less likely to be outliers, while the inverse is true as well, thus low density areas are where the outliers will be.
Let’s code. First, importing the module.
from sklearn.mixture import GaussianMixture
Next, we can fit the GMM. We are using 3 components, which means that this data is being divided in 3 Gaussian Distributions. n_init= 10
is for the GMM to run 10 iterations to find the best fit.
gm = GaussianMixture(n_components=3, n_init=10)
gm.fit(X)
Ok, now we must calculate the scores.
# Finding densities
density_scores = gm.score_samples(X)
If we print it, in this case we will see only negative numbers. So, I will just take the absolute value of them to make it easier to calculate the percentiles.
density_scores= abs(density_scores)
Now we can calculate the percentile we want as threshold, or the amount of outliers to find. Let’s work with the same 9%
# Define threshold
threshold = np.percentile(density_scores, 9)
# Finding outliers
X[density_scores< threshold]
X['densities'] = density_scores
The threshold value is 16.36102
. We can see that it finds the same results, that is everything below that number (our fake outliers).
It worked as expected.
Before You Go
Finding outliers is a task that can be performed in many ways. There are other good algorithms out there, like Isolation Forest. There are other methods like Z-Score, IQR… well a number of options.
I wanted to show these two because I found them to be very easy to apply.
For LOF, just use:
LocalOutlierFactor(contamination= n)
.
For GMM, use:
scores = gm.score_samples(X)
threshold = np.percentile(scores, 9)
X[scores< threshold]
Here’s the full code in GitHub.
If you liked this content, follow my blog.
Find me on LinkedIn. If you’re considering to join Medium, here’s a referral link.
Reference
Aurélien Géron, 2019. Hands-on Machine Learning with Scikit-Learn, Keras & TensorFlow. 2ed, O’Reilly.
Michael Walker, 2022. Data Cleaning and Exploration with Machine Learning. 1ed. Packt Publications.