Autoencoder

A classical obstacle you stumble upon in data science and machine learning is outliers. The concept of an outlier is intuitively clear to a human, yet there is no generally meaningful mathematical definition apart from simple hacks from Stats 101 that involve the standard deviation or the interquartile range.
I think the reason might be that it is pretty darn tricky to come up with one since an outlier can mean different things to different people. Take the following dataset as an example:

I claim that most people – including me – would consider the point in the top right corner an outlier. The point in the middle is more interesting: On the one hand, it is not on this imaginary circle that you draw in your head, so it should be an outlier. On the other hand, neither its x nor y coordinate is really crazy compared to the other points – the center point is literally as average as it can be.
I would still consider this point an outlier since it feels right that the actual data points form a circle. However, this is nothing that you can turn into an actual algorithm for finding outliers.
If this little two-dimensional example can spark some discussion already, just imagine what happens when we deal with higher dimensional data that you cannot visualize anymore to take a look at it.
In this article, you will learn how to build an outlier detector using autoencoders in Tensorflow in only a few lines.
Why do we Care?
By definition, outliers are not like other data points. Among other reasons, outliers can be:
- measurement errors or typos: somebody enters 777 kg instead of 77 kg as a weight of a person
- dummy values: something couldn’t be measured, and a data pipeline imputes 999 as a fixed value
- rare data points / natural outliers: if you have a dataset consisting of used cars, a Ferrari is not like the other cars
We care about outliers because can heavily skew the result of analyses and Machine Learning model training.

Feeding a model outliers that stem from errors, typos, and dummy values basically means that you feed the model garbage, which in turn makes the model learn and output garbage.
Feeding a model with natural outliers might result in overfitting on these rare data points, which is an important thing to keep in mind.
Note: There are a lot of ways to deal with outliers, but I will not go into detail here. The most frequent method is just to drop outliers. So you lose some data, but the overall quality of your analysis or model might improve.
Finding Outliers
There is a variety of ways to find outliers. From the awesome scikit-learn website:

However, we will focus on another approach that you don’t (yet?) find in scikit-learn.
Autoencoders
It is based on autoencoders that I introduced in another article.
As a reminder: a good autoencoder should be able to compress (encode) data into fewer dimensions and then decompress (decode) it again without introducing many errors.
We can use this intuition to come up with the following logic to find outliers with an autoencoder:
If the autoencoder introduces a large error for some data point, this data point might be an outlier.
To give some more intuition: an autoencoder tries to learn good encodings for a given dataset. Since most data points in the dataset are not outliers, the autoencoder will be influenced most by the normal data points and should perform well on them.
An outlier is something that the autoencoder has not seen often during training, so it might have trouble finding a good encoding for them. Easy, right? Let’s try out this idea!
Simple Example With a Tensorflow Autoencoder
In the following, we will take a look at a simple example that will help you to understand and implement the above-mentioned logic.
Encode The Circle
Let us put our introductory example with the circle into code. First, let us create a dataset.
import tensorflow as tf
tf.random.set_seed(1234)
t = tf.expand_dims(tf.linspace(0., 2*3.14, 1000), -1)
noise = tf.random.normal((1000, 2), stddev=0.05)
points = tf.concat([tf.cos(t), tf.sin(t)], axis=1) + noise
Plotting this should look like this:

Let’s add some outliers now.
points_with_outliers = tf.concat(
[
points,
tf.constant([[0., 0.], [2., 2.]]) # the outliers
],
axis=0)
The result:

Now we are ready to define and train a simple autoencoder that compresses the two-dimensional data down into a single dimension.
shuffled_points = tf.random.shuffle(points)
encoder = tf.keras.Sequential([
tf.keras.layers.Dense(16, activation="relu"),
tf.keras.layers.Dense(16, activation="relu"),
tf.keras.layers.Dense(16, activation="relu"),
tf.keras.layers.Dense(1) # one-dimensional output
])
decoder = tf.keras.Sequential([
tf.keras.layers.Dense(16, activation="relu"),
tf.keras.layers.Dense(16, activation="relu"),
tf.keras.layers.Dense(16, activation="relu"),
tf.keras.layers.Dense(2) # decode to two dimensions again
])
autoencoder = tf.keras.Sequential([
encoder,
decoder
])
autoencoder.compile(loss="mse")
autoencoder.fit(
x=shuffled_points, # goal is that output is
y=shuffled_points, # close to the same input
validation_split=0.2, # to check if the model is generalizing
epochs=500
)
# Output:
# [...]
# Epoch 500/500
# 25/25 [==============================] - 0s 2ms/step - loss: # 0.0037 - val_loss: 0.0096
We can now put our points into the autoencoder via
reconstructed_points = autoencoder(points_with_outliers)
and see what the reconstruction looks like:

We can see that our autoencoder did a good job of recognizing the underlying pattern. It’s not a perfect circle, but still good enough, I would argue.
As a side note, here we can see that our autoencoder basically learned to unroll the circle into a one-dimensional line and roll it back again:

Furthermore, looking at the colors we can see that the reconstructed points are quite close to their originals.
Getting The Outliers
But what happens with the outliers? Let’s throw all of our points into the model now and highlight our two added outliers.

The autoencoder puts both outliers on its learned circle approximation as well, meaning that both of them moved quite a lot because both of them were far away from the circle before. Let us print some numbers:
import pandas as pd
reconstruction_errors = tf.reduce_sum(
(model(points_with_outliers) - points_with_outliers)**2,
axis=1
) # MSE
pd.DataFrame({
"x": points_with_outliers[:, 0],
"y": points_with_outliers[:, 1],
"reconstruction_error": reconstruction_errors
})

Even without much analysis, we can already see that the normal data points have a tiny reconstruction error of about 0.002, while our artificially added outliers have errors of about 1 and 4. Nice!
But wait, we just looked at a table again to find the outliers. Can’t we automate that?
Good question! Yes, we can. There are several ways to process from here. For example, we can say that all points with a reconstruction error of >µ+3σ are outliers, where µ is the mean of all reconstruction errors and σ the standard deviation.
Another thing we can do is to say that the data points with the highest x% of reconstruction errors are outliers. In this case, we would assume that x% of the dataset consists of outliers, also known as the contamination factor that you can find in several other Outlier Detection algorithms such as isolation forests.
Comparison With Scikit-Learn Outlier Detectors
I created a quick and dirty scikit-learn compatible outlier detector to be able to compare this approach to the other scikit-learn outlier detectors as done here.
from sklearn.base import BaseEstimator, OutlierMixin
import numpy as np
class AutoencoderOutlierDetector(BaseEstimator, OutlierMixin):
def __init__(self, keras_autoencoder, contamination):
self.keras_autoencoder = keras_autoencoder
self.contamination = contamination
def fit(self, X, y=None):
self.cloned_model_ = tf.keras.models.clone_model(self.keras_autoencoder)
self.cloned_model_.compile(loss="mse")
self.cloned_model_.fit(
x=X,
y=X,
epochs=1000,
verbose=0,
validation_split=0.2,
callbacks=[
tf.keras.callbacks.EarlyStopping(patience=10)
]
)
reconstruction_errors = tf.reduce_sum(
(self.cloned_model_(X) - X) ** 2,
axis=1
).numpy()
self.threshold_ = np.quantile(reconstruction_errors, 1 - self.contamination)
return self
def predict(self, X):
reconstruction_errors = tf.reduce_sum(
(self.cloned_model_(X) - X) ** 2,
axis=1
).numpy()
return 2 * (reconstruction_errors < self.threshold_) - 1
If we use the same autoencoder architecture that we also used for the circle, we get the following:

And here we see one drawback of this whole approach: if we use the wrong neural network architecture, we might end up with strange shapes. If we use a simpler autoencoder with 2 units per hidden layer instead of 16 as before, we get:

This is slightly better I would argue, but still far from being good.
Homework for you: Find an architecture that performs decent on these small toy datasets!
Conclusion
Outliers have the potential to break analyses and models. Hence, we should be able to spot and deal with them. One way to do this is by using autoencoders: a machine learning model – often a neural network – that encodes and then decodes data again. If the encoding/decoding step does not work well for a data point, this point might be an outlier.
This fundamental idea is neat, but you still have to put a lot of thought into how to design the autoencoder. While this can be extremely difficult, it is something that you cannot do with other outlier detection models. Isolation forests and co. have a fixed architecture that only lets you adjust some hyperparameters. With autoencoders, you are free – with all the advantages and disadvantages.
I hope that you learned something new, interesting, and useful today. Thanks for reading!
If you have any questions, write me on LinkedIn!
And if you want to dive deeper into the world of algorithms, give my new publication All About Algorithms a try! I’m still searching for writers!