
Recently I stumbled upon a new (to me) outlier detection algorithm – the Hampel filter. In this short article, I would like to describe how it works and how to use it in practice. To the best of my knowledge, there is no Python library containing this algorithm, so we will implement it from scratch using two different approaches (for-loop
and pandas
). In the end, we will see which one performs better in terms of execution speed.
Update: Hampel filter is implemented in sktime
, you can find more about it here.
Hampel Filter
Let’s start with a quick theoretical introduction. There is not a lot of online resources describing the algorithm (even no page on Wikipedia), but it is simple enough to quickly understand the logic. Also, I put some references at the end of the article.
The goal of the Hampel filter is to identify and replace outliers in a given series. It uses a sliding window of configurable width to go over the data. For each window (given observation and the 2 window_size
surrounding elements, window_size
for each side), we calculate the median and the standard deviation expressed as the median absolute deviation.
For the MAD to be a consistent estimator for the standard deviation, we must multiply it by a constant scale factor k. The factor is dependent on the distribution, for Gaussian it is approximately 1.4826.
If the considered observation differs from the window median by more than x standard deviations, we treat it as an outlier and replace it with the median.
The Hampel filter has two configurable parameters:
- the size of the sliding window
- the number of standard deviations which identify the outlier
We select these two parameters depending on the use-case. A higher standard deviation threshold makes the filter more forgiving, a lower one identifies more points as outliers. Setting the threshold to 0 corresponds to John Tukey’s median filter.
As the filter uses a sliding window, it makes the most sense to use it with time-series data, where the order of the data is governed by time.
Python Implementation
Import libraries
The first step is importing the required libraries.
import matplotlib.pyplot as plt
import warnings
import pandas as pd
import numpy as np
Random walk with outliers
Before implementing the algorithm, we create an artificial dataset using random walk. We define a function that accepts the percentage of outliers as a parameter and randomly scales some random walk increments by a constant multiplier.
Using the function we generate artificial data and plot it below, together with the introduced outliers.
rw, outlier_ind = random_walk_with_outliers(0, 1000, 0.01)
plt.plot(np.arange(len(rw)), rw)
plt.scatter(outlier_ind, rw[outlier_ind], c='r', label='outlier')
plt.title('Random Walk with outliers')
plt.xlabel('Time steps')
plt.ylabel('Values')
plt.legend();

Evaluating the results
We also need to define a function for evaluating the results of our outlier detection algorithms. For that, we will plot the actual vs. detected outliers and return a short summary of the performance.
Hampel filter implementation
for-loop
implementation
We start with the for-loop
implementation of the Hampel filter. We base the code on the one from the pracma
R package.
We run the algorithm on the RW series. We chose window size 10, but this should be determined empirically for the best performance.
res, detected_outliers = hampel_filter_forloop(rw, 10)
We evaluate the results using the previously defined helper function.
tp, fp, fn = evaluate_detection(rw, outlier_ind, detected_outliers)

In the plot, we can see that the algorithm correctly identified 8 (out of 10) outliers, mistook one observation for an outlier (red dot) and missed 2 outliers (black dots). Perhaps better performance can be achieved by tuning the window size. However, this is sufficient for the purpose of the exercise.
We additionally plot the transformed series, in which the outliers were replaced with the window median.
plt.plot(np.arange(len(res)), res);
plt.scatter(outlier_ind, rw[outlier_ind], c='g', label='true outlier')
plt.scatter(fp, rw[fp], c='r', label='false positive')
plt.title('Cleaned series (without detected outliers)')
plt.legend();

pandas
implementation
For the pandas
implementation we make use of the rolling
method of a pd.Series
and a lambda function.
In the rolling
method we specify twice the window size and use centering, so the considered observation is in the middle of a 2 * window_size + 1
window. Before running the algorithm, we transform the RW from np.ndarray
topd.Series
.
rw_series = pd.Series(rw)
res, detected_outliers = hampel_filter_pandas(rw_series, 10)
tp, fp, fn = evaluate_detection(rw, outlier_ind, detected_outliers)

The results of both approaches are identical, which is always a good sign 🙂
Performance comparison
At this point, we test the two implementations against each other in terms of execution speed. We expect that the pandas
one will perform faster.
First, we test the for-loop
implementation:
%%timeit
res, detected_outliers = hampel_filter_forloop(rw, 10)
# 67.9 ms ± 990 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Then, we run the analogical test for the pandas
implementation:
%%timeit
res, detected_outliers = hampel_filter_pandas(rw_series, 10)
# 76.1 ms ± 4.37 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
We see that the pandas
implementation turned out to be slower. One hypothesis to test is that the pandas
implementation will be faster for a larger series. That is why we also increase the length of the random walk series to 100000 and test the performance once again.
rw, outlier_ind = random_walk_with_outliers(0, 10 ** 5, 0.01)
rw_series = pd.Series(rw)
Having prepared the data, we move to test the performance:
%%timeit
res, detected_outliers = hampel_filter_forloop(rw, 10)
# 6.75 s ± 203 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
res, detected_outliers = hampel_filter_pandas(rw_series, 10)
# 6.76 s ± 30.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
It turns out that now they achieved comparable performance, with the pandas
implementation providing more stable performance (lower standard deviation).
Bonus: Increasing the speed of the for-loops with numba
As a bonus, we explore the possibility of speeding up the code based on for-loops with numba
. numba
is a library that under-the-hood translates Python code into an optimized machine code. Given it is possible to transform the code, numba
can make certain algorithms written in Python approach the speed of C.
The best part about numba
is that (if possible) the speed boost comes at a very small price in terms of coding. We need to import the library and add the [@jit](http://twitter.com/jit)
decorator before the function we want to translate to machine code.
The nopython
argument indicates if we want numba
to use purely machine code or to use some Python code if necessary. Ideally, this should always be set to true
, as long as there are no errors returned by numba
.
Below we test the execution speed.
%%timeit
res, detected_outliers = hampel_filter_forloop_numba(rw, 10)
# 108 ms ± 1.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
It only takes 108 ms (milliseconds) as compared to 6.76 s in case of the pandas
implementation! That is a massive speed-up of ~63 times!
Conclusions
Summing up, in this article we showed explained how Hampel filter work in terms of outlier detection and how to implement it in Python. We also compared the implementations in terms of execution speed.
It is also good to consider the drawbacks of the Hampel filter:
- it has problems with detecting outliers at the beginning and the end of the series – when the window is not complete (even on one side), the function does not detect possible outliers
- it struggles to detect outliers when they are close to each other – within the window range
As always, any constructive feedback is welcome. You can reach out to me on Twitter or in the comments. You can find the code used for this article on my GitHub.
Liked the article? Become a Medium member to continue learning by reading without limits. If you use this link to become a member, you will support me at no extra cost to you. Thanks in advance and see you around!
References
- Hampel F. R., "The influence curve and its role in robust estimation," Journal of the American Statistical Association, 69, 382–393, 1974
- Liu, Hancong, Sirish Shah, and Wei Jiang. "On-line outlier detection and data cleaning." Computers and Chemical Engineering. Vol. 28, March 2004, pp. 1635–1647
- Suomela, Jukka. "Median Filtering Is Equivalent to Sorting." 2014