The world’s leading publication for data science, AI, and ML professionals.

How to Deal with Time Series Outliers

Understanding, detecting and replacing outliers in time series

Photo by Milton Villemar on Unsplash
Photo by Milton Villemar on Unsplash

In this post, we’ll explore:

  • Different types of time series outliers
  • Prediction-based and estimation-based methods for detecting outliers
  • How to deal with unwanted outliers using replacement

Types of Outliers

Outliers are observations that deviate significantly from normal behavior.

Time series can exhibit outliers due to some unusual and non-repetitive event. These affect time series analysis and mislead practitioners into erroneous conclusions or defective forecasts. So, identifying and dealing with outliers is a key step to ensure a reliable time series modelling.

In time series, outliers are usually split into two types: additive outliers and innovational outliers.

Additive Outliers

An additive outlier is an observation that exhibits an unusually high (or low) value relative to historical data.

An example of an additive outlier is the surge in the sales of a product due to a promotion or related viral content. Sometimes these outliers occur due to erroneous data collection. The additivity has to do with the non-persistent effect of the outlier on the underlying system. The unusual value is confined to the respective observation, after which the time series resumes its normal patterns.

A time series with a few additive outliers. Image by author.
A time series with a few additive outliers. Image by author.

Additive outliers can span consecutive observations. These are also known as subsequence outliers or outlier patches.

Innovational Outliers

Innovational outliers are similar to additive ones but with a persistence effect. The outlier bears an impact on next observations. A common example is the number of visits to a website that increases due to some viral content. The website may continue to experience a number of visits higher than usual until the effect fades.

A time series with innovational outliers. Image by author.
A time series with innovational outliers. Image by author.

One way to deal with innovational outliers is using intervention analysis. For example, using a dummy variable with an effect that fades over time.

Relation to Change points

Outliers are related to the notion of change. Some observations, known as change points, mark the onset of a structural change in the time series.

These change points are related but distinct to outliers. An outlier is an anomalous observation relative to a specific distribution. Change points are structural breaks characterised by a change in the distribution.


Meaning of Outliers

How you deal with outliers depends on their nature and the goal of the analysis.

Outliers that arise from noise, such as errors in data collection, are unwanted data. This type of outlier should be removed or replaced before analysis.

On the other hand, some outliers are interesting in themselves, and important to predict. So, removing them may lead to misleading conclusions or overly optimistic forecasts. Instances occur in various domains, such as fraud detection or energy. Consider a time series of energy demand in which the energy load surges during some period. This type of outlier can be caused by some unusual event (e.g. extremely cold weather). Utility companies need to anticipate such outliers so removing them is not a good idea. Modelling these observations is key to balancing the supply and demand of energy and preventing power outages.


Detecting and Dealing with Outliers

There are several approaches for detecting outliers in time series data. Many of them fall into one of two categories: prediction-based or estimation-based.

Prediction-based detection

Detecting outliers based on predictions involves using a forecasting model. The goal is to compare the forecasts with the actual values. A large discrepancy between the two indicates that the observation is an outlier.

Let’s see how this works in practice using the following time series:

from datasetsforecast.m4 import M4

dataset, *_ = M4.load('./data', 'Hourly')

series = dataset.query(f'unique_id=="H1"').reset_index(drop=True)

In the preceding code, we got the time series with id H1 from the M4 dataset. Next, we build a seasonal naive forecasting model based on statsforecast:

from statsforecast import StatsForecast
from statsforecast.models import SeasonalNaive

# seasonal naive model 
model = [SeasonalNaive(season_length=24)]

# creating a statsforecast instance
sf = StatsForecast(df=series, models=model, freq='H')
# fitting the forecasting model
sf.forecast(h=1, level=[99], fitted=True)

# getting insample predictions
preds = sf.forecast_fitted_values()

After building the model, we get prediction intervals for the training samples using the _forecast_fittedvalues method. Then, we compare these with the actual value:

# outliers based on prediction intervals
outliers = preds.loc[(preds['y'] >= preds['SeasonalNaive-hi-99']) | (preds['y'] <= preds['SeasonalNaive-lo-99'])]

Any observation falling outside the 99% prediction interval is an outlier is considered an outlier.

Here’s a plot of the outliers:

Outliers detected by a seasonal naive model. Image by author
Outliers detected by a seasonal naive model. Image by author

You can also use the actual error of the forecasts instead of the interval. In that case, outliers would appear when the error is unusually large.

Estimation-based detection

Estimation-based approaches use summary statistics to detect outliers. One example is the z-score. The idea is to standardize the data by subtracting the mean and dividing it by the standard deviation. Then, outliers are points with a large z-score value.

Here’s an example:

# values above/below 3 std deviations
thresh = 3

rolling_series = series['y'].rolling(window=24, min_periods=1, center=True)
avg = rolling_series.mean()
std = rolling_series.std(ddof=0)
zscore = series['y'].sub(avg).div(std)
m = zscore.between(-thresh, thresh)

Note that the average and standard deviation are computed using a rolling window to account for the temporal dependency in time series.

Another approach is to use time series decomposition methods and detect outliers on the residuals. Let’s start by getting the residuals using STL:

from statsmodels.tsa.seasonal import STL

stl = STL(series['y'].values, period=24, robust=True).fit()
resid = pd.Series(stl.resid)

Note that we pass the argument robust=True to STL so the model tolerates larger errors.

Then, you can use a standard boxplot rule to detect outliers. For example, marking observations below 3 times the IQR below the first quartile or above the third quartile as an anomaly. Here’s how to do that:

q1, q3 = resid.quantile([.25, .75])
iqr = q3 - q1

is_outlier_r = ~resid.apply(lambda x: q1 - (3 * iqr) < x < q3 + (3 * iqr))
is_outlier_r_idx = np.where(is_outlier_r)[0]

resid_df = resid.reset_index()
resid_df['index'] = pd.date_range(end='2021-12-01', periods=series.shape[0], freq='H')
resid_df.columns = ['index', 'Residual']

These outliers are also evident in the series of residuals:

Detecting outliers using the boxplot rule in the residuals. Image by author.
Detecting outliers using the boxplot rule in the residuals. Image by author.

Replacing outliers

After detection, you can clean outliers by replacing them with more sensible values.

You first remove the outlier, and then turn the problem into a data imputation task.

As you learned in a previous article, there are many approaches for time series imputation. These include:

  • forward or backward filling
  • moving averages
  • linear interpolation

Key Takeaways

  • Time series outliers are observations that deviate significantly from historical data
  • Outliers can exhibit different characteristics in terms of persistence and meaning
  • There are several approaches for Outlier Detection, including prediction-based and estimation-based methods
  • You can replace unwanted outliers using data imputation techniques

Related articles

References

[1] Tsay, Ruey S. "Outliers, level shifts, and variance changes in time series." Journal of forecasting 7.1 (1988): 1–20.

[2] FPP Dealing with outliers and missing values

[3] Blázquez-García, Ane, et al. "A review on outlier/anomaly detection in time series data." ACM Computing Surveys (CSUR) 54.3 (2021): 1–33.

[4] Nixtla Anomaly detection tutorial


Related Articles