Outlier Detection with RNN Autoencoders

Utilising a reconstruction autoencoder model to detect anomalies in time series data.

David Woroniuk
Towards Data Science

--

Image generated by Author.

Note from Towards Data Science’s editors: While we allow independent authors to publish articles in accordance with our rules and guidelines, we do not endorse each author’s contribution. You should not rely on an author’s works without seeking professional advice. See our Reader Terms for details.

TL DR: Historic-Crypto Package, Code.

What are Anomalies?

Anomalies, often referred to as outliers, are data points, data sequences or patterns in data which do not conform to the overarching behaviour of the data series. As such, anomaly detection is the task of detecting data points or sequences which don’t conform to patterns present in the broader data.

The effective detection and removal of anomalous data can provide highly useful insights across a number of business functions, such as detecting broken links embedded within a website, spikes in internet traffic, or dramatic changes in stock prices. Flagging these phenomena as outliers, or enacting a pre-planned response can save businesses both time and money.

Types of Anomalies?

Anomalous data can typically be separated into three distinct categories, Additive Outliers, Temporal Changes, or Level Shifts.

Additive Outliers are characterised by sudden large increases or decreases in value, which can be driven by exogenous or endogenous factors. Examples of additive outliers could be a large increase in website traffic due to an appearance on television (exogenous), or a short-term increase in stock trading volume due to strong quarterly performance (endogenous).

Temporal Changes are characterised by a short sequence which doesn’t conform to the broader trend in the data. For example, if a website server crashes, the volume of website traffic will drop to zero for a sequence of datapoints, until the server is rebooted, at which point normal traffic will return.

Level Shifts are a common phenomena in commodity markets, as high demand for electricity is inherently linked to inclement weather conditions. As such, a ‘level shift’ can be observed between the price of electricity in summer and winter, owing to weather driven changes in demand profiles and renewable energy generation profiles.

What is an Autoencoder?

Autoencoders are neural networks designed to learn a low-dimensional representation of a given input. Autoencoders typically consist of two components: an encoder which learns to map input data to a lower dimensional representation and a decoder, which learns to map the representation back to the input data.

Due to this architecture, the encoder network iteratively learns an efficient data compression function, which maps the data to a lower dimensional representation. Following training, the decoder is able to successfully reconstruct the original input data, as the reconstruction error (difference between input and reconstructed output produced by the decoder) is the objective function throughout the training process.

Implementation

Now that we understand the underlying architecture of an Autoencoder model, we can begin to implement the model.

The first step is to install the libraries, packages and modules which we shall use:

Secondly, we need to obtain some data to analyse. This article uses the Historic-Crypto package to obtain historical Bitcoin (‘BTC’)data from ‘2013–06–06’ to present day. The code below also generates the daily Bitcoin returns and intraday price volatility, prior to removing any rows of missing data and returning the first 5 rows of the DataFrame.

Now that we have obtained some data, we should visually scan each series for potential outliers. The plot_dates_values function below enables the iterative plotting of each series contained within the DataFrame.

We can now iteratively call the above function, generating Plotly charts for the Volume, Close, Open, Volatility and Return profiles of Bitcoin.

Image generated by Author.

Notably, a number of spikes in trading volume occur in 2020, it may be useful to investigate if these spikes are anomalous or indicative of the broader series.

Image generated by Author.

A pronounced spike exists within the closing price in 2018, followed by a crash to a technical support level. However, a positive trend broadly exists throughout the data.

Image generated by Author.

The daily opening price follows a similar pattern to that of the closing price above.

Image generated by Author.

Price volatility displays a number of pronounced spikes in both 2018 and 2020. As such we could investigate if these volatility spikes are considered anomalous by an Autoencoder model.

Image generated by Author.

Due to the stochastic nature of the Returns series, we have elected to test for outliers within the daily traded volume of Bitcoin, as characterised by Volume.

As such, we can begin data preprocessing for the Autoencoder model. The first step in data preprocessing is to determine an appropriate split between the training data and testing data. The generate_train_test_split function outlined below enables the splitting of training and testing data by date. Upon calling the below function, two DataFrames, namely training_data and testing_data are generated as global variables.

To improve model accuracy, we can ‘normalise’ or scale the data. This function scales the training_data DataFrame generated above, saving the training_mean and training_std for standardising the testing data later.

Note: It is important to scale the training and testing data using the same scale, otherwise the difference in scale will generate interpretability issues and model inconsistencies.

As we have called the normalise_training_values function above, we now have a numpy array containing normalised training data called training_values, and we have stored training_mean and training_std as global variables to be used in standardising the test set.

We can now begin to generate a series of sequences which can be used to train the Autoencoder model. We define that the model shall be provided with 30 previous observations, providing a 3D training data of the shape (2004,30,1):

Now that we have completed the training data processing, we can define the Autoencoder model, then fit the model on the training data. The define_model function utilises the training data shape to define an appropriate model, returning both the Autoencoder model, and a summary of the Autoencoder model.

Subsequently, the model_fit function calls the define_model function internally, then provides epochs , batch_size and validation_loss parameters to the model. This function is then called, beginning the model training process.

Once the model has been trained, it is important to plot the training and validation loss curves to understand if the model suffers from bias (underfitting) or variance (overfitting). This can be observed through calling the below plot_training_validation_loss function.

Image generated by Author.

Notably, both training and validation loss curves are converging throughout the chart, with the validation loss remaining slightly larger than the training loss. Given both the shape and relative errors, we can determine that the Autoencoder model does not suffer from underfitting or overfitting.

Now, we can define the reconstruction error, one of the core principles of the Autoencoder model. The reconstruction error is denoted as train_mae_loss, with the reconstruction error threshold determined as the maximal value of train_mae_loss. Consequently, when the test error is calculated, any value greater than the maximal value of train_mae_loss can be considered as an outlier.

Image generated by Author.

Above, we saved the training_mean and training_std as global variables in order to use them for scaling test data. We now define the normalise_testing_values function for scaling the testing data.

Subsequently, this function is called on the Volume column of testing_data. As such, the test_value is materialised as a numpy array.

Following this, the generate_testing_loss function is defined, calculating the difference between the reconstructed data and the testing data. If any values are greater than the maximal value of train_mae_loss, they are stored within the global anomalies list.

Additionally, a distribution of the testing MAE loss is presented, for direct comparison with the training MAE loss.

Image generated by Author.

Finally, the outliers are visually represented below.

The outlying data, as characterised by the Autoencoder model, are presented in orange, whilst conformant data is presented in blue.

Image generated by Author.

We can see that a large proportion of Bitcoin Volume data in 2020 was considered to be anomalous — maybe due to increased retail trading activity driven by Covid-19?

Experiment with the Autoencoder parameters and dataset — see if you can find any anomalies within Bitcoin closing price — or alternatively use the Historic-Crypto library to download different Cryptocurrencies!

Happy Coding!

--

--

Current PhD Student in Economics, interests in Machine Learning, Deep Learning and Systematic Trading.