Explainable AI (XAI) design for unsupervised deep anomaly detector

Interpretable prototype for detecting Out-of-Distribution Samples and Adversarial Attacks

Ajay Arunachalam
Towards Data Science

--

An interpretable prototype of unsupervised deep convolutional neural network & lstm autoencoders based real-time anomaly detection from high-dimensional heterogeneous/homogeneous time series multi-sensor data

Hello, friends. In this blog post, I will take you through the new features of the package “msda”. More details can be found on the GitHub page here

What’s new in MSDA v1.10.0?

MSDA is an open source low-code Multi-Sensor Data Analysis library in Python that aims to reduce the hypothesis to insights cycle time in a time-series multi-sensor data analysis & experiments. It enables users to perform end-to-end proof-of-concept experiments quickly and efficiently. The module identifies events in the multidimensional time series by capturing the variation and trend to establish a relationship aimed towards identifying the correlated features helping in feature selection from raw sensor signals. Also, it provides a provision to precisely detect the anomalies in real-time streaming data an unsupervised deep convolutional neural network & also a lstm autoencoders based detectors are designed to run on GPU/CPU. Finally, a game theoretic approach is used to explain the output of the built anomaly detector model.

The package includes:-

  1. Time series analysis.
  2. The variation of each sensor column wrt time (increasing, decreasing, equal).
  3. How each column values varies wrt other column, and the maximum variation ratio between each column wrt other column.
  4. Relationship establishment with trend array to identify most appropriate sensor.
  5. User can select window length and then check average value and standard deviation across each window for each sensor column.
  6. It provides count of growth/decay value for each sensor column values above or below a threshold value.
  7. Feature Engineering a) Features involving trend of values across various aggregation windows: change and rate of change in average, std. deviation across window. b) Ratio of changes, growth rate with std. deviation. c) Change over time. d) Rate of change over time. e) Growth or decay. f) Rate of growth or decay. g) Count of values above or below a threshold value.
  8. ** Unsupervised deep time-series anomaly detector. **
  9. ** Game theoretic approach to explain the time-series data model. **

MSDA is simple, easy to use and low-code.

Who should use MSDA?

MSDA is an open source library that anybody can use. In our view, the ideal target audience of MSDA is:

  • Researchers for quick poc testing.
  • Experienced Data Scientists who want to increase productivity.
  • Citizen Data Scientists who prefer a low code solution.
  • Students of Data Science.
  • Data Science Professionals and Consultants involved in building Proof of Concept projects.

ANOMALY Revisited

What is an anomaly, and why should it be of any concern? In layman terms, “Anomalies” or “outliers” are the data points in a data space, which are abnormal, or out of trend. Anomaly detection focuses on identifying examples in the data that somehow deviate from what is expected or typical. Now, the question is, “How do you define something is abnormal or outlier?” The quick rationale answer is all those points that don’t follow the trend of the neighboring points in the sample space.

For any business domain, detecting suspicious patterns from a huge set of data in very critical. Say, for example in banking domain the fraudulent transactions pose a serious threat & loss/liabilities to the bank. In this blog, we will try to learn about detecting anomalies from data without training the model before-hand, because you can’t train a model on data, which we don’t know about! That’s where the whole idea of unsupervised learning helps. We will see two network architectures for building real-time anomaly detector, i.e., a) Deep CNN b) LSTM AutoEncoder

These network suits for detecting a wide range of anomalies, i.e., point anomalies, contextual anomalies, and discords in time series data. Since, the approach is unsupervised, it requires no labels for anomalies. We use the unlabeled data to capture, and learn the data distribution that is used to forecast the normal behavior of a time-series. The first architecture is inspired from the IEEE paper DeepAnT, it consists of two components: time series predictor and anomaly detector. The time series predictor uses deep convolutional neural network (CNN) to predict the next time stamp on the defined horizon. This component takes a window of time series (used as a reference context) and attempts to predict the next time stamp. The predicted value is then passed to the anomaly detector component, which is responsible for labeling the corresponding time stamp as Non-Anomaly or Anomaly.

The second architecture is inspired from this Nature paper Deep LSTM-based Stacked Autoencoder for Multivariate Time Series

Let first understand simply what is AUTOENCODER neural network. The autoencoder architecture is used to learn efficient data representation in an unsupervised manner. There are three components to an autoencoder: an encoding (input) portion that compresses the data, in the process learns a representation (encoding) for the set of data, a component that handles the compressed data (size reduction), and a decoder (output) portion that reconstructs the learned representation as close as possible to the original input from the compressed data while minimizing the overall loss function. So, simply when the data is fed into an autoencoder, it is encoded and then compressed down to a smaller size, and further that smaller representation is decoded back to original input. Next, let us understand, why LSTM is appropriate here? What is LSTM? Long short-term memory (LSTM) is a neural network architecture capable of learning order dependencies in sequence prediction problems. A LSTM network is a type of recurrent neural network (RNN). The RNN mainly suffers from vanishing gradients. Gradients contain information, and over time, if the gradients vanish, then important localized information is lost. This is where LSTM is handful as it helps remember the cell states preserving the information. The basic idea is that the LSTM network has multiple “gates” inside of it with trained parameters. Some of these gates control the modules “output” and other gates control “forgetting.” LSTM networks are good fit for classifying, processing and making predictions based on time series data, since there can be lags of unknown duration between important events in a time series.

An LSTM Autoencoder is an implementation of an autoencoder for sequence data using an Encoder-Decoder LSTM network architecture.

Now, that we have seen the basic concepts of each network, let us go through the design of our both network as shown below. The DeepCNN consists of two convolutional layers. Typically, CNN consists of a sequence of layers which includes convolutional layers, pooling layers, and fully connected layers. Each convolutional layer normally has two stages. In the first stage, the layer performs the mathematical operation called convolution which results in linear activations. In the second stage, a non-linear activation function is applied on each linear activation. Like other neural networks, the CNN also uses training data to adapt its parameters (weights and biases) to perform the learning task. The parameters of the network are optimized using ADAM optimizer. The kernel size, number of filters can be tuned further to perform better depending on the dataset. Further, the dropout, learning rate, etc. can be fine tune to validate the performance of the network. The loss function used was the MSELoss (squared L2 norm) that measures the mean squared error between each element in the input ‘x’ and target ‘y’. The LSTMAENN consists of stacked multiple LSTM layers with input_size — The number of expected features in the input x, hidden_size — The number of features in the hidden state h, num_layers — Number of recurrent layers (Default:1), etc. For more details refer here. To avoid the scope of interpreting the detected noise in the data as anomalies, we can tune the additional hyper-parameters like ‘lookback’ (time series window size), units in hidden layers, and many more.

Unsupervised Deep Anomaly Models

DeepCNN(
(conv1d_1_layer): Conv1d(10, 16, kernel_size=(3,), stride=(1,))
(relu_1_layer): ReLU()
(maxpooling_1_layer): MaxPool1d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
(conv1d_2_layer): Conv1d(16, 16, kernel_size=(3,), stride=(1,))
(relu_2_layer): ReLU()
(maxpooling_2_layer): MaxPool1d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
(flatten_layer): Flatten()
(dense_1_layer): Linear(in_features=80, out_features=40, bias=True)
(relu_3_layer): ReLU()
(dropout_layer): Dropout(p=0.25, inplace=False)
(dense_2_layer): Linear(in_features=40, out_features=26, bias=True)
)
LSTMAENN(
(lstm_1_layer): LSTM(26, 128)
(dropout_1_layer): Dropout(p=0.2, inplace=False)
(lstm_2_layer): LSTM(128, 64)
(dropout_2_layer): Dropout(p=0.2, inplace=False)
(lstm_3_layer): LSTM(64, 64)
(dropout_3_layer): Dropout(p=0.2, inplace=False)
(lstm_4_layer): LSTM(64, 128)
(dropout_4_layer): Dropout(p=0.2, inplace=False)
(linear_layer): Linear(in_features=128, out_features=26, bias=True)
)

Now, that we have designed the network architectures. Next, we will go through the further steps with hands-on demonstration as given below.

Getting Started

1) Install the package

The easiest way to install msda is using pip.

pip install msdaOR$ git clone https://github.com/ajayarunachalam/msda
$ cd msda
$ python setup.py install

Notebook

!pip install msda

2) Import time-series data

Here, we will use the climate data from here. This dataset is compiled from several public sources. The dataset consists of daily temperatures and precipitation from 13 Canadian centres. Precipitation is either rain or snow (likely snow in winter months). In 1940, there is daily data for seven out of the 13 centres, but by 1960 there is daily data from all 13 centres, with the occasional missing value. We have around 80 years records (daily frequency of data), and we want to identify the anomalies from that climate data. As seen below this data has 27 features, and around 30K records.

df = pd.read_csv('Canadian_climate_history.csv')
df.shape
=============
(29221, 27)

3) Data validation, pre-processing, etc.

We start by checking for missing values, and impute those missing values.

The functions missing(), and impute() from Preprocessing & ExploratoryDataAnalysis class can be used to find missing values, and filling the missing information. We are replacing the missing values with the mean values (hence, modes=1). There are several utility functions within these classes that can be used for profiling your dataset, manual filtering of outliers, etc. Also, other options provided include datetime conversions, getting descriptive stats of the data, normality distribution test, etc. For more details peek here

'''
Impute missing values with impute function (modes=0,1, 2, else use backfill)
0: impute with zero, 1: impute with mean, 2: impute with median, else impute with backfill method
'''
ExploratoryDataAnalysis.impute(df=df, modes=1)

4) Post-processing data to input into the anomaly detector

Next, we are inputting data with no missing values, removal of unwanted fields, assert the timestamp field, etc. Here, the user can input the column to drop with their index value, and assert the timestamp field with their index value too. This returns two dataframes, one will have all the numerical fields without timestamp index, while the other will have all the numerical fields with timestamp indexing. We need to use one with the timestamp as index of data for further steps.

Anamoly.read_data(data=df_no_na, column_index_to_drop=0, timestamp_column_index=0)

5) Data processing with user-input time window size

The time window size (lookback size) is given as input to the function data_pre_processing from the Anamoly class.

X,Y,timesteps,X_data = Anamoly.data_pre_processing(df=anamoly_df, LOOKBACK_SIZE=10)

With this function, we are also normalizing the data within the range of [0,1] and then modifying the dataset by including ‘time-steps’ as another additional dimension. The idea is to convert two-dimensional data set of the dimension from [Batch Size, Features] to three-dimensional data set [Batch Size, Lookback Size, Features]. For more details inspect here.

6) Selecting custom user selection input configurations to train the anomaly model

Using the set_config() function the user can select from the deep network architectures, set time window size, tune the kernel size. The available models — Deep Convolutional Neural Network, LSTM AUTOENCODERS, that can be given with possible values [‘deepcnn’, ‘lstmaenn’]. We choose the time-series window size=10, and use the kernel size of 3 for the convolutional network.

MODEL_SELECTED, LOOKBACK_SIZE, KERNEL_SIZE = Anamoly.set_config(MODEL_SELECTED='deepcnn', LOOKBACK_SIZE=10, KERNEL_SIZE=3)
==================
MODEL_SELECTED = deepcnn
LOOKBACK_SIZE = 10
KERNEL_SIZE = 3

7) Training the selected ANOMALY detector model

One can train the model with either GPU/CPU based on availability. The compute function will use GPU, if available, otherwise, it will use the CPU resources. The google colab uses NVIDIA TESLA K80 which is the most most popular GPU, while NVIDIA TESLA V100 is the First Tensor Core GPU. The number of epochs for training can be custom set. The device being used will be outputted on the console.

Anamoly.compute(X, Y, LOOKBACK_SIZE=10, num_of_numerical_features=26, MODEL_SELECTED=MODEL_SELECTED, KERNEL_SIZE=KERNEL_SIZE, epocs=30)
==================
Training Loss: 0.2189370188678473 - Epoch: 1
Training Loss: 0.18122351250783636 - Epoch: 2
Training Loss: 0.09276176958476466 - Epoch: 3
Training Loss: 0.04396845106961693 - Epoch: 4
Training Loss: 0.03315385463795454 - Epoch: 5
Training Loss: 0.027696743746250377 - Epoch: 6
Training Loss: 0.024318942805264566 - Epoch: 7
Training Loss: 0.021794179179027335 - Epoch: 8
Training Loss: 0.019968783528812286 - Epoch: 9
Training Loss: 0.0185430530715746 - Epoch: 10
Training Loss: 0.01731374272046384 - Epoch: 11
Training Loss: 0.016200231966590112 - Epoch: 12
Training Loss: 0.015432962290901867 - Epoch: 13
Training Loss: 0.014561152689542462 - Epoch: 14
Training Loss: 0.013974714691690522 - Epoch: 15
Training Loss: 0.013378228182289321 - Epoch: 16
Training Loss: 0.012861106097943028 - Epoch: 17
Training Loss: 0.012339938251426095 - Epoch: 18
Training Loss: 0.011948177564954476 - Epoch: 19
Training Loss: 0.011574006228333366 - Epoch: 20
Training Loss: 0.011185694509874397 - Epoch: 21
Training Loss: 0.010946418002639517 - Epoch: 22
Training Loss: 0.010724217305010896 - Epoch: 23
Training Loss: 0.010427865211985524 - Epoch: 24
Training Loss: 0.010206768034701313 - Epoch: 25
Training Loss: 0.009942568653453904 - Epoch: 26
Training Loss: 0.009779498535478721 - Epoch: 27
Training Loss: 0.00969111187656911 - Epoch: 28
Training Loss: 0.009527427295318766 - Epoch: 29
Training Loss: 0.009236675929400544 - Epoch: 30

8) Finding Anomalies

Once the training is completed, the next step is to find the anomalies. Now, this brings us back to our fundamental question, i.e., how exactly can we estimate & trace what is an anomaly?. One can use Anomaly Score, Anomaly Likelihood, and some recent metrics like Mahalanobis distance-based confidence score etc. The Mahalanobis confidence score assumes that the intermediate features of pre-trained neural classifiers follow class conditional Gaussian distributions whose covariances are tied for all distributions, and the confidence score for a new input is defined as the Mahalanobis distance from the closest class conditional distribution. Anomaly Score is the fraction of active columns that were not predicted correctly. In contrast, Anomaly Likelihood is the likelihood that a given anomaly score represents a true anomaly. In any dataset, there will be a natural level of uncertainty that creates a certain “normal” number of errors in prediction. Anomaly likelihood accounts for this natural level of error. Since, we don’t have the ground truth anomaly label, so in our case, we cannot use this metric. The find_anamoly() is used to detect anomalies by generating the hypothesis, and calculating losses, which are the anomaly confidence scores for individual time stamps given in the data set.

loss_df = Anamoly.find_anamoly(loss=loss, T=timesteps)

hypothesis = model(torch.from_numpy(X.astype(np.float32)).to(device)).detach().cpu().numpy()

loss = np.linalg.norm(hypothesis — Y, axis=1)

return loss.reshape(len(loss),1)

9) Plotting samples with confidence score: DeepCNN example

Next, we need to visualize the anomalies, the samples are assigned anomaly confidence score for each timestamp record. The plot_anamoly_results function can be used to plot the anomaly score with respect to frequencies (bins) & confidence score for every timestamp record.

Anamoly.plot_anamoly_results(loss_df=loss_df)

From the above graphs, one can preasume that the timestamps/instances, which has anomaly confidence scores greater equal to 1.2 are likely examples that deviate from what is expected or typical, and thus can be treated as potential anomalies.

10) Intepretable results of predictions from the anomaly detector — DeepCNN

Finally, a prototype of Explainable AI for the built time-series predictor is designed. Before, we go through this step, let us understand what is need of interpretable models/explainable models.

Why Explainable AI (XAI) is the buzz & need of the hour?

Data is everywhere, and machine learning can mine it for information. Representation learning would become more valuable & highly significant, if also the results generated by machine learning models could be easily understood, interpreted, and trusted by humans. That is where Explainable AI comes in, thereby making things no longer a black box.

The explainable_results() uses the game theoretic approach to explain the output of model. To understand, interpret, and trust the results on the deep models at individual/samples level, we use the Kernel Explainer. One the fundemental properties of Shapley values is that they always sum up to the difference between the game outcome when all players are present, and the game outcome when no players are present. For machine learning models, this means that SHAP values of all the input features will always sum up to the difference between baseline (expected) model output, and the current model output for the prediction being explained. The explainable_results function takes the input value for the specific row/instance/sample prediction that was made to be interpreted. It also takes the number of input features (X), and the time-series window size difference (Y). We can get the explainable results at the individual instance level, and also at the batch of data size (say for example first 200 rows, last 50 samples, etc.)

Anamoly.explainable_results(X=anamoly_data, Y=Y, specific_prediction_sample_to_explain=10,input_label_index_value=16, num_labels=26)

The above graph is the result for the 10th example/sample/record/instance. It can be seen that the features that contributed significantly to the corresponding resulted anomaly confidence score were due to the temperature readings from the weather stations of Vancouver, Toronto, Saskatoon, Winnipeg, Calgary.

Important Links

Complete code is made available here. Refer this notebook

CONTACT

You can reach me at ajay.arunachalam08@gmail.com

Thanks for reading. Always, keep learning :)

References

https://en.wikipedia.org/wiki/Convolutional_neural_network

https://machinelearningmastery.com/stacked-long-short-term-memory-networks/

https://arxiv.org/abs/2003.00402

https://www.nature.com/articles/s41551-018-0304-0

--

--