The world’s leading publication for data science, AI, and ML professionals.

How to Understand the Deep Time Series Classifier with Integrated Gradients

Time series classification is a general problem that can be seen in many domains and contexts, e.g. for given time series of a financial…

Image by author: Two instances of the motor data
Image by author: Two instances of the motor data

Time series classification is a general problem that can be seen in many domains and contexts, e.g. for given time series of a financial product, predicting if a customer will make a purchase or not.

However, with a special temporal dimension, it is not always an easy task to understand why a classifier makes a decision: It is not a single point on the timeline that decides the label but the whole timeline. But do all moments count? Do they have the same importance? Is today more important than yesterday or is Tuesday more important than Sunday for the prediction?

In this article, I am trying to give my solution to explain the time series classifier. This article mainly consists of two parts:

  1. In the first part, I will build a Cnn model for a classic time series classification problem.
  2. In the second and the main part of the article, I will use the integrated gradients to explain the model’s prediction.

Please check the notebook here for all my code.

CNN model as time series classifier

The model I built considered an example from this tutorial.

The dataset

The given dataset consists of the motor data in which each instance is a time series as a measurement of engine noise captured by a motor sensor. The goal of the task is to predict whether the engine has a specific issue. For more details on the dataset, you can have a look at this paper.

The figure at the very beginning of this article shows two instances in this time series: class 0 means the absence of the issue while class 1 means the presence. In addition, I give the figure below: all the three curves belong to the same class 0 (the class without issue) and you can therefore see that it is not straightforward to have an idea for what reason a time series can be labeled:

Image by author: Three time series of the same class 0
Image by author: Three time series of the same class 0

CNN Model

One of the important differences between time series from other normal features is that each point on the timeline is not independent of the prediction. One point t a moment should not only contain information at that moment but also a part of the information from the past. This requires feature extraction of the time series (mean, std, max, etc. within an observation window). Instead of doing the feature extraction, we will construct a 1D CNN model to do the classification with the following code:

from keras import models
from keras import layers
input_layer = layers.Input(shape=(x_train.shape[1],1))
conv1 = layers.Conv1D(filters=64, kernel_size=3, padding="same")(input_layer)
conv1 = layers.BatchNormalization()(conv1)
conv1 = layers.ReLU()(conv1)
conv2 = layers.Conv1D(filters=64, kernel_size=3, padding="same")(conv1)
conv2 =layers.BatchNormalization()(conv2)
conv2 = layers.ReLU()(conv2)
conv3 = layers.Conv1D(filters=64, kernel_size=3, padding="same")(conv2)
conv3 = layers.BatchNormalization()(conv3)
conv3 = layers.ReLU()(conv3)
gap = layers.GlobalAveragePooling1D()(conv3)
output_layer = layers.Dense(1, activation="sigmoid")(gap)
model=models.Model(inputs=input_layer, outputs=output_layer)

I will give a plot as a summary of the model below. You can see that there are three 1D convolutional layers in the model.

Image by author: Summary of the CNN model
Image by author: Summary of the CNN model

Trained on 3601 instances and tested on another 1320 instances, the model reached a val_binary_accuracy of 0.9697 after 220 training epochs with an early stop.

Time series prediction explained by IG

IG as an explainer

Let’s go deeper now. How can we understand the model? To explain the CNN classifier, I will use the Integrated Gradients as a tool. I suppose that you are familiar with IG. If not, please check my last blog in which I gave a brief introduction to IG and showed an example of its implementation and limits.

In short, IG generalizes the coefficient in a linear expression, and its value of an input feature measures how this feature makes the model’s output different from the baseline output. In this example, the baseline was chosen simply as a time series with all points 0. Here is the definition of IG.

Image bu author: The definition of IG
Image bu author: The definition of IG

According to above the definition, IG was obtained by accumulating the gradients along the straight-line paths from the baseline to the input. In practice, with the help of automatic differentiation of Tensorflow, the computation of the gradient can be obtained easily with the following code:

def compute_gradients(series):
   with tf.GradientTape() as tape:
     tape.watch(series)
     logits = model(series)
    return tape.gradient(logits, series)

Results

To show how IG can help us understand the model we built in the last chapter, I chose two time series in the test dataset which are predicted to be in class 1 and 0 respectively. The curves in the figures display the time series while the colors display the absolute value of the IG value: the darker the color is, the bigger the IG value is. In other words, dark parts on the curves contribute more to the prediction.

Image by author: Time series in class 1 with IG values
Image by author: Time series in class 1 with IG values
Image by author: Time series in class 0 with IG values
Image by author: Time series in class 0 with IG values

Conclusions

We can observe some interesting things in these two figures:

  1. A single point cannot lead to the prediction. There exists always a neighborhood of a point on the timeline in which all the points have similar values of contribution to the prediction.
  2. Some points (intervals) on the timeline have much less importance than others on the timeline for the prediction.
  3. Even if we don’t have a quantified value of extracted features that leads to the prediction, we can still have a good visual intuition of what exactly gives the output by the CNN classifier. e.g. the local minimum or maximum of the value does not have a big impact on the prediction.

Related Articles