Anomaly Detection in Videos using LSTM Convolutional Autoencoder

Hashem Sellat
Towards Data Science
8 min readOct 15, 2019

--

“London Underground atrium” (Photo by Anna Dziubinska on Unplash)

Imagine we have thousands of surveillance cameras that work all the time, some of these cameras are mounted in remote areas or streets where it’s unlikely that something risky would take place, others are installed in crowded streets or city squares. There is a wide variety of abnormal events that might take place even in a single location, and the definition of abnormal event differs from location to another and from time to time.

Using automated systems to detect unusual events in this scenario is highly desirable and leads to better security and broader surveillance. In general, the process of detecting anomalous events in videos is a challenging problem that currently attracts much attention by researchers, it also has broad applications across industry verticals, and recently it has become one of the essential tasks of video analysis. There is a huge demand for developing an anomaly detection approach that is fast and accurate in real-world applications.

Pre-Required Knowledge:

Understanding the basics of the following topics:

Convolutional neural networks:
Simple explanation:
https://www.youtube.com/watch?v=YRhxdVk_sIs&t=2s
More details:
http://cs231n.github.io/convolutional-networks/

LSTM networks:
http://colah.github.io/posts/2015-08-Understanding-LSTMs/

Autoencoders:
https://www.quora.com/What-is-an-auto-encoder-in-machine-learning
https://towardsdatascience.com/deep-inside-autoencoders-7e41f319999f

Deconvolutional layers:
https://towardsdatascience.com/transpose-convolution-77818e55a123

Why Don’t We Use Supervised Methods to Detect Anomalies?

If we want to treat the problem as a binary classification problem, we need labeled data and in this case, collecting labeled data is hard because of the following reasons:

  1. Abnormal events are challenging to obtain due to their rarity.
  2. There is a massive variety of abnormal events, and manually detecting and labeling such events is a difficult task that requires much manpower.

The above reasons promoted the need to use unsupervised or semi-supervised methods like dictionary learning, Spatio-temporal features, and autoencoders. Unlike supervised methods, these methods only require unlabeled video footages that contain little or no abnormal events that are easy to obtain in real-world applications.

Autoencoders

Autoencoders are neural networks that are trained to reconstruct the input. The autoencoder consists of two parts:

  1. The encoder: Capable of learning efficient representations of the input data (x) called the encoding f(x). The last layer of the encoder is called the bottleneck, which contains the input representation f(x).
  2. The decoder: produces a reconstruction of the input data r = g(f(x)) using the encoding in the bottleneck.

The Approach

It is all about the reconstruction error.
We use an autoencoder to learn regularity in video sequences.
The intuition is that the trained autoencoder will reconstruct regular video sequences with low error but will not accurately reconstruct motions in irregular video sequences.

Getting Dirty With Data

We will use the UCSD anomaly detection dataset, which contains videos acquired with a camera mounted at an elevation, overlooking a pedestrian walkway. In normal settings, these videos contain only pedestrians.

Abnormal events are due to either:

  1. Non-pedestrian entities in the walkway, like bikers, skaters, and small carts.
  2. Unusual pedestrian motion patterns like people walking across a walkway or at the grass surrounding it.

The UCSD dataset consists of two parts, ped1 and ped2. We will use the ped1 part for training and testing.

Setting Up

Download the UCSD dataset and extract it into your current working directory or create a new notebook in Kaggle using this dataset.

Preparing The Training Set

The training set consists of sequences of regular video frames; the model will be trained to reconstruct these sequences. So, let’s get the data ready to feed our model by following these three steps:

  1. Divide the training video frames into temporal sequences, each of size 10 using the sliding window technique.
  2. Resize each frame to 256 × 256 to ensure that input images have the same resolution.
  3. Scale the pixels values between 0 and 1 by dividing each pixel by 256.

One last point is that since the number of parameters in this model is huge, we need a large amount of training data, so we perform data augmentation in the temporal dimension. To generate more training sequences, we concatenate frames with various skipping strides. For example, the first stride-1 sequence is made up of frames (1, 2, 3, 4, 5, 6, 7, 8, 9, 10), whereas the first stride-2 sequence consists of frames (1, 3, 5, 7, 9, 11, 13, 15, 17, 19).

Here is the code. Feel free to edit it to get more/fewer input sequences with various skipping strides, and see how the results change afterward.

Note: if you face memory error, decrease the number of training sequences or use Data Generator.

Building And Training The Model

Finally, the fun part begins! We will use Keras to build our convolutional LSTM autoencoder.

The below image shows the training process; we will train the model to reconstruct the regular events. So let us start discovering the model settings and architecture.

To build the autoencoder, we should define the encoder and the decoder. The encoder accepts as input a sequence of frames in chronological order, and it consists of two parts: the spatial encoder and the temporal encoder. The encoded features of the sequence that comes out of the spatial encoder are fed into the temporal encoder for motion encoding.

The decoder mirrors the encoder to reconstruct the video sequence, so our autoencoder looks like a sandwich.

Note: because the model has a huge number of parameters, it’s recommended that you use a GPU. Using Kaggle or Colab is also a good idea.

Initialization and Optimization:
We use Adam as an optimizer with a learning rate set to 0.0001, we reduce it when training loss stops decreasing by using a decay of 0.00001, and we set the epsilon value to 0.000001.

For initialization, we use the Xavier algorithm, which prevents the signal from becoming too tiny or too massive to be useful as it goes through each layer.

Let’s Dive Deeper into the Model!

Why using the convolutional layers in the encoder and the deconvolutional layers in the decoder?
The convolutional layers connect multiple input activations within the fixed receptive field of a filter to a single activation output. It abstracts the information of a filter cuboid into a scalar value. On the other hand, deconvolutional layers densify the sparse signal by convolutional-like operations with multiple learned filters; thus, they associate a single input activation with patch outputs by an inverse operation of convolution.

The learned filters in the deconvolutional layers serve as bases to reconstruct the shape of an input motion sequence.

Why did we use convolutional LSTM layers?
For general purposes sequence modeling, LSTM as a particular RNN structure has proven stable and robust for preserving long-range dependencies.

Here we used convolutional LSTM layers instead of fully connected LSTM layers because FC-LSTM layers do not keep the spatial data very well because of its usage of full connections in input-to-state and state-to-state transitions in which no spatial information is encoded.

What is the purpose of Layer Normalization?
Training deep neural networks is computationally expensive. One way to reduce the training time is to normalize the activities of the neurons using Layer Normalization; we have used Layer Normalization instead of other methods like Batch Normalization because here we have a recurrent neural network. Read more about the normalization techniques.

Did We Do Well?

Let’s get to the testing phase.

The first step is to get the test data. We will test each testing video individually. UCSD dataset provides 34 testing videos, the value of Config.SINGLE_TEST_PATH determines which one will be used.

Each testing video has 200 frames. We use the sliding window technique to get all the consecutive 10-frames sequences. In other words, for each t between 0 and 190, we calculate the regularity score Sr(t) of the sequence that starts at frame (t) and ends at frame (t+9).

Regularity Score:

We compute the reconstruction error of a pixel’s intensity value I at the location (x,y) in frame t of the video using L2 norm:

Where Fw is the learned model by the LSTM convolutional autoencoder. Then we compute the reconstruction error of a frame t by summing up all the pixel-wise errors:

The reconstruction cost of a 10-frames sequence that starts at t can be calculated as follows:

Then we compute the abnormality score Sa(t) by scaling between 0 and 1.

We can derive regularity score Sr(t) by subtracting abnormality scores from 1.

After we compute the regularity score Sr(t) for each t in range [0,190], we draw Sr(t).

Test 032 of UCSDped1

Some Tests:

First, let’s take a look at test 32 of UCSDped1. At the beginning of the video, there is a bicycle on the walkway, which explains the low regularity score. After the bicycle left, the regularity score starts to increase. At frame 60, another bicycle enters, the regularity score decreases again and increases right after it left.

Test 004 of UCSDped1 dataset shows a skater entering the walkway at the beginning of the video, and someone walks on the grass at frame 140, which explains the two drops in the regularity score.

Test 024 of UCSDped1 dataset shows a small cart crossing the walkway, causing a drop in the regularity score. The regularity score returns to the normal state after the cart left.

Test 005 of UCSDped1 dataset shows two bicycles passing the walkway, one at the beginning and the other at the end of the video.

Conclusion:

Try multiple datasets like the CUHK avenue dataset, UMN dataset, or even gather your own data using a surveillance camera or a small camera in your room. The training data is relatively easy to collect since it consists of videos that contain only regular events. Mix multiple datasets and see if the model will still do well. Think of a way to speed up the process of detecting anomalies like using fewer sequences in the testing stage.

And don’t forget to write your results in the comments!

The code and trained model are available on GitHub here.

Keep in touch on Linkedin.

References:

[1] Yong Shean Chong, Abnormal Event Detection in Videos using Spatiotemporal Autoencoder (2017), arXiv:1701.01546.

[2] Mahmudul Hasan, Jonghyun Choi, Jan Neumann, Amit K. Roy-Chowdhury, Learning Temporal Regularity in Video Sequences (2016), arXiv:1604.04574.

--

--