The world’s leading publication for data science, AI, and ML professionals.

Enhancing Autoencoders with memory modules for Anomaly Detection.

Making Sense of Big Data

Videos, Anomalies and Autoencoders

Photo by Matthew Hamilton on Unsplash
Photo by Matthew Hamilton on Unsplash

Videos, Anomalies and Autoencoders

Detecting anomalies in video streams is a hard task. Even with the rise of Deep Learning techniques that can utilise the large amounts of data generated by CCTV cameras, the task is still hard to solve because, as the name suggests, anomalies are rare and it is close to impossible to have all kinds of anomalies annotated for supervised learning. In this article, we’ll go through how to detect anomalies in video feeds and how contemporary work use Memory modules to improve performance.

As we noted, a well-annotated and comprehensive dataset is extremely hard to build for Anomaly Detection. This is why unsupervised learning is used for Anomaly Detection, i.e., Deep Autoencoders.

An autoencoder is a type of artificial neural network used to learn efficient data codings in an unsupervised manner.¹

Autoencoders are trained in an unsupervised fashion, i.e., we task neural networks with representation learning. In essence, the neural network has an encoder-decoder architecture where the encoder learns the compressed feature set (representation) and the decoder uses this representation to reconstruct the input (usually, but not always).

How is an Autoencoder used for Anomaly detection?

For Anomaly detection, Autoencoders are tasked with the reconstruction of the input. The Autoencoders are only trained on images from normal scenarios and the assumption was that the Autoencoder would have high reconstruction error when the model encounters abnormal data. However, this assumption does not hold true in practice. It has been seen empirically that the representation power of Convolutional Neural Network (CNN) is powerful enough to reconstruct even abnormal frames with low reconstruction error.

It has been observed that sometimes the autoencoder "generalizes" so well that it can also reconstruct anomalies well, leading to the miss detection of anomalies.²

Where does a Memory module fit into all of this?

The works of Gong et al.² and Park et al.³ attempted to address this problem with the addition of Memory modules to the network architecture. Both these works try to memorize the prototypical elements of normal data and use this memory to distort reconstructions of abnormal data. In simpler terms, instead of solely using the representation from the encoder, they use the "memory items" as well to reduce the representation power of the CNN. The usage of the memory module slightly differs in these works. Let’s dissect these implementations.

Gong et al., MemAE

Anomaly detection with MemAE. This visualisation shows a simplified version where only one memory item is queried. MemAE exclusively reconstructs the output from memory items. Image by Gong et al.
Anomaly detection with MemAE. This visualisation shows a simplified version where only one memory item is queried. MemAE exclusively reconstructs the output from memory items. Image by Gong et al.

MemAE is essentially an encoder-decoder with a memory module added between the encoder and the decoder. The encoder is used to build the representation of the input and this representation is used to query the most relevant memory items that are used for reconstruction by the decoder. During the training phase, the memory items are updated to "memorize" the prototypical features of the normal data. While running inference, The memory items are frozen and since the Memory module only recognizes the normal data, it will pick suboptimal features when anomalous entities appear in the input. The reconstruction error will be much higher for anomalous frames than for normal ones and this, in turn, is used for anomaly detection.

Park et al., Memory-guided Normality for Anomaly Detection (MNAD)

Learning Memory-guided Normality for Anomaly Detection. We will refer to this network as MNAD. The Blue+Red feature block will be referred to as updated_features. Note the red Key has a corresponding (similar) blue memory item stacked on it. Image by Park et al.
Learning Memory-guided Normality for Anomaly Detection. We will refer to this network as MNAD. The Blue+Red feature block will be referred to as updated_features. Note the red Key has a corresponding (similar) blue memory item stacked on it. Image by Park et al.

Following MemAE, Park et al. introduced a similar solution to deal with the representation power of CNN autoencoders. They proposed to use both encoder representations and memory items as input to the decoder for reconstruction. Instead of exclusively using memory items for reconstruction, they split the encoder representation (HxWxC) into HxW keys (1x1xC). These keys are used to update the memory items and fetch relevant memory items during the training phase. once the memory items are aggregated, they are stacked to the encoder representation and input to the decoder for the reconstruction of the input.

Colours signify similarity. Encoder Keys will have similar memory items stacked on them. This would couple well when the input is a normal frame but Key, Memory pair would be sub-optimal for an anomalous frame.
Colours signify similarity. Encoder Keys will have similar memory items stacked on them. This would couple well when the input is a normal frame but Key, Memory pair would be sub-optimal for an anomalous frame.

The idea is to use the memory items as some sort of noise by building a 2C wide representation (updated_features as shown in the figure) where the encoder Key has a similar memory item stacked on it. If the frame is anomalous, the memory item would not match as closely as it would for a normal frame.

What are the core differences in these works?

The main difference in both works is that MemAE attempts to optimally memorize items such that the Memory module can solely construct a representation with enough information to help the decoder to reconstruct the input; while MNAD attempts to use the memory items as some sort of noise filter. The updated_features will be a concatenation of the encoder features and relevant memory items. In case of an anomalous input, the memory items will still correspond to the normal scene features whereas encoder features will be that of the anomalous frame, which in theory, should increase the reconstruction error.

MemAE also proposes using 3D Convs to deal with temporal information in videos whereas MNAD deals with temporal information with motion cues fed as an input batch (4–5 input frames).

MNAD introduces a Test Time Memory updation scheme which allows for a finely tuned memory module at all times of inference.

What do these changes afford MNAD?

Since MNAD uses both the encoder representation and the Memory module to build the updated_features, it is not necessary for the memory module to have a large number of memory items (2000 for MemAE vs 10 for MNAD) and since the size of the Memory module is small, it is imperative that the module is populated with diverse features that would be well distributed over the queries. MNAD introduces feature separateness loss and compactness loss which incentivises populating the memory module with diverse memory items.

The feature separateness loss incentivises the model to decrease the distance between each feature and its nearest feature while increasing the distance between the feature and the second nearest feature. This looks just like the Triplet loss.

Triplet loss is a loss function for machine learning algorithms where a baseline (anchor) input is compared to a positive (truthy) input and a negative (falsy) input. The distance from the baseline (anchor) input to the positive (truthy) input is minimized, and the distance from the baseline (anchor) input to the negative (falsy) input is maximized.⁹

The compactness loss incentivises the model to reduce the intra-class variation of the representations that the encoder builds.

Together, both these losses help increase the diversity of the memory items stored in the Memory module and increases the discriminative power of the Memory module.

t-SNE plots for features with separateness loss, without separateness loss and with our proposed supervision to force distribution (discussed later in the article).
t-SNE plots for features with separateness loss, without separateness loss and with our proposed supervision to force distribution (discussed later in the article).

Which is better?

MNAD consistently outperforms most of its competition and the authors primarily credit the performance to the way their Memory module is implemented and to the Test Time memory updation scheme.

Are these conclusions concrete and accurate?

We reimplemented MNAD for the RC2020 challenge⁴ and wrote a report⁵ on its reproducibility.

The primary goal of this event is to encourage the publishing and sharing of scientific results that are reliable and reproducible. In support of this, the objective of this challenge is to investigate reproducibility of papers accepted for publication at top conferences by inviting members of the community at large to select a paper, and verify the empirical results and claims in the paper by reproducing the computational experiments, either via a new implementation or using code/data or other information provided by the authors.⁴

Let’s dig a bit deeper into how much of this theory worked out as intended.

Are the tasks under-credited?

MNAD uses two proxy tasks to benchmark their models.

  1. Reconstruction task.
  2. Prediction task.

For the Reconstruction task, the model is fed 1 image and the model is tasked with reconstructing this frame.

For the Prediction task, the model is fed a batch of 5 images (for temporal information) and the model is tasked with the prediction of the 6th frame.

The model architecture for both these tasks is quite similar to a U-Net. The main difference is that the skip connections have to be removed for the Reconstruction task. This is because one of the input features is the target feature and with skip connection, the model would essentially learn to copy the input to the output.

This raises two questions.

  1. The prediction task allows to embed temporal information in the input features, so does this task have to be credited to a degree for the performance gain?
  2. The prediction task also allows the model to have Skip connections; how do the Skip connections affect the behaviour of the model?

Empirically, we showed that the Skip connections were to be credited for some performance increase. To be absolutely certain this is the case, we would have to introduce skip connections to the model in the Reconstruction task. To achieve this, we added salt and pepper noise at different probabilities (5%,25%,40%) to the input and tasked the model to denoise the input. Since the input features no longer contain an exact target feature, the model is unable to copy the input through skip connections. We saw a raw performance gain of ~4% but it did not beat the benchmarks from the Prediction task. It is important to note that the Prediction task does embed temporal information into the input features, which means that the model trained on the Prediction task will be able to classify temporal and Spatio-temporal anomalies whereas the temporal anomalies would be completely missed by the model trained on the Reconstruction task.

So the answer to the two questions would empirically be Yes. The Prediction task allows for embedding temporal information and the introduction of skip connections, both of which show improved performance.

How exactly does the task affect the model behaviour?

The Reconstruction task is void of any temporal information as it takes in only one input and reconstructs one output, and consequently is only able to identify Spatial anomalies. On the other hand, the Prediction task allows for embedding temporal information into the input features which helps the model identify Spatial, Temporal and Spatio-Temporal anomalies.

To show how the tasks affected the performance of the models, we synthetically generated a dataset with Spatial, Temporal and Spatio-temporal anomalies. The normal instances include circles of radius 10 pixels moving across at a speed of 5 pixels per frame. Spatial anomalies involve squares moving at the same speed while Temporal anomalies involve circles moving at a speed of 10 pixels per frame. Spatio-Temporal anomalies are squares moving at 10 pixels per frame.

The following figures show how the output artefacts are more prominent in the model trained on the Prediction task.

Spatial anomaly: The data has never had squares before, and both models reconstruct the squares as circles which populate the normal data.
Spatial anomaly: The data has never had squares before, and both models reconstruct the squares as circles which populate the normal data.
Temporal anomaly: The circles are moving faster and this affects the input batch for the Prediction task (This information is absent for reconstruction as the input is void of temporal information), only the model trained on the prediction task shows artefacts.
Temporal anomaly: The circles are moving faster and this affects the input batch for the Prediction task (This information is absent for reconstruction as the input is void of temporal information), only the model trained on the prediction task shows artefacts.
Spatio-temporal anomaly: The input has squares that are moving faster than the usual circles. Both models convert the squares to circles but the model trained on the prediction task also introduces temporal artefacts which makes it an easier classification task.
Spatio-temporal anomaly: The input has squares that are moving faster than the usual circles. Both models convert the squares to circles but the model trained on the prediction task also introduces temporal artefacts which makes it an easier classification task.

Were these the only elements that might have been under credited?

MNAD was benchmarked on UCSD Ped2⁶, CUHK Avenue⁷ and Shanghaitech⁸ datasets. The Shanghaitech dataset is much more complex than the UCSD Ped2 dataset and CUHK Avenue. While reproducing the results of MNAD on the Shanghaitech dataset, we noticed a lot of NaN values in one of the heuristics that used the memory items. Dwelling deeper, we found that the Memory module was also not working as intended. The features were being related to only a small portion of the memory items. The skewed distribution would essentially make the introduction of the memory module completely useless. For the Shanghaitech dataset, we found that all the features were related to just one memory item. This means that the Key, Memory item pair would be the same for an anomalous or a normal frame.

The Memory tensor would always be mapped to this key, making the memory module completely obsolete.
The Memory tensor would always be mapped to this key, making the memory module completely obsolete.

We introduced additional supervision to force a uniform distribution and this help beat the reported scores. This is not an ideal solution as we are forcing a certain distribution that might not be optimal, but this hack does get the pipeline working as intended.

Forced Uniform Memory distribution.
Forced Uniform Memory distribution.

Misc. observations

We also observed that the skip connections were heavily involved in decoding the output, so much so that the memory module and the intermediate representation did not matter at all. We replaced the intermediate representations with zeros, ones and random tensors and the output was optically unchanged. This led us to look at the difference in performance with Test Time memory updation for the Reconstruction and the Prediction task. The Prediction task was not affected much but the reconstruction task would take a huge hit, which was to be attributed to the fact that there were no skip connections to do the heavy lifting.


Conclusion

Our experiments show that the addition of the Memory module has improved performance for simpler datasets like UCSD Ped2 and CUHK Avenue. The verdict is still out for more complex datasets like the ShanghaiTech dataset. The addition of a Memory module has deteriorated the performance when compared with a model without a Memory module. We show that training on such complex datasets will lead to the Memory module being sidelined and rendered completely obsolete. The additional supervision we propose to tackle this problem is not ideal but does let us evaluate the model to scores marginally better than reported by Park et al. The biggest concern for the ShanghaiTech dataset remains that the score for the model trained on the Reconstruction task is slightly higher than that of the Prediction task. Given that the Reconstruction task completely ignores the Temporal anomalies, this leads to question the efficacy of the Memory module on this dataset. It is possible that the model finds it easier to rely heavily on the skip connections even after the proposed supervision and attempts to find superficial features to help decode the features. It is also possible that the dataset is too noisy for the model.

Finally, we could hit the reported scores with these proposed amendments but we conclude that there still remains work to be done for building a stable pipeline for complex datasets like Shanghaitech. I hope this article gives a decent introduction to Memory modules for Anomaly detection!


References

[1] Kramer, Mark A. (1991). "Nonlinear principal component analysis using auto associative neural networks" (PDF). AIChE Journal. 37 (2): 233–243. doi:10.1002/aic.690370209.

[2] https://arxiv.org/pdf/1904.02639.pdf

[3] https://arxiv.org/pdf/2003.13228.pdf

[4] https://paperswithcode.com/rc2020

[5] https://arxiv.org/ftp/arxiv/papers/2101/2101.12382.pdf

[6] Weixin Li, Vijay Mahadevan, and Nuno Vasconcelos. Anomaly detection and localization in crowded scenes. IEEE TPAMI, 2013.

[7] Cewu Lu, Jianping Shi, and Jiaya Jia. Abnormal event detection at 150 FPS in MATLAB. In ICCV, 2013.

[8] Weixin Luo, Wen Liu, and Shenghua Gao. A revisit of sparse coding based anomaly detection in stacked RNN framework. In ICCV, 2017.

[9] https://en.wikipedia.org/wiki/Triplet_loss

[10] https://github.com/alchemi5t/MNADrc


Related Articles