The world’s leading publication for data science, AI, and ML professionals.

Anomaly Detection in Manufacturing, Part 2: Building a Variational Autoencoder

Machine failures? Use variational autoencoders to detect and prevent them

MANUFACTURING DATA SCIENCE WITH PYTHON

Photo by Daniel Smyth on Unsplash
Photo by Daniel Smyth on Unsplash

In the previous post (Part 1 of this series) we discussed how an autoencoder can be used for Anomaly Detection. We also explored the UC Berkeley milling data set. Going forward, we will use a variant of the autoencoder – a variational autoencoder (VAE) – to conduct anomaly detection on the milling data set.

In this post, we’ll see how the VAE is similar, and different, from a traditional autoencoder. We’ll then implement a VAE and train it on the milling data. In the next post, Part 3, we will check the VAE for its anomaly detection performance.

The Variational Autoencoder

The variational autoencoder was introduced in 2013 and today is widely used in machine learning applications. [1] The VAE is different from traditional autoencoders in that the VAE is both probabilistic and generative. What does that mean? The VAE creates outputs that are partly random (even after training) and can also generate new data that is like the data it is trained on.

There are excellent explanations of the VAE online – I’ll direct you to Alfredo Canziani’s deep learning course (video below from YouTube). Regardless, here is my attempt at an explanation.

At a high level, the VAE has a similar structure to a traditional autoencoder. However, the encoder learns different codings; namely, the VAE learns mean codings, µ, and standard deviation codings, σ. The VAE then samples randomly from a Gaussian distribution, with the same mean and standard deviation created by the encoder, to generate the latent variables, z. These latent variables are "decoded" to reconstruct the input.

The figure below demonstrates how a signal is reconstructed using the VAE.

A variational autoencoder architecture (top), and an example of a data sample going through the VAE (bottom). Data is compressed in the encoder to create mean and standard deviation codings. The coding is then created, with the addition of Gaussian noise, from the mean and standard deviation codings. The decoder uses the codings (or latent variables) to reconstruct the input. (Image by author, inspiration from Aurélien Geron)
A variational autoencoder architecture (top), and an example of a data sample going through the VAE (bottom). Data is compressed in the encoder to create mean and standard deviation codings. The coding is then created, with the addition of Gaussian noise, from the mean and standard deviation codings. The decoder uses the codings (or latent variables) to reconstruct the input. (Image by author, inspiration from Aurélien Geron)

During training, the VAE works to minimize its reconstruction loss (in our case we use binary cross entropy), and at the same time, force a Gaussian structure using a latent loss. The structure is achieved through the Kullback-Leibler (KL) divergence, with detailed derivations for the losses in the original VAE paper. [1] The latent loss is as follows: *

where K is the number of latent variables, and β is an adjustable hyper-parameter as introduced by Higgens et al. [2]

A VAE learns factors, embedded in the codings, that can be used to generate new data. As an example of these factors, a VAE may be trained to recognize shapes in an image. One factor may encode information on how pointy the shape is, while another factor may look at how round it is. However, in a VAE, the factors are often entangled together across the codings (the latent variables).

Tuning the hyper-parameter beta (β), to a value larger than one, can enable the factors to "disentangle" such that each coding only represents one factor at a time. Thus, greater interpretability of the model can be obtained. A VAE with a tunable beta is sometimes called disentangled-variational-autoencoder, or simply, a β-VAE. For simplicity, we’ll still call the β-VAE a VAE.

Data Preparation

Before going any further, we need to prepare the data. Ultimately, we’ll be using the VAE to detect "abnormal" tool conditions, which correspond to when the tool is in a worn or failed. But first we need to label the data.

As shown in the last post, each cut has an associated amount of flank wear, VB, measured at the end of the cut. We’ll label each cut as either healthy, degraded, or failed according to the amount of wear on the tool – these are the tool health categories. Here’s the schema:

I’ve created a data prep class that takes the raw Matlab files, a labelled CSV (each cut is labelled with the associated flank wear), and spits out the training/validation/and testing data. However, I want to highlight one function in the class that is important; that is, the create_tensor function.

Note: For brevity, I won’t cover all the code – follow along in the Colab notebook and train some models.

The create_tensor function takes an individual cut, breaks it up into chunks, and puts them into a single array. It breaks the cut signal up into chunks using a window of a fixed size (the window_size variable) and then "slides" the window along the signal. The window "slides" by a predetermined amount, set by the stride variable.

We’ll take each of the 165 cuts (remember, two cuts from the original 167 are corrupted) and apply a window size of 64 and a stride of 64 (no overlap between windows).

I’ve visually inspected each cut and selected when the "stable" cutting region occurs, which is usually five seconds or so after the signal begins collection, and a few seconds before the signal collection ends. This information is stored in the "labels_with_tool_class.csv" file.

With the data_prep.py (see github repo), and some Python magic, we can then create the training/validation/testing data sets. Here is what the script looks like:

The final distribution of the data is shown below. Notice how imbalanced the data is (that is, relatively few "failed" samples)? This is a common problem in Manufacturing/industrial data, which is another reason to use a self-supervised method.

For anomaly detection, it is common to train the autoencoders on "normal" data only. We’ll be doing the same and training our VAE on healthy data (class 0). However, checking the performance of the anomaly detection will be completed using all the data. In other words, we’ll be training our VAE on the "slim" data sets, but testing on the "full" data sets.

Building the Model

We now understand what a Variational Autoencoder is and how the data is prepared. Time to build!

Our VAEs will be made up of layers consisting of convolutions layers, batch normalization layers, and max pooling layers. The figure below shows what one of our VAE models could look like.

Example model architecture used in the VAE. The input to the encoder is a milling data sample, with a window size of 64 for an input shape of (64, 6). There are 3 convolutional layers, a filter size of 17, and a coding size of 18. (Image by author)
Example model architecture used in the VAE. The input to the encoder is a milling data sample, with a window size of 64 for an input shape of (64, 6). There are 3 convolutional layers, a filter size of 17, and a coding size of 18. (Image by author)

I won’t be going through all the details of the model. However, here are some important points:

  • I’ve used the temporal convolutional network as the basis for the convolutional layers. The implementation is from Philippe Remy – thanks Philippe! You can find his github repo here.
  • Aurélien Geron’s book, "Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow", is great. In particular, his section on VAEs was incredibly helpful, and I’ve used some of his methods here. There is a Jupyter notebook from that section of his book on his github which is useful. Thanks Aurélien! [3]
  • I’ve used rounded accuracy to measure how the model performs during training, as suggested by Geron.

Here is, roughly, what the model function looks like:

Training the Model

Time to begin training some models. To select the hyperparameters, we’ll be using a random search.

Why a random search? Well, it’s fairly simple to implement and has been shown to yield good results when compared to a grid search. [4] Scikit-learn has some nice methods for implementing a random search – we’ll use the [ParameterSampler](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.ParameterSampler.html) method.

We’ll be training a bunch of different VAEs, all with different parameters. After each VAE has been trained (trained to minimize reconstruction loss), and the model saved, we’ll go through the VAE model and see how it performs in anomaly detection (which we’ll cover in Part 3). Here’s a diagram of the random search training process:

The random search training process has three steps. First, randomly select the hyperparameters. Second, train the VAE with these parameters. Third, check the anomaly detection performance of the trained VAE. (Image by author)
The random search training process has three steps. First, randomly select the hyperparameters. Second, train the VAE with these parameters. Third, check the anomaly detection performance of the trained VAE. (Image by author)

In practice, when I ran this experiment, I trained about 1000 VAE models on Google Colab (yay for free GPUs!). After all 1000 models were trained, I moved them to my local computer, with a less powerful GPU, and then checked the models for their anomaly detection performance. The continuous use of Colab GPUs is limited, so it makes sense to maximize the use of the GPUs on them this way.

You can inspect the full training loop in the Colab notebook. Try training some models!

Conclusion

In this post, we learned how a VAE is similar, and different, from a traditional autoencoder. We then prepared the milling data, created a random search for hyperparameter selection, and began training models.

In the next post, Part 3, we’ll evaluate the trained VAEs and see how they are used for anomaly detection. We’ll use the precision-recall curve to interrogate the model performance. Finally, we’ll create some pretty graphics to visualize the results (my favourite!).

References

[1] Kingma, D. P., & Welling, M. (2013). Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114.

[2] Higgins, Irina, et al. (2016). "beta-vae: Learning basic visual concepts with a constrained variational framework."

[3] Géron, Aurélien. (2019). Hands-on machine learning with Scikit-Learn, Keras, and TensorFlow: Concepts, tools, and techniques to build intelligent systems. O’Reilly Media.

[4] Bergstra, J., & Bengio, Y. (2012). Random search for hyper-parameter optimization. Journal of machine learning research, 13(2).


_This article originally appeared on tvhahn.com. In addition, the work is complimentary to research published in IJHM. The official GitHub repo is here._

Except where otherwise noted, this post and its contents is licensed under CC BY-SA 4.0 by the author.

  • Correction: In the original article I had an error in the latent loss function (the code is fine though). I have changed it to match that in Geron. In addition, the notation is different than that of Alfredo Canziani. Please refer to his videos – they are very good!

Related Articles