MANUFACTURING DATA SCIENCE WITH PYTHON

The availability of data in modern manufacturing environments is immense, but the ability to harness the data is often lacking. Fortunately, the tools of data science and machine learning can help, and in return, unlock incredible value. In this series we’ve been exploring the application of these tools to detect failures during metal machining.
In the previous post we built and trained variational autoencoders to reconstruct milling machine signals. This is shown in steps 1 and 2 in the figure below. In this post, we’ll be demonstrating the last step in the random search loop by checking a trained VAE model for its anomaly detection performance (step 3) – we’ll see if our anomaly detection model can truly detect worn tools.
The anomaly detection is performed using both the reconstruction error (input space anomaly detection) and measuring the difference in KL-divergence between samples (latent space anomaly detection). We’ll see how this is done, and also dive into the results with some pretty data visualizations. Finally, I’ll suggest some potential areas for further exploration.

Background
Input Space Anomaly Detection
Our variational autoencoders have been trained on "healthy" tool wear data. As such, if we feed the trained VAEs unhealthy data, or simply abnormal, we should generate a large reconstruction error. A threshold can be set on this reconstruction error, whereby data producing a reconstruction error above the threshold is considered an anomaly. This is input space anomaly detection.
Note: For brevity, I won’t cover all the code in this post— open the Colab notebook for an interactive experience and see all the code.
We’ll measure the reconstruction error using mean-squared-error (MSE). Because the reconstruction is of all six signals, we can calculate the MSE for each individual signal (mse
function), and for all six signals combined (mse_total
function). Here is what these two functions look like:
The reconstruction values (recon
) are produced by feeding the windowed cut-signals (also called sub-cuts) into the trained VAE, like this: recon = model.predict(X, batch_size=64).
Reconstruction probabilities is another method of input space anomaly detection (sort of?). An and Cho introduced the method in their 2015 paper. [1]
I’m not as familiar with the reconstruction probability method, but James McCaffrey has a good explanation (and implementation in PyTorch) on his blog. He says: "The idea of reconstruction probability anomaly detection is to compute a second probability distribution and then use it to calculate the likelihood that an input item came from the distribution. Data items with a low reconstruction probability are not likely to have come from the distribution, and so are anomalous in some way."
We will not be using reconstruction probabilities for anomaly detection, but it would be interesting to implement. Maybe you can give it a try?
Latent Space Anomaly Detection
Anomaly detection can also be performed using the mean and standard deviation codings in the latent space, which is what we’ll be doing. Here is the general method:
- Using KL-divergence, measure the relative difference in entropy between data samples. A threshold can be set on this relative difference indicating when a data sample is anomalous.
Adam Lineberry has a good example of the KL-divergence anomaly detection, implemented in PyTorch, on his blog. Here is the KL-divergence function (implemented with Keras and TensorFlow) that we will be using:
where mu
is the mean (µ) and the log_var
is the logarithm of the variance (log σ²). The log of the variance is used for the training of the VAE as it is more stable than just the variance.
To generate the KL-divergence scores we use the following function:
Evaluation Metrics
After we’ve calculated the reconstruction errors or KL-divergence scores, we are ready to set a decision-threshold. Any values above the threshold will be anomalous (likely a worn tool) and any values below will be normal (a healthy tool).
To fully evaluate a model’s performance we have to look at a range of potential decision-thresholds. Two common approaches are the receiver operating characteristic (ROC) and the precision-recall curve. The ROC curve plots the true positive rate versus the false positive rate. The precision-recall curve, like the name implies, plots the precision versus the recall. Measuring the area under the curve then provides a good method for comparing different models.
We’ll be using the precision-recall area-under-curve (PR-AUC) to evaluate model performance. PR-AUC performs well on imbalanced data, in contrast to the ROC-AUC. [2, 3] Below is a figure explaining what precision and recall is and how the precision-recall curve is built.

Ultimately, the evaluation of a model’s performance and the setting of its decision threshold is application specific. For example, a manufacturer may prioritize the prevention of tool failures over frequent tool changes. Thus, they may set a low threshold to detect more tool failures (higher recall), but at the cost of having more false-positives (lower precision).
Analyze the Best Model
Now that some of the "background" information is covered, we can begin analyzing the trained VAE models. You would have to calculate performance metrics against each model – the PR-AUC score – and see which one is the best. But for the sake of this post, I’ve already trained a bunch of models and selected the top one (based on PR-AUC score).
Here are the parameters of the top model:

Calculate PR-AUC Scores
Let’s see for the top model what the PR-AUC scores are across the training/validation/testing sets. But first, we need to load the data and packages.
The get_results
function takes a model and spits out the performance of the model across the training, validation, and testing sets. It also returns the precisions, recalls, true positives, and false positives for a given number of iterations (called grid_iterations
). Because the outputs from a VAE are partially stochastic (random), you can also run a number of searches (search_iterations
), and then take an average across all the searches.
And finally, here is how we generate the results:

Look at the table above: the latent space anomaly detection outperforms the input space anomaly detection. This is not unsurprising. The information contained in the latent space is more expressive and thus more likely to identify differences between cuts.
Precision-Recall Curve
We want to visualize the performance of the model. Let’s plot the precision-recall curve and the ROC curve for the anomaly detection model in the latent space.

The dashed lines in the above plots represent what a "no skilled model" would obtain if it was doing the anomaly detection – that is, if a model randomly assigned a class (normal or abnormal) to each sub-cut in the data set. This random model is represented by a diagonal line in the ROC plot, and a horizontal line, set at a precision 0.073 (the percentage of failed sub-cuts in the testing set), on the PR-AUC plot.
Compare the precision-recall curve and the ROC curve: the ROC curve gives a more optimistic view of the performance of the model; that is an area-under-curve of 0.883. However, the precision-recall area-under-curve is not nearly as high, with a value of 0.450.
Why the difference in area-under-curve values? It is because of the severe imbalance in our data set. This is the exact reason why you would want to use the PR-AUC instead of ROC-AUC metric. The PR-AUC will provide a more realistic view of a model’s performance when dealing with imbalanced data.
Violin Plot for the Latent Space
A violin plot is an effective method of visualizing the decision boundary and seeing where samples are misclassified. What is a violin plot, you say? Let’s build one!
Here’s the violin_plot
function that will will use to create the plot. It takes the trained encoder, the sub-cuts (X
), the labels (y
), and an example threshold.
We need to load the encoder.
… and plot!

Nice, huh? You can see in the violin plot how different thresholds would misclassify varying numbers of data points. Imagine the red dashed line, representing the decision-threshold, moving left or right on the plot. This is the inherent struggle with anomaly detection – separating the noise from the anomalies.
Compare Results for Different Cutting Parameters
There are six cutting parameters in the milling data set:
- the metal type (either cast iron or steel)
- the depth of cut (either 0.75 mm or 1.5 mm)
- the feed rate (either 0.25 mm/rev or 0.5 mm/rev)
We can see if our selected anomaly detection model is better at detecting failed tools on one set of parameters over another. We’ll do this by feeding one type of parameter into the model at a time and observing the results. For example, we’ll feed the cuts that were made with cast-iron. Then we’ll move to steel. Etc. etc.
I’ve skipped over a good code chunk (see Jupyter notebook) since it’s repetitive. But once we have created result dataframes for each unique cutting parameter, we can combine them into a succinct bar chart.
Here’s the code to combine each set of results:

And now we can make the pretty bar chart.

Clearly, this "best" model finds some cutting parameters more useful than others. Certain cutting parameters may produce signals carrying more information and/or have a higher signal-to-noise ratio.
The model may also develop a preference, during training, for some parameters over others. The preference can be a function of the way the model was constructed (e.g. the beta parameter or the coding size), along with the way the model was trained.
I suspect that there may be model configurations that have different parameter preferences, such as cast-iron over steel. An ensemble of models may thus produce significantly better results. This would be an interesting area of further research!
Trend the KL-Divergence Scores
The KL-divergence scores can be trended sequentially to see how our anomaly detection model works. This is my favourite chart – it’s pretty, and gives good insight.
Note: you can also trend the input space reconstruction errors, but we won’t do that here.
Let’s do some quick exploration to see how these trends will look. We need a function to sort the sub-cuts sequentially:
Now do a quick plot of the trend.

We now have all we need to create a plot that trends the KL-divergence score over time.
We’ll trend case 13, which is performed on steel, at slow speed, and is a shallow cut.

Looks good! The model produces a nice clear trend. However, as we’ve seen in the previous section, our anomaly detection model does have some difficulty in discerning when a tool is abnormal (failed/unhealthy/worn) under certain cutting conditions.
Let’s look at another example – case 11.

You can see how the trend increases through the "degraded" area, but then promptly drops off when it reaches the red "failed" area. Why? Well, I don’t know exactly. It could be that the samples at the end of the trend are more similar to healthy samples…. I’d be interested in hearing your thoughts.
There is much more analysis that could be done… which I’ll leave up to you. Let me know if you find anything interesting!
Further Ideas
What we’ve done, in this three-part series, is construct a method for anomaly detection, on an industrial data set, using a VAE. I have no doubt that these methods can be significantly improved upon, and that other interesting areas can be explored.
I hope that some industrious researcher or student can use this work as a spring-board, or inspiration, to do some really interesting things! Here are some things I’d be interested in doing further:
- As I mentioned above, an ensemble of models may produce significantly better results.
- The beta in the VAE makes this a disentangled-variational-autoencoder. It would be interesting to see how the codings change with different cutting parameters, and if the codings do represent unique features.
- I used the TCN in the VAE, but I think a regular convolutional neural network, with dilations, would work well too (that’s my hunch). This would make the model training simpler.
- If I were to start over again, I would integrate in more model tests. These model tests (like unit tests) would check the model’s performance against the different cutting parameters. This would make it easier to find which models generalize well across cutting parameters.
Conclusion
In this post we’ve explored the performance of our trained VAE model through several visualizations. We found that the latent space, using KL-divergence, was more effective than the input space anomaly detection.
There is a strong business case for using the tools of data science and machine learning within the field of Manufacturing. In addition, the principals demonstrated here can be used across many domains where anomaly detection is used.
I hope you’ve enjoyed this series, and perhaps, have learned something new!
References
[1] An, J., & Cho, S. (2015). Variational autoencoder based anomaly detection using reconstruction probability. Special Lecture on IE, 2(1), 1–18.
[2] Davis, J., & Goadrich, M. (2006, June). The relationship between Precision-Recall and ROC curves. In Proceedings of the 23rd international conference on Machine learning (pp. 233–240).
[3] Saito, T., & Rehmsmeier, M. (2015). The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PloS one, 10(3), e0118432.
_This article originally appeared on tvhahn.com. In addition, the work is complimentary to research published in IJHM. The official GitHub repo is here._
Except where otherwise noted, this post and its contents is licensed under CC BY-SA 4.0 by the author.