The world’s leading publication for data science, AI, and ML professionals.

Anomaly Detection in Process Control Data with Machine Learning

Introducing anomaly detection with data generated from your own desktop on the Temperature Control Lab device

Hands-on Tutorials

Photo by Dimitry Anikin on Unsplash
Photo by Dimitry Anikin on Unsplash

Anomaly Detection is a powerful application of machine learning in a real-world situation. From detecting fraudulent transactions to forecasting component failure, we can train a machine learning model to determine when something out of the ordinary is occurring.

When it comes to machine learning, I’m a huge advocate for learning by experiment. The actual math behind machine learning models can be a bit of a black box, but that doesn’t keep it from being useful; in fact, I feel like that’s one of the advantages of machine learning. You can apply the same algorithms to solving a whole gamut of problems. Sometimes, the best way to learn is to handle some real data and see what happens with it.

In this example, we’ll be able to generate some real data with the Temperature Control Lab device and train a supervised classifier to detect anomalies. The TCLab is a great little device for generating real data with a simple plug-and-play Arduino device. If you want to create your own data for this, t[here](https://github.com/nrlewis929/TCLab_anomaly_detection_basic) are great introductory resources found here; otherwise, I included data I generated on my own TCLab device in the Github repository. If you want to run these examples on your own, you can download and follow the code here.

Problem Framework

The TCLab is a simple Arduino device with two heaters and two temperature sensors. It’s a simple plug and play device, with a plug to power the heaters and a USB port to communicate with the computer. The heater level can be adjusted in a Python script (be sure to pip install tclab), and the temperature sensor is used to read the temperature surrounding each heater. For this example, we’ll keep it basic and just use one heater and sensor, but the same principles could also be applied to the 2 heater system, or even more complex systems such as what you might find in a chemical refinery.

Image from APMonitor. Reposted with permission.
Image from APMonitor. Reposted with permission.

We can imagine the problem as this: we have a heater for our garage workshop, with a simple on/off setting. We can program the heater to turn on or off for certain amounts of time each day to keep the temperature at a comfortable level. There are, of course, more sophisticated control systems we could use; however, this is designed as an introduction to anomaly detection with machine learning, so we’ll keep the raw data simple for now.

Under normal circumstances, it’s a simple enough exercise to verify that the heater is on and doing its job – just look at the temperature and see if it’s going up. But what if there are external factors that complicate the evaluation? Maybe the garage door was left open and lets a draft in, or perhaps some of your equipment starts overheating. Or worse yet, what if there’s a cyberattack on the temperature control, and it’s masked by the attackers? Is there a way to look at the data we’re gathering and determine when something is going wrong? This is the heart of anomaly detection.

With a supervised classifier, we can look at the sensor data and train it to classify when the heater is on or off. Since we also know when the heater should be on or off, we can then apply the classifier to any new data coming in and determine whether the behavior lines up with the data we see. If the two don’t match, we know there’s an anomaly of some type, and can investigate further.

Generating Data

To simulate this scenario, we’ll generate a data file from the TCLab, turning the heater on and off at different intervals. If you’re having trouble with your TCLab setup, there are some great troubleshooting resources here.

To start, we’ll set up a few arrays to store the data, which we’ll collect in 1 second intervals. The heater on/off cycling will be pretty simple for now, just turning it all the way on or off and leaving it for a few minutes. We’ll run for an hour to make sure we have lots of data to train and validate on.

The TCLab inputs are straightforward, setting lab.Q1 to either on or off, depending on what we generated. We only turn the heater on 70% to avoid overheating, and then record the temperature at each time point from lab.T1 . Finally, let the loop delay for 1 second.

The anomalous data is created in basically the same way, but with only 20 minutes. The big difference is that I blow a fan across the heater at a specified time, simulating a draft in the garage. The convection from the fan will naturally cool the system, counteracting what the heater is trying to do. Since this is unanticipated, we should see the classifier pick this up – perhaps indicating the heater is off when we know it’s on. Let’s go ahead and find out if it works!

Data Preprocessing

One of the keys to machine learning is investigating how to frame your data in a way that is useful for the model. When I do any project in machine learning, this is often the most time-consuming step. If I get it right, the model works like a charm; otherwise, I can spend hours or even days of frustration, wondering why my model won’t work.

You should always scale your data for machine learning applications, and this can easily be done with the MinMaxScaler from scikit-learn. In addition, check that the format of your data is correct for the model (for example, does the shape of a numpy array match what the classifier expects? If not, you’ll likely get a warning). Let’s see what this looks like so far. Note that we don’t scale the y data because it’s already just 0’s and 1’s.

You’ll notice that our input is just the temperature, and the output we’re trying to predict is the heater status (on or off). Is this effective? Can the classifier tell if the heater is on or off based only on the temperature? Thankfully with scikit-learn, training and predicting with a supervised classifier is just a few lines of code. We’ll also be using the Logistic Regression classifier, but there are many other supervised classifiers you could use.

Plotting the results yields this:

Plot by author
Plot by author

The answer to if we can just use the raw temperature as the input is an emphatic no. This makes intuitive sense – for example, at 35°C, the heater is either on or off. However, the change in temperature would tell us a whole lot about what the heater is doing.

Feature Engineering

This introduces the realm of feature Engineering, another important part of setting up a problem for machine learning. We can create additional features out of what we already have. What other data can we glean from just the raw temperature? We can get the 1st and 2nd derivatives, the standard deviation over the past few readings, and even look at long-term difference trends. Another trick to try is log scaling the data, especially if it’s log distributed. These are all new features, and some of them may just well be the silver bullet to feed into the classifier for good performance!

Again, some domain knowledge and intuition of the specific problem is really useful for feature engineering. Let’s start with looking at the change in temperature. There’s a bit of sensor noise, and so taking a rolling average of the temperature gives a nicer plot that will likely result in better performance.

Change in temperature data based on raw temperature (dT) and rolling average temperature (dT_ave). Plot by author
Change in temperature data based on raw temperature (dT) and rolling average temperature (dT_ave). Plot by author

Let’s see how that does. The only difference is that we create the dT column in our dataframe and use that as the input feature. We also have to drop any data with NaN values.

The result…is pretty good! There are a few misses, but those are always on the edge when the heater has just turned on or off. We could look into more features such as the 2nd derivative that might help with this, but for now, this looks like it will do the trick.

Plot by author
Plot by author

Validating the Classifier on New Data

Best practice before setting out to use the trained classifier dictates validating the performance on data that has never been seen. I personally like to try it out on the original training data initially, just as a gut check; if it doesn’t work there, it won’t work on new data. However, we saw that it worked well on the above data, so let’s do this check now on a new set of data. You can either generate a new anomaly-free data file from the TCLab, or use the train_test_split from scikit-learn. Note that we only use the transform feature on the data scaling, since we’ve already fit the scaler and used that scaling to train the classifier.

Here’s a plot of the results:

Plot by author
Plot by author

It checks out. Again, there are the fringe cases when the heater turns on or off. We could look into additional features to feed into the classifier, or even change the hyperparameters for the classifier, but for a simple demo, this is sufficient.

Anomaly Detection with the Trained Classifier

Here is where the rubber hits the road. We know the classifier works well to tell us when the heater is on or off. Can it tell us when something unexpected is happening?

The process is really the same. The only difference is now we have an anomaly in the data. We’ll create the new feature (change in temperature, using the smoothed rolling average), scale the data, and feed it into the classifier to predict.

Here are the results:

Plot by author
Plot by author

This looks like it worked well! We see the prediction is that the heater is off when the fan is running, when in reality the heater is on, there’s just an anomalous event. Even after the fan goes off, the system is re-equilibrating, so there’s some additional mismatch between what’s actually going on with the heater and what the classifier is predicting. If an operator is looking at the data coming in, they can compare what they expect the heater to be doing and what the classifier is predicting; any discrepancies indicate an anomaly that can be investigated.

Additional Ideas

This was a neat experiment with just one type of anomaly, using data that we generated right here on our tabletop! Sure it’s simple, but these kinds of problems can quickly become complex. The guiding principles are the same, though.

What other anomalous events might happen? We could simulate anomalous outside heat by turning on heater 2. We could simulate colder ambient temperature by running the TCLab in a colder room. There’s also a neat trick with these devices: if you make a cell phone call near the sensor, the waves interfere with the signal and you can get a completely new type of anomaly. What kinds of features would be useful for detecting these other types of anomalies? Could we tune the hyperparameters to improve the performance?

Finally, perhaps my favorite strategy when it comes to classifiers is to train several different classifiers, and then take an aggregate score. Go ahead and try out some other types of classifiers from scikit-learn. Which ones work well? Which don’t? If you find 4 or 5 that work pretty well, you could train all of them, and then only flag something as an anomaly if a majority of classifiers show a mismatch.

Was this helpful? Did you like being able to generate your own data? What other ideas have you found useful for anomaly detection? Thanks for reading, and I hope this introduction opened up some new ideas for your own projects.


Related Articles