While dealing with industrial sensor data, I often tackle Anomaly Detection use cases. I’ve been working on this topic with dozens of customers in the past decade and almost daily in the past five years. The typical end users I interact with are plant managers, process engineers or operators on manufacturing lines. Most process engineers have an excellent grasp of statistics and some are even good industrial data scientists. This is more the exception than the rule though and you end up getting into discussions where you have to answer to tough questions:
"I need to know why your model detected an anomaly. I need sound root cause analysis before I adjust my manufacturing process."
"Anomaly detection is not enough: when a model detects an anomaly, it’s already too late. I need prediction to justify investing time and effort into such an approach."
"I need prescription: tell me what I should to prevent a failure from happening."
And so on… Some time ago I posted a few short presentations to demystify anomaly detection on LinkedIn (see this post and this one). In this blog post, I will detail how you can produce similar outputs for your own models and make them smarter. You will then be better prepared to tackle the questions above! In a nutshell, by the end of this article you will have:
- Set up a Jupyter environment in the cloud with Amazon SageMaker,
- Cloned the GitHub repo that contained all the code to follow along this article.
- Identified a good dataset, downloaded and explored it
- Trained an anomaly detection model using Amazon Lookout for Equipment (a managed service from AWS dedicated to anomaly detection)
- Visualized the raw results from your anomaly detection model
- Postprocessed the results to derive more meaningful insights
So, let’s go for the first step!
Let’s get comfortable: preparing your environment!
I encourage you to follow along this blog post by browsing to GitHub to grab this series of companion Jupyter notebooks. You can use your usual Jupyter environment or fire up one with Amazon SageMaker.
- If you want to use your usual Jupyter environment, already have a trained anomaly detection model and have back-tested it on some test data to get some results you’re already good to go to the next paragraph!
- If you want to use your usual Jupyter environment, have some data and want to try Amazon Lookout for Equipment to train an anomaly detection model, you will need to create an AWS account and make sure your account credentials are accessible from your environment.
- If you want to stay within the AWS environment, you can create an AWS account, fire up an Amazon SageMaker environment and give it access to the Amazon Lookout for Equipment API. You can use the same dataset than this article or bring your own.
I will now consider that your environment is ready: the first step is to clone this GitHub repository in it:
git clone https://github.com/michaelhoarau/smarter-anomaly-detection.git
The notebooks from this repository will walk you through the whole process from data preparation to results post-processing. You can start with the first one [1_data_preparation.ipynb](https://github.com/michaelhoarau/smarter-anomaly-detection/blob/main/notebooks/1_data_preparation.ipynb)
.
Data overview
Wait a minute… What do YOU call an anomaly?
You might find this surprising, but this question is very often overlooked! To frame your business problem from a Machine Learning perspective, it is critical to understand:
- The shape of the anomalies you’re interested into
- How they build up overtime
- What happens when they are triggered
Besides, if you want to understand the potential ROI of building an anomaly detection system, you will also need to understand:
- Who the potential users are
- What types of decisions they need to make
- Which insights can make these decisions easier to make
- How these insights should be delivered
Collecting this knowledge will ensure your system is actually adopted by your end users.
With that said, let’s have a look at different types of anomalies one can capture from time series data:
- Sudden changes: this is the easiest one to spot. A change (univariate or multivariate when it’s happening across many timeseries) is sudden and the value change is obvious. In industrial environment, these sudden changes are often taken care at the edge level. Specific software components monitor these processes and pieces of equipment.
- Level shifts: this can happen when a given time series shifts between range of values based on underlying conditions or operating modes. If you want to consider all operating modes when detecting anomalies, you need to take care to include all of them in your training data. Otherwise, at inference time, a new operating mode might be considered an anomaly and you will end up with many false positives to deal with.
- Trending: a set of signals can change over time (not necessarily in the same direction). When you want to assess the condition of a process or of a piece of equipment, these trending anomalies will be great precursors events to search. They will help you build forewarning signals before actual failures may happen.
Now that we have levelled the field from a definition perspective, let’s put this into practice by looking at an actual industrial multivariate dataset…
Dataset overview
Searching for an industrial multivariate time series dataset with annotated anomalies and enough historical data is a challenge in itself. The only public dataset with relevant anomalies I’ve seen is an industrial water pump dataset available from Kaggle. You can download this dataset from this link. This dataset contains 52 sensors ranging from April 1, 2018 to August 31, 2018 (5 months) with a 1-minute sampling rate. However, as there is no license associated to this dataset, you cannot use it for any kind of commercial use.
To help you get started, the first notebook ([synthetic_0_data_generation.ipynb](https://github.com/michaelhoarau/smarter-anomaly-detection/blob/main/notebooks/synthetic_0_data_generation.ipynb)
) in the repo mentioned earlier will generate a synthetic multivariate timeseries dataset and add different types of anomalies on it: this is not like the real thing, but it will be close enough to expose my thought process here. I generated a dataset with 20 signals, 1 year of data and a sampling rate of 10 minutes.
Let’s load this data and look at a few signals to:
- Identify the failure times (the red dots below)
- Highlight the periods (in yellow below) during which this virtual asset was broken or recovering from its failure:

Using the first five months of this dataset to train a model (from January 2021 up to May 2021) will give us at least 2 anomalous ranges to evaluate our model against.
- One of these is the yellow band visible in November 2021
- The other will be the strange spike visible on the green signal in early September 2021
Let’s plot a strip chart for these 20 time series:

I use strip charts to compact time series signals information. Once compacted, each time series becomes a colorful band where:
- Low values of the signal are green
- Medium values are orange
- And high values are red
This simple strip band is very handy to show when low or high values occur. When putting together all the strip bands of the 20 signals of our dataset, we get the diagram above. You can see some vertical red areas that might be of interest. Some of them match the known periods where the asset was marked as broken (around November).
In some cases, we can also see a shift from red to green for many signals from left to right: this could be a sign that a drift is occurring in a dataset after a certain date. Here is an example of a dataset I encountered previously where you can see a shift occurring after December 2017:

After further investigation, this type of information will be critical to set up a retraining trigger for any anomaly detection model you train. Without knowing this, you will gradually see your model issuing more and more false positives, rendering your model unusable and probably loosing trust from the end users of your model.
To generate strip charts, I used my tsia
package (pip install tsia
) and then the following code:
If you want to know more about strip charts, check out this previous article where I dive deeper in how they are produced:
Using strip charts to visualize dozens of time series at once
Once your synthetic dataset is generated, you can use the second companion notebook ([synthetic_1_data_preparation.ipynb](https://github.com/michaelhoarau/smarter-anomaly-detection/blob/main/notebooks/synthetic_1_data_preparation.ipynb)
) to prepare the data for Amazon Lookout for Equipment. Feel free to update it to prepare the data to a format suitable for your own anomaly detection model. You’re ready to train your model!
Training an anomaly detection model
I will use Amazon Lookout for Equipment to train an anomaly detection model on the previous dataset. To achieve this, you will need an AWS Account. Then, in the GitHub repo already mentioned, you can run the third companion notebook (the one called [synthetic_2_model_training.ipynb](https://github.com/michaelhoarau/smarter-anomaly-detection/blob/main/notebooks/synthetic_2_model_training.ipynb)
): this will push the data to Amazon S3 (the AWS block storage service that most managed AWS AI/ML services use for input datasets), ingest the data and train an anomaly detection model.
If you want to know more about Amazon Lookout for Equipment, there are six chapters dedicated to this service in my book, Time Series Analysis on AWS:
Time Series Analysis on AWS: Learn how to build forecasting models and detect anomalies in your…
Feel free to have a look at this blog post and this other one to get more details about what you can find in these chapters:
Time series analysis on AWS – Part 2 presentation – Multivariate anomaly detection
The training process includes a back-test evaluation on historical data where you can the events the service would have detected if it was running at the time. Like other anomaly detection models, the raw results from Amazon Lookout for Equipment look like this:

At the top, I plotted the time series for some of the sensors. Below, in green, are the known periods when the virtual asset was recovering from a failure. At the bottom, in red, are the events detected by Lookout for Equipment. When seeing these results, someone may point out that:
"There are some false positives, I don’t have time to investigate each event!"
"Your model only detects anomalies when they already happened, it’s useless!"
"I have hundreds of sensors: when an anomaly is detected by your model, I still have to investigate my whole operations, I’m not saving any time here!"
Sounds familiar? Let’s see how we can derive more insights from your anomaly detection model and start earning more trust from your end users…
Postprocessing the model’s raw outputs
As shown before, the most basic anomaly detection models are able to flag timestamps from your time series dataset that it considers abnormal. Let’s now open the fourth notebook of the repo to start post-processing these results with more details.

Taking a literal interpretation of this anomaly detection model you may say that you have too many false positives (all the red events happening before the failure in November 2021). You might expect the model to flag the failure periods and nothing before or after. But this is linked to many assumptions:
- You have a precise date for the failure
- You actually know it’s a failure and not a maintenance event
- You know that no precursor event is triggered by your equipment or process before the failure
That’s a lot of assumptions and most of the time you won’t be in this exact situation: any abnormal event visible in your time series will either be a precursor event, a detectable anomaly (forewarning about a future event), a failure, a maintenance activity or a healing period while your industrial process recovers after an issue.
This fact actually gives us some leeway to postprocess our anomaly detection model raw outputs, especially when you want to understand the condition your asset/process is in (condition monitoring used for condition-based maintenance approaches).
Measuring event rates
Anomaly detection models fires events and as a user, you are the one who must decide if an event is actually an anomaly your care for, an anomaly you don’t want to capture, a precursor events forewarning a failure to come or a false positive. Filtering out the last one can be driven by measuring how many events you have over a period of time of interest. For instance, measuring the number of daily events corresponding to similar results I obtained from another dataset yielded a plot similar to this one:

Based on this, you can decide to only notify an operator after the daily event rate reaches at least 200 per day. This would allow you to only react to 3 events out of the 41 detected during this period. My anomaly detection model outputs a dataframe with a status for each time stamp (0 if nothing is detected and 1 if an anomaly is found). My data have a sampling rate of 5 minutes, so if I used a rolling window over 12 x 24 = 288 periods
, I cover a full day of data:
Using this simple technique, you can address, at least partly, some of the concerns voiced earlier. You can use this to actually react when an event rate starts to grow to large (allowing you to move from detecting to predicting) and filter out false positives (when scarce events are detected).
Let’s now try and address the last concern about the time wasted by the lack of precision of anomaly detection models when it comes to help root cause analysis…
Measuring and plotting variables contributions
Many anomaly detection models also yield some explainability details such as the contribution of each variable to any given event detected. Amazon Lookout for Equipment is no different and each model results will include the following field in the JSON output for each event detected:
'predicted_ranges': [
{
'start': '2019-08-08T00:42:00.000000',
'end': '2019-08-08T01:48:00.000000',
'diagnostics': [
{'name': 'syntheticsignal_00', 'value': 0.052},
{'name': 'syntheticsignal_01', 'value': 0.023},
{'name': 'syntheticsignal_02', 'value': 0.038},
{'name': 'syntheticsignal_03', 'value': 0.023},
...
{'name': 'syntheticsignal_17', 'value': 0.049},
{'name': 'syntheticsignal_18', 'value': 0.033},
{'name': 'syntheticsignal_19', 'value': 0.046},
{'name': 'syntheticsignal_20', 'value': 0.044}
]
},
...
]
If you’re following my companion notebooks, you will see how I reformat this type of outputs. I use an expanded dataframe suitable for further plotting and processing:

Now, instead of the event rate, let’s plot the evolution of the variable importance over time:

I allocated a different color to each sensor and this plot is not yet readable. It looks like one of the green signals is the key contributing factor on the first large event on the left. Another signal, a red one, is trending higher for the event on the right. Let’s use a cumulative bar plot:

This is a bit better, we can actually better follow the trends and identify that multiple signals can be the most contributing ones for a given event. However, when trying to understand the "most important" ones, we don’t need to plot all the 50+ signals. Let’s just try and focus on the top 5 and aggregate all the others in an "Other signals" category:

It’s faster to plot and a lot more readable. With 20 signals, if each signal was contributing the same way to a given event, each signal would have a 5% contribution and 5 signals a 25% contribution. We can see above that the top 5 sensors consistently reach 30% to 60% contribution meaning that this might be quite interesting to investigate further…
Now, let’s add a few signals and the events (known or detected) for some additional context around this bar plot:

You will find all the detailed codes in the [synthetic_3_model_evaluation.ipynb](https://github.com/michaelhoarau/smarter-anomaly-detection/blob/main/notebooks/synthetic_3_model_evaluation.ipynb)
notebook. From the expanded results dataframe shown above (let’s call this df
), the previous bar plot is generated with the following code:
What’s next?
In this article your learned how to leverage event rates to move your model from mere detection to some level of prediction and how to start getting a better understanding of which signals are the ones to investigate first.
In my next article article, I will detail how I go even deeper to better understand the results of my anomaly detection models:
Top 3 Ways Your Anomaly Detection Models Can Earn Your Trust
I hope you found this article insightful: feel free to leave me a comment here and don’t hesitate to subscribe to my Medium email feed if you don’t want to miss my upcoming posts! Want to support me and future work? Join Medium with my referral link: