The world’s leading publication for data science, AI, and ML professionals.

Unhappy about your time series anomalies? Synthesize them!

Generate your own multivariate time series datasets with realistic anomalies

Photo by Maxim Berg on Unsplash
Photo by Maxim Berg on Unsplash

If you’ve worked with anomaly detection problems on time series data, you may have searched for annotated datasets that include relevant anomalies. And you may have struggled in this search, especially when looking for multivariate time series data suitable for researching industrial IoT use cases. In addition, even when you work with real-life sensor data, you might still struggle to find anomalies that, by definition, are rare in a manufacturing process.

While writing some articles about anomaly detection, I searched for such a good dataset to illustrate my thought process. Basically, I needed:

  • A multivariate dataset including multiple sensors
  • Spanning several months (ideally one year)
  • With a reasonable sampling rate (around 1 to 5 minutes)
  • Including some known anomalies to validate my results

The only dataset I found that satisfies all these criteria was a water pump dataset available on Kaggle. Unfortunately, no license was associated to this dataset, making it impossible to actually use on publications such as Medium as this is considered a commercial usage.

So I decided to take a stab at generating my own multivariate datasets with the type of anomalies I’m encountering when dealing with industrial sensors and manufacturing process data…

I encourage you to follow along this blog post by browsing to GitHub to grab this series of companion Jupyter notebooks. You can use your usual Jupyter environment or fire up one with Amazon SageMaker. After you have cloned the repo, you can open the first one (synthetic_0_data_generation.ipynb) and follow along this article.

Initialization

We are going to build our dataset from scratch, only with basic Python libraries:

Nothing fancy here: data processing libraries (numpy and pandas), the random generator library and matplotlib to visualize our time series. When dealing with date operations, I like using the relativedelta method: we will use it when it comes to combine several types of anomalies in a given signal.

And with this, we are ready to go! Let’s start by generating the base signals of our time series…

Generating the baseline

We will first define the extent of our dataset:

I want to generate one year of data with a regular sampling rate of 10 minutes. I then generate an empty dataframe with this date time index.

Generating values at random

Your first instinct might be to generate random values along the date time axis we just created:

This piece of code yields the following result:

Random values generation (image by author)
Random values generation (image by author)

This white noise doesn’t look realistic: there are no pattern to look for and using this approach to generate multiple signals to synthetize a multivariate dataset won’t be very useful to simulate a real process. If you are measuring the temperature of a machine, the values won’t evolve this chaotically. This is where random walk comes into play…

Leveraging a random walk process

Real data should display patterns: at a given point in time a value will actually have some level of relationships with the previous values. In probability, a random walk is a process to determine the likely position of a subject (here the value of our time series) given the probability to move in some direction.

Here is the function I use to generate random walks: my time series starts with an initial value (start) and then I randomly add a quantity (step). I can use the min_value and max_value parameters to constraint my time series within a certain range. I can also play with the probability probability to give a decreasing or increasing tendency to my time series:

Let’s generate a few plots to visualize the behavior we can obtain:

And here is the plots resulting from this code:

Random walk-generated time series (image by author)
Random walk-generated time series (image by author)

This looks a lot better: these signals actually look quite realistic! I will now use this function to generate the baseline for a multivariate dataset of 20 signals located within different ranges (I use different starting values for this). This will simulate sensor data measuring different dimensions of a process (for instance). Here is the result I obtain:

Baseline of a multivariate dataset generated with random walk process (image by author)
Baseline of a multivariate dataset generated with random walk process (image by author)

Let’s now add some anomalies in there…

Adding anomalies

Adding level shifts

Level shifts in time series data can be seen when a process or a series of equipment are going through different operating modes. This may also happen when the environmental conditions go through a sudden change. Here is the function I use to simulate a level shift on a given signal I input as a Pandas Series:

This function allows you to shift a portion of a given time series in a direction (magnitude_shift) at a given point in time (between start and end). When a shift happens, your signals may also be either smoother (magnitude_multiply < 1.0) or more chaotic (magnitude_multiply > 1.0).

And here are two examples of level shifts generated by this function:

Level shifts examples (image by author)
Level shifts examples (image by author)

On the first signal, we add a positive level shift with a smoother signal. On the second, the level shift is negative and we simulate a more chaotic behavior over this time range.

Here is the associated code that produced this plot:

Now I will use this function to add two random level shifts to three signals selected at random. I will also add another random level shifts to five other signals, also selected at random:

Multivariate dataset with random level shifts added (image by author)
Multivariate dataset with random level shifts added (image by author)

This type of anomalies is quite common, but they are easy to spot. Let’s now see how we can add gradual changes to a signal to simulate a slow degradation of a process…

Adding trending or gradual changes

To add a gradual change to a time series, I use the following function:

Basically, I generate a new random walk with a degradation_slope which modifies the random walk probability and a degradation_speed which modifies the random walk step. Using this function I will add a couple of degradation to some signals selected at random in my multivariate dataset (see purple and yellow signals below):

Multivariate dataset with slow degradations added (image by author)
Multivariate dataset with slow degradations added (image by author)

Adding catastrophic failures

Often, after a slow degradation patterns appears in a dataset, a catastrophic failure may follow. To simulate this, I will just set several signals to 0.0 right after a degradation pattern. I have a very simple function for this:

Check out my notebook to see how I added these failures right after the degradation pattern we just generated:

Multivariate dataset with sudden failures added (image by author)
Multivariate dataset with sudden failures added (image by author)

And voilà! If you want to see an example of anomaly detection models trained and evaluated on this type of dataset, check out these articles:

Your Anomaly Detection Model Is Smarter than You Think

Top 3 Ways Your Anomaly Detection Models Can Earn Your Trust

Conclusion

In this article, you learned how you can leverage some synthetic generation techniques to create multivariate time series data with a realistic look.

I hope you found this article insightful: feel free to leave me a comment here and don’t hesitate to subscribe to my Medium email feed if you don’t want to miss my upcoming posts! Want to support me and future work? Join Medium with my referral link:

Join Medium with my referral link – Michaël HOARAU


Related Articles