Part 7: Fast Pattern Searching with STUMPY

Finding Similar Subsequence Matches for Known Patterns

Sean Law

Published in

Towards Data Science

9 min readSep 30, 2020

The Whole is Greater than the Sum of Its Parts

STUMPY is a powerful and scalable Python library for modern time series analysis and, at its core, efficiently computes something called a matrix profile. The goal of this multi-part series is to explain what the matrix profile is and how you can start leveraging STUMPY for all of your modern time series data mining tasks!

Note: These tutorials were originally featured in the STUMPY documentation.

Part 1: The Matrix Profile
Part 2: STUMPY Basics
Part 3: Time Series Chains
Part 4: Semantic Segmentation
Part 5: Fast Approximate Matrix Profiles with STUMPY
Part 6: Matrix Profiles for Streaming Time Series Data
Part 7: Fast Pattern Searching with STUMPY
Part 8: AB-Joins with STUMPY
Part 9: Time Series Consensus Motifs
Part 10: Discovering Multidimensional Time Series Motifs
Part 11: User-Guided Motif Search
Part 12: Matrix Profiles for Machine Learning

Beyond Matrix Profiles

At the core of STUMPY, one can take any time series data and efficiently compute something called a matrix profile, which essentially scans along your entire times series with a fixed window size, m, and finds the exact nearest neighbor for every subsequence within your time series. A matrix profile allows you to determine if there are any conserved behaviors (i.e., conserved subsequences/patterns) within your data and, if so, it can tell you exactly where they are located within your time series. In a previous tutorial, we demonstrated how to use STUMPY to easily obtain a matrix profile, learned how to interpret the results, and discover meaningful motifs and discords. While this brute-force approach may be very useful when you don’t know what pattern or conserved behavior you are looking but, for sufficiently large datasets, it can become quite expensive to perform this exhaustive pairwise search.

However, if you already have a specific user defined pattern in mind then you don’t actually need to compute the full matrix profile! For example, maybe you’ve identified an interesting trading strategy based on historical stock market data and you’d like to see if that specific pattern may have been observed in the past within one or more stock ticker symbols. In that case, searching for a known pattern or “query” is actually quite straightforward and can be accomplished quickly by using the wonderful core.mass function in STUMPY.

In this short tutorial, we’ll take a simple known pattern of interest (i.e., a query subsequence) and we’ll search for this pattern in a separate independent time series. Let’s get started!

Getting Started

Let’s import the packages that we’ll need to load, analyze, and plot the data

%matplotlib inline

import pandas as pd
import stumpy
import numpy as np
import numpy.testing as npt
import matplotlib.pyplot as plt
from matplotlib.patches import Rectangle

plt.rcParams["figure.figsize"] = [20, 6]  # width, height
plt.rcParams['xtick.direction'] = 'out'

Loading the Sony AIBO Robot Dog Dataset

The time series data (below), T_df, has n = 13000 data points and it was collected from an accelerometer inside of a Sony AIBO robot dog where it tracked the robot dog when it was walking from a cement surface onto a carpeted surface and, finally, back to the cement surface:

T_df = pd.read_csv("https://zenodo.org/record/4276393/files/Fast_Pattern_Searching_robot_dog.csv?download=1")
T_df.head()   Acceleration
0  0.89969
1  0.89969
2  0.89969
3  0.89969
4  0.89969

Visualizing the Sony AIBO Robot Dog Dataset

plt.suptitle('Sony AIBO Robot Dog Dataset, T_df', fontsize='30')
plt.xlabel('Time', fontsize ='20')
plt.ylabel('Acceleration', fontsize='20')
plt.plot(T_df)
plt.text(2000, 4.5, 'Cement', color="black", fontsize=20)
plt.text(10000, 4.5, 'Cement', color="black", fontsize=20)
ax = plt.gca()
rect = Rectangle((5000, -4), 3000, 10, facecolor='lightgrey')
ax.add_patch(rect)
plt.text(6000, 4.5, 'Carpet', color="black", fontsize=20)
plt.show()

In the plot above, the periods of time when the robot dog was walking on cement is displayed with a white background while the times when the robot dog was walking on carpet is highlighted with a grey background. Do you notice any appreciable difference(s) between walking on the different surfaces? Are there any interesting insights that you can observe with the human eye? Do any conserved patterns exist within this time series and, if so, where are they?

Have You Seen This Pattern?

The subsequence pattern or query (below) that we are interested in searching for in the time series (above) looks like this:

Q_df = pd.read_csv("https://zenodo.org/record/4276880/files/carpet_query.csv?download=1")plt.suptitle('Pattern or Query Subsequence, Q_df', fontsize='30')
plt.xlabel('Time', fontsize ='20')
plt.ylabel('Acceleration', fontsize='20')
plt.plot(Q_df, lw=2, color="C1")  # Walking on cement
plt.show()

This pattern, Q_df, has a window length of m = 100 and it was taken from a completely independent walking sample. Does it look familiar at all? Does a similar pattern exist in our earlier time series, T_df? Can you tell which surface the robot dog was walking on when this query sample was collected?

To answer some of these questions, you can compare this specific query subsequence or pattern with the full time series by computing something called a “distance profile”. Essentially, you take this single query, Q_df, and compare it to every single subsequence in T_df by computing all possible (z-normalized Euclidean) pairwise distances. So, the distance profile is simply a 1-dimensional vector that tells you exactly how similar/dissimilar Q_df is to every subsequence (of the same length) found in T_df. Now, a naive algorithm for computing the distance profile would take O(n*m) time to process but, luckily, we can do much better than this as there exists a super efficient approach called “Mueen’s Algorithm for Similarity Search” (MASS) that is able to compute the distance profile in much faster O(n*log(n)) time (log base 2). Now, this may not be a big deal if you only have a few short time series to analyze but if you need to repeat this process many times with different query subsequences then things can add up quickly. In fact, as the length of the time series, n, and/or the length of the query subsequence, m, gets much longer, the naive algorithm would take way too much time!

Computing the Distance Profile with MASS

So, given a query subsequence, Q_df, and a time series, T_df, we can perform a fast similarity search and compute a distance profile using the core.mass function in STUMPY:

distance_profile = stumpy.core.mass(Q_df["Acceleration"], T_df["Acceleration"])

And, since the distance_profile contains the full list of pairwise distances between Q_df and every subsequence within T_df, we can retrieve the most similar subsequence from T_df by finding the smallest distance value in distance_profile and extracting its positional index:

idx = np.argmin(distance_profile)print(f"The nearest neighbor to `Q_df` is located at index {idx} in `T_df`")zThe nearest neighbor to `Q_df` is located at index 7479 in `T_df`

So, to answer our earlier question of “Does a similar pattern exist in our earlier time series, T_df?”, let’s go ahead and plot the most similar subsequence in T_df, which is located at index 7479 (blue), and overlay this with our query pattern, Q_df, (orange):

# Since MASS computes z-normalized Euclidean distances, we should z-normalize our subsequences before plotting
Q_z_norm = stumpy.core.z_norm(Q_df.values)
T_z_norm = stumpy.core.z_norm(T_df.values[idx:idx+len(Q_df)])plt.suptitle('Comparing The Query (Orange) And Its Nearest Neighbor (Blue)', fontsize='30')
plt.xlabel('Time', fontsize ='20')
plt.ylabel('Acceleration', fontsize='20')
plt.plot(Q_z_norm, lw=2, color="C1")
plt.plot(T_z_norm, lw=2)
plt.show()

Notice that even though the query subsequence does not perfectly match its nearest neighbor, STUMPY was still able to find it! And then, to answer the second question of “Can you tell which surface the robot dog was walking on when this query sample was collected?”, we can look at precisely where idx is located within T_df:

plt.suptitle('Sony AIBO Robot Dog Dataset, T_df', fontsize='30')
plt.xlabel('Time', fontsize ='20')
plt.ylabel('Acceleration', fontsize='20')
plt.plot(T_df)
plt.text(2000, 4.5, 'Cement', color="black", fontsize=20)
plt.text(10000, 4.5, 'Cement', color="black", fontsize=20)
ax = plt.gca()
rect = Rectangle((5000, -4), 3000, 10, facecolor='lightgrey')
ax.add_patch(rect)
plt.text(6000, 4.5, 'Carpet', color="black", fontsize=20)
plt.plot(range(idx, idx+len(Q_df)), T_df.values[idx:idx+len(Q_df)], lw=2)
plt.show()

As we can see above, the nearest neighbor (orange) to Q_df is a subsequence that is found when the robot dog was walking on carpet and, as it turns out, the Q_df was collected from an independent sample where the robot dog was walking on carpet too! To take this a step further, instead of extracting the only top nearest neighbor, we can look at where the top k = 16 nearest neighbors are located:

# This simply returns the (sorted) positional indices of the top 16 smallest distances found in the distance_profile
k = 16
idxs = np.argpartition(distance_profile, k)[:k]
idxs = idxs[np.argsort(distance_profile[idxs])]

And then let’s plot all of these subsequences based on their index locations:

plt.suptitle('Sony AIBO Robot Dog Dataset, T_df', fontsize='30')
plt.xlabel('Time', fontsize ='20')
plt.ylabel('Acceleration', fontsize='20')
plt.plot(T_df)
plt.text(2000, 4.5, 'Cement', color="black", fontsize=20)
plt.text(10000, 4.5, 'Cement', color="black", fontsize=20)
ax = plt.gca()
rect = Rectangle((5000, -4), 3000, 10, facecolor='lightgrey')
ax.add_patch(rect)
plt.text(6000, 4.5, 'Carpet', color="black", fontsize=20)for idx in idxs:
    plt.plot(range(idx, idx+len(Q_df)), T_df.values[idx:idx+len(Q_df)], lw=2)
plt.show()

Unsurprisingly, the top k = 16 nearest neighbors to Q_df (or best matches, shown in multiple colors above) can all be found when the robot dog was walking on the carpet (grey)!

Summary

And that’s it! You have now taken a known pattern of interest (or query), ran it through core.mass using STUMPY, and you were able to quickly search for this pattern in another time series. With this newfound knowledge, you can now go and search for patterns in your own time series projects. Happy coding!

Additional Note — Distance Profiles with Non-normalized Euclidean Distances

There are times when you may want to use non-normalized Euclidean distance as your measure of similarity/dissimilarity, and so, instead of using core.mass (which z-normalizes your subsequences first before computing the pairwise Euclidean distances), you can use the core.mass_absolute function. This is provided for those who are interested in computing the non-normalized matrix profiles that are available in the complementary stumpy.aamp, stumpy.aamped, stumpy.gpu_aamp, and stumpy.ammpi functions.

Bonus Section — What Makes MASS So Fast?

The reason why MASS is so much faster than a naive approach is because MASS uses Fast Fourier Transforms (FFT) to convert the data into the frequency domain and performs what is called a “convolution”, which reduces the m operations down to log(n) operations. You can read more about this in the original Matrix Profile I paper.

Here’s a naive implementation of computing a distance profile:

def compute_naive_distance_profile(Q, T):
    Q = Q.copy()
    T = T.copy()
    n = len(T)
    m = len(Q)
    naive_distance_profile = np.empty(n - m + 1)    start = time.time()
    Q = stumpy.core.z_norm(Q)
    for i in range(n - m + 1):
        naive_distance_profile[i] = np.linalg.norm(Q - stumpy.core.z_norm(T[i:i+m]))
    naive_elapsed_time = time.time()-start    print(f"For n = {n} and m = {m}, the naive algorithm takes {np.round(naive_elapsed_time, 2)}s to compute the distance profile")    return naive_distance_profile

For a random time series, T_random, with 1 million data points and a random query subsequence, Q_random:

Q_random = np.random.rand(100)
T_random = np.random.rand(1_000_000)naive_distance_profile = compute_naive_distance_profile(Q_random, T_random)For n = 1000000 and m = 100, the naive algorithm takes 44.1s to compute the distance profile

The naive algorithm takes over half a minute to compute! However, MASS can handle this (and even larger data sets) in about a 1 second:

start = time.time()
mass_distance_profile = stumpy.core.mass(Q_random, T_random)
mass_elapsed_time = time.time()-startprint(f"For n = {len(T_random)} and m = {len(Q_random)}, the MASS algorithm takes {np.round(mass_elapsed_time, 2)}s to compute the distance profile")For n = 1000000 and m = 100, the MASS algorithm takes 1.13s to compute the distance profile

And to be absolutely certain, let’s make sure and check that the output is the same from both methods:

npt.assert_almost_equal(naive_distance_profile, mass_distance_profile)

Success, no errors! This means that both outputs are identical. Go ahead and give it a try!