Fast and Robust Sliding Window Vectorization with NumPy

Diving into NumPy vectorization tricks for multivariate time series data.

Syafiq Kamarul Azman
Towards Data Science

--

Photo by bin foch on Unsplash

Once in a while, you get to work with the underdog of the data world: time series (images and natural language have been in the limelight a lot, recently!). I’ve been lucky (or unlucky) enough to have had to work on time series data for the most part of last year. Among my adventures in the wacky world of 1-dimensional machine learning, I found some very useful vectorization that allowed me through blaze through my data pre-processing stage. I wanted to share these tricks that were instrumental in accelerating my tasks in hopes that it could do so for you.

We will be working with multivariate time-domain simulations in this article. Our goal here is to learn how to develop a component in a performant data pipeline. NumPy is my library of choice for executing the matrix operations. Let’s get to it!

Recap: multivariate time series

More commonly, you will see time series as a single line that cuts through a plot while progressing up or down. On the horizontal axis, time proceeds gracefully, yet ever so regularly while on the vertical axis, measurements or values stand recorded. A unit in the horizontal axis is what we will term a timestep. Multivariate time series are similar with the difference being that the lonely line is now accompanied by other lines propagating through in parallel.

Univariate time series vs. multivariate time series, but you already know that. Image by author.

One common example of a multivariate time series data is a weather station measuring temperature, humidity, pressure, and other facets of the environment at regular intervals. Another is electroencephalograms which capture the brain activity using multiple electrodes producing many readings in parallel. More abstractly, sheet music can be interpreted as multivariate time series: the instruments are different channels of information over a regular sampling time (tempo) of the piece.

Voltages in power system transient stability

My research looks at transient stability in power systems when a fault occurs. During a fault in a power system, a measure is taken to mitigate the mishap. Depending on what state the system is in, where the fault happened and how long it took to clear away that fault, the system can go two ways: return to stability or diverge to instability. One way to determine this is to simulate a bunch of states, fault locations, and clearing times, then learn from the dynamics of the system to predict whether the system stabilizes or explodes (okay, not really).

A simple power system with 3 generators, 9 buses, and 9 lines. Image from Zheng, et al. (2018). Synthetic Dynamic PMU Data Generation: A Generative Adversarial Network Approach.

The data we capture comes from phasor measurements over a period of time at different buses (bold solid lines) in the power system. Since there are many buses and each bus measures different types of data, this naturally lends itself to multivariate time series data. We will be working with some of those simulation data for the purposes of this article. It is known that the transient stability phenomenon can be predicted by taking a small section of the entire data. Most of the remainder of the time series is superfluous.

Our goal is to extract important regions of the signal for a training dataset.

I will first introduce the core concept of a basic extraction before diving deeper and extending the functionality of the extractor. You can find the dataset for the 9 bus system on this link or on my website as a gzipped folder of NumPy matrix files along with some metadata. The metadata column which has particular importance here is the clearing time. Take a few minutes to digest the data if you want.

Code for plotting the datapoint number 693 from the 9 bus dataset.
All information for case 693. There are 9 voltage buses and 3 generator buses. We can combine all these into a 24-channel multivariate time series data. Image from author.

Starting simple: basic sliding window extraction

The part of the signal that we want is around the clearing time of the simulation. We want a window of information before the clearing time and after the clearing time; called the main window. The main window can span up to some maximum timestep after the clearing time, we call this max time. Within the main window, we want a bunch of smaller windows, called sub-windows, that will make up our training examples. The main window should accommodate such that there is a sub-window-sized amount of timesteps before the clearing time. This is a lot to capture but maybe a figure can help us understand the structure of these windows. We will use the voltage magnitude for case 693 for all illustrations going forward.

The basic sliding window scheme; we are aiming to extract the sub-windows on the right. Image from author.

Essentially, we want to slide a sub-window across the main window, step by step, and collect the information at each timestep. The first sub-window must contain the first timestep after the clearing time. The rationale here is that we only know the time at which the fault is cleared in the system (the clearing time). Therefore, the earliest we can have any usable data is the next timestep after clearance.

A small example

Let’s capture a main window with a max time of 10 timesteps. Within the main window, we can take 11 sub-windows of size 5 timesteps. Visually:

  1st timestep after clearing --|           |-- max time
v v
main_window = [..., 5, 6, 7, 8, 9, 10, ..., 19]
^
clearing time index --|
sub_windows = [
[ 5, 6, 7, 8, 9],
[ 6, 7, 8, 9, 10],
...
[15, 16, 17, 18, 19]
]

So if I have a max time T, and a sub-window of size K, I can get T+1 examples from one simulation (including the index at the max time). This scheme disregards K because we can increase K to whatever size we want which will in turn increase the size of the left half of the main window (assuming we don’t underflow the data matrix). This means extracting data with a sub-window of either size 5 or size 8 gives you an identical number of sub-windows but of different sub-window sizes.

We can substitute for any sub-window size, as long as the left half of the data can accommodate. Image from author.

Awesome! Now that we have the framework, we can roughly plan out the extraction function. We have the input information as a 2-dimensional (2D) matrix where timesteps propagate down the rows, and features are distributed across the columns. Visually, a sample input matrix with 4 features and 6 timesteps would look like this:

  Feature 1  Feature 2  Feature 3  Feature 4
------------------------------------------
[[ -5.29328 , 9.89139 , -2.79590 , -8.73531 ], Time=1 |
[ -5.29345 , 9.89152 , -2.79595 , -8.73542 ], Time=2 |
[ -5.29396 , 9.89200 , -2.79644 , -8.73593 ], Time=3 |
[ -5.29416 , 9.89222 , -2.79671 , -8.73614 ], Time=4 |
[ -5.29451 , 9.89257 , -2.79702 , -8.73649 ], Time=5 |
[ -5.29479 , 9.89258 , -2.79732 , -8.73643 ]] Time=6 V

Our target output matrix is a 3D matrix that looks something like this (assuming we want a sliding window of size 4):

   Feature 1  Feature 2  Feature 3  Feature 4
------------------------------------------
[[[ -5.29328 , 9.89139 , -2.79590 , -8.73531 ], Time=1 |
[ -5.29345 , 9.89152 , -2.79595 , -8.73542 ], Time=2 |
[ -5.29396 , 9.89200 , -2.79644 , -8.73593 ], Time=3 |
[ -5.29416 , 9.89222 , -2.79671 , -8.73614 ], Time=4 V
[ -5.29345 , 9.89152 , -2.79595 , -8.73542 ], Time=2 |
[ -5.29396 , 9.89200 , -2.79644 , -8.73593 ], Time=3 |
[ -5.29416 , 9.89222 , -2.79671 , -8.73614 ], Time=4 |
[ -5.29451 , 9.89257 , -2.79702 , -8.73649 ], Time=5 V
[ -5.29396 , 9.89200 , -2.79644 , -8.73593 ], Time=3 |
[ -5.29416 , 9.89222 , -2.79671 , -8.73614 ], Time=4 |
[ -5.29451 , 9.89257 , -2.79702 , -8.73649 ], Time=5 |
[ -5.29479 , 9.89258 , -2.79732 , -8.73643 ]]] Time=6 V

This should be trivial: just loop over the timesteps and slice the rows of the matrix to get the sub-windows.

The (naive) sliding window extractor

Extracting slices of the main window with a for-loop. It’s efficient, right?

The function takes the array from which the data is extracted, the clearing time index which is the index of our clearing time, the max timesteps (T), and the sub-window size (K). First, we declare an empty list to store our output sub-windows. Then we define the start index which is the index at which the sliding window will begin. Using the T+1 rule explained earlier, we know how many data points we expect. All that’s left is a for-loop to extract the data from the array using good old slicing.

As our goal is to collect a bunch of 2D matrices, it would be wise to stack the data in a batch using a 3D matrix where the first dimension would tell you how many 2D matrices are available. Hence the use of expand_dims and vstack to achieve this.

How good is this solution?

That looks like a good solution, right? That wasn’t so bad. Or is it? Well, Python for-loops are notoriously slow and we are not exploiting the capabilities of NumPy’s fancy indexing. At the moment, we don’t really have anything to compare the for-loop method, but try to convince yourself that it is indeed bad and that we can do better. So, let’s see a few tricks of fancy indexing that will help us reach a blazing fast sliding window extraction function.

Some fancy vectorization tricks

Trick #1: We can index any row of a 2D matrix arbitrarily using a 1D matrix of integer indices.

1-dimensional row-indexing of a 10×3 matrix X. Image from author.

What we did was pick the row indices 7, 4, 1, and 2 out of the 2D matrix X by using an array of indices. By knowing what indices we want, NumPy allows us to forgo loops and instead immediately index the rows that we need.

The analog of a single sub-window in our sliding window is indexing an array of consecutive numbers.

But, really, this isn’t much more useful compared to the slicing that we have used in our original code since you would need to loop and create consecutive arrays of indices to extract all the sub-windows. However, Trick #1 is only the prelude to a more powerful trick.

Trick #2: We can index any 2D sub-matrix of rows of a 2D matrix using a 2D matrix of integer indices (wow, that’s a mouthful of 2Ds).

2-dimensional row-indexing of matrix X. Image from author.

This is a little harder to catch but let’s understand what is going on. First, we created a matrix, I, which is a 3×4 matrix of integers. Then, we indexed multiple rows of X together using NumPy’s fancy indexing giving us a 3D matrix. The first index of the output 3D matrix is the 2D matrix as if we used Trick #1 on matrix X with the indices [7, 4, 1, 2]. Similarly, the second index of the 3D matrix is the output of applying Trick #1 on X with the indices [5, 6, 8, 9]. Can you figure out the third index of the output 3D matrix?

Basically, we have found a way to vectorize the outer for-loop in the starter code: we can create a 2D matrix of integers that contains consecutive indices of our simulation data and apply Trick #2 to quickly extract a 3D array that contains those sub-windows. If instead, we replaced matrix I with the matrix below, we can quickly get our sliding window sub-windows.

I = np.array([[0, 1, 2, 3],
[1, 2, 3, 4],
[2, 3, 4, 5])

Now that we know we can index arbitrarily any 2D matrix, we can come up with a systematic way to compose matrix I so that we can generalize this indexing to any sub-window size and up to any max time to vectorize the sliding window extraction.

The math of the sliding window indexer

This part is a little math-heavy with lots of symbols flying around so feel free to leave a comment if you don’t follow anything.

Knowing Trick #2, what we are looking to extract is a 2D matrix of consecutive indices equal to the width of the sub-window. The main window would span from the clearing time plus one (C) minus the sub-window size (K) until the max time (T). Specifically, we want to fancily index our data using the following (T+1)×K indexer matrix:

   [[C-K+1  , C-K+2  , C-K+3  , ..., C   ]
[C-K+2 , C-K+3 , C-K+4 , ..., C+1 ]
I = [C-K+3 , C-K+4 , C-K+5 , ..., C+2 ]
[ ..., ]
[C+T-K+1, C+T-K+2, C+T-K+3, ..., C+T ]]

Okay, I have to admit, I did spend some time making sense of this matrix (mainly because of Python’s 0-indexing scheme). But, in essence, this is the application of Trick #2 to a generic situation in our sliding window extraction. Ideally, we would also generate this 2D indexer matrix without using for-loops too: they’re just that slow! We can actually simplify this matrix significantly by calculating our starting index. The starting index (S) is calculated as S = C - K+1 = C - (K - 1). Thus, we can represent our index matrix in the simpler-to-understand version:

       [[0    , 1    , 2    , ..., K-1  ]
[1 , 2 , 3 , ..., K ]
I = S + [2 , 3 , 4 , ..., K+1 ]
[ ..., ]
[T , T+1 , T+2 , ..., T+K-1]]

Simply, we add an offset amount, S, to a (T+1)×K matrix. This matrix has rows of consecutive values across the columns from 0 up to K - 1, and each row starts with increasing consecutive values down the rows from 0 up to T. We know that we will have T+1 sub-windows, so we just need to have consecutive indices up to the size of the sub-window T+1 times. This would get the same matrix as we would above. In fact, we can represent this matrix in a yet simpler decomposition:

    [[0  , 1  , 2  , ..., K-1]    [[0, 0, 0, ..., 0] 
[0 , 1 , 2 , ..., K-1] [1, 1, 1, ..., 1]
S + [0 , 1 , 2 , ..., K-1] + [2, 2, 2, ..., 2]
[ ..., ] ...,
[0 , 1 , 2 , ..., K-1]] [T, T, T, ..., T]]

The astute among you might’ve seen this coming from the second matrix already but let’s go through it. Since we know that each sub-window has size K, we really just need to make a matrix of consecutive indices once. Also, with the information of T+1 consecutive sub-windows, we just need to add the values [0, 1, 2, …, T] to each of the sub-window indices one at a time.

Using a clever vectorization technique called broadcasting, we don’t even have to construct the whole middle and right 2D matrices of the addition. Rather, we only need to create a 1×K and a (T+1)×1 matrix instead of two (T+1)×K matrices:

                                  [[0]
[1]
S + [[0 , 1 , 2 , ..., K-1]] + [2]
...
[T]]

By broadcasting the addition of the center row matrix and the right column matrix, we can get the original (T+1)×K matrix that we needed. Let’s see this in code.

Vectorized sliding window extractor. How elegant is that? No more for-loops!

Performance comparison

If the elegance is not convincing enough for you, perhaps hard numbers will. To test the performance, I first created two matrices, X and Y, with equal size but different data using NumPy’s random.randn function. This ensures that any computation between the two functions is not cached between the runs. Then I run the loop-based extractor on X and the vectorized extractor on Y. The extraction yields a matrix of size 1201×10×200.

Image of a terminal output showing running times of both vectorized and loop-based sliding window extraction function
10.8ms down to 3.97ms. That’s about a 272% speedup! Image from author.

What a significant speed up! Our vectorized implementation drove circles around the loop-based extractor. This is particularly useful when you are processing multiple files and want to extract lots of windows from many different files one at a time. There are however some caveats to this method. With the for-loop, we extract data without having to create a 2D matrix to index the original matrix. This is definitely “easier” on the memory, but I argue that since our output matrix is much larger than the indexer matrix, ultimately that’s the decisive memory consumer.

That’s pretty much the core of it. Now let’s explore how to make full use of this vectorization to perform other manipulations to the time series data.

Making the sliding window extractor more robust

Alright, let’s add more features to our window extractor. Of course, we’ll keep the vectorization flavor going.

Stride, not just slide

What if we wanted to slide the window over more than one timestep? Like this:

Notice that the sub-windows starting time jumps two timesteps on each slide. Image from author.

Borrowing the terminology of convolutional neural network filter operations, we call this striding windows. You can imagine this to be a relatively simple augmentation to our indexing matrix. Specifically, we are looking at an indexing matrix like this:

                                   [[0 ] 
[V ]
S + [[0 , 1 , 2 , ..., K-1]] + [2V]
...
[T ]]

I’m running out of clever symbol names so I’m going to denote the number of spaces as V for the stride size. When V is 1, this is the basic sliding-window code we have above. Any larger integer values will determine how much it jumps resulting in a smaller set of output sub-windows. We need to be careful: since our max time (T) hasn’t changed, we will need to account for reducing the size of the rightmost vector. There are two ways around this:

  1. Create the entire array of indices (as we did before) and do fancy indexing to select every V row from that array, or,
  2. Create the rightmost vector that looks like the one in the block above.

Let’s code both implementations.

Striding windows implementation in two ways. Note that there is a new argument “stride_size”.

Not sure which one is better? Let’s do a similar test as we did before and measure the performance.

About equal performance for both variants. Image from author.

The only difference in the arguments that I pass to the function now is the stride size which is 3. As you can see, there isn’t much difference, but variant 1 is slightly (but, negligibly) slower than variant 2. I attribute this slow down in variant 1 to the fact that it has to make a full (T+1)×K matrix and then slice it. In contrast, variant 2 creates the stride indices immediately which could save time and memory in the long run.

Downsampled sub-windows

In my research, the simulator is able to generate signals at a frequency much higher than what is practically possible. The real-world devices usually sample at around 50Hz or 60Hz (50/60 timesteps per second) while the simulator can generate data above 1kHz (>1000 timesteps per second). Downsampling the simulation data can be useful for training models that are deployable in a production environment.

We can easily downsample the sub-windows using the same technique presented in the striding windows part earlier, with the exception that we create a striding sub-windows vector. But this can only work if our downsampling rate is evenly divisible. For example, if we have to downsample 1kHz to 100Hz, we just take every 10 timesteps in the sub-window (we have a downsampling ratio (R) of 10). However, if we are downsampling 1kHz to 60Hz, we need to take every 16.667 timesteps which is not an integer downsampling ratio.

The blue blocks are one contiguous array, just sampling at evenly spaced timesteps. Note the timespan in the right-hand plot. Image from author.

We can cheat a bit without too much issue by rounding the steps multiplied by the downsampling ratio (R). When R is equal to 1, this is, yet again, our basic sliding window. More importantly, we also need to consider how the data window is spaced out. We wanted a certain sub-window size which now has an interval of the sampling ratio. Also, the max timesteps that we need to capture are also similarly spaced. Let’s view this in our good friend, the matrix:

                                        [[0 ] 
[R ]
S' + [[0 , R , 2R , ..., (K-1)R]] + [2R]
...
[TR]]

There are a few things to note here. S’ is a new starting point index accounting for the spaced-out sub-window, R is our downsampling ratio, and accordingly, we have to adjust the spacing to capture the desired max timesteps. Caveat: R should not be less than 1; otherwise you’re sampling data in between unit timesteps which entails interpolation.

Calculating S’ is relatively simple, we just need to subtract from the clearing time (C) the size of the sub-window (K) minus one multiplied by the sampling rate; mathematically, we have S’ = C - (K - 1)R. This will ensure that the first timestep after clearance is the last data point in the first index. Okay, enough maths, let’s code.

Sliding windows AND downsampling in one go, what more could you want?

Downsampled windows maintain the same output window size as a standard sliding window but occupy a greater time scale. This can be useful if you want to keep your data similar in size but cover timesteps that are more spaced-out.

Conclusion

I hope you’ve found some awesome tricks to help you vectorize your sliding window pre-processing workflows. Changing your mindset to approach matrix manipulation without for-loops can be tough but once you get the hang of it, it feels more natural. My biggest challenge was thinking of the outputs in high dimensional matrices which feels so foreign (and still does sometimes).

The only gripe with this technique that I have is the maintainability factor of the code decreases since you’d have to ensure that the people using your code can understand what the vectorization is doing. This means having solid documentation alongside the function is crucial and naming your variables sensibly is paramount for other’s understanding.

The extensions to the basic sliding window vectorization will hopefully inspire you to try out your own complex vectorization to speed up your data pipeline. When you get it running, you’ll feel like a NumPy ninja, honest! Nevertheless, the increased performance will leave you waiting a lot less for files to process. What better way to keep working more and minimizing your idle time. Right?

--

--

A data engineer at AIQ, a JV between ADNOC and G42. Started in web, went into bioinformatics and is now crazy about deep learning. See more at syaffers.xyz