Multiple Time Series Classification by Using Continuous Wavelet Transformation

The purpose of this post is to show why the continuous wavelet transformation is so powerful and how to use it to classify multiple non-stationary signals and time series.

Sebastian Feike

Published in

Towards Data Science

13 min readFeb 1, 2020

1 Introduction: Importance of Continuous Wavelet Transformation

In the age of digitalization and the fourth industrial revolution, companies increasingly focus on developing data-driven applications to create new business models. Often a prerequisite for the resulting new services is recording non-stationary data or signals (data that dynamically varies over time) and time series like financial trends or sounds, vibrations, electrical, accelerometer or other kinds of sensor signals. In order to generate the desired business value, events must be detected and understood by this data usually.

This is where machine and deep learning comes in. These mathematical models enable classifying and predicting the status from an IoT device in a very efficiency way. Unfortunately, often multiple non-stationary signals (time series) as required, which are very complex and prone to noise and misleading values. To improve the prediction accuracy of a model, it is often helpful to use digital signal processing techniques. One very powerful technique for this scope is continuous wavelet transformation.

Continuous Wavelet Transform (CWT) is very efficient in determining the damping ratio of oscillating signals (e.g. identification of damping in dynamic systems). CWT is also very resistant to the noise in the signal
Carmen Hurley & Jaden Mclean: Wavelet, Analysis and Methods (2018). Page 73

Ps. In addition to the CWT, there is also Discrete Wavelet Transformation (DWT), which I will not explain in further detail in this post.

Incomprehensibly, CWT is not very popular in data science. For this reason, in the following post I would like show how easily CWT can be used for machine and deep learning (section 3). First, I would like to outline some basic theory regarding CWT (section 2).

2 Signal Processing Using Continuous Wavelet Transformation

In this section, I would like to provide a brief overview of why wavelet transformation is so useful for analyzing non-stationary signals (section 2.1), such as the concept behind it (section 2.2).

2.1 Limitation of the Fourier Transformation

Fourier transformation (FT) decomposes a signal into frequencies by using a series of sinus waves. It helps to transition between the time and frequency domain. For a better understanding of FT, I recommend this useful video.

To illustrate the limitation of FT, we need a very simple stationary and non-stationary signal. A stationary signal does not change its mean, variance and covariance over time, instead of a dynamically non-stationary signal. In FT, we must consider time and frequency domain. The time domain shows the amplitude/strength of signal as a function of time, which can be represented as individual isolated frequencies in the frequency domain after FT (Figure 1).

Figure 1. Analysis of non-stationary signals using FT enables no complete picture of its time and frequency domain

As you can see (Figure 1), FT works very well for sinus waves, which is generated by a stationary process because the signal contains all of its frequencies all of the time (in this example, only one frequency). Now let’s focus on the burst in the non-stationary signal, which could be a kind of anomaly or characteristic pattern. Like with the stational process, you can also identify the frequencies by using FT, although you cannot identify which frequencies represent exactly the burst in the signal. The thing about FT is that it only decomposes a signal in its frequency domain without any information about its time domain. This explains why it is not possible to determine which frequencies are part of a signal at a specific moment in time, or vice versa. In summary, with non-stationary signals in combination with FT you are confined to either the time or the frequency domain, but you never get the complete picture of the signal. To handle this problem in an intelligence way, let’s take a closer look at the continuous wavelet transformation in the next section.

2.2 Functionality of Continuous Wavelet Transformation (CWT)

Wavelets are mathematical functions that are often referred to as mini wavelets. In contrast to infinity (-infinity to +infinity) sinus functions that are used for FT, WT contains:

Different families and types of wavelets with differing compactness and smoothness
Which are zero mean and limited (finite) in time

The different wavelet shapes enable us to choose the one that fits best with the features that we are looking for in our signal. Most common wavelets for CWT are the “Mexican, Morlet and Gaussian” wavelet (Figure 2). They are also called “Mother Wavelets”.

Figure 2: Shape of the Mexican hat, Morlet and Gaussian mother wavelets.

Ps: The Python package “PyWavelets” used provides further mother wavelets that are compatible with CWT. Therefore, please read the PyWavelets API references.

Figure 2 also demonstrates the zero mean and the time limitation of the mother wavelets. Both of these conditions allow a localization from time and frequency at the same time. Additionally, they enable the necessary integrable and inverse wavelet transformation. CWT can be described by the following equation:

Consequently, the wavelet transformation uses the mother wavelets to divide a 1D to ND time series or image into scaled components. In this connection, the transformation is based on the concepts of scaling and shifting.

Scaling: stretching or shrinking the signal in time by the scaling factor.
Shifting: moving the differently-scaled wavelets from the beginning to the end of the signal.

The scale factor corresponds to how much a signal is scaled in time and it is inversely proportional to frequency. This means that the higher the scale, the finer the scale discretion (Figure 3).

Figure 3: Demonstration of a shrink and a stretched Morlet mother wavelet in time. The scale factor is inversely related to frequency.

Accordingly, this helps:

stretched wavelets to capture slow changes; and
shrink wavelets to capture abrupt changes in the signal.

The different wavelets in scales and time are shifted along the entire signal and multiplied by its sampling interval to obtain physical significances, resulting in coefficients that are a function of wavelet scales and shift parameters. For example, a signal with 100 timesteps multiplied by 32 (in a range from 1 to 33) scales results in 3,200 coefficients. This enables better characterizing oscillated behavior in signals with CWT.

If we apply CWT to our non-stationary signal example and visualize the resulting coefficients in a scalogram, we obtain the following result.

Figure 4: Comparison between a non-stationary signal with burst (time domain) and its CWT (time and frequency domain) by a scalogram

The scalogram in Figure 4 indicates where most of the energy (see the color bar right of the scalogram) of the original signal is contained in time and frequency. Furthermore, we can see that the characteristics of the signal are now displayed in highly resolved detail. Thereby, you can see the abrupt changes of the burst that are detected by the shrink wavelet with the scale 1 and 2 and the slow changes of the sinus wave by stretched wavelets with a scale of 15 up to 25.

Ps: The abrupt changes are often the most important part of the data both perceptibly and in terms of the information that they provide.

Such a visualization of the CWT coefficients like the 2D scalogram can be used to improve the distinction between varying types of a signal. In an industrial context, this enables differentiating between different production processes in a machine (process monitoring), identifying components like bearing as well as machine or tools faults (condition monitoring) simply as quality issues (quality monitoring) based on — for example — non-stationary vibration sensor signals. I hope that this gives you a better understanding of how powerful CWT can be for data analysis. Now it is relevant to see how you can use this technique in machine and deep learning.

3 Classification Through CWT

In section 2.2, we have seen that CWT transforms a 1D time series into 2D coefficients. Thereby, the coefficients represent the time, frequency and characteristics of a signal and thus much more information than only the time series or FT (Figure 4). The goal of this section is to use this generated information as a basis for classification by using pattern recognition (section 3.3) or feature extraction (section 3.4). Previously, I have provided a brief overview of the selected example dataset (section 3.1) and how to apply CWT on this data (section 3.2).

3.1 Human Activity Recognition (HAR) Dataset Used

The open HAR dataset contains a smartphone sensor (accelerometer and gyroscope) measuring different people while they are undertaking the following activities:

There are 7,532 train and 2,947 test samples (measurements) with a 50% overlap between each sample. Each sample comprises nine signals that have a fixed size, with 128 readings and time steps.

You can download and read more about the dataset at the this link.

We are loading the almost raw inertial signals where only the gravity effect has been filtered out.

useful information: shapes (n_samples, n_steps, n_signals) of X_train: (7352, 128, 9) and X_test: (2947, 128, 9) all X’s have a mean of: 0.10 and a standard derivation of: 0.40

As you can see, the signals are almost normalized.

If you plot the body accelerometer and body gyroscope signals of two different activities, you will see dynamically changing (non-stationary) signals for each sample (Figure 5). For better clarity, we will not plot the total accelerometer values.

Figure 5: Example visualization of the activity walking and laying through the body accelerometer and body gyroscope smart phone signals from the HAR dataset

According to the dynamical behavior of the signals (Figure 5), this dataset seems ideal to apply wavelet transformation.

3.2 Application of the CWT in Python

As already mentioned (section 2.2), we can distinguish between different events through the visualization of the CWT coefficient manually. Is this also possible for the HAR dataset?

To find this out, we must first install the Python package PyWavelets with “pip install PyWavelets” or “conda install pywavelets”, which we can use to apply the wavelet transformation on our dataset.

Second, we must define a convenient mother wavelet and scale size for the continuous wavelet function pywt.cwt. Regarding this kind of signal (Figure 5), we choose the Morlet Mother Wavelet (Figure 2) based on its most suitable shape. To select an appropriate range of scales, let’s consider the CWT coefficients of three different ranges 32, 64 and 128, represented as a scalogram (Figure 6).

Figure 6: Representation of the increasing information about the behaviors of a signal to the expanding range of scales.

In general, a smaller size of scales (in our example 32) enables more focus of abrupt changes. As already mentioned, these suddenly changes are often the most important characteristics. Otherwise a wide range of scales (in our example, 64 or 128) provides more information (about slowly changes), which can provide a better classification accuracy. However, you will need a deeper CNN for the second option.

For the next illustration, a scale of 64 seems like a useful compromise to reach a good prediction accuracy.

Figure 7: Distinguishing the different HAR activities based on the visualization of the continuous wavelet transformed sensor data via a scalogram

As you can see (Figure 7), it is manually feasible to differ between the different activities through the visualization of the CWT coefficients via a scalogram. Please feel free to choose other signals or samples, then you will see that each signal is more or less convenient to distinguish between the six activities. For example, you will not reach a good classification accuracy for the non-moving activities like sitting, standing and laying using only the total accelerate sensor data in x, y and z.

I am certain that you would not like to consider and compare thousands of scalograms manually to determine the activity of each sample. For this reason, we first use a conventional neuronal network (section 3.3) and second a feature extraction technique plus a classifier (section 3.4) to classify the differ activities of the HAR dataset automatically.

3.3 Pattern Detection using Conventional Neuronal Networks (CNN)

A CNN is highly efficiency to learn characteristic patterns of labels in images. This kind of neuronal network can also handle the 2D CWT coefficients like pixels of an image to predict the corresponding activity. Let’s see how this combination works.

However, before we can start to feed a CNN, we must:

transform the signals of the HAR dataset using the pywt.cwt function; and
bring the resulting coefficients into a suitable format.

In this case, we will also choose the Morlet Mother Wavelet and a scale size of 64 for the pywt.cwt function, like in section 3.2. Additionally, we resize all coefficient matrices (64x128) to a square shape (64x64). This step is not absolutely necessary but saves many parameters and computation resources and at the same time we should not lose too many details of the images (Figure 8).

Figure 8: Depiction of the impact of down sampling the continuous wavelet transformed 2D coefficients by using scalograms

As second point, we must still clarify how to feed the resulting CWT coefficient matrices into the CNN. Here, the best approach is to place the 2D coefficients (images) of the nine signals on each other like the three channels red, green, blue (RGB) of a color image. Accordingly, all dependencies between the different sensor data can be taken into account simultaneously, which is very important. Please note here: If you concatenate the CWT coefficients of the nine different signals into one Numpy array (image), you will have abrupt changes between them. These may lead to the effect that the CNN focuses on the boundaries due to the abrupt changes instead of the significant characteristics patterns of each signal. Accordingly, you will need a much deeper CNN to ignore this kind of noise.

For the illustration, we build a simple CNN with LeNet-5 architecture (Figure 9) and make two improvements:

Figure 9: Conventional neuronal network (CNN) LeNet-5 architecture with max pooling and ReLU activation

Max pooling instead of average pooling, because max pooling may reach a better performance when you would like to extract the extreme features (abrupt changes) and when the images have a high pixel density (due to the nine channels).
Rectified linear unit (ReLU) activation function instead of hyperbolic tangent (Tanh) to overcome the vanish gradient problem, accelerate the training and achieve a better performance.

Let’s train and evaluate the model!

training…

evaluation…

Accuracy: 94.91%

Not bad! An accuracy of almost 95% is very good for this dataset because the distinction between the similar non-moving (staying, laying and sitting) and similar moving (walking, walking upstairs and downstairs) activities undertaken by different people is not easy for a simple model. This result should be considered as evidence that the combination of CWT and CNN is a useful option for the classification of non-stationary multiple time series/signals. I am certain that through improvements of the CNN (adding regularization, more neurons and so one), other CNN architectures, hyperparameter tuning or another scale sizes (with or without down sampling) can be reached with better results. So, let’s try another approach.

3.4 Feature Extraction Using Principal Components Analysis (PCA)

While we used all 2D CWT coefficients like images for a CNN in the previous section, we will select the most important coefficients per scale to feed a classifier this time. Therefore, we apply PCA to extract the features with the highest variation. If you are not familiar with the logic behind the PCA, I recommend to you this source.

To select the coefficient with the highest variation per scale, we must:

apply the pywt.cwt function like before; and
apply PCA for only a single component to obtain the most significant coefficient per scale.

In this way we work with 64 features instead of 64*128 for each signal. The resulting new feature dataset has a 2D shape of (n_samples, 576 (= 64 features * 9 signals).

For the classification, I choose the XGBoost (eXtreme Gradient Boosting), which is one of the currently most winning Kaggle competition classifiers. XGBoost is a gradient boosting, decision tree-based ensemble machine learning algorithm designed for speed and performance. To figure out why the XGBosst performs so well, you can read this post.

In the context of the HAR dataset, it is important to select the objective “multi:softmax” to enable the classification of more than binary classes. Additionally, we apply subsampling — also known as bagging — to reduce variance and hence overfitting. Thereby, the subsampling fraction randomly selects training samples that are used to fit each tree.

Accuracy: 92.67%

Almost 93% accuracy is also a good result. Evidently this approach separates slightly better between the non-moving activities like sitting and standing, and therefore obviously worse between the different walking activities. As with every machine learning model, you can improve the performance through hyper parameter tuning and in this case by increasing of the scale size to obtain more input features. The XGBoost classifier is sufficiently powerful to handle 576 input features. Most classifiers can independently determine which features are useful or not. However, you should think about selecting the features with the highest importance when you use a much higher scale.

4 Conclusion

In this post, you have seen why and how to use the powerful CWT for non-stationary signals. The good results of the CWT in combination with CNN’s or PCA plus a classifier serve as proof that these approaches are an excellent choice to classify multiple time series per event. There are so many tunable configurations of different algorithm and models that affect the outcoming result that no generally valid procedure for this purpose can be named. In my experiences, continuous (also discrete) wavelet transformation mostly overperforms other signal processing techniques for non-stationary signals and the classification accuracy of other kind of models like simple recurrent neuronal networks using the raw HAR data.

Finally, I hope that this post makes (continuous) wavelet transformation more popular in data science just as in machine and deep learning, and that it motivates you to give it a try.