In the prior post we explained how Clustering of time series data works. In this post we’re going to do a deep dive into the code itself.
Everything will be written in python, but most libraries have an R version. We will try to stay relatively high level but the code will have some useful resources if you’re looking for more.
Without further ado, let’s dive in.
0 – Data Creation
To help interpretation, let’s leverage a theoretical example: we’re trying to forecast gold prices by using information from local markets around the world.
First, we develop a 1000 x 10 data frame. Each row corresponds to a unique time point, in our case a day, and each column corresponds to a different gold market. All values in our data frame are prices.
# create data
rng = pd.date_range('2000-01-01', freq='d', periods=n_rows)
df = pd.DataFrame(np.random.rand(n_rows, n_cols), index=rng)
The above code simplifies our example dramatically. One conceptual issue is that prices always take on a value between 0 and 1, however the lessons from the code still apply.
With a synthetic data frame created, let’s make it dirty.
# "unclean" data
df = df.apply(lambda x: make_outliers_on_col(x), axis='index')
df = df.apply(lambda x: make_nan_on_col(x), axis='index')
The functions above randomly input 10 outliers and 10 null values into our data frame. Our resulting data frame looks something like this…

1 – Data Cleaning
There are two major data cleaning steps: missing data imputation and outlier removal. Luckily, pandas has some simple built-in methods that can help us.
# interpolate missing values
df = df.interpolate(method='spline', order=1, limit=10, limit_direction='both')
# interpolate outliers
df = df.apply(lambda x: nullify_outliers(x), axis='index')
df = df.interpolate(method='spline', order=1, limit=10, limit_direction='both')
Our strategy here is very simple. We first impute all missing data using spline interpolation. We then replace all outliers with null values and again use spline interpolation.
The paper suggested a variety of missing value imputation methods, some of which include interpolation (shown above), Singular Value Decomposition (SVD) Imputation, and K-Nearest-Neighbors (KNN) Imputation.
If you care about speed, SVD or interpolation are your best options. KNN may provide better results, but it’s more computationally intensive.
At the end of this step, we will have a data frame that looks like figure 2:

2 – Clustering
With a clean dataset, we will now look to group gold markets that have similar characteristics. Our hypothesis is that similar markets will be more easily fit by a model, thereby leading to more accurate forecasts.
The most effective type of classification cited in the paper involved leveraging features about each time series. We will look at two types: time series and signal processing features.
# TS-specific features
autocorrelation = df.apply(lambda x: acf(x, nlags=3), axis='index')
partial_autocorrelation = df.apply(lambda x: pacf(x, nlags=3), axis='index')
# Signal-processing-specific features
fast_fourier_transform = df.apply(lambda x: np.fft.fft(x), axis='index')
variance = df.apply(lambda x: np.var(x), axis='index')
From here, we can perform k-means clustering on each pair of feature sets. We’re limiting to 2 features for simplicity, however the paper cites four potential features for both groups.
import numpy as np
from scipy.cluster.vq import kmeans2
# setup ts and signal features for clustering
features = [np.array([autocorrelation, partial_autocorrelation]),
np.array([fast_fourier_transform, variance])]
for f in features:
# cluster
out = kmeans2(f, 2)
cluster_centers, labels = out
# ...
The above code groups each of our 10 gold markets into two distinct groups, as shown below in figure 3.

Now that we have grouped our data into "similar" time series’, we’re ready to model each group.
3 – Forecasting Model
The paper cited a bi-directional LSTM as having the best accuracy. Despite the scary name, a bidirectional LSTM is just two LSTM’s. The first is trained forward to backward with regular inputs. The second is trained backward to forward with the input vectors reversed.
By creating two learning structures in a single epoch, the model often converges faster and learns the structure in the data more completely.
However, for simplicity, we will use a basic LSTM, but the concepts can be easily applied to more complex model structures.
Before doing so, it’s important to note that we averaged our time series values in each cluster. Some models, such as DeepAR, fit multiple time series’ and output a single prediction. However, vanilla LSTMs require univariate data.
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
# fit basic LSTM
model = Sequential()
model.add(LSTM(4, input_shape=(1, look_back)))
model.add(Dense(1))
model.compile(loss='mean_squared_error', optimizer='adam')
model.fit(trainX, trainY, epochs=n_epoch, batch_size=1, verbose=2)
The above code was iteratively run on all 4 datasets and their accuracies are shown below in figure 4.

Just by judging visually, there doesn’t appear to be much difference in our forecasts, so let’s take a look at the Root Mean Squared Error (RMSE) for each of our models.

Figure 5 adds a bit more granularity to our evaluation. We can see that TS features performed better than signal features on cluster 1, but worse on cluster 2. Overall, the average RMSE between cluster 1 and 2 for each group is similar.
Now you might be wondering why performance is so similar between groups. Well if you recall, our data generating mechanism was the same for all time series’.
The purpose of clustering is to pick up on systematic differences in our time series models. We can then develop a specialized model for each.
If the data have the same underlying data generating mechanism, clustering will not help predictive performance.
4 – Next Steps
The complete code for the above walkthrough can be seen here. However, for a modeling project with real world data, it’s advisable to try more.
For instance, leveraging subject matter knowledge for how time series’ should be grouped can also be very informative. One example for our gold data could be to cluster time series’ at similar geographic locations. If you don’t have subject matter knowledge, here are some more ideas:
- Cluster on more features
- Cluster on both TS and signal-based features at the same time
- Use more complex Deep Learning structures
- Introduce static features (the paper discussed architectures for this)
Thanks for reading! I’ll be writing 30 more posts that bring academic research to the DS industry. Check out my comment for links to the main source for this post and some useful resources.