Time Series Hierarchical Clustering using Dynamic Time Warping in Python

Andrew
Towards Data Science
5 min readNov 13, 2019

--

Let us consider the following task: we have a bunch of evenly distributed time series of different lengths. The goal is to cluster time series by defining general patterns that are presented in the data.

Here I’d like to present one approach to solving this task. We will use hierarchical clustering and DTW algorithm as a comparison metric to the time series. The solution worked well on HR data (employee historical scores). For other types of time series, DTW function may work worse than other metrics like CID (Complexity Invariant Distance), MAE or correlation.

I will skip the theoretical explanations of hierarchical clustering and DTW algorithms and focus on why did I select such combination:

  1. Hierarchical Clustering is simple, flexible, tunable (linkage criteria) and allows us not to cluster all trajectories
  2. DTW method allows us to compare time series of different length and, by my experience, works perfectly with infrequent

Ok, here we go! Our imports:

import random
from copy import deepcopy
from scipy import interpolate
import numpy as np
from dtaidistance import dtw
import matplotlib.pyplot as pltfrom _plotly_future_ import v4_subplots
import plotly.graph_objects as go
from plotly.subplots import make_subplots

Some parameters for time series generation and our threshold:

  • NUM_OF_TRAJECTORIES number of trajectories that we have to cluster
  • MIN_LEN_OF_TRAJECTORY and MAX_LEN_OF_TRAJECTORY lower and upper length bounds for any trajectory
  • THRESHOLD our threshold for DTW
NUM_OF_TRAJECTORIES = 200
MIN_LEN_OF_TRAJECTORY = 16
MAX_LEN_OF_TRAJECTORY = 40
THRESHOLD = 0.50

For simplicity, all our trajectories will lie between -1 and 1. Also, I added some smoothing.

for item in range(NUM_OF_TRAJECTORIES):
length = random.choice(list(range(MIN_LEN_OF_TRAJECTORY, MAX_LEN_OF_TRAJECTORY + 1)))
tempTrajectory = np.random.randint(low=-100, high=100, size=int(length / 4)).astype(float) / 100

oldScale = np.arange(0, int(length / 4))
interpolationFunction = interpolate.interp1d(oldScale, tempTrajectory)

newScale = np.linspace(0, int(length / 4) - 1, length)
tempTrajectory = interpolationFunction(newScale)

trajectoriesSet[(str(item),)] = [tempTrajectory]

Notice, that all trajectories are stored as dictionary values of list type (for convenience, when we will start to union them into groups). For the same reason, the names of trajectories are stored as tuples.

Our algorithm is the following:

  1. We find a pair of closest entities (trajectory-trajectory or trajectory-cluster or cluster-trajectory or cluster-cluster)
  2. Group them into a single cluster if their distance is lower than the THRESHOLD
  3. Repeat step 1
  4. We stop our algorithm if we fail at step 2 or we get one big cluster (so all our trajectories get into it — it means that our THRESHOLD is very big)

The first part of the algorithm:

trajectories = deepcopy(trajectoriesSet)
distanceMatrixDictionary = {}
iteration = 1
while True:
distanceMatrix = np.empty((len(trajectories), len(trajectories),))
distanceMatrix[:] = np.nan

for index1, (filter1, trajectory1) in enumerate(trajectories.items()):
tempArray = []

for index2, (filter2, trajectory2) in enumerate(trajectories.items()):

if index1 > index2:
continue

elif index1 == index2:
continue

else:
unionFilter = filter1 + filter2
sorted(unionFilter)

if unionFilter in distanceMatrixDictionary.keys():
distanceMatrix[index1][index2] = distanceMatrixDictionary.get(unionFilter)

continue

metric = []
for subItem1 in trajectory1:

for subItem2 in trajectory2:
metric.append(dtw.distance(subItem1, subItem2, psi=1))

metric = max(metric)

distanceMatrix[index1][index2] = metric
distanceMatrixDictionary[unionFilter] = metric

Dictionary distanceMatrixDictionary helps us to keep already calculated distances.

Numpy array distanceMatrix is filled with np.nan at the beginning of each step. It is needed only to keep representations between pairs of indexes and calculated distances. May be removed after adding the same functionality to the distanceMatrixDictionary.

This part of code allows us to compare all possible options — trajectory-trajectory or trajectory-cluster or cluster-trajectory or cluster-cluster :

metric = []
for subItem1 in trajectory1:

for subItem2 in trajectory2:
metric.append(dtw.distance(subItem1, subItem2))
metric = max(metric)

The last line above — metric = max(metric) — is linkage criteria called ‘complete linkage’. It worked better for me but you can try other criteria or even make it customized.

Okay, distances are calculated, let us proceed with the grouping process.

We find the lowest distance and a pair of indexes that provide us this distance.

Here for simplicity, we will work with only one pair (the first one). Even, if we have two, three or more pairs for the same distance — the rest will be processed step-by-step during the next iterations.

minValue = np.min(list(distanceMatrixDictionary.values()))if minValue > THRESHOLD:
break
minIndices = np.where(distanceMatrix == minValue)
minIndices = list(zip(minIndices[0], minIndices[1]))
minIndex = minIndices[0]

After getting the pair of indexes we need simply define entities names and values, combine them, put the combination into the dictionary and remove these single entities from it:

filter1 = list(trajectories.keys())[minIndex[0]]
filter2 = list(trajectories.keys())[minIndex[1]]
trajectory1 = trajectories.get(filter1)
trajectory2 = trajectories.get(filter2)
unionFilter = filter1 + filter2
sorted(unionFilter)
trajectoryGroup = trajectory1 + trajectory2trajectories = {key: value for key, value in trajectories.items()
if all(value not in unionFilter for value in key)}
distanceMatrixDictionary = {key: value for key, value in distanceMatrixDictionary.items()
if all(value not in unionFilter for value in key)}
trajectories[unionFilter] = trajectoryGroup

After that, we repeat the previous step until we have nothing to cluster.

I have described the general approach but this algorithm can be simplified, boosted and modified to avoid any recalculations.

As a result, we get groups like this:

In this cluster, we see 3 time series of different lengths. All of them have the same general pattern: local minimum in the first third, then global peak in the second half and a global minimum in the end.

Some more results (here for each cluster the left subplot presents original trajectories lengths, right one — rescaled to the MAX_LEN_OF_TRAJECTORY for comparison):

Cluster #1–2 items
Cluster #2–2 items
Cluster #3–3 items
Cluster #4–3 items
Cluster #5–3 items

Depending on THRESHOLD value we can make our clusters bigger (and more generalized) or smaller (more detailed).

What can we improve if the current approach does not perform well on another dataset?

  1. We can try to replace DWT with another distance metric
  2. We can try to work additionally with time series: scale/rescale, smooth or remove outliers
  3. We can try to use different thresholds
  4. We can try to change linkage criteria

After finding the optimal hyperparameters it is possible to refactor the code and speed-up the calculations.

Code to use is available here:
https://github.com/avchauzov/_articles/blob/master/1.1.trajectoriesClustering.ipynb

--

--