Multivariate Time Series Clustering Using Growing Neural Gas and Spectral Clustering

A different graph theory-based approach at time series clustering

Halil Ertan
Towards Data Science

--

Clustering is an unsupervised methodology, and have lots of usage field. It might also be applied to time series, although it is not as popular as forecasting and anomaly detection for time series. The point is to analyze an overall system that has many different KPIs flowing through time and represents the behaviors of the system as states and clusters the similar ones altogether. In other words, taking the snapshot of the system for a specific time period, and detecting the variant patterns.

In this writing, I mention a novel implementation utilizing Growing Neural Gas(GNG) and Spectral Clustering together for this purpose.

Photo by Eber Brown on Unsplash

Scenario Definition & Dataset Inspection

Let's assume a system that consists of several devices, each device is represented by 100 different KPIs and these KPIs are flowing through time, in other words, a multivariate time series is used to determine the general overview of the system. The target is to detect different behaviors of the system and cluster them along the defined time period. In a similar sense, we might also get an insight into the outlier cases we encounter. To keep the size of the data small and the calculations simple and fast, I work on a simplified dataset. To clarify it, take a look at the simplified version of the dataset in Figure 1. You might even turn this dataset into a more simplified one by ignoring the device column and studying on a single device that creates multiple KPIs over time.

To not lose the focus on the study, I simply assume the data is already preprocessed, which means that missing values are handled and the data is already normalized.

Image by the Author — Figure 1

1- Apply GNG

GNG is mostly used for learning the topological structure of the data in graph format. GNG algorithm is firstly introduced by Fritzke in 1995. In the original paper, its working principle is basically explained. It is a kind of simple unsupervised neural network approach like Self Organizing Maps(SOMs). However, both GNG and SOM do not include common neural network structures and steps like layers, feedforward, and backpropagation. Similarly, the weight concept in GNG and SOM basically corresponds to the location of the node in the space and is represented by a vector with the size of features.

Both of them are iterative approaches and have similar best matching unit/nearest unit reasoning during their iterations. However, they also have basic differences in their implementations. At variance with SOM, you don’t initialize the nodes at the beginning of the process in GNG, these are determined dynamically through the iterations, where “growing” comes from. GNGs try to include the data to the topology where the model performs worst so far at each iteration. In this manner, they do not grow uniformly and aim to cover the entire data. GNGs are also typically used to generate entertaining animations from images. I utilize the NeuPy package to implement the GNG in python, which offers a simple abstraction.

You can find a simple introductory code snippet that composes a simple GNG model for our data below. One of the important parameters which directly affect the performance and accuracy of the model is the number of nodes, in other words, the resolution of the data representation. To find the optimum number of node count is not so straightforward, you might just implement some simple reasoning behind that depending on the unique device number and working time window in our case. Note that, the graph attribute which is implemented for the GNG model in the Neupy package gives insight into the topological structure of the data.

from neupy import algorithms
import pandas as pd
import numpy as np
from sklearn.neighbors import KNeighborsClassifier
from sklearn.cluster import KMeans
from scipy import stats
gng = algorithms.GrowingNeuralGas(
n_inputs=self.n,
n_start_nodes=5,
shuffle_data=True,
verbose=True,
step=0.1,
neighbour_step=0.01,
max_edge_age=50,
max_nodes=100,
n_iter_before_neuron_added=100,
after_split_error_decay_rate=0.5,
error_decay_rate=0.995,
min_distance_for_update=0.05
)
epoch = 100
gng.train(df_scaled_data, epochs=epoch) # only including features, features are named as '0', '1', '2' ...
g = gng.graph
node_list = g.nodes

After composing the GNG model, we have one more issue to complement in this part. That is to find out which sample is matched to which node. We can simply utilize the KNN algorithm with the n_neighbors=1 parameter to match the samples with the nearest node.

number_of_features = 92 # we have 92 different KPIs in our data
node_koor_node = []
for node in gng.graph.nodes:
row = node.weight[0].tolist()
row.append(node_list.index(node))
node_koor_node.append(row)
df_train = pd.DataFrame(node_koor_node)
df_test = pd.DataFrame(df_scaled_data)
X_train = df_train.iloc[:, 0:number_of_features]
y_train = df_train.iloc[:, number_of_features]
X_test = df_test.iloc[:, 0:number_of_features]
knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(X_train, y_train)
knn_result = knn.predict(X_test)
scaled_sample_node_matching = []
for i, data in enumerate(df_scaled_data):
row = data.tolist()
node_idx = knn_result[i]
row.append(node_idx)
scaled_sample_node_matching.append(row)
df_scaled_sample_node_matching = pd.DataFrame(scaled_sample_node_matching)
df_scaled_sample_node_matching = df_scaled_sample_node_matching.rename(columns={number_of_features: “node”})
df_node_sample_match = pd.merge(df_static_cols, df_scaled_sample_node_matching, left_index=True,
right_index=True) # df_static_cols includes static columns and datetime

2- Include Temporal Behaviour

Ok, so far so good, but we overlook something important. That is about time series. GNG does not take into consideration the temporal behavior, in other words, temporal dependencies between the samples, in the data. We have to somehow include temporal behavior in the data. For this reason, we simply ignore all edges constructed in the previous step by GNG while keeping the nodes and their position in the space. The edges are recalculated by taking into account the transformations between consecutive time periods for the same devices in our case. To clarify it, take a look at the following figure.

Image by the Author — Figure 2

In the previous step, we determined the nodes spread over the data space and all samples they include. In other words, we have the knowledge of at what time which devices are on which node. By leveraging this information, we are also able to compose the adjacency matrix that shows us the transitions between nodes. The adjacency matrix is a very important representative of a graph structure. In our case, we do not compose a binary adjacency matrix only consisting of 0s and 1s. We calculate the probability of transitions between the nodes(in other saying the degree of closeness), keep that value in each cell of the adjacency matrix. In this sense, we construct a weighted adjacency matrix. It is also called an affinity matrix.

Additionally, we also compose the degree matrix of the graph that is updated with the new edges. Degree matrixes include zeros in each cell except the diagonal ones, and basically specify the degrees of each node, how many edges go out from the corresponding node.

device_list = df_node_sample_match['DEVICE'].unique()
n = len(gng.graph.nodes)
mat_transition = np.zeros((n, n))
previous_node = -1
for device in device_list:
df_device = df_node_sample_match.loc[df_node_sample_match['DEVICE'] == device]
df_device = df_device.sort_values('DATETIME')
for i, row in df_device.iterrows():
current_node = row['node']
if (previous_node != -1):
mat_transition[previous_node, current_node] = mat_transition[previous_node, current_node] + 1
else:
print('It is the starting node, no transaction!')
previous_node = current_node
previous_node = -1
df_adjacency = pd.DataFrame(mat_transition)df_adjacency['sum_of_rows'] = df_adjacency.sum(axis=1)
list_degree = df_adjacency['sum_of_rows'].tolist()
degree_matrix = np.diag(list_degree)
# calculating transition probability
df_affinity = df_adjacency.loc[:, :].div(df_adjacency["sum_of_rows"], axis=0)
df_affinity = df_affinity.drop(["sum_of_rows"], axis=1)
df_affinity = df_affinity.fillna(0)

3- Apply Spectral Clustering

Spectral clustering is a kind of dimension reduction technique that basically clusters high dimensional datasets into smaller similar clusters. Its reasoning comes from graph theory and aims to detect close nodes in terms of connectivities in the topology and cluster them separately. It is also applicable to different data forms. The main steps of spectral clustering are the following:

Main steps of Spectral Clustering, source

To better understand the Spectral Clustering, take a look at the behaviors of K-means and Spectral clustering on the following circle data in Figure 3. Spectral clustering utilizes the spectrum(eigenvectors) of the Laplacian matrix(Graph Laplacian) that is calculated by using the affinity matrix and degree matrix. Note that, we already calculated both matrices in the previous step. So, we are ready to see the Laplacian matrix.

Clustering of circles data with two different approaches, Image by the Author — Figure 3

While Laplacian matrix can be calculated with the equation L = D - A; where D is the degree matrix and A is adjacency matrix(in our case affinity matrix), normalized Laplacian matrix can be defined as in the following equation:

Normalized Laplacian equation, source

We prefer using Normalized Laplacian to Laplacian matrix for our problem. It can be calculated with the help of the following code snippet.

I = np.identity(df_affinity.shape[0])
sqrt = np.sqrt(degree_matrix)
D_inv_sqrt = np.linalg.inv(sqrt)
normalised_laplace = I — np.dot(D_inv_sqrt, df_affinity).dot(D_inv_sqrt)
df_normalised_laplace = pd.DataFrame(normalised_laplace)

As the second step, we determine the eigenvectors with the corresponding smallest k eigenvalues, where k refers to the number of clusters. Adjusting the optimum cluster number is not trivial. You can find an explanatory approach to determine it here. The second eigenvalue of the normalized Laplacian matrix gives insight into the connectivity of the graph. We select it as 10 in our example. To find the eigenvectors and eigenvalues of Graph Laplacian, we exploit the singular value decomposition(svd) technique. An alternative might be using eigh method of the NumPy package.

number_of_nodes = df_affinity.shape[0]
cluster_number = 10
u, s, vN = np.linalg.svd(normalised_laplace, full_matrices=False)
vN = np.transpose(vN)
eigen_vecs = vN[:, number_of_nodes — cluster_number:number_of_nodes] # 10 eigenvectors with smallest 10 eigenvalues

As the last step, the composed subset of the eigenvectors can be used to cluster the data, simply an input for k-means. After clustering, we match the nodes with the corresponding clusters.

# Clustering
kmeans = KMeans(n_clusters=cluster_number, max_iter=1000, random_state=0)
kmeans.fit(eigen_vecs)
labels = kmeans.labels_
cluster_centers = kmeans.cluster_centers_
# node-cluster matching
clusters = dict()
for i, n in enumerate(node_list):
if labels[i] not in clusters:
clusters[labels[i]] = list()
clusters[labels[i]].append(i)

4- Determine Anomaly Clusters and Root Causes

Is clustering the final destination? Not probably. One possible question might be defining the anomaly clusters. Another saying of this is to find out devices and corresponding times behaved in an abnormal way in our case. Furthermore, the reason for this abnormality is also a concern. It is obvious that minor clusters tend to be anomalies. In this manner, for instance, we might conclude that the clusters which represent smaller than 10% of the entire data are anomaly clusters. We expect that a few clusters will cover the majority of the data. It is an open research field for us, still studying it. Rather than rule-based approaches, more advanced techniques are welcome. Additionally, domain experts might also interpret the clustering results and express an opinion on the anomalies.

After defining the anomaly clusters, being able to detect the root causes of these anomalies would be an asset for the study. At this point, differences of each KPI’s distributions between the anomaly and normal clusters come into the table. To measure and order the differences between the two distributions, you might take advantage of the Kolmogorov-Smirnov test. You can also find different options here. For each KPI(feature) of samples in anomaly clusters, we just apply Kolmogorov-Smirnov test to find the similarity to the corresponding KPI of the normal clusters. The more dissimilar features, the more probable to be a reason for the anomaly.

# We somehow determine the anomaly clusters
anomaly_clusters_list = [3, 4, 5, 7, 8, 9]
normal_cluster_list = [0, 1, 2, 6]
node_cluster_matching = []
for i in range(len(node_list)):
for j in range(cluster_number):
if i in clusters[j]:
node_cluster_matching.append([i, j])
df_node_cluster_matching = pd.DataFrame(node_cluster_matching)
df_node_cluster_matching.columns = [‘node’, ‘cluster’]
df_sample_node_cluster = pd.merge(df_node_sample_match, df_node_cluster_matching, on=[‘node’], how=’inner’)
list_ = np.arange(0, number_of_features, 1).tolist()
list_ = [str(i) for i in list_]
list_.append(‘cluster’)
df_anomaly_clusters_data = df_sample_node_cluster.loc[
df_sample_node_cluster[‘cluster’].isin(anomaly_clusters_list)]
df_anomaly_clusters_data = df_anomaly_clusters_data.loc[:, df_anomaly_clusters_data.columns.isin(list_)]
df_anomaly_clusters_data = df_anomaly_clusters_data.reset_index(drop=True)
df_normal_clusters_data = df_sample_node_cluster.loc[
df_sample_node_cluster[‘cluster’].isin(normal_cluster_list)]
df_normal_clusters_data = df_normal_clusters_data.loc[:, df_normal_clusters_data.columns.isin(list_)]
df_normal_clusters_data = df_normal_clusters_data.reset_index(drop=True)
dict_of_cluster_feature_pvaluelist = dict()
for c in anomaly_clusters_list:
filtered_anomaly_data = df_anomaly_clusters_data.loc[df_anomaly_clusters_data[‘cluster’] == c]
pvalue_feature_list = []
for i in range(number_of_features):
anomaly_clusters_data_feature = filtered_anomaly_data[str(i)].to_numpy().tolist()
normal_clusters_data_feature = df_normal_clusters_data[str(i)].to_numpy().tolist()
result = stats.ks_2samp(anomaly_clusters_data_feature, normal_clusters_data_feature)
print(‘Result of cluster ‘ + str(c) + ‘ feature ‘ + str(i))
print(result)
pvalue = result.pvalue
if pvalue <= 0.05:
pvalue_feature_list.append([i, pvalue])
pvalue_feature_list.sort(key=lambda x: x[1])
pvalue_feature_list = pvalue_feature_list[0: 3] # lets say we pick top 3 probable features for root cause
dict_of_cluster_feature_pvaluelist[‘c’ + str(c)] = pvalue_feature_list

That is all for now. Thanks to Sezin Gürkan, Ersin Aksoy, and Amir Yavariabdi, I really appreciate their invaluable contributions to the study.

USEFUL LINKS

--

--