The world’s leading publication for data science, AI, and ML professionals.

Root cause analysis of IT incidents based on correlations between time series of IT infrastructure…

How to predict IT incidents

Root cause analysis of IT incidents based on correlations between time series of IT infrastructure metrics

Introduction

One of the tasks of IT monitoring systems is collection, storage and analysis of various metrics characterizing both the state of various elements of the IT infrastructure (CPU load, free RAM, free disk space, etc.), and the state of various business processes. In order to apply the extensive mathematical apparatus of statistical analysis, it is often more convenient to present these data in the form of ordered time series of the corresponding variables. A good tool for time series processing in Python is a combination of three modules: pandas, scipy and statsmodels (pandas.pydata.org, scipy.stats, statsmodels.org) which provide a wide range of classes and functions for constructing time series, for estimating various statistical models, as well as for conducting statistical tests and analyzing statistical data. Specifically in this article, of all the plethora of mathematical tools contained in these modules algorithms that we use for root cause analysis in the Aiops platform monq will be described, in particular, correlation analysis of time series of IT infrastructure metrics.

Monq IT service health map with MLOps data. Image by author.
Monq IT service health map with MLOps data. Image by author.

Mathematically, correlation dependence is a concordant change in the values ​​of two or more variables, when for any change in one variable, there is a relative change (decrease or increase) in the other variable, and the correlation coefficient is a quantitative measure of this matched variability of two variables. Correlation analysis is one of the methods of mathematical statistics that allows, by calculating the correlation coefficients between variables, to determine whether there is a correlation between them and how strong it is. It should be borne in mind that the correlation coefficient is a descriptive statistic and the presence of a correlation between the variables does not necessarily imply the presence of a causal relationship between them, that is, a positive or negative correlation does not necessarily mean that a change in one variable causes a change in another ("correlation does not imply causation ").

If time series of three or more variables are considered in a correlation analysis, then a correlation matrix is ​​constructed from the pairwise correlation coefficients – a matrix in which the correlation coefficient between the corresponding variables is found at the intersection of the corresponding row and column. With regard to Monitoring systems of IT infrastructure, the correlation matrix of time series of various metrics can be used in two main scenarios (use cases): in search of the root causes of incidents in the system (root cause analysis) and in search of hidden infrastructural connections between system elements (in some way, "drawing" resource-service model). Ultimately, temporal correlations between metrics can be inserted into various algorithms for further data analysis, such as forecasting or modeling tools.

Constructing time series of metrics

The first step in the correlation analysis procedure is to present sets of metric values ​​describing the state of IT infrastructure elements in the form of regular (equally spaced) time series, i.e. rows characterizing the value of the metric at each certain moment of time (every 5 minutes, every half hour, etc.). In the general case, measurements of metrics values ​​and their recording in a data storage occur asynchronously, with different intervals and with different frequencies, since very often these metrics come from external sources and the monitoring system has no control over them: there are situations when several metric values come within a 5-min interval, and then not a single one within half an hour. In this regard, a procedure for regularizing time series of various metrics is needed, for which it is necessary to use methods of data interpolation and extrapolation for different time intervals. The time series regularization procedure is illustrated in Figure 1, and in the pandas module this is done in one line – in the following code example, this is the last line, where the method resample (‘5min’).mean averages measurements within each 5-minute time interval, and the method fillna(method=’ffill’) extrapolates the previous metric value to empty (no measurements) time intervals:

import pandas as pd
data=pd.read_csv('TimeSeriesExample.txt',parsedates=[0])
timeSeries=pd.Series(data['KEHealth'].values, index=data['Timestamp'])
timeSeriesReg=timeSeries.resample('5min').mean().fillna(method='ffill')
tsCollection.append(timeSeriesReg)
Figure 1. Regularization of the time series of metric values. Image by author.
Figure 1. Regularization of the time series of metric values. Image by author.

Monq calculates the health metrics of hardware and services based on the state of associated triggers and the mutual influence of third-party systems and components on a given configuration unit using a resource-service model. In the above example, data on a specific health metric was used, which is calculated for each element of the monitoring system according to certain rules. A typical view of the time series of the health metric for some of CUs in the system of one of our clients with several thousand configuration units is shown in Figure 2 over a time span of about a year.

Figure 2. Time series of the health metric for different elements of the monitoring system. Image by author.
Figure 2. Time series of the health metric for different elements of the monitoring system. Image by author.

Calculation of the correlation matrix

After obtaining regularized time series of metrics, calculating their correlation matrix is a fairly straightforward task. In the pandas module, this is easiest to do by combining the time series into one table (dataframe) and applying the corr() method to it, which calculates the Pearson correlation coefficient for each pair of metrics within the time interval of their joint definition (not necessarily continuous):

import matplotlib.pyplot as plt
allKeDF=pd.concat(tsCollection, axis=1)
corrMatrix=allKeDF.corr()
pallet=plt.getcmap('jet')
img=plt.imshow(corrMatrix, cmap=pallet, vmin=-1, vmax=1, aspect='auto')
plt.colorbar(img)
Figure 3. Correlation matrix of time series of the health metric for the 150 most volatile CUs in the system. Image by author.
Figure 3. Correlation matrix of time series of the health metric for the 150 most volatile CUs in the system. Image by author.

Figure 3 shows the correlation matrix for the time series of the health metric of 150 configuration units from the monitoring system of one of our clients for which the metric values ​change most often. As you can see, this matrix is ​​basically a "green field" that means either absence of any or presence of very weak correlations between the overwhelming majority of CUs. White pixels in the picture appeared for those CU pairs where the algorithm failed to calculate the correlation coefficient – their time series do not intersect in time (for the convenience of further processing, all nan values ​​are zeroed out). This is quite possible, since the monitoring system is alive – new configuration elements can appear in it and old ones can disappear. Red and blue pixels in the picture correspond to those CU pairs, the time series of which are correlated and, accordingly, anti correlated. There are not many such CU pairs: there are only 65 correlated pairs with a correlation coefficient r>0.7 (0.29% of the total number of pairs), and there are only 4 anti correlating pairs with r<-0.7 (0.02%). This characterizes the monitoring system on the good side: there are not many duplicate elements monitoring the same parameter, everyone is busy with their own business. Configuration units duplicating each other’s functions should just fall into the group of strongly correlated pairs with a correlation coefficient r>0.95.

Figure 4. Correlation matrices for different intervals of regularization of metrics time series - 5-minute and 10-minute. Image by author.
Figure 4. Correlation matrices for different intervals of regularization of metrics time series – 5-minute and 10-minute. Image by author.

Figure 4 illustrates the correlation matrices calculated for a set of the same CUs, but using different intervals of regularization for time series of metrics – 5-minute and 10-minute. At first glance, both pictures look quite similar, but if you count the difference between them, you get the histogram shown in Figure 5, in which the mean value is close to zero μ = 0, and the standard deviation σ = 0.11. The same histogram for the difference in correlations of 5-minute and 20-minute time series has a standard deviation σ = 0.16, from which it follows that with a change in regularization, a noticeable number of KE pairs can move from the group of correlated to the group of non-correlated, and vice versa. As one can see, the choice of the interval of regularization for time series can have a significant effect on the values ​​of the correlation matrix, and it should be chosen as a certain compromise between the desired accuracy of calculating correlations, the required computer resources, and the computation time.

Figure 5. Histogram of the difference between the values of the same correlation coefficients for 5-minute and 10-minute regularized time series. Image by author.
Figure 5. Histogram of the difference between the values of the same correlation coefficients for 5-minute and 10-minute regularized time series. Image by author.
Figure 6. Typical view of the correlated time series of the health metric of CU pairs. Image by author.
Figure 6. Typical view of the correlated time series of the health metric of CU pairs. Image by author.

The general view of several pairs of correlated time series of CU health metrics is shown in Figure 6, and of anticorrelated time series – in Figure 7. The last figure clearly shows that high modulus values ​​of the correlation coefficient may be due to very specific behavior of time series, characteristic intersection of time intervals of their joint variability. In such situations, to check the significance of the calculated correlation coefficient, one can use (with some exaggeration) the Student’s t-test, for which, using the formula t=|r|√(n-2)/(1-r2), the t-statistics tobs is calculated and compared with a tabular value tcrit(α,k) for the given level of significance and the number of degrees of freedom k=n-2, where n is the number of simultaneous observations of metrics. As n, one should take the minimum number of value changes in the time series of the metric (from the CU pair) on the time interval where both of them are defined. For pairs of time series in Figure 7, the results of calculating the Student’s t-test with α = 0.05 are shown in red squares. Since for the first two pairs tobs<tcrit, the measured values ​​of the correlation coefficients are not considered significant and may be the result of random coincidences. For the last pair tobs>tcrit which means that the measured value of anticorrelation has statistical significance. The tcrit values ​​are very easy to obtain in the scipy module:

import scipy as sp
tCrit=sp.stats.t.ppf(1-alpha/2, ndf)
Figure 7. Typical view of the anticorrelated time series of the health metric of CU pairs. Image by author.
Figure 7. Typical view of the anticorrelated time series of the health metric of CU pairs. Image by author.

Usage of correlation matrix

As mentioned in the introduction, there are several scenarios for using the correlation matrix of time series of metrics in a monitoring system of IT infrastructure, two of which seem to be the main ones: 1) search for the root causes of incidents in the IT system (root cause analysis) and 2) search for hidden infrastructure links between system elements that are absent in the resource-service model. One should keep in mind that the informativeness and usefulness of the correlation analysis in both scenarios directly depends on the quality and completeness of the resource-service model (RSM) available for the given IT system: if all connections and degrees of influence that exist between the configuration units of the system are displayed on the RSM graph, then the correlation matrix is ​​unlikely to contain any additional information to that which is already registered in the RSM. It is clear that the full RSM gives a complete picture of all cause-and-effect relationships in the system, and in this case the correlation matrix is ​​just some diminished representation of the RSM graph. Nevertheless, the building of a complete resource-service model of an IT system is a very time-consuming process and is not always feasible in practice, therefore, the analysis of correlations in the behavior of system elements can help to reveal some hidden (absent in RSM) connections between them or to establish a quantitative expression of the degree of influence of one element to another, that is, to complement the RSM graph in some extent. In a situation where there is no RSM for the system, the correlation matrix can find another application: 3) hierarchical cluster analysis of correlations for combining system elements into groups and visualizing itsPCM structure in the form of a dendrogram.

Search for root causes of incidents

In general, an incident in an IT system is any event that causes, or can potentially cause, a failure in the standard process of a service with interruption of the service or a decrease in its quality. At the same time, many incidents can be only symptoms of some deeper problem at the infrastructure level, and quite often one problem (root cause) gives rise to a whole group of incidents. Finding these root causes of incidents is one of the most important functions of IT monitoring systems.

In principle, if a full-fledged and reliable resource-service model of the system exists, the root cause of a failure in the operation of a particular CU can be traced within the RSM graph by sequentially checking for the presence of problematic CUs along all the chain of connections and influences on the original CU. The absence of a full-fledged RSM graph when searching for the root causes of incidents can, to some extent, be compensated for by analyzing the correlation matrix between the time series of CU states. It is also possible that the root cause of an incident can be determined by the correlation matrix faster than by the RSM graph due to the difference in the types of queries to the database.

In monq, the function of the correlation analysis of the time series of the health metric of CUs is delegated to a separate microservice, which constantly updates the correlation matrix in the background. Updates do not occur simultaneously for the entire matrix but in parts (due to limitations on computing resources), and for more volatile time series the calculation of correlation coefficients is carried out more often than for slightly volatile ones. Thus, the current correlation matrix contains the correlation coefficients calculated at different moments in time, but due to the above-described method for updating it, all the values ​​in it are relevant.

When an incident occurs with some CU, after receiving the corresponding request, the microservice issues a sorted list of those CUs for which the values of the correlation coefficient with the original CU are above a certain threshold (usually r>0.7), as shown in Figure 8. This information can be used directly on the frontend to check the status of correlated CUs, and also be passed on to the combined root cause search algorithm, which, in addition to temporal correlations, also uses semantic clustering of incidents.

Figure 8. List of CUs correlating with CU-38374 that is produced by the microservice for calculating the correlation matrix. Image by author.
Figure 8. List of CUs correlating with CU-38374 that is produced by the microservice for calculating the correlation matrix. Image by author.

Search for hidden infrastructure links

As mentioned above, the correlation analysis of the time series of CU states can reveal new connections between the elements of the IT system that are not taken into account in its resource-service model. This can be some kind of hidden infrastructure link in the system itself (for example, the use of common power supply circuits, a common cooling system, etc.) or a general dependence on some kind of external service. Since in this scenario rather rigid connections between the elements of the system are sought, the threshold value of the correlation coefficient for selection of CU pairs should be taken rather high: r>0.95.

Figure 9 shows the time series correlation matrix of CU health metrics for our main customer’s IT system, which consists of 3200 configuration items. The number of CU pairs with a correlation coefficient above 0.95 is 7470, of which, after checking for connection between the pair elements within the RSM graph, 2310 remain. A typical view of the time series for some of the remaining CU pairs is shown in Figure 10, where the Student’s t-test results are given in red squares (with α=0.001 for greater reliability). As it can be seen from the figure, for most pairs the value of the t-statistic is less than the critical one, so only 3 pairs of correlated CU pass this t-test in the end. For the t-criterion with the parameter value α=0.01, the number of remaining CU pairs is 27. What to do next with the discovered new connections between the elements of the IT system, whether to add them to the RSM, is decided by its operator although automatic addition is also possible, in principle.

Figure 9. Full correlation matrix of time series of CU health metrics for the IT system of our main customer. Image by author.
Figure 9. Full correlation matrix of time series of CU health metrics for the IT system of our main customer. Image by author.
Figure 10. Typical view of time series for some highly correlated pairs of CUs that are not connected in the RSM graph. Image by author.
Figure 10. Typical view of time series for some highly correlated pairs of CUs that are not connected in the RSM graph. Image by author.

Hierarchical cluster analysis of correlations

Hierarchical clustering is used to identify relatively homogeneous groups in a set of variables using an algorithm that first considers each variable as a separate cluster, and then sequentially, based on some quantitative measure proximity between the variables, combines these clusters with each other into larger ones until there is only one left. As the output of the procedure, a dendrogram is obtained – a tree-like graph built from a matrix of measures of similarity (difference), which visualizes the mutual distances between variables from a given set. In the case of correlations, the difference matrix is ​​usually taken to be Mdist=||1||-Mcorr, where ||1|| is a matrix of units of the same size as the correlation matrix Mcorr. In the scipy module, you can build a dendrogram from the correlation matrix in just several lines of code:

import scipy.cluster.hierarchy as hac
z = hac.linkage(1-corrMatrix, method='complete')
hac.dendrogram(z, colorthreshold=3, leaf_rotation=90., labels=allKeDF.columns)
plt.title('Dendrogram of hierarchical cluster analysis based on correlation matrix of CU health', fontsize=12)
plt.ylabel('Distance',fontsize=10)
plt.xlabel('Cu Id',fontsize=10)
plt.show()

Figure 11 shows a dendrogram obtained from the correlation matrix of the time series of the health metric of 150 configuration units from Figure 3, in which the hierarchical clustering algorithm highlighted in different colors the CU clusters with correlated behavior of their metrics, and effectively divided the entire set of the system CUs into related groups (subsystems). In the absence of the system RSM, such a division already reveals some structure of the system and can be useful, for example, when searching for root causes of incidents.

Figure 11. Dendrogram from the correlation matrix of the health metric time series for the 150 most volatile CUs in the system. Image by author.
Figure 11. Dendrogram from the correlation matrix of the health metric time series for the 150 most volatile CUs in the system. Image by author.

Related Articles