The world’s leading publication for data science, AI, and ML professionals.

Single Model Based Anomaly Detection for Multi-Item Datasets

A simple approach about detecting anomalies in datasets including multiple items with a single model

Detecting anomalies on time series has a comprehensive usage field in many different domains. Most corresponding studies are carried out on datasets including only one item. To be more clear, with "item" term, I refer to an exact device in Anomaly Detection in traffic values of a device in telco domain, or anomaly detection in energy consumption of a device in the energy domain. Similarly, it might suit to an exact brand for anomaly detection in sales amount in the retail domain. These examples can be extended for other domains as well.

What about anomaly detection in datasets including multi items which is very common in real-life scenarios? Think about a dataset consists of thousands of different device and traffic values flowing through them. Each of them has its own pattern and behaviour, which means that thousands of different models fitting to each device.

You might come up with a few workarounds in such situations. One possible and direct solution is developing a separate model for each item which probably leads to the highest accuracy. An important drawback for this methodology is obviously performance issues. If you have 1000 unique items in your dataset, you need 1000 different models to develop. In real life, you can even have much more unique items in the problem on which you are working. The more unique items you have in the dataset, the more complex solutions you implement. It is not the only a problem in terms of maintenance, but also a problem in terms of performance regarding resource usage.

An alternative solution is to cluster similar behavioural items into the same clusters and develop a model for each cluster. This will result in better outputs with respect to complexity and performance issues by making compromising on accuracy.

In this writing, I will mention an alternative approach for anomaly detection in multi-item datasets in which items have different patterns depending on the same factors. The rest of the writing is based on the implementation of this approach on a real dataset.


Photo by Yves Alarie on Unsplash
Photo by Yves Alarie on Unsplash

Step 1 – Problem Definition

Roaming in the telco domain is basically about using your own mobile phone in another country you visit. Telco companies continue to provide service for their customers in their international visits via a local service provider in the visited country. In order to get high customer satisfaction, telco companies monitor service quality of their own customers even if they are abroad. Within this scope, our target is to understand patterns and detect anomalies in the service quality customers experience in a foreign country and take an action accordingly. For this purpose, I am using a real dataset of a prominent telco company in Turkey.

Step 2— Getting Raw Data

The data consist of 5 columns, those are time, operator name and 3 different KPIs. These KPIs are simply about service quality metrics and represent 3 months periods between July 2020 – October 2020. I anonymized the data because of the strict privacy issues. The data are received hourly and include summarized KPI values in the corresponding time range for each operator. The original data include more than 1000 different operators from different countries. For the sake of simplicity, I am working on a subset which corresponds to only 30 different operators.

As you can guess, each operator has a different pattern. Each country receives different numbers of visitors from Turkey in each month of the year which, for instance, directly affects the call attempts number. Additionally, countries might have different time zones according to Turkey. So, you are likely to encounter with different usage volumes in Germany and Indian at 5 p.m based on Turkey time zone.

# The data includes some KPI values for roaming process of a service provider in Turkey.
# I masked some information in the data for privacy issues.
df_roaming = pd.read_csv('roaming.csv')
df_roaming['Time'] = df_roaming['Time'].apply(lambda x: datetime.strptime(x, '%Y-%m-%d %H:%M:%S'))
df_roaming
The dataset
The dataset

Step 3— Imputing missing values

The data have some missing rows due to impairments in the data flow. Handling missing values are out of the scope of this writing. Thus, I simply fill the gaps in the data with the mean of previous and next values of KPI values for the corresponding operator.

# The data has missing values for some time periods
# Since they are not included as NaN values(they are completely missing), 
# We are firstly composing a baseline for all operators within the minimum and maximum dates
min_date = min(df_roaming['Time']) # 2020–07–22 07:00:00
max_date = max(df_roaming['Time']) # 2020–10–27 12:00:00
datetimes = pd.date_range(start = min_date, end = max_date, freq='h')
operators = df_roaming['Operator'].unique()
df_operator = pd.DataFrame(data=operators, columns=['Operator'])
df_time = pd.DataFrame(data=datetimes, columns=['Time'])
df_operator['key'] = 1
df_time['key'] = 1
df_baseline = pd.merge(df_operator, df_time, on='key').drop('key', 1) # baseline is created
# Each null value is simply replaced with mean of the preceding and posterior values of the corresponding operators
# Successive missing values which are not so much in the dataset, are just ignored currently
df_roaming = df_baseline.merge(df_roaming, on= ['Operator', 'Time'], how='left')
df_roaming = df_roaming[['Operator', 'Time', 'KPI_1', 'KPI_2', 'KPI_3']]
df_roaming['preceding_KPI_1'] = (df_roaming.sort_values(by=['Time'], ascending=True)
 .groupby(['Operator'])['KPI_1'].shift(1))
df_roaming['preceding_KPI_2'] = (df_roaming.sort_values(by=['Time'], ascending=True)
 .groupby(['Operator'])['KPI_2'].shift(1))
df_roaming['preceding_KPI_3'] = (df_roaming.sort_values(by=['Time'], ascending=True)
 .groupby(['Operator'])['KPI_3'].shift(1))
df_roaming['posterior_KPI_1'] = (df_roaming.sort_values(by=['Time'], ascending=True)
 .groupby(['Operator'])['KPI_1'].shift(-1))
df_roaming['posterior_KPI_2'] = (df_roaming.sort_values(by=['Time'], ascending=True)
 .groupby(['Operator'])['KPI_2'].shift(-1))
df_roaming['posterior_KPI_3'] = (df_roaming.sort_values(by=['Time'], ascending=True)
 .groupby(['Operator'])['KPI_3'].shift(-1))
df_roaming.KPI_1.fillna((df_roaming.preceding_KPI_1 + df_roaming.posterior_KPI_1)/2, inplace=True)
df_roaming.KPI_2.fillna((df_roaming.preceding_KPI_2 + df_roaming.posterior_KPI_2)/2, inplace=True)
df_roaming.KPI_3.fillna((df_roaming.preceding_KPI_3 + df_roaming.posterior_KPI_3)/2, inplace=True)
df_roaming = df_roaming[['Operator', 'Time', 'KPI_1', 'KPI_2', 'KPI_3']]

Step 4— Visualizations

Visualization always undertakes a critical role for any kinds of data science project just like in our approach here. I try to figure out the factors which shape the pattern of behaviours of KPI values for each operator.

For this purpose, I randomly pick two operators and plot the KPI_1 values of them along the time axes. To be more understandable, I use only last week values. I am not able to analyze the seasonal pattern because of working on 3 months period. Keep in mind that these KPI values are deeply affected by the pandemic as a result of limitation policies in travelling between countries. As an initial observation, KPI values are dependent on the hour of the day and dependent on the day of the week. These inferences are generally straightforward and can be obtained without visualization. However, I encourage you to exploit visualization before making any decision and spend much more time on this step.

# Plotting KPI values of random operators
df_roaming_op_0 = df_roaming.loc[df_roaming['Operator'] == 'Country/Operator_0']
df_roaming_op_15 = df_roaming.loc[df_roaming['Operator'] == 'Country/Operator_15']
df_roaming_op_15 = df_roaming_op_15.sort_values(by=['Time'], ascending = True)
df_roaming_op_15.reset_index(inplace=True)
df_roaming_op_15 = df_roaming_op_15.tail(7*24*1) # select last week
df_roaming_op_0 = df_roaming_op_0.sort_values(by=['Time'], ascending = True)
df_roaming_op_0.reset_index(inplace=True)
df_roaming_op_0 = df_roaming_op_0.tail(7*24*1) # select last week
# cast to string in order to visualise xticks with hour minute and second info
df_roaming_op_0['Time'] = df_roaming_op_0['Time'].apply(lambda x: x.strftime('%Y-%m-%d %H:%M:%S')) 
df_roaming_op_15['Time'] = df_roaming_op_15['Time'].apply(lambda x: x.strftime('%Y-%m-%d %H:%M:%S'))
plt.figure(2, figsize=(18, 12))
plt.plot(df_roaming_op_15['Time'], df_roaming_op_15['KPI_1'], marker='o', label='Operator_15')
plt.plot(df_roaming_op_0['Time'], df_roaming_op_0['KPI_1'], marker='o', label='Operator_0')
ymin=0 
ymax=1500
x_axes_label = [df_roaming_op_15['Time'].iloc[i] for i in list(range(0, df_roaming_op_15['Time'].shape[0])) if i%24==0]
plt.xticks(x_axes_label)
plt.vlines(x_axes_label, ymin=ymin, ymax=ymax, linestyles='dashed', color='red')
plt.legend(loc='upper right')
plt.title('KPI_1 values along time axes for Operator_0 and Operator_15')
plt.xticks(rotation=90)
plt.show()
Visualization of KPI_1 values for two random operators along the last week
Visualization of KPI_1 values for two random operators along the last week

Step 5— Compose the Pattern Table

The most critical part is to compose a pattern table which simply reflects the pattern of each item according to the factors which are found out in the visualization part. While composing the pattern table, I assume that the features of each item fit a normal distribution, so I calculate the mean and standard deviation of the KPI values in a simple manner.

Before composing the pattern table, I split the dataset into train and test. And then, compose the pattern table by only using feature values in the train set.

# train test split
train_test_split_date = datetime.strptime('2020–10–01 00:00:00', '%Y-%m-%d %H:%M:%S')
df_roaming = df_roaming.loc[~((df_roaming['KPI_1'].isnull()) & (df_roaming['KPI_2'].isnull()) & (df_roaming['KPI_3'].isnull()))]
df_roaming_train = df_roaming.loc[df_roaming['Time'] < train_test_split_date] 
df_roaming_test = df_roaming.loc[df_roaming['Time'] >= train_test_split_date]
# extracting the weekday and hour information from 'Time' 
df_roaming_train['Weekday'] = df_roaming_train['Time'].apply(lambda x: x.weekday())
df_roaming_train['Hour'] = df_roaming_train['Time'].apply(lambda x: x.hour)
# calculating the mean and standart deviation of KPI values with respect to hour and day
df_pattern = df_roaming_train.groupby(['Operator', 'Hour', 'Weekday']).agg({'KPI_1': ['mean', 'std'],
 'KPI_2': ['mean', 'std'],
 'KPI_3': ['mean', 'std']})
df_pattern.columns = ['_'.join(col).strip() for col in df_pattern.columns.values]
df_pattern.reset_index(inplace=True)

Step 6— Create Features (z-scores)

After composing the pattern table, the z-scores of each KPI values for each row are calculated. In other saying, deviations of each KPI values from their own patterns are estimated by taking into account pattern determining factors.

# pattern list represents the metrics which determines the pattern of data. 
pattern_list = ['Operator', 'Hour', 'Weekday']
df_roaming_train_z_scores = df_roaming_train.merge(df_pattern, on = pattern_list, how='inner')
# calculating the z-scores
df_roaming_train_z_scores['KPI_1_z_score'] = (df_roaming_train_z_scores['KPI_1'] - df_roaming_train_z_scores['KPI_1_mean']) 
/ df_roaming_train_z_scores['KPI_1_std']
df_roaming_train_z_scores['KPI_2_z_score'] = (df_roaming_train_z_scores['KPI_2'] - df_roaming_train_z_scores['KPI_2_mean']) 
/ df_roaming_train_z_scores['KPI_2_std']
df_roaming_train_z_scores['KPI_3_z_score'] = (df_roaming_train_z_scores['KPI_3'] - df_roaming_train_z_scores['KPI_3_mean']) 
/ df_roaming_train_z_scores['KPI_3_std']

Step 7— Modelling & Anomaly Detection

I use Isolation Forest as the anomaly detection model, which is one of the alternatives among the lots of anomaly detection models. The crucial point is to feed the model with the calculated z-scores of each row. In other words, I take into consideration the deviation of each KPI values from the corresponding pattern.

Actually, there is no need to use an anomaly detection model. You can even estimate a general weighted z-score value for each row together with a threshold value and come up with a more primitive solution in order to detect anomalies.

# Note that there are not any features spesific to any operator 
# Feeding the model with z_scores, in other saying deviations from their own patterns 
X_train = df_roaming_train_z_scores[['KPI_1_z_score', 'KPI_2_z_score', 'KPI_3_z_score']]
model = IsolationForest(random_state=0).fit(X_train) 
# Convert the test data into a compatible format for the model
# Note that test data is not used while constructing pattern table
df_roaming_test['Weekday'] = df_roaming_test['Time'].apply(lambda x: x.weekday())
df_roaming_test['Hour'] = df_roaming_test['Time'].apply(lambda x: x.hour)
pattern_list = ['Operator', 'Hour', 'Weekday']
df_roaming_test_z_scores = df_roaming_test.merge(df_pattern, on = pattern_list, how='inner')
df_roaming_test_z_scores['KPI_1_z_score'] = (df_roaming_test_z_scores['KPI_1'] - df_roaming_test_z_scores['KPI_1_mean']) 
/ df_roaming_test_z_scores['KPI_1_std']
df_roaming_test_z_scores['KPI_2_z_score'] = (df_roaming_test_z_scores['KPI_2'] - df_roaming_test_z_scores['KPI_2_mean']) 
/ df_roaming_test_z_scores['KPI_2_std']
df_roaming_test_z_scores['KPI_3_z_score'] = (df_roaming_test_z_scores['KPI_3'] - df_roaming_test_z_scores['KPI_3_mean']) 
/ df_roaming_test_z_scores['KPI_3_std']
X_test = df_roaming_test_z_scores[['KPI_1_z_score', 'KPI_2_z_score', 'KPI_3_z_score']]
# predicting
predictions = model.predict(X_test)

Assumptions

In this approach, there are a few assumptions as you might notice. First of all, I assume that each features follow a pattern dependent on only some inferred factors which are operators (items), hour and weekday in our case. Note that, each operators has a different pattern but dependent on the same factors. Secondly, features fit normal distribution in the light of these inferred factors. Thirdly, correlations between the features are not taken into consideration while detecting anomalies.

In fact, all these assumptions come together with their limitations and weaknesses. However, this approach can be utilized in a simple Time Series for anomaly detection problems including multiple items as an alternative. My goal is to give a simple insight about this approach, so I don’t spend much time in each step. In a real-life scenario, determining the factors which shape the behaviours of each item correctly is very crucial. For this reason, detailed analyses are needed for especially in the visualization part. Each features even might be affected by different factors, not common factors.

As I mentioned before, the dataset is affected profoundly by the pandemic. To get more robust results, it should be worked with more clear data or a subset of the data.

Thanks for reading, I appreciate any feedback.


Related Articles