Time Series Anomaly Detection with PyFBAD

An End-to-End Unsupervised Outlier Detection

Oğuzhan Yediel
Towards Data Science

--

The typical flow of a machine learning project starts with reading the data, followed by some preprocessing, training, testing, visualization, and sharing the results with the notification system. Of course, all the steps can be easily done with the help of various open-source libraries. However, in some task-specific cases, such as anomaly detection in time series data, reducing the number of library and hard-coded steps would be more beneficial for explainability. The pyfbad library has been developed for that reason.

The pyfbad library is an end-to-end unsupervised anomaly detection package. This package provides source codes for all ml-flow steps mentioned earlier. For example, data can be read from a file, MongoDB, or MySQL with special filters using pyfbad. These read data can be made ready for the model with preprocessing methods. The model can be trained using different machine learning models such as Prophet or Isolation Forest. Anomaly detection results can be reported via email or slack. In other words, the whole cycle of the project can be done with the help of source codes provided under pyfbad, without using any other library.

In fact, it would be helpful to read this informative article to understand why we need to develop this package and how to design an unsupervised anomaly detection project in general terms. However, we still briefly describe anomaly detection as an identification technique used to find unusual observations that cause sudden spikes or dips in a given data set.

Figure 1. The pipeline of the unsupervised anomaly detection on time-series data using the pyfbad. Image by author

As seen in Figure 1, pyfbad has 4 main modules: database, feature, model, and notification. This structure is almost standardized in data science projects with the help of Drivendata by Cookiecutter. The project organization can be seen in Figure 2.

Database:

This module has scripts to read data from various databases or files. MySQL and MongoDB are databases support added so far. Especially in MongoDB using filtering steps via Pyfbad becomes more user-friendly. The following snippet may give an idea of how to use pyfbad for database operations.

Feature:

The concept of variate time series anomaly detection requires two types of data. One of them is continuous-time data and the other is the master data that we want to detect anomalies. These two data should be extracted from the raw data as model data. Pyfbad provides retrieving model data from raw dataframe with optional filtering. The following snippet shows how to use pyfbad for this operation.

Model:

This module has the ability to train model data with various algorithms. As can be understood, the Pyfbad aims to detect anomalies on time series data. In this sense, it gives the opportunity to use models that can be applied quickly and robustly. Facebook Prophet and Isolation Forrest are models support added so far. As an example, how Prophet is implemented with the help of pyfbad can be seen from the code snippet below.

Notification:

How successful all the technologies we use are related to how well they use their output. If the output reaches the user well and can explain how to use it, the product will become visible. The pyfbad provides various notification systems to share the results of the project such as email and slack. The email option can be used as the code snippet below. Note that the email account should not have high-security settings for authorization.

Implementation!

Let’s make a quick implementation to better understand how the Pyfbad can be used. A full notebook can be found on Kaggle from here. Here let’s just look at the results of each step.

After the dataset module is imported under the pyfbad library the raw data can be read from a file as CSV. The raw data looks like in Figure 3.

Figure 3. Raw time-series data values. Image by author

In this dataset, there’s actually no need for preprocessing steps to extract the ready-to-train dataset from the raw dataset. However, in Figure 4 the differences between the initial and final data frame can be seen.

Figure 4. The raw dataset is on the left and the ready-to-train dataset is on the right. Image by author

The Prophet algorithm was used to train the model in this implementation. After the training step, the detected anomalies can be shown in Figure 5.

Figure 5. Anomaly detection results. Image by author

Conclusion

Pyfbad works well with most popular databases, like MongoDB and MySQL, but still different kinds of database support can be added. It currently uses the best-known models like FB Prophet and Isolation Forrest, but still needs to have more machine learning algorithms. The team behind the project is eager to learn, passionate about research and development, and very ambitious about learning new technologies. Therefore, we can safely say that this is the first version of Pyfbad. Studies are ongoing both to complete the mentioned deficiencies and to make Pyfbad a very comprehensive unsupervised anomaly detection library.

--

--