The world’s leading publication for data science, AI, and ML professionals.

Beyond Anomaly Detection

Anomaly detection != threat detection.

by qimono from Pixabay
by qimono from Pixabay

Anyone familiar with threat detection in security operations has likely heard a familiar explanation. It always goes something like, "profile normal behavior, then detect anomalies." The process involves generating Data from several different sensors within your environment and using that data to define some kind of baseline threshold. You can then generate alerts based on any sensor data which falls out of that specific threshold. This is all a bit abstract – let us dive in a bit further.

Those familiar with machine learning will know the terms supervised and unsupervised learning. Usually people associate unsupervised learning with anomaly detection, but we can actually use either to detect anomalies. In fact, we can apply the ideas from machine learning to both human learning and static analytics. The best threat detections will usually consist of some combination of static detections (rules), machine learning, and human learning. We will focus our unsupervised learning on clustering and supervised learning on classification.

Clustering. When humans and algorithms perform clustering, they generate logical groups for different data pieces to fall into. These groups will appear from the combination of data features rather than relying on any particular feature to be a label. In other words, nothing in the data should say outright, "this piece is in group A." The analyst or algorithm will place that item in group A based on some measure of "closeness" to other items in group A. In anomaly detection, the analyst or algorithm generates the groups and the measure (or measures) of "closeness" is the threshold. If a data item sits outside this threshold, too "far" from the cluster, it is an anomaly.

A security analyst could generate a simple cluster around duty hours. If the organization has 9–5 work hours, we could generate a cluster around 1pm with a threshold of +/- 4 hours. Anything outside of this timeframe would generate an alert. A more realistic cluster would group different interactions based on service, length of interaction, amount of data, and other factors into different basic activities of users and administrators. Strange activities would fall outside of these clusters and analysts can investigate them. If they find the activities to be benign, they can add them to a new cluster group or expand the thresholds to get rid of these false alarms.

Classification. There is no formal decree as to which features of a data item are features versus labels. We may assign any feature of data as a label and use the other features to predict it. In this case, we use the different fields in the data to predict another field. The analyst or algorithm can choose which fields provide the best thresholds to only alert real anomalies. We can also combine different data entries across different streams into a single data object and then use the disparate data entries to make higher-fidelity predictions.

If we consider an NGINX log entry, for example, we have the following features in an entry: IP address, date, method, URL, protocol, status, bytes sent, referer, and user agent. Now, let us imagine a business in which most users access a web server from a browser using the /home URL at any time during the business day. Their request uses the GET method, and the response is quite large as it delivers a whole webpage. There are some developers, however, who access the web server from their code with an API. They make POST requests to the /api URL and get short responses. They usually only make these requests near the beginning of the workday. In this case, the combination of an early-day request using the POST method, a short response, and no user-agent string will strongly predict the /api URL.

The above methods will allow you to detect anomalies, and we can investigate those anomalies to find threats – right? Well, this relies on a couple of assumptions:

  1. Attack traffic is more dynamic than real traffic. In other words, your real traffic will remain predictable through clustering and classification while attack traffic will be the traffic that is unpredictable.
  2. Among all of the available anomaly detection algorithms that you are able to generate, the ones that you pick will best detect attacks. If you focus on optimal anomaly detection, you will achieve optimal threat detection.

These are very strong assumptions. These may hold up for the most rudimentary of attacks, but advanced attackers are trying to make needles look like hay so that you will not see their attacks as anomalies. Meanwhile, the pace of change in most environments is rapidly accelerating, so new, anomalous traffic will appear more and more regularly. Also, how do you verify these assumptions without using actual attack data to test these algorithms? Even if you want to use anomaly detection, you will want to see how well your anomaly detection algorithms work against actual attack data. Then with real attack data, you can generate algorithms which directly classify data objects as malicious or benign. You can also have your data object generation method match different attack lifecycles rather than just make educated guesses. You can even ensemble your anomaly detection methods with other methods to generate very high-fidelity alerts.

Whether we are generating rules, training analysts, or using Machine Learning, we should have real data, labeled with what we are actually looking for. In this case, we should have the benign sensor data which we would use to profile normal behavior in an anomaly detection method, and we should also have the same sensor data for malicious behavior which we would want to detect. Anomaly detection, therefore, is not enough, and it will not get you to great threat detection by itself.


Related Articles