Anomaly detection in dynamic graphs using MIDAS

An interesting approach to modelling network security

Nimish Mishra
Towards Data Science

--

With the ever-growing reach of network attacks, network security research has been paramount in recent years. Ideas ranging from genetic algorithms to deep learning have been leveraged to identify possible attack scenarios and notify the same to the network administrator.

Under the fancy details such approaches come with, most of them have a similar core idea: noticing weird patterns. As someone observing a network closely, it is only intuitive that you would be pensive about a sudden increase in requests from certain geographical locations, or a sudden capture/re-routing of network traffic.

Technically, these weird patterns are called anomalies, something out of the ordinary. These are the most obvious points to turn to when identifying a possible attack.

Anomaly detection has been an important problem for researchers and industrialists alike. In this article, I focus on using graphs to identify such patterns. All we need to do is to somehow model the entire network traffic (yes that LAN or that WiFi you are currently connected to) using these mathematical structures.

Some links to get you started:

  1. An interesting TDS article on MIDAS applications and implementation details: https://towardsdatascience.com/controlling-fake-news-using-graphs-and-statistics-31ed116a986f
  2. Source code: https://github.com/bhatiasiddharth/MIDAS
  3. arXiv: Siddharth Bhatia, Bryan Hooi, Minji Yoon, Kijung Shin and Christos Faloutsos. “MIDAS: Microcluster-Based Detector of Anomalies in Edge Streams.” AAAI Conference on Artificial Intelligence (AAAI), 2020. https://arxiv.org/abs/1911.04464

So how do we model networks with graphs?

A graph is a simple mathematical structure: it has few points called nodes or vertices, and some lines connecting them, called edges. Logically, edges represent certain relationships between vertices.

And we can take advantage of this abstract definition.

To model a problem as a graph, you can identify certain entities as vertices, and then draw edges between them to denote certain relationships.

In our case, we model the computer nodes (a computer device like laptop, desktop) as the vertices of the graph, and a connection between them as an edge.

So going by this, the above graph can be read as:

  1. 5 computer devices in a network: P, Q, R, S, T
  2. The graph shown above is a directed graph. That means for every edge, we define a source vertex and a destination vertex; for instance, considering the directed edge P → Q, this means a directed edge from vertex (or in our case a computer device P) to another vertex (aka a computer device Q).

We now get the perfect material for modelling our network. Say we create a directed graph consisting of all the nodes in a particular network, and we connect a directed edge whenever one node tries to establish a connection with another. Based on established patterns in these connections/edges, there might be a possibility of noticing some weird patterns and thus anomalies.

But aren’t networks very dynamic? Don’t they change much frequently?

Yes they are. And that’s a big issue one faces in any model of a network.

You must have seen how it’s possible for any device to be up online within seconds. So far so good. Now imagine this same thing in a huge demographical area- several devices are coming in, and several others are going on. And by coming in and going out, I don’t mean you simply walk into a network. There’s a lot of things going on, the most important of them and also a good starting point to understand subnets is the Dynamic IP configuration for every node on the network. Moreover, people are always sending files, images, videos, texts, HTTP requests, and what not; and servers are usually finding it difficult to handle excess traffic.

On top of this, it’s useful to understand that one connection between your node/computer and another (say a server) using a very common protocol- TCP- can take upto 3–4 transactions back and forth to successfully deliver one unit of information (called packet). And you usually send and receive thousands of packets over few minutes. And there are thousands of people doing the exactly same thing at exactly the same moment as you.

All this means one thing if you think of this in depth:

The internet’s a pretty busy place. A literal virtual web spread world-wide. And it is easy to hide there.

So what does this mean for anomaly detection?

It means it becomes tough. In such dense network dynamics, you can imagine how many edges would exist (thousands as a starter) over several hundreds of vertices. So how do we actually begin to scratch the surface of the problem?

Here is where MIDAS comes in. It builds upon its predecessors (the details of which are beyond the scope of this article) in a sense that it performs detections by considering the temporal nature of the networks and by considering micro-clusters instead of individual edges.

MIDAS-R (a close variant of MIDAS) considers temporal characteristics of the network

You must have had an idea by now regarding the intense dynamic nature of computer networks- how quickly and on what scale they change.

With anomaly detection techniques that use static graphs, we immediately see a problem: no temporal relations (temporal = time, or how the network evolves over time). Let’s get into this in a bit more detail.

Demonstration purposes only. Source: https://towardsdatascience.com/large-graph-visualization-tools-and-approaches-2b8758a1cd59

Look closely at the graphs above. Consider a network you are observing. It looks right now like the one in figure (b). An hour later, it looks like (a). What are the anomalies? How do you tell if someone attacked?

Using static graphs, you would take the snapshot of the network configuration at certain intervals. And you view the graphs like independent existing entities.

Quite intuitively, this method doesn’t make sense. Why? Because network configurations aren’t self-existent. Everything happening in (a) is somehow related to what has happened sometime in the past, like in (b). That is the entire idea of temporal characteristics. And that is what MIDAS (and other similar algorithms) using dynamic graphs does. They ask the following question:

How does the current scenario relate to the past? What has changed, and what hasn’t?

The moment you start asking these questions, you realise important links to the past configurations of the network- something you can exploit to have better chance at detecting some weird pattern and be alerted.

MIDAS considers micro-clusters (here lies one of its novel characteristics)

Consider a very popular network attack: DoS (or another variant DDoS). While details are beyond scope, very crudely these attacks load a server with so many requests that the server isn’t available for genuine requests from genuine clients, effectively crashing the entire system.

If we try to model this in terms of our dynamic directed graphs, what could we expect.

Observed something?

Let’s consider the second image first, all points (considered as nodes) are receiving several edges from different nodes. However, in the first image, the node labelled by 1 is receiving several edges from a single node: 0.

Earlier algorithms did not differentiate between these. But MIDAS asks an interesting question:

Why are there so many parallel edges between two similar nodes?

Parallel nodes in our model of a network imply one computer is sending several independent requests to another. Say 1 is a server, and 0 is a home computer running a browser. That browser might ask for several things at once: a YouTube video, a Wikipedia article, a Medium article, a Towards Data Science article, and several other things. (This is still a gross oversimplification of the actual stuff happening. In reality, every URL you type into the browser gets mapped to different servers spread across the world. Never do all requests from one browser go to one server. But for understanding, we imagine for a second that they actually do).

So far so good. Now what if this number of parallel requests starts to climb? Something like below:

Source: original MIDAS paper

The above graph from the authors illustrates this point. There’s a sudden burst of activity between two nodes, after a period of somewhat normal activity. Recall from our discussion on temporal characteristics of a graph: what happens now depends on the past as well. Here, the past saw ~100–200 connections, and suddenly the count went up to ~1000.

In such a case, MIDAS says, ‘There might be something wrong. Scrutinise this.’

This is the entire core of micro-cluster detection: amongst the several parameters employed to monitor anomalies, include monitoring of suddenly appearing bursts of activity sharing several nodes or edges that are close by in the sense of the graph representation (spatial locality). This is something that makes MIDAS very powerful.

How does MIDAS stack up against existing solutions?

MIDAS (or MIDAS-R) performs better than present solutions/algorithms. Some of the interesting highlights reported by the authors include:

Theoretical guarantees on the false positive probability

I’ll leave a detailed discussion outside the scope of this article. Crudely, the authors showed that MIDAS (MIDAS-R) can give binary decisions (whether an anomaly exist or not) upto a user defined threshold ε. So depending on whether you wish to have 10% or 1% as the maximum probability of a false positive, MIDAS can handle it.

Note: False positive means the algorithm has detected some event as falsely positive, i.e. the event is a negative but the algorithm says it’s a positive. In the current scenario, this would mean there isn’t an anomaly, but MIDAS says there is. This is just an extra layer of caution applied by setting a high probability. You might rather check a non-anomalous event rather than letting an anomalous event slip away (due to too low a probability).

This isn’t always desirable however, and depends greatly on the use-case. Consider an algorithm detecting cancer tumours. Do you wish to have a large threshold? This means several people who don’t have cancer are reported to be having it. On the other hand, too low a threshold means some cancer patient might escape detection. Both conditions are hazardous; choosing thresholds is thus an empirical decision based on the use case of the problem.

Constant memory and update time

Consider a rather possible scenario of a typical network:

Source: https://towardsdatascience.com/getting-started-with-graph-analysis-in-python-with-pandas-and-networkx-5e2d2f82f18e

The real networks are way denser than that. Question arises:

Where do you store such a large network modelled as a graph?

To understand more memory and time complexity problems glaring at us here, take the following example:

Find all prime numbers less than a given number n?

A very naive algorithm:

  1. Take any number q less than n.
  2. Check if q is prime. (this itself is an algorithm detailed after step 3)
  3. Repeat until q reaches n-1

(Step 2 expansion) Checking if q is prime:

  1. Consider all numbers p less that q.
  2. Does p divide q? Yes: increment some counter. No: repeat.
  3. At the very end (when p reaches q-1), see if that counter is 2 (means there are 2 factors: 1 and the number itself). Yes: number is prime. No: isn’t prime.

A bad algorithm but it does the trick. You now get a sense of the problem: increase n: first to tens, then to thousands, then to tens of thousands; you shall begin seeing the time differences.

Crudely, we think of time and space complexities such that when the input size increases, both time and space requirements increase so much that it is infeasible to solve the problem after a certain limit.

Porting this to our graph problem of a network, MIDAS faces the challenge to represent several thousands of computers as nodes/vertices, and several tens of thousands of edges. You could look up the standard representations of graphs (adjacency lists, matrices etc.), and you will immediately see the problem. As with the prime number algorithm, in such representations, as the number of nodes and connections between them increase, space requirement for storing and time requirement for executing the algorithm skyrockets.

The authors solve this issue using an uncommon data structure: Count-Min-Sketch (CMS). Details are out of scope, but this eliminates the above problem.

MIDAS (MIDAS-R) is simply better

Quoting some results from experiments done by the authors:

MIDAS and MIDAS-R score higher than the benchmark. Source: Original MIDAS paper.
MIDAS and MIDAS-R are experimentally proven to be more accurate and more precise than the benchmark. Source: Original MIDAS paper
Linear time and space relationship as graphs get denser. Source: Original MIDAS paper

Conclusion

I think micro-cluster based, temporal and spatial locality included, anomaly detection on dynamic graphs shall go a long way and find interesting applications in near future. As one, the authors benchmark important security event detection and find the following:

MIDAS captures the important events just as previously established benchmark does. Source: original MIDAS paper

In the coming years, we will see increased research efforts into development of more and more clean and novel approaches to anomaly detection, and move on to implementing searches for weird patterns in other domains.

--

--