Machine Learning Techniques Applied to Cyber Security

4 min readSep 10, 2017

In general, we can divide Machine Learning algorithms into two broad categories: supervised and unsupervised. In a nutshell, supervised algorithms require a labelled training data set, and once trained should subsequently be able to correctly classify or predict data given new input. Most Neural Network architectures can be considered as supervised learning algorithms. Conversely, unsupervised algorithms do not require labelled training data sets, and typically use inherent data properties to subsequently predict or classify data. For example, most clustering techniques such as K-Means are unsupervised algorithms.

In this short article, we’ll give a high level overview of how different types of machine learning algorithms can be used in the cybersecurity domain — including examples in supervised, unsupervised, and reinforcement learning.

Supervised: Using Recurrent Neural Networks to distinguish between “normal” DNS domains and those generated by Domain Generation Algorithms

The goal here is to distinguish between “probably normal” domain and those randomly generated by malware, to communicate back to their command and control servers. This is a prime example because signature based approaches are useless here — the randomly generated domains are usually truly random and change quite frequently to stay ahead of any signature updates.

There are actually many techniques out there on how to distinguish between the domains such as using linguistic analysis (for example…randomly generated domains tend to have a weird ratio of consonants to vowels when coma. One of the most effective and accurate I’ve seen so far is to leverage Recurrent Neural Networks [RNNs] to distinguish between the two. RNNs excel at finding temporal structures in sequences (keep in mind that a domain is just a sequence of letters and symbols). Your typical ML pipeline in this case would look something like this:

Gather your normal domain dataset (usually something like Alexa Top 1 Million Sites), and your random domains dataset (usually from Domain Generation Algorithms)
Label “normal” as 0 and random generated as “1”
Build a deep (2 or more layers) RNN whose input is the domain split into a sequence of characters, and whose desired output is a sigmoid (which outputs a value between 0 and 1)
Your loss function minimizes the error between the RNN output for a given sequence and the label (0 or 1 as per step 2)

Note: It’s a common misconception that neural networks are supervised and require a labelled dataset. While this is true in the majority of cases — this is not always so. In fact we typically use unsupervised variations of RNNs (see links in “other examples” below)

Unsupervised: Using Self Organizing Maps and clustering techniques to identify anomalous IP traffic

Self Organising Maps [SOMs] are a cool type of neural network. Every neuron essentially represents a point in multidimensional space, and every neuron is connected to it’s neighbors. Every time a training sample in this multidimensional space is presented to the SOM, the closest neuron “wins” and moves closer to the training sample, and “pulls” its neighbors along with it. One cool-looking application we’ve seen is making a SOM approximate a sketch of Maryln Monroe:

SOMs are extremely useful because after training, most neurons would be grouped towards “normal” data, while a minority would be clustered towards abnormal data. SOMs are typically used as a “pre-filter” to organize your test data into multidimensional space on which you can subsequently run distance based clustering algorithms such as DBSCAN.

We’ve used this technique with IP traffic — if a test sample is passed through the SOM and gets associated with a neuron that’s pretty far from it’s neighbors that’s clearly an anomaly

Other unsupervised examples would be using Hidden Markov Chains to determine how probable a sequence of system calls is — that’s essentially what we used in our Anti-Ransomware demo (see “other examples”)

Reinforcement Learning: bringing threat intelligence and the end user into the loop

We don’t see many examples of this in academia because it’s one of those features which is more important in production than in a lab. Anomaly based machine learning algorithms applied in practice are notoriously high in False Positives [FP]. One way of dealing with this is keeping a human in the loop. In CyberSift’s case, this is usually the Security Engineer using our product. We are aiming that when a false positive does occur, the engineer can mark an alert as being false and this internally will adjust the weights of our algorithms — for example by changing training sample weights or changing the model weights of an ensemble. This makes our algorithms more accurate over time. In essence, CyberSift’s models become the “Agents” while the Engineer is the “Interpreter” or “Critic”. A similar concept is using external threat intelligence feeds as your interpreter/critic. For example, increase the weights for those samples that contain known bad IP addresses, and so on…

Other Examples

Using (unsupervised) LSTMs as a Docker HIDS: What do Smartphone Predictive Text and Cybersecurity Have in Common?
Using (unsupervised) frequency-based algorithms to detect ransomware: More Anomaly detection vs Ransomware — Towards Data Science — Medium

Hopefully, you already follow Towards Data Science! If not, there’s no time like the present!

If you liked this article please click the 💚 button! Check us out at https://cybersift.io

Machine Learning Techniques Applied to Cyber Security

Written by CyberSift