Cyber Criminals vs Robots

Building an ensemble model for network intrusion detection and prevention

Shaefer Drew

Published in

Towards Data Science

19 min readJul 28, 2022

All Images unless otherwise noted are by the author

Introduction

What happens when cyber criminals face robots? What happens when they use robots? How will offensive and defensive strategies of cybersecurity evolve as artificial intelligence continues to grow? Both artificial intelligence and cybersecurity have consistently landed in the top charts of fastest growing industries year after year¹². The 2 fields overlap in many areas and will undoubtedly continue to do so for years to come. For this article, I have narrowed my scope to a specific use case, intrusion detection. An Intrusion Detection System (IDS) is software that monitors a company’s network for malicious activity. I dive into AI’s role in Intrusion Detection Systems, code my own IDS using machine learning, and further demonstrate how it can be used to assist threat hunters.

Problem Definition

As of recent, less than half of companies are taking advantage of machine learning in their Intrusion Detection Systems¹. The most common IDS relies solely on a technique called “signature matching”. Signature-based intrusion detection finds “sequences and patterns that may match a particular known attacker IP address, file hash or malicious domain”; however, it has significant limits for detecting unknown attacks². Many cyber criminals understand how signature matching works and will alter their behavior to avoid detection. While signature matching has proven to be useful, in the ever-changing cyber threat climate, it cannot be a solution by itself. Rather, signature matching must be complemented with a more adaptable solution: Machine Learning. Using features from a network’s traffic, machine learning not only detects attacks that have happened before, but it is dynamic enough to detect completely new attacks.

The specific problem I wanted to focus on for my intrusion detection algorithm was optimizing precision and recall. In the context of intrusion detection, errors can have massive costs to the organization. A false negative may result in an undetected system breach with potentially devastating consequences. Too many false positive alerts for malicious samples can degrade confidence in the system and misdirect critical security personnel towards dead ends, causing them to de-prioritize and fail to mitigate actual attacks. Quantifying and optimizing for this trade-off is another research project on its own and likely varies by organization. In this article, I assume that false negatives are worse than false positives but try to minimize both. In evaluating the results, I try to quantify this relationship in terms of weighted cost to an organization, assuming that false negatives cost twice as much as false positives.

Data

For this project, I used the NSL-KDD dataset³ ¹⁴, an improved version of the 1999 KDD Cup dataset³. This data consists of raw tcpdump traffic over 9 weeks in a Local Area Network (LAN) that simulated the environment of a US Air Force LAN. The features in this data set are engineered from the raw tcpdump. Most of the feature descriptions can be found here and here. Each row represents a sample of network traffic at a given point in time and each network sample can either be labeled as “Malicious” or “Benign” and categorized into the attack categories DoS (Denial of Service), R2L (Remote to Local), U2R (User to Root), and Probe.

Related Work

This project is inspired by the book Machine Learning & Security by Clarence Chio & David Freeman⁴. In the Network Traffic Analysis chapter, Chio and Freeman explore an ensemble method for intrusion detection. The end result was a system that performed very well in terms of recall, but not quite up to standards in terms of precision⁵. My system considers and implements many of their strategies, refining them to optimize for precision and recall. The results from their ensemble model, plotted below, will serve as a baseline.

Accuracy: 85%

Classification Report: Chio/Freeman Model

Methodology

Feature Selection

Categorize continuous, binary, and nominal columns
Drop all the features with only 1 value. num_outbound_cmds only had 1 unique value, so I dropped this from the data set.
Validate data against attribute type (su_attempted is supposed to be a binary attribute, but had 3 values)
One-hot encode categorical variables

Resampling

In order to maximize recall and create a model that is adaptable to uncommon and unseen attacks, I need to make sure these uncommon attacks are better represented in the training data. As seen above, despite malicious vs benign label distributions being fairly even, there is a very uneven distribution of the attack categories in our training data. In order to increase the representation of some of the minority classes and decrease the representation of some of the majority classes, I take a resampling approach that combines oversampling and undersampling⁵.

Training: Resampled Attack Category Distribution

Using the SMOTE¹⁰ and RandomUnderSampler¹¹ classes from Imbalanced-Learn, I now have a balanced training set.

Predicting Attack Category — Multi-Class Modeling

I am focused on intrusion detection; however, identifying the correct attack category will help threat hunters with intrusion prevention. I cross-validate 4 classification models on the training data, comparing accuracy.

The tree models performed much better than the linear model (logistic regression), with XGBoost having the best results.

Fitting the XGBoost model on the entire training set and predicting the test results, I get the following:

Accuracy: 79%

XGBoost Multi Classifier: Multi-Class Results

XGBoost does an excellent job at identifying Probe and DoS attacks but lacks recall when it comes to R2L and U2R, which had less representation in the initial training set. Resampling helped improve this; however, the training attributes for those classes weren’t sufficiently representative of the test data.

Mapping the attack categories to malicious and benign, here are the binary results:

Accuracy: 81%

Classification Report: XGBoost Multi Classifier

Confusion Matrix: XGBoost Multi Classifier

Multi-class classification of attack category does a decent job at identifying attack categories; however, its recall for identifying malicious samples is too low. I decided to focus more on intrusion detection than attack classification, but still valued the results above in choosing an algorithm. I’ll use this as a baseline moving forward and will also incorporate it into the dashboard.

Intrusion Detection — Binary Classification

Feature Reduction

I decided to reduce the number of features through SelectPercentile⁶, performing ANOVA tests and returning the F-values for the features to determine their independent relevance to the target vector. After some experimentation, I decided to keep the top 33% of features according to their F-score.

Metrics for Model Selection

Since I plan on optimizing for precision and recall in the context of intrusion detection, it is important that the model’s performance is maximized at all thresholds. Therefore, Receiver Operating Characteristic (ROC) Area Under Curve (AUC) is the appropriate metric to maximize in selecting a model. The ROC curve plots the True Positive Rate (TRR) against the False Positive Rate (FPR). The area under this curve will essentially tell us the probability that a randomly chosen malicious example has a higher score than a randomly chosen benign example.

After comparing the AUC of different models using 5-fold cross-validation, the forest models outperformed others, particularly XGBoost.

Below I set up a pipeline to standardize the features, reduce them with SelectPercentile, and fit the data to an XGBoost classifier.

Threshold Selection

I split the data into training and validation, fit the model to the training data, and plot the ROC curve of the validation results.

Remember that the goal is to maximize precision and recall, with a larger emphasis on recall. This will minimize the number of attacks that pass through the detection system without sacrificing the integrity of the algorithm by flagging too many false positives. As seen above, I can adjust the threshold to improve the true positive rate without significantly increasing the false positive rate. Setting a threshold with a true positive rate of 1 means there are no false negatives; hence, the model flags every single malicious sample. Doing so, I chose an optimal threshold of 0.0050058886, creating a validation recall of 100% and precision of 98%. When predicting malicious samples, this means that any sample that has at least a 0.5% chance of being malicious, according to the model’s prediction probabilities, is labeled as malicious.

Let’s see the the test results of this model with the refined threshold.

Accuracy: 90%

Classification Report: XGBoost with Threshold

Confusion Matrix: XGBoost with Threshold

So far, the model shows a 5% improvement in accuracy from the Chio/Freeman model, with a 16% improvement in precision and 6% reduction in recall.

I want to further explore ways to improve the recall while maintaining this high precision. Thinking about cyber attack patterns, many attackers will create new malware that is either slightly different from a previous version, very different, or a completely new zero day exploit. So how can a machine learning algorithm label these instances that aren’t represented in the training data? I turn towards anomaly detection. After experimenting with many different algorithms, I decided that clustering was the best approach.

Clustering

My goal is to complement the classification model with a clustering model. Ideally, a clustering model would be able to pick up some of the malicious samples that the classification model mislabeled. It’s more important in this model to balance precision and recall, because clustering will only be able to compliment classification well if its predictions are precise enough to override the classifier’s. In order to maximize precision and recall in this context, I turn to F1 Score.

First, I plot out the classes of the training set in a 2D vector space.

At first glance, the clusters don’t look very well defined, but I could potentially use clustering to detect some of the obvious outliers.

Next, I fit a simple 2-cluster K-Means model, using Principle Component Analysis (PCA) to reduce the dimensionality.

The 2-cluster model does a poor job representing the actual classes. My next approach is to experiment with a larger number of clusters and optimize a strategy that will map them back to malicious or benign.

After many iterations in cluster number and strategy parameters, I fit a 27-cluster model.

The clusters here look much more centered and distinct than the 2 cluster model. So what was my strategy and how do I map these clusters to the class labels of benign and malicious? Both questions have 1 answer:

Inspired by Chio and Freeman⁴, this cluster mapping technique involves anomaly detection and majority-class labeling for malicious samples.

The Strategy:

If the % of malicious traffic within a cluster is greater than 95%, all instances in the cluster are labeled as malicious.
If the cluster size relative to the total population is less than 0.1%, all instances in the cluster are labeled as malicious.

This differs from Chio and Freeman’s model because it refines the parameters of cluster number, malicious traffic percentage, and relative cluster size to maximize the F1 Score. Additionally, my approach only focuses on labeling malicious instances and doesn’t use machine learning models on a cluster-by-cluster basis⁴.

Each malicious label from this strategy can be traced back to either a majority-malicious cluster or an outlier, which will be useful in model interpretability later on. After mapping the clusters according to this strategy, here is the side-by-side comparison of the predicted values vs actual values in a 2D vector space.

K-Means Mapped Predictions vs Actual Labels

After mapping, the predictions and the labels look virtually identical.

Create Ensemble Model: Combine Classification and Clustering

The way I combine the XGBoost Model with the K-Means model is by overwriting any cases where XGBoost labeled a sample as benign but the K-Means approach flagged the sample as malicious.

Evaluation

The final model’s results are as follows:

Accuracy: 90%

Adding clustering to the XGBoost model improved the recall for malicious traffic while maintaining precision. The algorithm found 109 new malicious attacks at the small price of only mislabeling 5; thereby, increased classifier performance without degrading its integrity

Before I compare this model to the other baseline models I’ve discussed, it’s important to define a cost function. The cost or impact of a cyber attack is an entire area of research by itself that comes with many variables. With the goal to eventually expand on this research in the future, the cost function I define in this article is a temporary heuristic.

Based on the logic that an undetected cyber attack is worse than a false alarm, I assume that a false negative is twice as costly as a false positive. I assume that the average false alarm costs an organization $1000; therefore, the average undetected cyber attack costs twice as much at $2000.

Cost Function

Results

Going from left to right, it is apparent that the XGBoost multi-class predictor, mapped to binary labels, has the highest precision at 97%. Unfortunately, this model underperforms in every other area. Comparing the ensemble model to main baseline of the Chio Freeman model, there is a precision improvement of 16%, signifying improved confidence in Intrusion Detection System’s flagged samples. The Chio Freeman model still outperforms the other models in recall at 97%, with my ensemble model following at 92%. I believe there is still room for improvement in this area. The F1 score for the ensemble model outperforms the others, with the thresholded XGBoost model following closely at 0.91. The ensemble model outperformed the Chio Freeman baseline by 0.05. The high F1 score accomplishes my goal of optimizing precision and recall. The ensemble model and the thresholded XGBoost models are very similar in accuracy, with a rounded accuracy of 90% for each. Finally, the ensemble model outperformed the others in weighted cost at $220,000 cheaper than the next best performer and $540,000 less costly than the Chio Freeman model.

Discussion

While my model did optimize precision and recall by accomplishing the highest F1 score, I do not have a concrete measurement for optimizing these metrics in the context of intrusion detection. Keep in mind that my cost function is a temporary heuristic. Different attack categories, network architectures, systems, and organizations have different costs. These should all be considerations for modeling a new cost function. For example, probing attacks typically don’t cost the organization a lot, if anything, because they’re typically used to surveil a network instead of breach it. Additionally, a successful DoS attack may cost Facebook $3 million while it only costs a local coffee chain $2000. Since not all of this data is readily available, my next steps would be to research the average cost of each attack category and include that in the calculation. Afterwords, I would adjust the cost of false positives, factoring in average labor costs, opportunity costs of missing actual attacks, and maybe even the cost of depreciated model confidence.

Explainability

My system now does a decent job of identifying attacks. The next step is to explain those predictions in an automated way that can help cyber defense protect their infrastructure. Explainable Intrusion Detection Systems can point threat hunters in the right direction.

Using the streamlit¹³ python library, I have created a proof-of-concept dashboard showing how an Intrusion Detection System can also serve as an Intrusion Prevention System by flagging attacks, prioritizing them by risk, and recommending specific remediation paths to threat hunters.

Interpreting Models

In order to explain the malicious prediction of the ensemble model, I have to automatically interpret both the XGBoost classifier and the K-Means clustering algorithm.

XGBoost: For XGBoost, one option is to extract feature importances and potentially fit a linear model to the data to help determine directionality. In a complex forest model like XGBoost, this isn’t the best representation of the underlying model. The perfect interpretation of an XGBoost model would be plotting the decision tree forests; however, plotting an entire complex forest diagram would be difficult to follow and it would not be practical for security teams who are expected to work at a fast pace. I turned to the software package LIME (Local Interpretable Model-Agnostic Explanations), which approximates a model’s behavior around the vicinity of a particular instance⁸. I chose LIME because it breaks down complex models into local approximations and estimations that are easy to understand.

For my IDS, LIME can give local explanations of the samples XGBoost labeled as malicious. The chart below shows the 10 most important features in labeling a particular sample, feature weight, and directionality.

For predicting the category, I use the multi-class XGBoost model discussed earlier.

K-Means: For interpreting the K-Means model, I first identify the predicted cluster. Next, I plot a heat map of each of the cluster centers, showing the relative values of all of the features and highlighting the cluster for that sample. A high value for a feature in the cluster center vector, in absolute terms, is deemed an important feature and included in the table of feature values.

For predicting the category, I take the most common attack category in each cluster unless it’s an outlier cluster.

Risk

I calculate risk as the XGboost model’s predicted probability that the sample is malicious, standardized between 1 and 10. I map these risk scores to severities according to the National Vulnerability Database’s CVSS v3.0 Impact Severity mappings⁹.

Mapping the risk scores to their severity for the test population results in the following distribution.

There are many Critical severity flags here, with Low being the next highest. The dashboard gives threat hunters the option of filtering by severity, allowing them to prioritize higher severity flags.

Remediation

The last step in turning the Intrusion Detection System into an Intrusion Prevention System is to point threat hunters in the right direction and offer a remediation path. Threat hunters can likely get a general sense of how an attack was detected and the root of the problem by looking at the important features highlighted in the bar chart or heat map. In addition to these charts, the dashboard gives the values for the top 3 most important features and the attack category to which the sample likely belongs. For each attack category, I provided a remediation path, which is a suggestion of the actions threat hunters or security analysts should take in validating and remediating each attack.

From start to finish, the dashboard allows any threat hunter to quickly detect attacks, prioritize them, identify key features to investigate, predict the category of said attack, and offer a remediation strategy for that particular attack.

Here, I walk through a couple examples of how the dashboard can be used.

Example #1

In this example, the threat hunter wants to focus on Critical attacks. He filters the severity to critical and chooses a flagged sample of network traffic. This sample has the highest risk score possible of 10. It is likely to be a Denial of Service attack, which makes sense to the threat hunter because he can see that there were 255 connections to the same destination IP address (“dst_host_count”). Following the dashboard’s suggestion, the threat hunter investigates these connections in the network traffic logs, confirming they are coming from a known botnet and validating that it’s a Denial of Service attack. He immediately notifies Incident Response and Network Security to block these connections and divert traffic.

Example #2

After taking care of their high priority issues for the day, a security analyst is tasked with investigating some of the lower priority items that have been in the queue for a while. He notices that this sample is an outlier and wants to further investigate. Following the advice of the dashboard, he looks into which features could have caused this traffic to be flagged. He believes that the failed login attempt increased the system’s suspicion. He monitors this user’s activity and notices more failed login attempts that are frequently after business hours and by way of remote connection. The analyst works with the System Administrator to harden the server’s security through requiring remote connection through a jump server and adding 2-factor authentication. The number of failed login attempts subsides over the following week.

Discussion

In noting areas for improvement in the dashboard and explainability model, I want to reference the risk score calculation, multi-class model, remediation path alignment, and building a production-ready dashboard.

I used a temporary heuristic for measuring risk. This should be adjusted according to industry standards and weighted differently by attack category and host system. Additionally, there should be a better method of quantifying risk for the K-Means predictions.

Another next step would be improving the multi-class XGBoost model that predicts attack category. I decided to focus more on intrusion detection than attack classification, but I did end up incorporating the multi-class model in the dashboard to help threat hunters. One way to improve the model would be to drop benign examples from the training set and only label malicious samples, since I already have an IDS in place for that. Improving this model offers the most benefit for the least amount of effort and would increase confidence in the dashboard’s predicted categories and remediation paths.

Speaking of which, remediation paths should be defined by subject matter experts. While I wrote the remediations to the best of my knowledge, I would like to get the opinion of an experienced threat analyst to refine the recommendations and offer new ones.

The final step, if building this for an organization, would be to set up the dashboard in a production environment. This would mean setting up an ETL to extract relevant features from a system’s network traffic, streaming this to the dashboard, updating the dashboard in real-time, and incorporating a feedback loop to re-tune the model.

Conclusion

With the ensemble model, I accomplished my goal of optimizing precision and recall. Using advanced ensembling that combined XGBoost with K-Means, I was able to outperform others in a number of areas, including the metric I defined of weighted cost. Since some models still had advantages over mine and the weighted cost metric is a subjective heuristic, I can’t objectively claim that my model holistically outperforms the others. Nonetheless, the model did accomplish my stated goals.

I believe the main takeaway from this project is that artificial intelligence is a valuable tool that can assist cybersecurity teams when it comes to intrusion detection. Machine learning is a robust solution to IDS aversion techniques and evolving threat actors in this space. Additionally, these models can be interfaced with security teams to help one another. They have the ability to detect a large number of threats that humans can’t pattern themselves, point threat hunters in the right direction, and incorporate user feedback to adapt to new scenarios.

So what happens when cyber criminals face robots? The question should be reframed. AI is remarkable in its ability to detect a large number of threats but it has many weaknesses such as model degradation, susceptibility to poisoning and hacking, and an inability to automatically remediate complex or unseen attacks. Humans lack the ability to process such a large amount of data in real time, but possess the creativity and business acumen required to pursue and prevent some of the most sophisticated cyber criminals. Machine learning models can fill in many of the gaps in cyber response teams and vice versa. Both Intrusion Detection Systems and threat hunters compliment each other to create an even stronger system. So the question should be, rather, “What happens when cyber criminals face cyborgs?” When humans and AI work together to defend against cyber crime, threat actors probably won’t like the answer.

Link to Code

Link to Dashboard

My Website

References

Feldman, Sarah, and Felix Richter. “Detecting Security Intrusions Is Top AI Application in 2018.” Statista Infographics, Statista, 5 Apr. 2019, https://www.statista.com/chart/17630/artificial-intelligence-use-in-business/.
Rezek, Michael. “What Is the Difference between Signature-Based and Behavior-Based Intrusion Detection Systems?” Accedian, Accedian, 29 June 2022, https://accedian.com/blog/what-is-the-difference-between-signature-based-and-behavior-based-ids/.
Mahbod Tavallaee et al., “A Detailed Analysis of the KDD CUP 99 Data Set,” Proceedings of the 2nd IEEE Sym‐ posium on Computational Intelligence in Security and Defense Applications (2009): 53–58.
Chio, Clarence, and David Freeman. Machine Learning and Security: Protecting Systems with Data and Algorithms. O’Reilly Media, Incorporated, 2018.
Chio, Clarence. “Book-Resources/NSL-KDD-Classification.ipynb at Master · Oreilly-MLSEC/Book-Resources.” GitHub, Oreilly, https://github.com/oreilly-mlsec/book-resources/blob/master/chapter5/nsl-kdd-classification.ipynb.
“Sklearn.feature_selection.Selectpercentile.” Scikit, Scikit-Learn, https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectPercentile.html.
https://pixabay.com/illustrations/technology-sci-fi-futuristic-7111801/
“Local Interpretable Model-Agnostic Explanations (LIME)¶.” Local Interpretable Model-Agnostic Explanations (Lime) — Lime 0.1 Documentation, GitHub, https://lime-ml.readthedocs.io/en/latest/.
“Vulnerability Metrics.” NVD, NVD, https://nvd.nist.gov/vuln-metrics/cvss.
“Smote.” SMOTE — Version 0.9.1, Imbalanced-Learn, https://imbalanced-learn.org/stable/references/generated/imblearn.over_sampling.SMOTE.html.
“Randomundersampler.” RandomUnderSampler — Version 0.9.1, Imbalanced-Learn, https://imbalanced-learn.org/stable/references/generated/imblearn.under_sampling.RandomUnderSampler.html.
Elias, Nad, et al. “Top-Growing Jobs Sectors: AI, Data Science, & Cybersecurity.” RecruitAbility, RecruitAbility, 23 Mar. 2020, https://www.therecruitability.com/ai-data-science-and-cybersecurity-americas-fastest-growing-jobs-sectors/.
“Streamlit Docs.” Streamlit Documentation, Streamlit, https://docs.streamlit.io/.
“NSL-KDD Dataset.” University of New Brunswick Est.1785, UNB, https://www.unb.ca/cic/datasets/nsl.html.