Which GPT-like Model Engineering Techniques Work on System Logs?

Evaluation of Transformer Neural Network Modeling Methodologies applied to Behavior Malware Traces

Dmitrijs Trizna
Towards Data Science

--

Figure 1. Self-attention activations from the Transformer model trained on malware emulation reports. Image by the author.

This article evaluates various Transformer Neural Network (AI model powering GPT) engineering methodologies applied to machine data — malware behavioral logs from the Speakeasy emulator. Data used in these experiments have been freely available since released as part of this publication on hybrid malware analysis [Trizna], with archives downloadable here. You have access to data and are free to replicate or advance results!

Originally, Transformer was presented as an encoder/decoder architecture suitable for sequence-to-sequence tasks like natural language translation. Later it was adopted to suit other tasks like masked decoder-only models of GPT to be good at text generation. Since we will use Transformer for inference, the model only consists of the encoder layers (model’s PyTorch code), similar to architectures used, for instance, in BERT [Devlin et al.].

Speculatively, the same conclusions about the Transformer engineering methods drawn in this article can be expanded to any set of system logs, for instance, operating system telemetry like Sysmon on Windows or corresponding Linux frameworks like auditd, application level logs as kube-audit events from Kubernetes API server, or access log from HTTP servers, and counting.

Figure 2. A schematic and simplified view of pre-processing filtered JSON events to a sequence of tokens with filters and normalization. Image by the author.

While part of an ongoing study, the article avoids broader discussion on this data and architecture combination and instead focuses on the methodology and PyTorch code of Transformer modeling techniques to evaluate the relative utility of engineering strategies rather than absolute performance. List of evaluated configurations:

  • triangular, cosine, step, and one-cycle learning rate schedulers;
  • accumulated gradients;
  • gradient clipping;
  • attention block’s pre-norm vs. post-norm;
  • data optimizations;
  • learning rate dependence on model size.

For all the configuration options, I do a three-fold cross-validation run and report mean ROC curves for validation and test sets, as well as training losses. This article does not cover the models’ self-supervised pre-training but focuses on malware classification’s downstream task (i.e., supervised learning).

Several publications have already systematically analyzed Transformer’s engineering advancements in natural languages, which we will explore on machine data. Given the desire to further investigate the ideas presented here, I suggest studying, for instance, ”cramming” paper [Geiping and Goldstein].

Optimizing the Dataset

The data consists of JSON reports representing emulation results from ~120k malware and benignware samples, with malicious examples spanning seven malware types like ransomware, trojans, backdoors, etc. However, I limit experiments to binary classification, with the Clean label representing the benign class and all others representing malicious samples. Worth noting that the test set was collected three months after the training set to introduce an evaluation in the presence of concept drift.

Since emulation is an imperfect way to obtain the behavior of evasive malware samples, further filters and normalization are applied to the reports mentioned above to drop failed or incomplete emulations, resulting in 76126 samples within the training set and 17407 samples in the test set.

Field filters for sequence length reduction

Based on our observations, any semantical logic stored in machine data spans a very long sequence of tokens, including countless occurrences of metadata and environmental specifics that represent little relevancy for modeling tasks. Arguably, this is one of the most drastic differences between natural and machine languages. The former has concise semantics in relatively short-length sentences with no or little futile components.

This is unfortunate for self-attention models since it has quadratic complexity on input length, meaning longer sequences is exponentially more expensive to train. This is because every element in the input to self-attention must attend to every other member in the same sequence — see the title Figure 1 above for visualization.

Therefore, applying a domain knowledge-based filter when working with machine data is mandatory. In our case, JSON events are filtered based on manipulations on (1) files, (2) registry, (3) network access, as well as (4) API call names, arguments, and return values. This significantly increased epistemic density (i.e., relevant knowledge per token ratio) and data quality.

Sequence length is limited to 512 tokens during the experiments below to iterate over configuration combinations faster, but keeping longer sequences for final performance evaluations or production implementations is advisable. There is evidence that task-related performances benefit from longer sequences, as depicted in the heatmap below:

Figure 3. Mean True Positive Rates under False Positive Rate of 0.001 (one false alert per 1000 detections), depending on variable sequence lengths and vocabulary sizes. Image by the author.

Tokenization

Machine data can have a significantly larger vocabulary than natural language, as no distinct lexical boundaries or grammatical rules often define the language being used.

In system logs, it is common to see arbitrary character combinations like /tmp/83afba/setup.bin or jre1.8.0_311, which explode vocabulary given improper handling. For instance, I observe ~3 000 unique API names, with ~1 000 appearing only once in the training data.

Therefore, every extra field from the original JSON you include for analysis by the model significantly increases the vocabulary size. Consider the logarithmic plot below, which visualizes the frequency distribution of tokens in the training sets with different JSON fields filter setups:

Figure 4. Token frequency distribution concerning different JSON field filters applied to the training set. Image by the author.

For instance, given a field filter that uses API calls, file, network, and registry records total vocabulary size is about 250k tokens. Given no filters were applied, this number jumps close to 800k tokens, exploding vocabulary more than three times and significantly reducing the epistemic density (valuable information per token) of the data. This emphasizes the importance of appropriate domain knowledge-influenced filters to raise the data quality the model receives.

The same applies to normalization on arbitrary value fields like hashes and IP addresses. For instance, leaving hashes untreated drastically expands vocabulary while normalizing it to values like <sha256> and <md5> with the function below yields only a few easy-to-interpret placeholders that the model can use:

This is just a single and simple example, while all normalization implemented in my pipeline can be found here.

However, another essential trick is improving tokenization itself so that technical variability in machine data does not indefinitely create new tokens. Natural languages already offer a solution that fits machine data perfectly.

I define a custom JSON Byte Pair Encoding (BPE) tokenizer (code) based on Google’s SentencePiece, which analyzes the relative co-occurrence of bytes. That way, tokens do not represent distinct values in technical language but parts of longer value sequences, as exemplified below:

Figure 5. Example of whitespace tokens. Image by the author.
Figure 6. Example of sentencepiece’s BPE tokens. Image by the author.

I limit further experiments to the vocabulary of 50k BPE tokens.

Model size and learning rate

The learning rate (LR) choice is straightforward and known to any deep learning practitioner. However, I want to emphasize that for Transformer architectures, selecting an appropriate LR range for a given model size is especially crucial.

Consider this table from the GPT-3 paper [Brown et al.] that reports the configuration of different models and variability in the learning rate used:

Table 1 from GPT-3 paper [Brown et al.]. Note how the learning rate and batch size vary depending on the number of trainable model parameters.

Note the significance of learning rate reduction once the model grows, with ~10x reduction in LR with ~1000x increase in model size.

Because of the hardware setup (single consumer type GPU), I use a model of modest size with ~5–6 M parameters. Since this article evaluates relative configuration utility rather than achieving state-of-the-art absolute values, it is a good size that allows iterating over many configurations fast, providing a good benchmark on the larger model’s behavior. Industry shows that model size (as the number of parameters in non-embedding layers) strongly predicts performance [Kaplan et al.] — this property is referred to as “scaling law.” Therefore, model’s size and dataset increase should yield necessary detection rates upon production release.

I decided to loop over a set of learning rates from 0.003 to 0.0001, decreasing each next ~3 times. The results are as follows:

Figure 7. Mean values of validation and test set ROC curves and training losses for variable learning rates over three cv folds. Image by the author.

It is clearly seen that 0.003 is too big, whereas 0.0001 is too small, with the optimal value being somewhere between 0.0003–0.001. I selected LR close to the bottom line ~0.0003 for further tests.

Learning Rate Scheduler

Once an appropriate value of maximal learning rate for a given model is selected, I explored various schedulers of learning rate value. Scheduler refers to a function that modifies learning rate value during training.

I define multiple schedulers based on ones reported in the most promising publications, specifically: (1) step, (2) triangular, (3) one-cycle, and (4) cosine with LR transformation over training time as depicted below:

Figure 8. Learning rate changes over training steps with different schedulers. Image by the author.

The results of experiments (including a run without any scheduler at all) are reported below:

Figure 9. Results of various scheduler applications. Image by the author.

We clearly see that schedulers that have a “warmup” stage — i.e., do not stop from max LR value (“triangular” and “one-cycle”), have higher training loss during the first ~2000 updates. From ROC on the test set, we might conclude that one-cycle performs the worst, and no scheduler might have higher True Positive Rate (TPR, aka detection rate) values under low False Positive Rate (FPR) demands.

To resolve ambiguity, here are exact AUC and TPR values under specific FPR with the lowest of 0.0001 (meaning one false alert in 10 000 analyses) on the test set:

Table 1. Mean True-Positive Rates (values) under specific False-Positive Rates (columns) and AUC on the test set over 3 cv folds. Table by the author.

And this reveals that training with the“step” schedule has the best results under especially low False Positive requirements, while overall best AUC has run with the “triangular” scheduler. Indeed, there is evidence that “warmup” might be important, especially for self-supervised pre-training.

Interesting to note that training with no scheduler provides poor results under the lowest FPR. This might indicate that “cooldown” (i.e., gradual reduction of LR value closer to the end) is especially important to find the best model’s checkpoint in local minima. Using a scheduler with “cooldown” might be crucial for information security applications, where False Positives drain a human analyst’s alert budget and are costly.

If you are further interested in thoroughly examining the topic of LR schedulers, the following TDS article might interest you.

Accumulating gradients

A curious reader noted batch size difference in GPT-3 Table 1 above when we discussed learning rate variability on model size. Different model sizes require a distinctive number of samples to produce an optimal gradient update. For the smallest models, GPT authors used batch sizes of 500k samples, while for the largest, including GPT-3 itself, they processed 3.2M samples for each weight update.

Given limited resource training, batch sizes that fit on a single GPU are under-optimal. Gradient vector computed from 64 or 96 samples might have skewed direction, slowing convergence. Therefore, resources that discuss training under limited constraints refer to the technique of accumulating gradients before updating weights (i.e., calling optimizer.step() in PyTorch). Training loop with gradient accumulation looks like follows:

The tricky point is that training speed is different under the various configurations of accumulation_steps. For example, updating gradients every step is slower than doing that every 16 steps. Therefore, I set up the same time budget of ten minutes per training run — to evaluate the efficiency of every configuration with the same amount of computing resources. This allowed runs with lower update rates to iterate over more batches within the same time (thus, different lengths of training losses below). The results are as follows:

Figure 10. Mean values of validation and test set ROC curves and training losses for variable accumulated gradient batch values over three cv folds. Image by the author.

For some reason, we see that this technique drastically decreases the performance of the final model. The same pattern appears whether gradient accumulation is implemented in native PyTorch as exemplified above or using accelerate from HuggingFace. Therefore, it shouldn’t be a bug in the implementation. Even though models with accumulation loop over significantly more batches within the same training budget, a classical implementation with ~5x fewer data achieve higher detection rates.

Gradient Clipping

Gradient clipping is a method of setting an upper limit to the gradient value during a parameter update. Exploding gradients were a significant drawback of Recurrent Neural Networks (RNNs), and gradient clipping was introduced as a cure for exploding gradients in RNNs. However, practically all recent Transformer-focused papers also implement this technique. It stabilizes the training process with no significant drawbacks. PyTorch provides a dedicated clip_grad_norm_ function to implement this logic with a single line of code within the training loop (just before optimizer.step() a call, for example). The results with variable values in a range from 0 to 1 are as follows:

Figure 11. Mean values of validation and test set ROC curves and training losses for variable gradient clipping limits over three cv folds. Image by the author.

We clearly see that clipping values at 0.1 are sub-optimal, while there are no clear conclusions on the other three options by curves. Therefore, as before, looking at TPR at specific FPR and AUC values in the table below is highly informative:

Table 2. Mean True-Positive Rates (values) under specific False-Positive Rates (columns) and AUC on the test set over 3 cv folds. Table by the author.

Here it is obvious that gradient clipping of 1.0 provides the best metrics, with the highest overall AUC and at least a 3% benefit over other options under extremely low FP conditions.

Layer Normalization — Input or Output

Original Transformer implementation applies Layer Normalization on the output from a self-attention block. Contemporary literature, however, agrees that input normalization outperforms the vanilla setup (for more details, see PreNorm vs. PostNorm discussions, e.g. [1], [2]).

Notable that PyTorch’s default behavior is still classical PostNorm, but it is easy to modify with a single norm_first boolean of a Transformer Encoder block. Our results confirm public opinions, with input normalization outperforming classical realization:

Figure 12. Mean values of validation and test set ROC curves and training losses for pre-norm and post-norm configurations. Image by the author.

An ongoing debate is whether RMSNorm outperforms conventional LayerNorm, where implementations like LLaMa use RMSNorm, while other experiments reveal no benefits. We excluded this from our analysis.

Summary

Through experimentation, we found that certain configurations, such as (1) input normalization, (2) “triangular” or “step” learning rate schedulers, and (3) gradient clipping around 1.0, effectively improved model performance. On the other hand, gradient accumulation does not improve performance, which is subject to further exploration.

Additionally, we discussed important caveats of engineering a Transformer model for machine data, such as (1) domain knowledge influenced filter, normalization, and tokenization to reduce vocabulary and (2) learning rate choice concerning the model size.

As mentioned above, speculatively, the same conclusions about Transformer engineering methods could be extended to other types of system logs, such as operating system telemetry or access logs from HTTP servers, and I hope that this article contributes to the challenging task of adopting modern AI techniques to ease the day-to-day tasks of industry professionals.

--

--

Sr. Security Researcher @ Microsoft. This blog is an independent R&D at the intersection of Machine Learning and Cyber-Security.