Leverage Cloud Technologies for Malware Hunting at Scale

How to index hundreds of terabytes of malware using Apache Spark and Iceberg tables

Published in

Towards Data Science

9 min readSep 10, 2022

In this article, we will show how we used Spark and Iceberg tables to implement a malware index similar to UrsaDB and integrated this index into Mquery an analyst-friendly web GUI to submit YARA rules and display results.

This proof of concept was developed during GeekWeek an annual workshop organized by the Canadian Centre for Cyber Security and bring together key players in the field of cyber security to generate solutions to vital problems facing the industry.

YARA in a Nutshell

YARA is a tool aimed at helping malware researchers identify and classify malware samples. With YARA, you can create descriptions of malware families based on textual or binary patterns. Each description, aka a rule, consists of a set of strings and a boolean expression which determines its logic. For example:

rule silent_banker : banker
{
    meta:
        description = "This is just an example"
        threat_level = 3
        in_the_wild = true
    strings:
        $a = {6A 40 68 00 30 00 00 6A 14 8D 91}
        $b = {8D 4D B0 2B C1 83 C0 27 99 6A 4E 59 F7 F9}
        $c = "UVODFRYSIHLNWPEJXQZAKCBGMT"
    condition:
        $a or $b or $c
}

If either of these sequences (a or b or c) is found in a file, the rule will evaluate to true. YARA is an executable written in C which is typically invoked on the command-line by passing it a rule and a folder of binary files to process.

Although YARA executes quickly, processing millions of malware samples in a brute force fashion utilizes a lot of CPU and I/O and thus can be prohibitively long to execute.

CERT Polska

To accelerate the evaluation of YARA rules, the Polish Computer Emergency Response Team (CERT-Polska) built a custom database called UrsaDB and an analyst-friendly web GUI named Mquery.

UrsaDB acts as a bloom filter to test whether a given sequence of bytes might be present in a binary file. False positive matches are possible, but false negatives are not. In other words, a query returns either “possibly in file” or “definitely not in file”.

Mquery first uses UrsaDB to find candidate files which might contain the desired sequence of bytes, and then uses YARA CLI to confirm if the file is a match. Thus the YARA CLI is evaluated on a small subset of the entire malware corpus, greatly accelerating the overall execution time.

Let’s dig a little deeper into UrsaDB.

UrsaDB

The main index in UrsaDB is the 3gram index.

For every file in the corpus, UrsaDB extracts all possible unique three-byte combinations. The 3gram index is essentially a big map where the key is a 3gram and the value is a list of files containing the 3gram .

For instance, if we index a text file containing the ASCII string TEST MALWARE (ASCII: 54 45 53 54 20 4D 41 4C 57 41 52 45), then the database generates the following trigrams (_ denotes space character):

+---+-----------+---------+
| # | Substring | Trigram |
+---+-----------+---------+
| 0 | TES       | 544553  |
| 1 | EST       | 455354  |
| 2 | ST_       | 535420  |
| 3 | T_M       | 54204D  |
| 4 | _MA       | 204D61  |
| 5 | MAL       | 4D616C  |
| 6 | ALW       | 414C57  |
| 7 | LWA       | 4C5741  |
| 8 | WAR       | 574152  |
| 9 | ARE       | 415245  |
+---+-----------+---------+

UrsaDB is a custom C program that runs on a single machine and is capable of handling a large number of malware files. The UrsaDB CLI has means to index, compact and search 3gram indices which are stored on disk as custom formatted binary files.

UrsaDB is limited to a single machine, and although it is possible to cluster multiple UrsaDB instances, it is not natively supported and managing multiple indices across machines is cumbersome. UrsaDB also requires that the index files reside on an attached storage device.

Many web scale companies are reevaluating their use of HDFS in favour of cloud blob storage. Blob storage is quickly becoming the de facto choice for storage of large datasets. There are many reasons for this and many articles have broached the subject.

In our proof of concept we wanted to see if we could leverage cloud blob storage and specifically Iceberg tables to reduce cost and simplify the management of a large-scale malware index.

Building a 3gram Index With Spark and Iceberg

UrsaDB stores its index as a sequence of file IDs and 3gramsusing an on disk format. Furthermore, UrsaDB uses a run length encoding technique which is also used in parquet files (parquet files are the foundation of Iceberg tables). The Parquet file format is well known and is generally used in big data platforms.

The first thing we need to build the index is a way to break files into 3gram. This is the most compute intensive part of building an index. We tried many approaches raning from pure Spark SQL to custom panda python UDFs, but nothing could compete with a custom Java implementation leveraging the ByteBuffer and the incredible fastutil library. Here’s our custom UDF which returns all distinct 3gramsas integer values.

Using this function, it’s quite easy to build a 3gram index. Here is what the transformation looks like. Note that we are using the %%sparksql magic in most of our examples. You can find out more about this extension in a previous article on jupyterlab-sql-editor.

The shingling function is applied to a byte sequence and returns the unique 3gram as a list of integer values. We then explode the list and shift the integer value to obtain three columns representing the bytes of a 3gram.

We chose to store each byte of the 3gram into its own column (a, b, c) rather than in a single column since it is easier to conceptualize and resulted in better overall compression for the table.

Finally we sort the table by the bytes forming the 3gram; (a, b, c). For every parquet file, Iceberg will store the min/max value of columns a, b and c. At query time, Iceberg uses these statistics to quickly identify which parquet files should be scanned for a given 3gram and by consequence find which malware file_id contain a given 3gram.

Now that we can index a byte sequence, all we have to do is get the bytes of a malware file. This is easy in Spark. Reading the content of a file is nothing more than using the spark.read() function with the binaryFile format option. The read()function returns a dataframe with the file path, modification time, file size and the file contents.

Once we have a table of malware, we can apply the shingling function to the content column. Here, we show the same example as before, however we join with a file_id_map table to obtain the file_id for a given file path. We don’t save the file path in the index since this would be rather large compared to storing a simple file_id.

-- We use Iceberg's WRITE ORDER BY to make sure the table is sortedALTER TABLE {table_name} WRITE ORDERED BY ts, source, a, b, c, file_id`

Querying the 3gram Index in Spark SQL

Mquery takes care of parsing the YARA rule and converting the byte or text patterns into a list of “3gramto search” for. Mquery includes an UrsaDB agent which uses these “3gram to search for” to form a query for UrsaDB.

Let’s suppose the desired byte pattern to look for is 0 -1 -86 -69 -14 88 . Mquery will generate the following 4 3grams.

UrsaDB will find candidate files which have all 4 3grams but not necessarily sequentially. YARA CLI is then executed on files 3 and 4 which finds that only file 4 is a match.

For our purposes, we wrote a custom agent to submit queries to Spark instead of UrsaDB. Our pyspark agent takes the list of “3gramto search for” and generates a corresponding Spark SQL query.

Each 3gram is converted into 3 bytes and each byte is tested in a where clause against columns a, b and c. The result of the query is a list of files which have the particular 3gram.

There is a subtlety however. When given a list of 4 3gram we must only return the files which have all 4 3gram requested. In order to achieve this we use a group by + having clause. The complete query looks like this.

Since the table is sorted by columns a, b and c, Iceberg can use its metadata to efficiently prune most of the parquet files. Then, within each file, Spark will use the parquet file footer (page statistics) to further isolate potential matches. Thus, the query executes rather quickly.

The result of the query is a simple list of file_id which are easily shuffled across the cluster to evaluate the final group by + having clause. The result of this query is a list of files which have all 4 3grams but not necessarily sequentially. Those 4 3grams could be anywhere in the malware file. Remember, the index acts as a bloom filter giving us a small list of “candidate files”. We must now run YARA on these candidates. Spark can help us here too.

Running YARA on a Distributed Spark Cluster

Spark is an analytical platform that can distribute arbitrary python code which runs in parallel. Furthermore, invoking the YARA C library from python is easy when using the yara-python binding.

Thus, running distributed YARA is rather easy in Spark. Here’s a simple python UDF showing the invocation of yara-python on the content of a malware file.

Putting it all Together

Here’s a demo showing the evaluation of the silent_banker YARA rule. It shows the following:

Firstly, our custom pyspark Mquery agent converts the 3gram list into an SQL statement.
Secondly, a custom Spark UDF finds candidate files on which the full YARA rule is evaluated.
Finally, the matching files are returned to the Mquery web UI.

Conclusion

In this article, we demonstrated how we leveraged Spark to build a scalable 3gram index. The index is stored in the Iceberg table format and is backed by an inexpensive datalake blob storage.

Then, we demonstrated how we wrote a custom Mquery agent which generates Spark SQL queries against the index.

Finally, we demonstrated how we leveraged yara-python to distribute the full evaluation of YARA rules.

Our initial results are promising. We were able to index and query 13 terabytes of malware with a compression ratio and execution time comparable to UrsaDB, but with the benefit of cheaper storage and easier index management.

We plan to productionize this proof of concept. In a future article, we will dig further into this project and give more details about the implementation, query times, compression ratio and costs. Stay tuned!