Malware Analysis with Visual Pattern Recognition

The secret to quickly reverse-engineering binary files

Nolan Kent

Follow

Published in

Towards Data Science

18 min readMay 27, 2020

--

Part 1: Introduction and Basic Patterns

I originally wrote this article for the benefit of fellow malware analysts when I was on Symantec’s Security Response team, analyzing and classifying 20+ files per day. “Extended ASCII analysis” is a technique for quickly gaining a high-level understanding of a file through pattern recognition. The technique is very beneficial for analysts who can’t spend a lot of time on each file. For many types of malware, it can allow accurate classification in seconds. The technique involves training your visual system to recognize patterns unique to raw binary files. This article is meant to serve as an introduction and reference, but it may require examining hundreds or thousands of files to become comfortable with the technique. It is also meant to serve as an intuitive justification for the use of convolutional neural networks in malware classification, which often perform well on the same tasks as the human visual system. I think this type of domain knowledge is essential for building models that are robust to changes in input distribution (more about that in https://towardsdatascience.com/covariate-shift-in-malware-classification-77a523dbd701). I wrote the original version of this article in 2014, so some of the specific malware screenshots may be out of date, but the basic concepts remain applicable.

Some of the article assumes familiarity with windows binary reverse-engineering, particularly experience with Portable Executable (PE) files: the file format of windows .exe and .dll files. The technique can provide insight into other file formats as well, but given the quantity of windows executable malware (and benign files), they are a good starting point. However, even if you aren’t familiar with reverse-engineering, I hope this article can demonstrate the value of training/applying your visual system on other types of raw data. Raw data may only be difficult to interpret because you haven’t looked at enough of it to pick up on the patterns: once you start to recognize patterns, you may notice things that get left out of more processed visualizations.

The Problem

There are a lot of binary analysis tools that display specific information about parts of a file. For example, Resource Hacker shows resources in a PE file, string dumpers show strings, PEID shows PE/packer information, etc. Running a large number of tools and combining their outputs to form a conclusion can take quite a bit of time. Because most of these tools provide a processed representation that focuses on particular aspects of a binary file, some bytes may go completely unnoticed. Most tools are also dependent on file type, so relying on them means it’s challenging to analyze unfamiliar formats.

Different tools give different views into a file, but don’t generally offer an overview that is both comprehensive and quick to digest

While IDA — the most popular interactive disassembler/decompiler in the industry — allows for a fairly complete analysis of most executable files, it can take time to identify encrypted/compressed areas of the file. Because IDA focuses on analyzing code, identifying encrypted data often involves identifying the decryption code. Again, this can be time-consuming, and identifying encrypted data is essential for malicious files. Most common malware are packed using low entropy custom packers designed to avoid antivirus software. Tools like PEID that identify packed executables don’t work consistently because many packer checks are based on either known packers or file entropy. PEID also can’t tie a specific packer to a specific threat: if it could, then it would make a great antivirus engine on its own. A faster way to get information about all the bytes in a file would be ideal for forming first impressions.

Solution

Scrolling through Hiew’s TEXT view (or opening the file in a hex editor, or even just notepad.exe) shows all the data within a binary file, with each byte represented by an extended ASCII symbol. Different patterns of data result in different patterns of symbols. Because the human brain is very good at identifying visual patterns, it can categorize sections of data very quickly. Common types of data include:

native code (x86, x64, ARM, etc.)
Microsoft Intermediate Language (MSIL)
poorly encrypted data
high entropy data (compressed/well encrypted/random data)
images
relocation sections
strings (including non-roman alphabets).

Identifying poorly encrypted data can be very helpful because that usually indicates a custom packer used for avoiding antiviruses rather than a commercial packer that might be used for compression and protection.

For example, many Trojan.Asprox.B samples (see image below) are encrypted in a way that results in a very consistent pattern of bytes. This means it’s possible to identify them in less than a second.

Encrypted section of Trojan.Asprox.B: even though the data is encrypted, many repeating patterns are visible (indicating encryption is weak)

While this technically gives information about (nearly) every byte in the file, it is not best for all types of analysis.

Limitations

This method focuses on a broad understanding rather than a deep understanding:

It can be used to recognize code, but understanding the code still requires a disassembler/decompiler
It can be used to identify images embedded in a PE file, but seeing the actual image still requires an image viewer program
It can be used to recognize encrypted/compressed data, but you must decrypt/decompress the data before it can be analyzed further

Viewing the data as symbols instead of hexadecimal numbers makes human pattern recognition much faster and easier.

ASCII Background

The key to this method is mapping each byte in a file to one of 256 different symbols, and then using the brain’s pattern recognition abilities to interpret the result. Formal ASCII only uses 7 bits: 0x00 to 0x7F. This includes only half of all possible symbols — most of them “printable” symbols such as alphanumeric characters. To use this technique, we need a way to print a symbol for every byte, however different computer systems handled the 8th bit differently. Windows refers to different character mappings as ‘code pages’, the two common groups of code pages are ANSI and OEM.

ANSI refers to Windows code pages; the default is called “Windows-1252”. This code page doesn’t map every single byte: 0x81,0x8c,0x8f,0x90,0x9D don’t map to any symbol and are usually displayed as a ‘.’ or ‘?’. They are, therefore, impossible to distinguish, which hides information.

OEM refers to console/DOS applications. The code page from the IBM PC is called code page 437. It maps all bytes to symbols; however, it maps 0x00,0xFF, and 0x20 to the same symbol — a blank space. This rarely causes problems in practice.

You can use this technique with a standard text editor, such as notepad. However, this creates issues with some control characters and lacks features. I personally use the hex editor “Hiew”.

Patterns and High Entropy Data Background

As previously stated, quickly identifying patterns is the primary purpose of this technique. The ability to recognize benign patterns to focus on potentially malicious patterns is crucial for malware analysis. One simplified approach to thinking about patterns more quantitatively is counting how often certain characters and sequences of characters show up in a chunk of data. Usually, some characters/sequences are prevalent, while others are rare. The exception is high entropy data, where most characters appear a roughly equal number of times. Note in the security industry, “entropy” is usually calculated as the entropy of a categorical distribution where each byte is its own category, and where the probability of a byte is given by its frequency in the chunk of data we’re considering. As this is a univariate distribution, it does not capture structure related to relative positions of bytes. However, here I’ll use the term ‘high entropy’ to refer to data that appears uniformly random, even when it comes to sequences of bytes. The image below is an example of high entropy data

High entropy data usually indicates one of four things:

compressed data
well-encrypted data
random data
constants used in cryptographic calculations.

In a perfectly high entropy file, each byte will, on average, appear once in every 256 bytes or around .39% of the time. The following image shows the frequency of the most common characters and sequences of 2 characters in a high entropy file (in this case, a chunk of a .zip file). While none of the characters in the top 20 most frequent appear .39% of the time, they are all reasonably close. The difference could be accounted for by sampling error.

How often the most common unigrams and bigrams appear in a sample of high entropy data

In the section, I discuss how to recognize patterns produced by standard compiled code. This is obviously a common pattern in executable files.

x86 (32-bit) Code

I generated the following image from around 7kb of x86 code. It shows some of the most common characters/sequences in code, and (very) roughly their corresponding opcodes. This example doesn’t represent all possible x86 code, as different compilers/source code generates different looking compiled code. For example, some compilers/compiler options align functions on 16-byte boundaries, some on 4, some don’t align. Some compilers use nops to pad, and some use int3. Obviously different source files will result in different compiled files. However, after training your visual system, it’s usually easy to recognize all types of x86 code.

How the most common sequences in x86 code map to operations

A few things to note: 0x00 and 0xFF are by far the most common, and usually present as offsets (for example, an offset used in a jmp instruction). These bytes map to blank spaces and produce visible gaps in x86 code. In this code, the compiler used the int3 (CC) instruction as padding between functions. Sequences involved in function beginning/ends are prevalent (padding, push ebp, mov ebp esp for beginning, ret, padding for end). After some practice locating sequences like these, it becomes much easier to recognize x86 data instantly.

Conclusion

Unfortunately, becoming comfortable with this technique may require examining hundreds or thousands of malicious and benign samples. If you regularly reverse engineer files, I would highly recommend trying this out, even if initially it takes a while to gain useful information with it. Malicious packers that use weak encryption have been one of the main tools used by malware authors to hide their attacks for a long time, but modern techniques such as emulation can help uncover these attacks by unpacking the payload (which is much easier to identify than the packed file). Understanding file structure can also help inspire and guide malware detection models such as neural networks https://medium.com/ai-ml-at-symantec/deep-learning-for-malware-classification-dc9d7712528f for malware detection.

Part 2: Pattern Atlas

In this part, I go over a large number of patterns that I’ve encountered frequently. This can help serve as a reference if you want to learn the technique.

x64 Code

64-bit code has many similarities with 32-bit code: offsets and padding are still present, and may operations map to the same bytes. The most obvious differentiating factor is how frequent the ‘H’ character is in 64-bit code.

Most common 1–4grams in a sample of x64 code

Direct contrast of a chunk of x86 vs x64 code. Note in this case the x64 code does not having padding between functions while the x86 code does, but that’s not always the case. If you count the number of ‘H’ characters in each section, it’s clear the bottom as more

MSIL Code

MSIL bytecode is very different from x86. Almost 25% of the bytes are 0x00, so there will be many significant gaps.

Portable Executable (PE) Files

In windows, the previously discussed code almost always shows up in the context of PE files (often seen as files with the .exe and .dll extensions), which have several recognizable patterns

Because PE files have more than just code, it’s good to get familiar with what patterns appear in different sections. A normal file might have:

Code in a .text section
Imports/exports/strings in a .rdata section
Strings in a .data section
Images/version info in a .rsrc section
Relocation data in a .reloc section
A signature appended on the end

It’s important to be able to distinguish between common benign data (such as relocation data) and potentially malicious data (poorly encrypted, unknown, etc) to analyze files quickly.

The following is an example of a large amount of relocation data

Relocation data. In the disassembly view, Hiew highlights the bytes that correspond to global addresses which need modification when the file is loaded into memory

The following is an example of an icon file, commonly found in the resource section and displayed in explorer

Icons that might appear in a PE file. Obviously a PE file with a PDF icon is extremely suspicious because it can trick a user into thinking it’s a document instead of an executable

The following is an example of a signature, appended to the end of the file

These types of signatures can be viewed and validated in explorer:

When examining a clean or malicious PE file, it is normal to come across areas of high entropy data. The next section discusses some approaches to quickly getting insight into the purpose of this data, which often appears for legitimate reasons.

High Entropy Data

Many image formats contain high entropy data due to compression:

PNG:

GIF

JPEG

Compressed archives use compression (clearly), here is an example of a .zip file:

In information theory, entropy can be used as a measure of randomness. Beyond simply identifying it, it’s impractical for a human to extract information from high entropy data because it looks uniformly random (no patterns such as repeating sequences, all bytes appear a similar number of times). However, there are sometimes easy ways to learn more about it. Most high entropy data is compressed and comes with header bytes at the top that indicates its purpose. Here are some standard formats that use compressed data

Images

jpeg
gif
png (uses zlib)

Weak Encryption

Weak encryption is any encryption that is easy to break with modern cryptanalysis, which in practice translates to pretty much any encryption scheme that is not well known and tested. Outside of sophisticated attacks, where the key might be stored on a remote server or dynamically generated in a machine-specific way, the encryption key to decrypt a packer’s payload must be in the binary, so the protection against cryptanalysis that secure encryption offers doesn’t help. Therefore the goal of most malware packers is not secure encryption or even protection from analysis but is instead to hide from antivirus software. Poorly encrypted data is, therefore, an indicator that the data might be used for malicious purposes. One example of a weak encryption scheme is a short xor key — in this case, it’s sometimes possible to recognize the algorithm and the key by briefly looking at the encrypted data. Note that there are legitimate reasons to use encryption to protect a file from analysis: for example, intellectual property protection. However, there really any legitimate reasons to hide from antivirus products (as long as those products properly reflect a user’s interests).

Here is an example of 1-byte xor encryption at the beginning of a PE file. Because 0x00 XORed with any byte is that same byte, and we know PE headers have 0x00 padding between the header and the code, it’s pretty clear what the key is: 0xFE (the box symbol).

After this PE file has been encrypted, any sequence of bytes that could be used to detect the original PE file will no longer work. Modern antivirus features like emulators and dynamic analysis were invented to maintain protection against this type of evasion. 1-byte xor keys are rare in practice because packers change encryption keys every time they pack a new executable to avoid detection on the encrypted data. 1-byte xor keys only allow 255 variations (XORing with the 0 byte does not encrypt).

Here is the same file XORed with a 4-byte xor key, it’s still easy to recognize the key

4-byte xor encryption of PE file. Key boxed in red

Here is the same file XORed with a 16-byte xor key

Finally, here is the same file XORed with a 64-byte xor key. While it’s still possible to see patterns with some effort, it’s not trivial to see the key anymore (though it would probably still be pretty easy using standard cryptanalysis). If the file was encrypted with a random xor key equal to its length, then it would be perfectly encrypted and perfectly high entropy (this ‘perfect’ type of encryption is called a one time pad)

Some repeating sequences are still visible in the 64-byte version

Here’s a practical example where a 1-byte xor key is used. The following image shows a collection of strings used in a .jar file that has been obfuscated. It isn’t super easy to see the xor key, but simple cryptanalysis would break this easily (think of what characters appear the most in strings).

Here is the unencrypted version

The next section shows some examples of data that might look like weak encryption at first glance but is a different type of data. When practicing this technique, it is useful to look up in IDA how the code uses data you don’t recognize.

Not Weak Encryption

It takes some practice to distinguish suspicious data from benign data such as .ico files, relocation data, and strings in languages with non-roman alphabets (such as the GB 2312 character set for simplified Chinese characters).

Here is an icon. Even with weak encryption, it’s rare for the encrypted data to be lower entropy than the unencrypted data. Therefore I wouldn’t find this data too suspicious because most potentially malicious data (code, strings) has higher entropy than this. In other words, the fact that there is so much predictable repetition means that this chunk of data likely cannot encode as much information as normal x86 code in the same number of bytes. There are exceptions depending on how crazy the encryption algorithm is.

Here are some Chinese strings. At first glance, they look like they might be poorly encrypted ASCII strings.

Here are strings in various languages/encodings. Some are easier to differentiate than others.

Various ways to encode different languages

The best way to become familiar with what is and isn’t poorly encrypted data is to look at many examples of clean and malicious files and examine how suspicious-looking data is used in the code. The next section shows examples of actual malware that use weak encryption to hide from antivirus software.

Malware Examples

Common x86 malware tends to have a decryption code stub combined with one or more packed layers. Familiarity with x86 patterns means recognizing the code stub is usually straightforward, and sometimes it’s even possible to distinguish between different layers of encryption. If there are large regions of the PE file with strange-looking data, it’s usually a good idea to check memory dumps for known malicious strings. This process can allow you to quickly determine what threat a file is without even opening IDA (sometimes in less time than it would take IDA to load the file). In all of the following examples, if you were to take any file packed with the same packer, it would look about the same, though the specific bytes would be different.

Below is an image of Trojan.Asprox.B which includes the unpacking stub on the top and the packed data on the bottom

Trojan.Asprox.B. x86 unpacker stub followed by poorly encrypted data

The following is an image of Trojan.Tracur. Three different sections are visible: outlined in red is the x86 unpacking code, outlined in green is the first layer, and outlined in orange is the top of the final layer (which is an encrypted MZ file).

Here are some encrypted sections in Trojan.Zbot samples:

Trojan.Zbot poorly encrypted data with PE header

There are a vast number of patterns that can be produced by weak encryption, so it takes a lot of practice to be able to recognize it generically. The next section has some miscellaneous tips for using this technique.

Additional Notes

Familiarity with code makes finding shellcode easier, see Hangul Word Processor exploit below (bytes for “push ebp mov ebp, esp” easy to see at start)

Hangul word processor exploit. One green box has the bytes that correspond to the beginning of a function, and the other contains data for loading/referencing kernel32.dll. These are highly suspicious in the context of a document file.

This technique can be applied to any binary file and is especially useful for dumps
It can also be used to browse memory in a debugger like ollydbg when using the full ASCII view. Unfortunately, many debuggers will map characters above the normal ASCII range to “.”
Sometimes different IDEs/compilers create different file structures, and this technique can help recognize/analyze those files. The example below is Delphi. One quick way to get a lot of information quickly on Delphi files is to go to the entry point (f8 f5 in non-text mode), then scroll up to see the custom strings. In Delphi the strings are mixed in with the code, and the library strings are closer to the top.

Hiew Background

Hiew is a full-featured hex editor designed with PE analysis in mind and is well suited for this technique. It has three view modes: text, hex, and disassembly. This technique primarily uses the text view. While it’s good to be familiar with all of Hiew’s features/hotkeys, the most important here are ‘Enter’ to switch view, ‘Pageup/Pagedown’ to scroll, and F2 to toggle Wrap/Unwrap (make sure “Unwrap” is displayed in the bottom left next to “2”). F8 in text mode can be used to switch between character encoding; if set up correctly, the table should be left on “As Is” (if you see a lot of ‘?’, it may mean it opened in Unicode mode).

Thanks to Mircea Ciubotariu for introducing me to this method of analyzing files, and for creating the font used in this article (available here). Thanks to Geoffrey So and Andrew Gardner for providing suggestions and feedback on the article.