Analysis of Computer Vision Techniques in Malware Classification

Introduction to Computer Vision techniques used in Malware Classification

Raja Narasimhan
Towards Data Science

--

Any software which is designed intentionally to cause damage to the system in which it is present. Major types are Worms, Trojan, and adware. These days around 350,000 samples are produced every year and it is becoming difficult for Anti-Virus companies as only 50% of new malware is reported and from that 50%, only 20% of those would be detected by existing Anti-Virus. So quick math would tell 90% of new malware would be undetected by Anti-Virus software.

Some of the Traditional methods used to classify malware are

  • Sandbox Detection: : here any suspicious software is running in a virtual environment where its actions can be monitored and based on its actions it can be Anti Virus will figure out whether the software is malicious or not. But this method can be bypassed by malware which is so huge that it can’t be processed in a virtual environment, malware file saved in an obscure file format that is not recognized, etc. Sandbox detection comes under Behavior-based malware detection.
  • Signature Based Detection: Here Anti-Virus company creates a signature for malware and updates it in its database. So Anti-Virus compares the signature of the software it scans with the signatures in the Anti-Virus companies database. Here the problem is in these days as discussed above around 350000 malware are created every day and extremely difficult for companies to create signatures for every one of them.
Malware these days
source: Reddit

SO these days Anti Virus companies are using deep learning techniques to counter Malware these days. Here we will dive deep into CNN based classification.

The similarity of images of malware of particular category when categorized as greyscale images were first seen in Malware Images: Visualization and Automatic Classification. In this paper, they show how a Trojan Virus would look.

source: Malware Images: Visualization and Automatic Classification
source: Malware Images: Visualization and Automatic Classification

.text part contains the code which gets executed and the .text part towards the end is full black which indicates zero paddings in the end. The .data part contains uninitialized code and .rsrc contains all resources of the module like the icons application may use.

source: Malware Classification Using Image Representation

The above picture is from paper Malware Classification Using Image Representation where they have shown the pictures of malware of different families where for a family we can see the similarity in the pictures.

Also in the paper Convolution Neural Networks for Malware Classification they show pictures of a common family of Malware like Rammit, Gatak(Version of Trojan), etc

source: Convolution Neural Networks for Malware Classification
source: Convolution Neural Networks for Malware Classification

Results Published

  • So the paper Malware Images: Visualization and Automatic Classification, they use GIST to compute texture features and used K-nearest neighbors with Euclidean distance to classify them. So GIST is wavelet decomposition of Image using Gabor filter. Gabor filter is a linear filter that basically analyzes frequency content in an image in a particular direction in the region of analyses. They are used for edge detection, texture analysis, and feature extraction. They used 9,458 malware from 25 families and they got 98% accuracy
  • In the paper Convolution Neural Networks for Malware Classification, they trained three models.
  1. 1. CNN 1C 1D, CNN 1C 2D, and CNN 3C 2D. CNN 1C 1D consists of the Input layer of NxN pixels (N=32), Convolutional layer (64 filter maps of size 11x11), Max-pooling layer, Densely-connected layer (4096 neurons), the Output layer of 9 neurons. The results were Accuracy: 0.9857 and Cross-entropy: 0.0968
  2. 2. CNN 1C 2D consists of the Input layer of NxN pixels (N=32), Convolutional layer (64 filter maps of size 3x3), Max-pooling layer, Convolutional layer (128 filter maps of size 3x3), Max-pooling layer, Densely-connected layer (512 neurons), Output layer. 9 neurons. The results were Accuracy: 0.9976 and Cross-entropy: 0.0231
  3. 3. CNN 3C 2D consists of Input layer of NxN pixels (N=32), Convolutional layer (64 filter maps of size 3x3), Max-pooling layer, Convolutional layer (128 filter maps of size 3x3), Max-pooling layer, Convolutional layer (256 filter maps of size 3x3), Max-pooling layer, Densely-connected layer (1024 neurons), Densely-connected layer (512 neurons), Output layer. 9 neurons. The results were Accuracy: 0.9938 and Cross-entropy: 0.0257
  • In the paper Malware Classification Using Image Representation, they used 2 models, a CNN model with 4 layers (2 convolution layer and 2 dense layers) and a Resnet18. The normal CNN one gave 95.24% accuracy and Resnet one gave 98.206 % accuracy.

Conclusion

So as you can see the results published by these papers around 95–98% of malware are detected which shows computer vision technique is way better than traditional approaches. Malware Classification Using Image Representation also shows a model with CNN and word embedding gives an accuracy of around 99.5%.

Not only computer vision but some researchers have also published a paper that uses Reinforcement Learning, Natural Language Processing, etc.

These days Attackers have also started using automation and it’s being difficult for Anti-Virus to protect our systems with outdated methods.

McKinsey Global Institute studies estimate that automation driven by technologies such as AI and machine learning could increase productivity at an annual rate of 0.8% to 1.4% over the next half-century also McKinsey estimates $9 trillion to $21 trillion of global economic value creation depends on the robustness of the cyber-security environment.

Deep Learning able to achieve really good accuracy and it also takes up less hardware compared to the traditional method. So maybe its the future.

I hope its a good introduction and will be posting my ideas in this field soon

See you soon✌

Reference:

[1] Ajay Singh, Malware Classification Using Image Representation(2017), Interdisciplinary Center for Cyber Security and Cyber Defense of Critical Infrastructures

[2] Daniel Gilbert, Convolution Neural Networks for Malware Classification(2016), IEEE

[3] L. Nataraj, S. Karthikeyan, G. Jacob, B. S. Manjunath Malware Images: Visualization and Automatic Classification(2011), Vison Reasearch Lab

https://www.welivesecurity.com/wp-content/uploads/2018/08/Can_AI_Power_Future_Malware.pdf

https://techbeacon.com/security/antivirus-dead-how-ai-machine-learning-will-drive-cybersecurity

Do check my profile:

https://github.com/rajanarasimhan

https://www.linkedin.com/in/raja-narasimhan-645329171/

https://rajanarasimhan.github.io./

--

--

Deep Learning Enthusiast trying to specialize in the field of Computer Vision.