How to attack Machine Learning ( Evasion, Poisoning, Inference, Trojans, Backdoors)

Published in

Towards Data Science

16 min readAug 6, 2019

In my previous article i mentioned three categories of AI threats (espionage, sabotage, and fraud). If looking at a technical level, attacks can occur on 2 different stages: during training or inference.

UPD 2021: here you can find the latest report on AI and ML Attacks

Attacks during training take place more often than you can imagine. Most of the production ML models retrain their system periodically with new data. For example, social networks continuously analyzes user’s behavior, which means that each user may retrain this system by modifying the behavior.

There are different categories of attacks on ML models depending on the actual goal of an attacker (Espionage,Sabotage, Fraud) and the stages of the machine learning pipeline (training and production), or also can be called attacks on algorithm and attacks on a model respectively. They are Evasion, Poisoning, Trojaning, Backdooring, Reprogramming, and Inference attacks. Evasion, poisoning, and inference are the most widespread now. Look at them in brief (Table 1).

Evasion (Adversarial Examples)

Evasion is the most common attack on the machine learning model performed during inference. It refers to designing an input, which seems normal for a human but is wrongly classified by ML models. A typical example is to change some pixels in a picture before uploading, so that the image recognition system fails to classify the result. In fact, this adversarial example can fool humans. See an image below.

Pic 1. Adversarial example for humans

Different methods are available to perform this attack depending on a model, dataset, and other properties. Since 2004 after the release of the study called “adversarial classification”, there have been over 300 research papers considering a similar topic on arXiv and approximately the same number describing various protective measures.

The interest is growing — see the graph. The stats of 2018 were collected in July so at least the double size is expected for the end of the year.

Pic 2. The approximate number of research papers about Adversarial Attacks on Arxiv.

Some restrictions should be taken into account before choosing the right attack method: goal, knowledge, and method restrictions.

Goal restriction (Targeted vs Non-targeted vs Universal)

False positive evasion, what is this? Imagine somebody wants to misclassify results, say, bypass access control system that rejects all employees except top management, or just need to flood the system with wrong predictions. It is false positive evasion of some sort.

Indeed, targeted attacks are more complicated than non-targeted ones but the full list will look like this:

Confidence reduction — we don’t change a class but highly impact the confidence
Misclassification — we change a class without any specific target
Targeted misclassification — we change a class to a particular target
Source/target misclassification — we change a particular source to a particular target
Universal misclassification — we can change any source to particular target

Knowledge restriction (White-box, Black-box, Grey-box)

As in any other type of attack, adversaries may have different restrictions in terms of knowledge of a target system.

Black-box method — an attacker can only send information to the system and obtain a simple result about a class.

Grey-box methods — an attacker may know details about dataset or a type of neural network, its structure, the number of layers, etc.

White-box methods — everything about the network is known including all weights and all data on which this network was trained.

Method restriction (l-0, l-1, l-2, l-infinity — norms)

Method restrictions relate to the changes one can perform against original data. For instance, if talking about image recognition, change less pixels considerably or vice versa — slightly modify as many pixels as possible. Or something in between.

In fact, attacks based on l-infinity norm (the maximum pixel difference) are more frequent and easier to perform. However, they are less transferable to real life, as the small changes can be offset by the quality of cameras. If attackers have a picture and made small perturbations to multiple pixels, they can fool the model. If they have a real object and a system making a photo of this object and then sending this photo to ML system, there is a big chance that camera recognizes most perturbations and the photo of the perturbed adversarial example won’t be adversarial anymore. So the attacks which use l-0 or l-1 norm seem to be more realistic and harder to perform. Another constraints apart form images can be in other types of data. For other data (such as text or binary files), the constraints can be much more restrictive as it’s impossible to alter many input features. Creating a malware that will bypass the analysis solution presents a more complex task because input features can have even less options to be changed so that the resulting malware example will both bypass detection algorithm and perform its functionality.

White-box adversarial attacks

Let’s move from theory to practice. One of the first examples of adversarial attacks was demonstrated based on popular database of handwritten digits. It illustrated that it was possible to make small changes to the initial picture of a digit so that it is recognized as another digit. It was not a sole example of the system confusing “1” and “7”. There are examples of all 100 possible misclassifications of digits from every 10 digits to each of 10 digits.

That was performed in a way that people couldn’t recognize a fake. Further research demonstrated that small perturbations of an image could lead to misclassification and the system recognized a car instead of a panda.

Currently there are over 50 methods to fool ML algorithms, and I will give you an example.

First of all, we calculate a dependency matrix, so-called Jacobian matrix, that shows changes of the output prediction (result class) for every input feature (for images, input features are pixel values). Afterwards, take the picture for modifications and change its most influential pixels. If it’s possible to misclassify the results by a bit optimized bruteforcing, take the next pixel and try again. The results are impressive, but not without a drawback. This was accomplished in a white-box mode. It means that researchers used to attack a system with well-known architecture, datasets, and responses. It seemed just a theory that was not so realistic and implementable in real life. Their research was updated a while later by another team (https://arxiv.org/pdf/1707.03501.pdf).

In July 2017, an article titled “Robust Physical-World Attacks on Deep Learning Models” was published revealing that recognition systems can be fooled, and self-driving cars can misclassify road signs. The experiment was conducted both in a static and dynamic mode by capturing videos from different angles with 84% accuracy. Moreover, they used art objects and graffiti that camouflaged the attack resembling regular vandalism.

There were over 100 research papers about 50 different attacks such as BIM, DeepFool, JSMA, C&W, etc.

Pic. 3. Adversarial attack example

Adding some noise to an image, which depicts a panda, will help classify it as a picture of gibbon.

Grey-box adversarial attacks or transferability attacks

In 2016, the first research that introduced a grey-box attack came to light. In fact, there are multiple levels between white-box and black-box ones. Grey-box implies that a person knows some information about the system, its architecture or something. Most of the popular solutions use publicly available architectures like Google. The research “Practical black-box attacks against machine learning” demonstrated that it was possible to collect information from a black-box system by sending various inputs and collecting outputs. Then you should train the substitute model based on these examples and launch an attack. Thus, adversarial examples are transferable. If somebody is able to hack one model, he or she is likely to hack the similar one.It turned out that all previous research examples considering white-box attack may be utilized to perform black-box attacks.

This is not the worst part. In real life, you don’t even need a dataset. The networks have vulnerabilities in their way to data process, for example, in Convolution layers. Numerous research papers of 2017 demonstrated the possibility to make a universal adversarial example that will be transferable from one model to another. Actually, it’s unnecessary to know the dataset.

Pic. 4. Comparison of models’ transferability

As you can see in Pic. 4, dome models have close to 100% transferability.

Black-box adversarial attacks

What if this transferability doesn’t work? Can we then do something?

It turns out that we can!

With black-box access to the model, it’s possible to make changes and the model will fail to recognize the initial picture. Sometimes the goal can be achieved with ONLY one pixel. Many other curious ways exist. One of my favourite methods takes place when we show NN a car but it identifies a cat. For that, we take a picture of a car, slightly change it to a cat, and while keeping NN answer that it is still a car. Here’s an inverse trick: instead of changing our source to target by slightly modifying pixels, they took a target, and started to change pixels in the direction of the source, while keeping the model output the same as needed.

Pic. 5. Step-by-step Cat-to-Dog misclassification

Adversarial reprogramming

In 2018, researches shared fantastic findings considering a new type of attacks called Adversarial Reprogramming. This is the most realistic scenario of a sabotage attack on the AI model. As the name implies, the mechanism is based on remote reprogramming of the neural network algorithms with the use of special images. Adversarial attacks allowed them to create images that resembled a specific noise and several small white squares inside a big black square. They chose the pictures in the way that, for example, the network considered the noise with a white square on a black background to be a dog, and the noise with two white squares to be a cat, etc. There were 10 pictures in total.

Consequently, the researchers took a picture with the exact number of white squares as an

input, and the system produced the result with a particular animal. The response made it

possible to see the number of squares in the picture.

In fact, their image recognition system then became a model that can calculate the number of squares in the picture. Think of it in a broader perspective. Attackers can use some open Machine Learning API for image recognition to solve other tasks that they need and use the resources of the target ML model.

Pic. 6. Turning ImageNet classifier to square calculator

Privacy Attacks (Inference)

Take a glance at another category of attacks. The goals are espionage and confidentiality.

An attacker is intended to explore the system, such as model, or dataset that can further come in handy.

Given the large number of systems and proprietary algorithms, one of the objectives is to get a knowledge about the AI system and its model — Model extraction attacks.

As for attack dealing with data, it’s possible to retrieve information about dataset by waging attacks like membership inference and data attributes with the help of Attribute inference. Finally, the model Inversion attack help extract particular data from the model.

Most studies currently cover Inference attacks at the production stage, but they are also possible during training. If we can inject training data, we can learn how an algorithm works based on this data. For example, if we want to understand how a social media website decides that you belong to a target audience, say, a group of pregnant women, in order to show you a particular ad, we can change our behavior, for instance, trying to search for information about dippers and check whether we are getting ads intended for future moms.

Pic. 7. Three types of privacy attacks on ML models

Membership inference and attribute inference

Membership inference is a less frequent type of attacks but the first one and a processor to data extraction. Membership inference is an attack where we intend to know whether a particular example were in the dataset. If talking about image recognition, we want to check if a particular person were in the training dataset. This is a way to understand how AI vendor follows privacy rules.

In addition, it can help plan further attacks such as Black-box Evasion attacks that are based on transferability. In transferability attacks, the more your dataset is similar to the victim dataset, the more chances you have to train your model to be similar to the victim model. Attribute inference helps you extract valuable information about the training data (e.g., the accent of the speakers in speech recognition models).

Attribute inference (guessing a type of data) and membership inference (particular data examples) are vital not only due to privacy issues but also as an exploratory phase for Evasion attacks. You can find more details in first paper about this topic — “Membership Inference Attacks Against Machine Learning Models”

Pic. 8. Membership inference attack

Membership inference attack is guessing if this particular dog was in the training dataset.

Input inference (model inversion, data extraction)

Input inference, or Model inversion, is the most common attack type so far with over 10 different research papers published. Unlike membership inference where you can guess whether the example that you have was in the training dataset, here you can actually extract data from the training dataset. While dealing with images, it’s possible to extract a certain image from the dataset, for instance, just knowing the name of a person, you can get his or her photo. In terms of privacy, this presents a big issue for any system, especially today when GDPR compliance is thriving.

Another paper described some details of an attack against ML model that are used to assist in medical treatments based on patient’s genotype https://www.ncbi.nlm.nih.gov/pubmed/27077138. Maintaining privacy about patients’ personal and medical records is an important requirement in healthcare domain and mandated by law in a lot of nations.

Pic. 9. Input Inference example. On the left, an original picture was recovered from the model.

Parameter inference (model extraction)

Parameter inference, or model extraction, is the less common attack with fewer than a dozen of public research papers. The goal of this attack is to know the exact model or even a model’s hyperparameters. This information can be useful for attacks like Evasion in the black-box environment.

One of the latest papers about evasion attacks use some model inversion methods to perform attacks much faster with available knowledge. I believe the first practical information about Inference attacks were published in 2013 in “Hacking smart machines with smarter ones: How to extract meaningful data from machine learning classifiers”

Pic. 10. Algorithm for model parameters extraction

Poisoning

Poisoning is another category of attacks and can be considered as one of the most widespread. Learning with noise data is actually an old problem, dated 1993 — “Learning in the Presence of Malicious Errors” — and 2002, however, these cases were about a small amount of noise data happened organically while poisoning mean the procedure where someone is purposely trying to exploit the ML model. The history of Poisoning attacks on ML starts in 2008 with the article titled “Exploiting Machine Learning to subvert your spam filter”. This paper presented an example of attack on SPAM filters. Later, over 30 other research papers about Poisoning attacks and Poisoning Defense were published.

Poisoning can be different like the evasion. First of all, there can be different goals such as Targeted and Non-Targeted attacks. The next difference is an environment restriction, put it simply, what exactly we can do to perform an attack. Can we inject any data or only limited types? Can we inject data and label it, only inject it or only label the existing data?

There are four broad attack strategies for altering the model based on the adversarial capabilities:

Label modification: Those attacks allows adversary to modify solely the labels in supervised learning datasets but for arbitrary data points. Typically subject to a constraint on total modification cost.

· Data Injection: The adversary does not have any access to the training data as well as to the learning algorithm but has the ability to augment a new data to the training set. It’s possible to corrupt the target model by inserting adversarial samples into the training dataset.

· Data Modification: The adversary does not have access to the learning algorithm but has full access to the training data. The training data can be poisoned directly by modifying the data before it is used for training the target model.

· Logic Corruption: The adversary has the ability to meddle with the learning algorithm. These attacks are referred as logic corruption.

How does poisoning attacks work? All started from poisoning attacks on more simple classifiers like SVM back in 2012. SVM method draws decision boundaries between different classes. The algorithm outputs an optimal hyperplane which classifies new examples. In two dimensional space, this hyperplane is a line dividing a plane into two parts where each class of examples lies in the separate side of the line.

Look at the picture below, as it illustrates the poison attack in detail as well as the comparison with classical adversarial attack.

Pic. 11. Comparison between adversarial and poisoning attacks

Poisoning attacks change classification boundary while adversarial attacks change input example (see Pic. 11).

If we add a point to the training data, the decision boundary will change. If we show our target object it will be in a different category. Actually, Neural Network models can be fooled in the same way. Just think of this picture as features of the last layer of our complex Neural Network. The latest research even presented a method to poison complex NN without labeling data. An adversary could place poisoned images online and wait for them to be scraped by a bot, in their case they wanted to bypass spam filter so they injected emails. They choose a target instance from the test set lets say a normal email. Then they samples a base instance from the base class, and make imperceptible changes to it to craft a poison instance then injected poison image into the training data. They reached 100% attack success rate and it’s noteworthy that the test accuracy dropped just by 0.2%.

There are also two other attack types such as Backdoors and Trojans. The goal of this attack and types of attackers are different but technically they are quite similar to Poisoning attacks. The difference lies in the kind of data which is available for an attacker.

Trojaning

While Poisoning, attackers don’t have access to the model and initial dataset, they only can add new data to the existing dataset or modify it. As to Trojaning, an attacker still don’t have access to the initial dataset but have access to the model and its parameters and can retrain this model. When can this happen? Currently, most companies don’t build their own models from scratch but retrain the existing models. For example, if it’s necessary to create a model for cancer detection, they take the latest image recognition model and retrain it with dataset since the lack of data and cancer images doesn’t allow training complex model from scratch. This means that most AI companies download popular models from the Internet where hackers can replace them with their own modified versions.

The idea of Trojaning is to discover ways to change the model’s behavior in some circumstances in such a way that existing behavior stays unchanged. How to retrain system after injecting any data so that it is still performing the original task? Researchers found a way — firstly, by subtracting dataset from the model and then combining it with the new inputs then retraining the model. I won’t go deeper into details but recommend to read this research paper.

Pic 12. Trojan attack algorithm

Backdooring

Model’s behavior modification such as Poisoning and Trojaning is possible even in black-box and grey-box environment along with the full white-box mode with an access to the model and dataset. Nonetheless, the main goal is not only inject some additional behavior but to do it in such a way that backdoor will operate after retraining the system.

The next attack was highlighted in 2017. The idea was adopted from one of the oldest IT concepts — so-called backdoors. Researchers thought of teaching a neural network to solve the main task as well as specific ones.

The attack has the potential to occur globally based on two main principles:

Convolutional neural networks for image recognition represent large structures formed of millions of neurons. In order to make minor changes to this mechanism, it’s necessary is to modify a small set of neurons.
Operating models of neural networks that are able to recognize images such as Inception or ResNet are complicated. They are trained with tremendous amounts of data and computing power, which is almost impossible for small and medium-sized companies to recreate. That’s why many companies that process images like MRI or carcinoma shots reuse the pre-trained neural networks of large companies. Therefore, the network originally aiming to recognize celebrities’ faces starts to detect cancerous tumors.
Malefactors can hack a server that stores public models and upload their own model with a backdoor, and the neural network models will keep the backdoor hackers made after the model has been retrained.

As an example, NYU researchers demonstrated that backdoors built into their road sign detector remained active even after they retrained the system to identify Swedish road signs instead of their U.S. counterparts. In practice, it’s hardly possible to detect these backdoors if you are not an expert. Fortunately, not that long ago, researchers discovered a solution. I can say with certainty that this mechanism will also be bypassed in the future.

Pic 13. Backdoor attack example

Summary

Finally, simply saying, we do not have optimal solutions for the listed problems right now, and perhaps, we will not be able to invent a universal solution at all. It sounds sad, but wait! There is something that inspires me — the fact that AI systems are vulnerable. Why? We should not be scared of any war between AI and People since we know this secret weapon.

Subscribe to read new articles on AI security beause it’s just the beginning, and we should learn more about defences.