About Adversarial Examples | Towards Data Science

An overview of what, why and how of adversarial examples

Adversarial examples are an interesting topic in the world of deep neural networks. This post will try to address some basic questions on the topic including how to generate such examples and defend against them. Here’s what we will cover:

What are adversarial examples?
Why is it important?
Why does it occur?
How to generate adversarial examples?
How can we defend against adversarial examples?

1. What are adversarial examples?

In general, these are inputs designed to make models predict erroneously. It is easier to get a sense of this phenomenon thinking about it in a computer vision setting – in computer vision, these are small perturbations to input images that result in an incorrect classification by the models.

**From Explaining and Harnessing Adversarial Examples by Goodfellow et al.**

While this is a targeted adversarial example where the changes to the image are undetectable to the human eye, non-targeted examples are those where we don’t bother much about whether the adversarial example looks meaningful to the human eye – it could just be random noise to the human eye.

**A non-targeted "3" from Tricking Neural Networks: Create your own Adversarial Examples by Daniel Geng and Rishi Veerapaneni**

I wasn’t completely accurate to say that targeted adversarial examples are where the changes to the image are undetectable to the human eye. While the perturbations are small, it has been shown that in a time-limited setting even humans can be fooled using adversarial examples.

2. Why is it important?

Before getting into why the topic is important, let’s get a couple of definitions out of the way. Depending on the knowledge level of the attacker, the Adversarial Attacks can be classified as either white-box or black-box attacks. White-box attacks are where the adversary has complete knowledge about the model being attacked like weights, biases, hyper parameters used etc. Black-box attacks are where the adversary is a normal user who knows only the output of the model.

Adversarial attacks are an important topic of research and consideration because it has been shown that adversarial examples transfer from one model to another. In other words, adversarial examples generated to fool one model can also fool other models using a different architecture or trained using different data sets for the same task. Now this becomes a huge security risk as attackers can develop local models for the same task as the target model, generate adversarial examples for the local model (white-box as far as local model is concerned) and use them for attacking the target (transferability making black-box attacks on target models easier). This puts a whole host of main stream or soon to be main stream applications like facial recognition, self-driving cars, biometric recognition etc that leverage ML based computer vision models at risk.

This paper explores how much of a real-world threat this is, given that in a lot of the use cases in real-world the inputs are coming in from cameras and other sensors. It demonstrates even feeding adversarial examples through camera does result in misclassification. This is another paper that demonstrates the real-world threat by generating examples that remain robust over a distribution of transformations and using 3D printing to create robust 3D adversarial objects.

3. Why does it occur?

Deep learning and Machine Learning have garnered all the attention because they have helped us successfully tackle problems like computer vision, natural language processing etc. These are use cases that cannot be addressed by traditional rule-based systems, simply because of the sheer volume and complexity of rules involved given all the variations you could expect in the inputs for these use cases. Put differently, the input space is typically extremely huge and approaches infinity if we have to hand-code all the rules required for accurate recognition. This is true even for a low variant data set like MNIST, let alone ImageNet or other scenarios.

In general, a neural network is a computational graph where classification decisions are driven by weights and biases optimized on training data and doesn’t explicitly apply logical reasoning for decisions. So therefore, any non-targeted image being incorrectly classified with higher confidence as belonging to a class isn’t quite surprising. For example, each pixel can have 256 possible values and even a small 16 x 16 sized image can therefore have an infinitely large number of possible inputs (i.e. 256¹⁶*¹⁶ or ~10⁶¹⁶ possible images). Given this large input space, identifying some that makes the computational graph produce the same result as it would for a proper image isn’t that surprising.

4. How to generate adversarial examples?

Conceptually the methods used for generating adversarial examples are not very different from developing a neural network. If A = training data, B = expected classification, C = model weights, D = cost function. While training a model, with A and B fixed, we keep changing C to determine the best value that minimizes D. Similarly, for generating a non-targeted adversarial examples like the one seen above, we keep B and C fixed and keep changing A to determine the best value that minimizes D.

Here’s another example from Pytorch documentation, for generating a targeted example by introducing a small change to the input intended to produce a misclassification.

# FGSM attack code
def fgsm_attack(image, epsilon, data_grad):
    # Collect the element-wise sign of the data gradient
    sign_data_grad = data_grad.sign()
    # Create the perturbed image by adjusting each pixel of the input image
    perturbed_image = image + epsilon*sign_data_grad
    # Adding clipping to maintain [0,1] range
    perturbed_image = torch.clamp(perturbed_image, 0, 1)
    # Return the perturbed image
    return perturbed_image

The documentation provides an end to end example including the model definition, init and feed forward methods and a test function to run the attack and visualize the adversarial examples generated. But the crux of it all is the method above that creates the adversarial example. Note that this is very much similar to training a model. Typically, you update weights of the model while training for a given input and expected output.

# Update weights
    w1 -= learning_rate * grad_w1
    w2 -= learning_rate * grad_w2

Whereas here, the weights are untouched but the input image is changed.

# Create the perturbed image by adjusting each pixel of the input image
    perturbed_image = image + epsilon*sign_data_grad

Also notice while updating weights we subtract learning rate times gradient thereby moving them in the direction that minimizes cost function, whereas we add epsilon times gradient to the image so as to move cost function in the opposite direction.

5. How can we defend against adversarial examples?

In general, there haven’t been any universally accepted solution to defend against adversarial attacks. A lot of defense techniques have been proposed with mixed results and often circumvented in subsequent studies. Here are some key common examples:

Adversarial Training is when adversarial examples are used during training to reduce misclassification. There are different techniques explored to formulate this concept and improve training. We will not get into the variety of techniques and related details of such training here, but in general adversarial training have produced mixed results in making the model robust against attacks.

Autoencoders are another mechanism used. Autoencoders are a type of neural network that first reduce the dimensionality of the inputs by passing them through a lower dimensional hidden layer(s) before attempting to reconstruct the info in its output layer that will be of the same dimension as the input layer. In other words, it attempts to be an identity function, except that it is constrained to first compress the information before reconstructing it. This by definition eliminates noise in the input and retain only those features that are necessary to reconstruct the original image. Such auto encoders can help eliminate the adversarial perturbations as well and have had some success. But it is also possible to generate adversarial examples using the same methods described earlier, only now to include auto encoder as well as part of the network.

Defensive Distillation is another way to address this problems as discussed [here](https://arxiv.org/pdf/1705.05264.pdf) and here. Defensive distillation is a variant of network distillation where the probability vector predictions of a network is fed as training labels for the distilled network with the same architecture. Using such soft labels for training makes the distilled network smooth and less sensitive to variations in input and robust to adversarial inputs.

DeepSafe, another promising approach, is discussed here. This is a technique that is based on the principal that all inputs within a region of the input space belongs to the same class and will be labelled the same. DeepSafe first groups known labelled inputs into different clusters, with each cluster containing images of same class and representing a region in the input space. Then it finds if there are inputs within this region which is classified differently. If no such input within the region is found, then it is deemed safe. If one is found, then the region is redrawn and the experiment repeated until a safe region is found.

As discussed in this paper here, perhaps adversarial examples are inevitable and harder to defend against particularly in higher dimensions. Regardless, this is an important topic for everyone involved in the field, from executives to enthusiasts and all research and development on the topic should be followed closely and supported as possible.