Know your enemy

How you can create and defend against adversarial attacks

Oscar Knagg
Towards Data Science

--

The current driving force in machine learning is to produce increasingly more accurate models while less attention has been paid to the security and robustness of these models. As I covered in my previous post ML models such as image classifiers are vulnerable to tiny perturbations to their inputs that cause them to make the wrong decisions. The aim of this post is to inform you how to create and defend from a powerful white-box adversarial attack via the example of an MNIST digit classifier. Contents:

  1. The projected gradient descent (PGD) attack
  2. Adversarial training to produce robust models
  3. Unexpected benefits of adversarially robust models (such as below)
Class interpolation with high epsilon adversarial examples. This isn’t from a VAE or GAN — it’s from an MNIST classifier trained against the projected gradient descent attack.

Check out this Jupyter notebook which contains the code to produce all the figures in this post and train your own adversarially robust model.

Projected Gradient Descent (PGD)

The PGD attack is a white-box attack which means the attacker has access to the model gradients i.e. the attacker has a copy of your model’s weights. This threat model gives the attacker much more power than black box attacks as they can specifically craft their attack to fool your model without having to rely on transfer attacks that often result in human-visible perturbations. PGD can be considered the most “complete” white-box adversary as it lifts any constraints on the amount of time and effort the attacker can put into finding the best attack.

Left: natural image. Middle: Adversarial perturbation found by PGD attack against ResNet50 model, size of perturbation is magnified x100 to be more visible. Right: adversarial example.

The key to understanding the PGD attack is to frame finding an adversarial example as a contrained optimisation problem. PGD attempts to find the perturbation that maximises the loss of a model on a particular input while keeping the size of the perturbation smaller than a specified amount referred to as epsilon. This constraint is usually expressed as the L² or L∞ norm of the perturbation and it is added so the content of the adversarial example is the same as the unperturbed sample — or even such that the adversarial example is imperceptibly different to humans.

Real world attacks that could be possible with PGD are:

The PGD algorithm can be summarised with the 5 steps below although the attacker is free to apply any optimisation improvements such as momentum, Adam, multiple restarts etc…

  1. Start from a random perturbation in the L^p ball around a sample
  2. Take a gradient step in the direction of greatest loss
  3. Project perturbation back into L^p ball if necessary
  4. Repeat 2–3 until convergence
Projected gradient descent with restart. 2nd run finds a high loss adversarial example within the L² ball. Sample is in a region of low loss.

“Projecting into the L^P ball” may be an unfamiliar term but simply means moving a point outside of some volume to the closest point inside that volume. In the case of the L² norm in 2D this is moving a point to the corresponding closest point on the circle of a particular radius centered at the origin.

Projecting a point back into the L² ball in 2 dimensions

For those of you who have a practical mindset the following PyTorch function is an implementation of PGD to generate targeted or untargeted adversarial examples for a batch of images.

Running this code snippet on samples from MNIST produces the following. Bear in mind that adversarial examples for MNIST are much more visible than for datasets like ImageNet due to its lower dimension/resolution — however a non-robust classifier is comprehensively fooled by these images.

Left column: natural examples. Middle column L² bounded adversarial examples. Right column: L∞ bounded adversarial examples.

However we’re data scientists and can do better than just interesting pictures. The charts below quantify the accuracy of a non-robust classifier against adversarial perturbations of varying size, epsilon. You can see the performance is poor as L² and L∞ bounded attacks of the size shown above reduce the accuracy of our model to around the random-guessing level. Let’s see what we can do about this!

Adversarial Training

The current state of the art defense against this attack is adversarial training, this is not the same training scheme as generative adversarial networks (GANs) although adversarially trained classifiers to exhibit GAN-like properties as I’ll demonstrate later.

Adversarial training is simply putting the PGD attack inside your training loop. This is can be viewed as “ultimate data augmentation” as instead of performing random transformations (rescaling, cropping, mirroring etc.) as a preprocessing step we create specific perturbations that best fool our model and indeed adversarially trained models do exhibit less overfitting when trained on small datasets.

This seems like an obvious approach but it is not obvious that such a training method will actually converge. Here’s why: in regular training we minimise the expected natural loss over our dataset {x, y}, w.r.t our model parameters, theta.

However in adversarial training we are minimising the following loss function where Δ is a set of perturbations we want our model to be invariant to such as the L² and L∞ perturbations discussed earlier.

Hence we are now solving a minimax AKA saddle-point problem that is guaranteed to be non-convex. Similarities can be drawn between this and the GAN loss function below that is also a two-player minimax game where the players are the discriminator and generator instead of adversary and network.

However in practice this kind of training does converge (and more consistently than GAN training) at the cost of increased training times as we are solving a multistep optimisation problem (i.e. PGD) for each training step of our network. The following code snippet trains an MNIST model against an L∞ adversary.

The following plots quantify the adversarial accuracy of models trained against L² and L∞ adversaries — there’s definitely an improvement in robustness.

There are a few interesting points to note here:

  • The L∞ trained model (purple) is more robust against both L² (left) and L∞ (right) bounded attacks
  • Both the robust models exhibit lower accuracy on natural samples (epsilon = 0) than the non-robust model: ~0.5% for L∞ model, ~3% for L² model
  • The L² attack appears to saturate in effectiveness (left). This is an illusion caused because I fixed the number of PGD steps in order to save compute time. A PGD attack allowed to run until convergence would probably still increase in effectiveness with epsilon

My hypothesis to explain the poor robust and non-robust performance of the L² trained model is that L² bounded perturbations are semantically relevant on MNIST. If you take a look at the adversarial loss defined earlier we are training our model to be invariant to perturbations in the set Δ i.e. L² perturbations. If these are semantically relevant perturbations on MNIST then L² adversarial training is actively obstructing our models ability to learn! (The alternative hypothesis is that I need to search more hyperparameters for L² training…)

Unexpected benefits of a robust model

Adversarial training of an MNIST classifier has some unexpected benefits beyond just robustness to attacks. The most interesting of which is the ability to smoothly interpolate between classes using large-epsilon adversarial examples. These are produced using the PGD method described earlier except we allow the size of the adversarial perturbation to be much larger than that used in training.

Shown below are targeted large-epsilon adversarial examples created using the PGD attack on the L² trained model. There are some artifacts but the end results are quite clearly images of the desired class. This is possible because the gradients of the robust model in the input space align well with human perception, so following that gradient with PGD produces plausible images.

Leftmost: original sample. Rightmost: targeted (3, 5, 9) adversarial examples with epsilon = 7.5

Trying the same thing with the non-robust model results in garbage images that only bear a slight resemblance to the desired classes.

Gross.

Interestingly, untargeted adversarial attacks on the L² robust model produce a trajectory from the most to least similar class. Shown below are the adversarial examples and prediction probabilities as we increase the L² norm of the adversarial pertubation from 0 to 10.

Progression of untargeted attack on L2 model

This phenomenon isn’t just unique to MNIST as Madry et al were able to produce the same kind of interpolations on an ImageNet trained model.

Figure 4 from https://arxiv.org/pdf/1805.12152.pdf showing interpolation from turtle -> bird and cat -> dog.

The L∞ does not produce as interesting interpolations as the L² model (check out my Juypter Notebook if you want to generate some) but it does have it’s own unexpected benefits — namely very sparse weights.

Sparse weights are considered useful for their own sake as they’re more interpretable and are more amenable to pruning and hence model size reductions. Manual inspection of the 32 convolutional filters in the first layer of the L∞ model shows some interesting traits.

  • Most of the filters are zeros (i.e. weights are sparse)
  • The non-zero filters contain only 1 non-zero weight

As the non-zero filters have only one weight they become just a rescaling of the original image. Combining rescaling with ReLu activation means that these are thresholding filters i.e. ReLu(ax-b), the activations of which will remain unchanged for any perturbations smaller than b. Thresholding/binarizing of input data is a well known adversarial defense as it destroys small perturbations — adversarial training has caused the model to learn this independently!

As deep learning achieves more widespread adoption we must remain cautious not to anthropomorphize ML models as we have seen they can fail catastrophically in ways unlike humans. However some stellar research is being done to prevent these kinds of attacks and in some cases certify robustness to particular perturbations so perhaps truly robust deep networks could be just around the corner.

Thanks for reading. Feel free to read my previous post which highlights some surprising results about adversarial examples or to check out this Jupyter notebook which contains the code to produce all the figures in this post and train your own adversarially robust model.

--

--