The world’s leading publication for data science, AI, and ML professionals.

Reinventing adversarial machine learning: adversarial ML from scratch

Adversarial ML from scratch. Let's learn the basic of adversarial machine learning together!

Understanding the limitations and vulnerabilities of neural networks

Photo by Nadine Shaabana on Unsplash
Photo by Nadine Shaabana on Unsplash

I learn best when I have to describe something from the ground up! In "reinventing" articles, I’ll try to describe the mathematical intuitions necessary to implement a technology for yourself! In this article, we’ll explore the exciting world of ~adversarial machine learning~

To get started, let’s put down a working definition for adversarial ML:

Adversarial ML involves methods to generate or defend against inputs intended to fool ML models.

To me, adversarial ML presents the most important theoretical challenges to modern ML solutions. It allows us to interrogate the decision boundaries learned by our networks and forces us to tackle our models’ limitations head on. Let’s dive in!


A tortured metaphor…

Bear with me! I think this might be a half-decent motivation! I want to explain why I think adversarial ML is so interesting. To give it context, let’s start with a ludicrous party question: is a Pop-Tart a ravioli?

… The metaphorical question

Let’s unpack why the question makes for a fun debate among friends. Fundamentally, it probes at the edges of our definition of ravioli. The question "is Chef Boyardee ravioli?" makes for less entertaining banter because we all agree (minus the occasional food snobs). Chef Boyardee is a "prototypical" ravioli. It shares almost all characteristics with the quintessential, Platonic ravioli in our heads: it is a square, sauced pasta with filling. Now Pop-Tarts on the other hand…

… The metaphorical answer

People can answer one of two ways. If it is a ravioli, then ravioli are a broad category of food described of filling in a thin layer of dough. If not, then ravioli are more specifically pasta dishes.

In either case, the disagreement is interesting. It forces us to explain what characteristics are necessary and sufficient to fall in the category ravioli. Whereas Chef Boyardee was smack in the center of our "decision boundaries" for ravioli, Pop-Tarts are close to the edge. The debate is fun because the disagreements reveal the different decision boundaries people have.

… The not-so-metaphorical answer

Thanks for bearing with me! I think of adversarial ML as a systematic, theoretically motivated way to ask Neural Networks these sorts of "decision boundary" questions. Starting off with an input sample that a neural network and I 100% agree is from a certain class (i.e., "that is definitely an MNIST number 4!"), I can ask my neural network what edits would need to be made before it claimed the sample was no longer from the class (i.e., "I’d call this number 4 a number 2 if these 20 pixels were white instead of black").

Similar to our ravioli question, finding samples that are right at the edge – at the boundary between classes – helps highlight where my model and I might disagree. For me, this is the crux of machine learning "explainability" and "trust in AI." The question "what do you think ravioli is?" is sometimes best answered by reposing the question as "what changes would I need to make to this ravioli before you called it something else?" If my model and I disagree about which inputs fall on each side of that boundary, I don’t trust it!

Here, I’ll walk through the simplest procedure to generate "adversarial examples" near a model’s decision boundary. Along the way, we’ll build some intuitions about adversarial machine learning. Let’s look at some code!

A basic MNIST architecture

To explain adversarial machine learning methods, we’ll need a target model to attack.

Dataset

I want to demonstrate that adversarial machine learning is possible even for trivial classification problems: classification challenges that have long been considered "solved" in the ML literature. I also want to be able to easily visualize some of the adversarial samples we generate, which is easiest on image data. Ideally, I’d also avoid the problem of sourcing the data myself; I don’t want lots of code artifacts that just handle data loading.

With these criteria in mind, we’ll use the MNIST digits dataset [1]!

Architecture

Similarly, I want to keep the architecture simple, so we can focus on the essentials of adversarial ML.

The resulting architecture is simple (and very linear, note the absence of activations). We have a single convolutional layer followed by max pooling. Finally, we flatten to a dense layer with 10 units, one for each of our output classes.

The network only has ~10k parameters and fits quickly, even without any GPU resources.

Results

For me, early stopping was triggered after epoch 3. Validation accuracy was ~98% at this point. Plenty good for this demonstration!

A basic adversarial attack

To build intuitions about how adversarial machine learning works, we’ll implement a basic "white box" attack for ourselves and visualize some results!

"White box" attack definition

For this discussion, we’ll focus on "white box" adversarial ML:

A "white box" attacker is assumed to have complete knowledge of the model being targeted: the architecture (i.e., what type of model is it?), the parameters (i.e., what are the weights and bias terms?), and the predicted values for a given input (i.e., what does the model predict for this input?).

If folks are interested in non-"White box" attacks, let me know in the comments!

Gradient following intuition

During neural network training, we are typically engaged in a process of "gradient descent." At each step, we are estimating how we should change the parameters to minimize "loss" (i.e., error). The "gradients" tell us how increasing or decreasing a weight will affect our loss, and we "descend" on this loss surface to improve our performance. We treat our input data as fixed, our output labels as fixed, and our weights as changeable. We update our weights to get a little less wrong at each step.

To generate an adversarial sample, we will just flip this logic! We are now looking for input samples that are similar to real data but end up misclassified by our model. We assume that we already have a fully trained, performant model, so our weights are now fixed. We continue to treat our output labels as fixed. Now, though, we are going to treat our input data as changeable. Instead of minimizing loss, we are going to maximize loss (i.e., "ascend" the loss surface). Our gradients will now tell us how increasing or decreasing our pixel values will affect our loss, and we will "ascend" on the loss surface.

During training, we calculate the derivatives of our loss with respect to our weights, and use gradient descent to update our weights to minimize our loss. During attack, we calculate the derivatives of our loss with respect to our input pixels, and use gradient ascent to update our weights to maximize our loss.

Simple "white box" attack implementation

First, we’ll need some quick Keras logic to get the gradients of our model outputs w.r.t. our input pixels. Rather than maximizing loss indiscriminately, we’ll "target" some desired output class. For instance, if we want a four to be misclassified as a two, we’ll use _targetclass of two.

The logic isn’t so bad! We just need one call to _batchjacobian to get our gradient values! Then, we grab only the values that correspond to the _targetclass we want.

Next, we will want to update our image to increase the loss, increasing the model’s confidence in the _targetclass.

Again, the logic isn’t so bad! All we did was take that gradient matrix, rescale it a little to make the update size more consistent, add it to the image, and clip to avoid pixel values outside of the [0, 1] range.

This makes sense if we consider what the gradient values for our output class tell us at each step. Positive values of the gradient matrix tell us that making the pixel at that location brighter white (i.e., closer to 1) will result in higher confidence in the target class. Similarly, negative values of the gradient matrix tell us that making the pixel at that location darker (i.e., closer to 0) will result in higher confidence in the target class.

By adding the rescaled gradient values to our image, we are gradually making an image that is classified with higher confidence as the target class!

Visualized sample

Let’s grab a random sample and use the logic we’ve developed to create a visually similar image that gets misclassified by our network.

Image by author. Current prediction: 5; Confidence: 1.00.
Image by author. Current prediction: 5; Confidence: 1.00.

Now that we have an image sample, we can see how the image was originally classified by our model.

As expected, our model correctly classifies the sample with very high confidence. Now, let’s generate an adversarial sample using this image as a starting point…


Easy enough. We are just repeatedly performing the _gradientascend step we previously defined in a loop until the classification behavior changes. We terminate once we’ve tricked our model.

Image by author. Current prediction: 4; Confidence: 0.32.
Image by author. Current prediction: 4; Confidence: 0.32.

Wow! We now have an image that is nearly identical, but our model classifies it as something else. Our simple white box attack worked!

Interestingly, the sample is still clearly human recognizable. In fact, the image modifications appear to be pretty random updates to pixel contents. We found a sample right at the edge between the model’s definition of a 4 and a 5.

Tying this back to our initial metaphor, the model has told me where it thinks the "boundary" is between 4s and 5s. Obviously, we disagree…

Parting thoughts

How cool is this?! We can quickly generate visually similar samples that our model misclassified. If we looked at the hold out accuracy on our test data (~98%), we would have been confident that this model was behaving as expected. Under closer examination, though, it is obvious that the model has learned decision boundaries that don’t agree with ours: those don’t look like fours to me!

The branch of adversarial ML presents interesting and important challenges to model trust, interpretability, and explainability. I hope you come away from this article with a similar appreciation for the field!

Leave a comment if you have a correction or would like to recommend additional clarification. Thanks for reading!


  1. LeCun Y. The MNIST database of handwritten digits. http://yann. lecun. com/exdb/mnist/. 1998 Jul. Creative Commons Attribution-Share Alike 3.0 license.

Related Articles