Explaining and Harnessing Adversarial Examples

Explaining and Harnessing Adversarial Examples
Ian J. Goodfellow, Jonathan Schlens, and Christian Sczegedy Google Inc. Presented by Jonathan Dingess

“panda” “gibbon” In this paper we will be discussing adversarial examples, so first off: what is an adversarial example? Here is one generated using the methods later defined in this paper. The picture on the left is a panda, for sure. The neural network is 57.7% confident. On the right, we have a gibbon, a type of monkey… and the neural network is very, very confident. What went wrong? 57.7% confidence 99.3% confidence Neural network used is GoogLeNet, a 22-layer deep convolutional neural network that claimed state-of-the-art on the ImageNet database at the time of its release. Left image is from ImageNet; Right image is adversarial input generated from left

A quick note Adversarial examples are, by and large, not naturally occurring Neural networks are not the only algorithms prone to adversarial examples, even linear classifiers have this issue This isn’t meant as a presentation of a flaw, but as an opportunity to improve neural networks A quick note, first: It's common in computer vision to look at the features of a convolutional neural network as a sort of space where Euclidean distance in the features approximates perceptual distance - how alike two things are. This is clearly flawed since images with immeasurably small perceptual difference correspond to completely different classes. These results don't really represent a flaw in neural networks so much. As we’re about to see, adversarial examples are designed to be worst-case perturbations and they do not occur naturally. In addition, this isn't just a problem for neural networks, the same can be found in linear classifiers too.

Objectives Given a neural network and a dataset, efficiently generate adversarial examples Use adversarial examples as training data to regularize a neural network Explain and demystify adversarial examples Analyze some interesting properties of adversarial examples In this paper we want to harness adversarial examples as a regularization technique, just like you might use dropout. Our objectives are to efficiently generate adversarial examples given a neural network and a dataset and use these generated examples to regularize a neural network. We also want to explain and demystify adversarial examples and analyze some interesting properties of them.

What causes this? First, consider a linear model
With 8 bits for each pixel, a pixel can have values of 0-255 Our model can still process non-integer values Consider input 𝑥 =𝑥+𝜂, 𝜂 ∞ <𝜀 𝑤 𝑇 𝑥 = 𝑤 𝑇 𝑥+ 𝑤 𝑇 𝜂 𝜂=𝜀∗𝑠𝑖𝑔𝑛(𝑤) For this section, we're just concerned with adversarial examples in linear models. This ends up being important in our nonlinear neural networks too though. Consider an image classifier. The input values are the values of each pixel. If we assume the image is encoded in 8 bits, then these values are in the range of 0 through 255. However, our first layer still accepts values like 4.5 or other values that clearly make no sense in the real world. Because the precision of the encoding is limited, it doesn't really make sense for the neural network to respond differently to an input like x-tilde here, where epsilon is the precision of the encoding - in our case, 1/255. You'd expect it to give the same input, albeit maybe at a lower confidence. When you multiply this x-tilde by our weights, though, you can start to see the problem. Depending on how high your dimensionality is, this can be a very large perturbation. We can maximize this perturbation for a given model by assigning delta to epsilon times some direction vector eta, and in this case we choose the direction vector that would maximize this dot product here. The size of this perturbation depends only on epsilon - the precision of the input, but the effect it has is order of the dimensionality of the input layer of our model.

Non-linear models Does our linear perturbation work just as well on non-linear models? Some neural networks are intentionally designed to behave in very linear ways to be easier to optimize Define the “Fast gradient sign method”: 𝜂=𝜀∗𝑠𝑖𝑔𝑛 𝛻 𝑥 ℒ 𝜃,𝑥,𝑦 We have this efficient way of generating adversarial examples in linear models - does the same thing work on nonlinear models? Well, a lot of neural networks are already designed to work in very linear ways. Stuff like LSTMs, ReLUs, and maxout networks are all very linear in nature. Sigmoid networks less so, but they are still vulnerable to this sort of perturbation. The linearity in our neural networks makes them easier to optimize. It's not really a huge surprise, then, that the same method of generating adversarial examples works for these nonlinear models too. Instead of using the sign of the weights in order to maximize the perturbation in the input, we now use the sign of the gradient vector of the loss function. You're already computing this in backpropagation so this is still an efficient method

And as we've already seen, it works
And as we've already seen, it works. This adversarial example from before was generated using the fast gradient sign method. Here their epsilon was .007, which corresponds to 8-bit encoding after their models' conversion to real numbers.

Adversarial Training Unlike shallow linear models, deep neural networks can become resistant to adversarial examples ℒ 𝜃,𝑥,𝑦 =𝛼ℒ 𝜃,𝑥,𝑦 + 1−𝛼 ℒ 𝜃,𝑥+𝜀𝑠𝑖𝑔𝑛 𝛻 𝑥 ℒ 𝜃,𝑥,𝑦 ,𝑦 Deep Neural Networks are universal approximators, so of course they can represent a function that is resistant to adversarial examples. We just have to train it to do so. We want to do this because this has a regularizing effect on the network. Sort of how you might use dropout to try to train different features to work well in isolation, here we give it input that it would never be given naturally in order to train it to resist perturbation. We use this modified loss function, which is some part the actual loss for the dataset x, and some part the loss of an adversarial example generated from that loss for x. They used alpha=1/2 and found good results so they didn't mess with it much but they said another value may theoretically be better.

Thoughts Adversarial training can be thought of as minimizing the worst-case error when the input is perturbed maliciously Think of it as active learning, where new points are generated dimensionally far away but perceptually close, and we heuristically just give the new point the same label as the data it is close to Training on adversarial examples is somewhat different from other data augmentation schemes; usually, one augments the data with transformations such as translations that are expected to actually occur in the test set. This form of data augmentation instead uses inputs that are unlikely to occur naturally but that expose flaws in the ways that the model conceptualizes its decision function. So, you can think of adversarial training as minimizing the worst case of what could happen if someone were to maliciously modify your input. Again, these don’t occur naturally, but you can still think of it as a small amount of noise in that range from negative epsilon to epsilon. You can also think of it like active learning. The algorithm requests a label for these new points it is generating and instead of a human assigning that label, we can just heuristically give it the same label as the data it was generated from because of the small perceptual difference. In fact, if you wanted to, you could train on all points in that negative epsilon to positive epsilon box around your input. They would all be perceptually similar to the original. The thing is, just sampling points from that box over and over is extremely inefficient. If you recall, that dot product between the weights and the noise is what makes it harder for the network – the larger that dot product is the more dissimilar to the original input it appears to the network. When adding noise with zero mean and zero covariance, the expected value of that dot product between the weights and randomly generated noise is 0, so most of the randomly generated examples will appear both to our eyes and to the network as the exact same as the original. The fast gradient sign method ensures a very large dot product, so even if the image appears the same to our eyes, it will seem very dissimilar to the network.

Results From 0.94% error down to 0.84% on the same architecture
Down to 0.77% with higher dimensionality More resistant to adversarial examples Without training, the model misclassified adversarial examples 89.4% of the time With training, this fell to 17.9% They trained some maxout networks on MNIST, both with this improved method and without. With dropout but not this method, they achieved an error rate of 0.94%. With this method and dropout, they got to 0.84% with the same architecture. We all know increasing the dimensionality of your hidden layers too much will usually lead to overfitting, but adversarial training can actually account for that a bit. By using 1600 units per layer instead of the 240 they were using before, with adversarial training and dropout, they reached an error rate of 0.77%. The adversarially-trained network also became much more resistant to adversarial examples. The original model misclassified adversarial examples 89.4% of the time. With training, that number became 17.9%. Unfortunately, even when it is trained against them, when it misclassifies an adversarial example it is still very very confident of its misclassification, with an average confidence of 81.4% on misclassified examples. They also, for the sake of a control, trained a network with adversarial training but on randomly generated noise instead of the fast gradient sign method. These misclassified adversarial examples 86.2% of the time with an average confidence of 97.3%. Clearly this isn’t much of an improvement compared to not training it at all.

Other Considerations I
ℒ 𝜃,𝑥,𝑦 =𝛼ℒ 𝜃,𝑥,𝑦 + 1−𝛼 ℒ 𝜃,𝑥+𝜀𝑠𝑖𝑔𝑛 𝛻 𝑥 ℒ 𝜃,𝑥,𝑦 ,𝑦 Gradient can’t predict the adversary’s changes because of the sign function So, some more notes. The derivative of the sign function is 0 or undefined everywhere, so gradient descent on the adversarial objective function here based on the fast gradient sign method doesn’t allow the model to predict how the adversary will react to changes in the parameters. They considered using, instead of the sign of the gradient, small rotations or additions to the scaled gradient, but they didn’t find even nearly as powerful a regularizing result from this. They assume it’s because these kinds of adversarial examples must end up being very easy to solve.

Other Considerations II
Should we perturb the input or the hidden layers or both? The results are inconsistent No sense in perturbing the final layer What about other network types? Some low capacity networks, such as RBF networks, are resistant to adversarial examples So far we've been discussing just changing the input, not stepping through and perturbing each layer's input. The results here are inconsistent - In another one of their papers they found that perturbing each layer improved results on a sigmoidal network, but now, when working with the fast gradient sign method, they found that perturbing the hidden layers just resulted in the hidden unit activations becoming very large to accommodate it. One thing to note is that adversarial training is only really useful if something has the capacity to learn to resist it - i.e. it's a universal approximator. The last layer of a neural network is not necessarily a universal approximator like the other layers, so they didn't perturb the final layer when doing their experiments. They tested this method on a variety of different networks, and they found that networks with low capacity were much more resistant to adversarial examples. The example network they used in the paper was a Radial Basis Function network. When I say resistant, I mean that though the network would still misclassify a large number of adversarial examples, when it did so it had an extremely low confidence. Their experimental results of an RBF network had an error rate of 55.4% on adversarial examples, but the misclassified examples had an average confidence of only 1.2%. Compare that to the 99.3% confidence of that panda picture before.

Interesting Properties
Adversarial Examples generalize An adversarial example generated for one neural network will often be an adversarial example for others The most interesting property they found is that Adversarial Examples generalize. An adversarial example generated for a specific network will often still serve as an adversarial example for other networks, even if they have different architectures or were trained on different subsets of the dataset. In addition, when misclassifying these, a lot of the time each network will misclassify it in the same way. This is especially surprising when you consider their test results that choosing a uniformly random direction vector eta within that epsilon norm box will usually result in very little change to the classification at all. It’s only really these specific direction vector etas that are worst-case that cause these huge misclassifications. What they found, though, is that adversarial examples actually occur in very broad subspaces. All you need is for eta to have positive dot product with the gradient of the cost function and epsilon be large enough. In other words, as long as epsilon is large enough, eta defines a one dimensional subspace of adversarial examples for each value of epsilon. This figure is showing the classification for different values of epsilon. When epsilon is very negative you can see that this black and red sort of take over, with red meaning a classification of 0 and black meaning 6. As epsilon becomes more positive, the blue 5 classification takes over. Only right here at epsilon close to 0 does it correctly classify it as 4. On the right here you can see – these are the input generated with the same direction vector, but different values of epsilon - with very large epsilon this input just becomes rubbish and isn't very helpful to training at all. They call these high epsilon examples rubbish examples. These yellow boxes are the ones that are correctly identified. All of these examples here after the yellow boxes , adversarial or rubbish, correspond to the same class, and all the ones before it as well, though actually I think it might switch from 6 to 0 at some point back here… Anyway, they serve to explain that adversarial examples are more abundant than you'd think, occurring in these broad subspaces. This shows that adversarial examples are more abundant than suggested earlier, and kind of explains why an adversarial example for one network is an example for another, but doesn't fully explain why they misclassify them in the same way. To explain that, they hypothesize that neural networks, even when trained with different methodologies, all resemble the linear classifier learned on the same training set. As support for this hypothesis they generated adversarial examples from a deep maxout network and compared the results of a shallow softmax network and a shallow RBF network’s attempts to classify these foreign adversarial examples. If we look at cases where both models made a mistake in classification, the softmax network predicted the same (wrong) class as the maxout network 86.4% of the time, while the RBF network was able to predict maxout’s class only 54.3% of the time. Their hypothesis doesn’t explain all of the maxout network’s mistakes or all the mistakes that generalize across models, but suggests that a significant proportion of them are consistent with linear behavior being a major cause of cross-model generalization. Another quick note: Usually you would use an epsilon below what is visible for a human to see. Throughout the paper they use higher epsilon on MNIST data because it is essentially ink or no ink and humans are able to recognize the number even with larger changes. Hence, even within the yellow examples it is classifying correctly you can see that it makes a noticeable change.

Summary and Discussion I
Adversarial networks can be explained as a property of high-dimensional dot products; they are a result of models being too linear rather than too nonlinear The generalization of adversarial examples across different models can be explained as a result of adversarial perturbations being highly aligned with the weight vectors of a model, and different models learning similar functions when trained to do the same task There, that wasn't so bad. In summary, adversarial examples are a property of high-dimensional dot products and it is the linearity within models that causes them rather than the nonlinearity, which was thought to be a cause in the past, before this paper. Examples generated for one model are often examples for another model because they learn similar functions to achieve their task, both behaving similarly to the linear classifier.

Summary and Discussion II
The direction of perturbation, not the specific point in space, is what matters most Because it is the direction that matters most, adversarial examples generalize across different clean examples We have a family of fast methods to generate adversarial examples Adversarial training can result in regularization further than dropout It's the direction that matters the most when perturbing the input, leading to subspaces full of adversarial examples identified by a direction vector. This is part of what causes adversarial examples to generalize across different clean examples. In this paper we looked at a few different methods to generate adversarial examples, namely the fast gradient sign method. We can use this together with dropout to regularize even further, lowering the error rate and slightly correcting for overfit.

Summary and Discussion III
Models that are easy to optimize are easy to perturb Linear models lack the capacity to resist adversarial perturbations; only structures with a hidden layer should be trained to resist adversarial perturbation Linearity in our neural networks makes them much easier to optimize, but the more linear the model is, the easier it is to perturb. A model can only be adversarially trained if the universal approximator theorem applies, so those with hidden layers. Finally, the existence of adversarial examples suggests that being able to explain the training data or even being able to correctly label the test data does not imply that our models truly understand the tasks we have asked them to perform. Instead, their linear responses are overly confident at points that do not occur in the data distribution, and these confident predictions are often highly incorrect. We have shown that we can correct for this problem by explicitly identifying these problematic points and correcting the model at each of these points, but one might also conclude that these model families we use are intrinsically flawed - ease of optimization has come at the cost of models that are easily misled. Neural networks are extremely powerful tools, but sometimes even when they obtain excellent results on the test data set, they are not really learning the underlying concepts that determine each correct output label, and adversarial examples can expose this flaw.

Questions? Goodfellow, I. J., Shlens, J., & Szegedy, C. (2014). Explaining and harnessing adversarial examples. arXiv preprint arXiv: Thank you

Explaining and Harnessing Adversarial Examples

Similar presentations

Presentation on theme: "Explaining and Harnessing Adversarial Examples"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Explaining and Harnessing Adversarial Examples

Similar presentations

Presentation on theme: "Explaining and Harnessing Adversarial Examples"— Presentation transcript:

Similar presentations

About project

Feedback