pradachan's picture
Upload folder using huggingface_hub
f71c233 verified
# DISSECTING LOCAL PROPERTIES OF ADVERSARIAL EXAMPLES
**Anonymous authors**
Paper under double-blind review
ABSTRACT
Adversarial examples have attracted significant attention over the years, yet a sufficient understanding is in lack, especially when analyzing their performances in
combination with adversarial training. In this paper, we revisit some properties of
adversarial examples from both frequency and spatial perspectives: 1) the special
high-frequency components of adversarial examples tend to mislead naturallytrained models while have little impact on adversarially-trained ones, and 2)
adversarial examples show disorderly perturbations on naturally-trained models
and locally-consistent (image shape related) perturbations on adversarially-trained
ones. Motivated by these, we analyze the fragile tendency of models with the generated adversarial perturbations, and propose a connection with model vulnerability and local intermediate response. That is, a smaller local intermediate response
comes along with better model adversarial robustness. To be specific, we demonstrate that: 1) DNNs are naturally fragile at least for large enough local response
differences between adversarial/natural examples, 2) and smoother adversariallytrained models can alleviate local response differences with enhanced robustness.
1 INTRODUCTION
Despite deep neural networks (DNNs) perform well in many fields (He et al., 2016; Devlin et al.,
2019), their counter-intuitive vulnerability attracts increasing attention, both for safety-critical applications (Sharif et al., 2016) and the black-box mechanism of DNNs (Fazlyab et al., 2019). DNNs
have been found vulnerable to adversarial examples (Szegedy et al., 2014; Goodfellow et al., 2015),
where small perturbations on the input can easily change the predictions of a well-trained DNN with
high confidence. In computer vision, adversarial examples exhibit their destructiveness both in the
digital world and the physical world (Kurakin et al., 2017).
Since then, how to alleviate the vulnerability of DNN so as to narrow the performance gap between
adversarial/natural examples is another key issue. Existing methods including defensive distillation (Papernot et al., 2016) and pixel denoising (Liao et al., 2018) have shown their limitations
due to follow-up attack strategies (Carlini & Wagner, 2017) or gradient masking (Athalye et al.,
2018). Amongst them, adversarial training (Goodfellow et al., 2015; Madry et al., 2018) and its
variants (Zhang et al., 2019; Wang et al., 2020b) indicate their reliable robustness and outperform.
Moreover, as a data augmentation method, adversarial training currently seems to rely on additional
data (Schmidt et al., 2018; Rebuffi et al., 2021) to further improve robustness, while is sensitive to
some basic model hyper-parameters, e.g., weight decay (Pang et al., 2021). Apart from these, the
effect of simply early stopping (Rice et al., 2020) even exceeds some promotion methods according
to recent benchmarks (Croce & Hein, 2020; Chen & Gu, 2020). These studies arise our curiosity
to further explore the relationship between adversarial examples and adversarial training, hoping to
provide some new understanding.
Recalling that high-frequency components can be potentially linked to adversarial examples (Wang
et al., 2020a; Yin et al., 2019; Harder et al., 2021), however, few explorations discuss the relationship between high-frequency components and the destructiveness of adversarial examples. In this
paper, we first demonstrate that high-frequency components of adversarial examples tend to mislead
the standard DNNs, yet little impact on the adversarially robust models. We further show that adversarial examples statistically have more high-frequency components than natural ones, indicating
relatively drastic local changes among pixels of adversarial examples. Since adversarial examples
-----
exhibit more semantically meaningful on robust models (Tsipras et al., 2019), we further notice that
adversarial examples show locally-consistent perturbations related to image shapes on adversariallytrained models, in contrast to disorderly perturbations on standard models. Both explorations on the
frequency and spatial domain emphasize local properties of adversarial examples, and motivated
by the local receptive field of the convolution kernels, we propose a locally intermediate response
perspective to rethink the vulnerability of DNNs. Different from the existing global activation perspective (Bai et al., 2021; Xu et al., 2019), our local perspective reflects the joint effect of local
features and the intermediate layers of the model. Based on the local perspective, we emphasize
that large enough local response differences make it difficult for the network to treat an image and
its potentially adversarial examples as one category, and demonstrate DNN models are naturally
fragile at least attributed to it. Motivated by adversarially-trained models tend to have ‘smooth’ kernels (Wang et al., 2020a), we simply use the smoother kernels to alleviate local response differences
on adversarially-trained models, which in turn affects the model robustness and reduces the robust
overfitting (Rice et al., 2020). To a certain extent, this explains why weight decay effectively affects
model robustness (Pang et al., 2021).
Our main contributions are summarized as follows:
- We first reveal some properties of adversarial examples in the frequency and spatial domain:
1) the high-frequency components of adversarial examples tend to mislead naturally-trained
DNNs, yet have little impact on adversarially-trained models, and 2) adversarial examples
have locally-consistent perturbations on adversarially-trained models, compared with disorderly local perturbations on naturally-trained models.
- Then we introduce local response and emphasize its importance in the model adversarial
robustness. That is, naturally-trained DNNs are often fragile, at least for non-ignorable local response differences through the same layer between potentially adversarial examples
and natural ones. In contrast, adversarially-trained models effectively alleviate the local response differences. And the smoother adversarially-trained models show better adversarial
robustness as they can reduce local response differences.
- Finally we empirically study local response with generated adversarial examples. We further show that, compared with failed attacks, adversarial examples (successful attacks)
statistically show larger local response differences with natural examples. Moreover, compared with adversarial examples generated by the model itself, those transferred by other
models show markedly smaller local response differences.
2 RELATED WORK
**Understandings of model vulnerability. Since the discovery of adversarial examples (Szegedy**
et al., 2014), a number of understandings on model vulnerability has been developed. For instance, linear property of DNNs (Goodfellow et al., 2015), submanifold (Tanay & Griffin, 2016)
and geometry of the manifold (Gilmer et al., 2018) were considered from the high-dimensional
perspective; the computing of Lipschitz constant (Szegedy et al., 2014; Fazlyab et al., 2019) and
lower/upper bounds (Fawzi et al., 2018; Weng et al., 2018) were considered from the definition of
model robustness; non-robust features (Ilyas et al., 2019), high-frequency components (Wang et al.,
2020a; Yin et al., 2019), high-rank features (Jere et al., 2020) and high-order interactions among
pixels (Ren et al., 2021) were explored adversarial examples from the different perspectives on images, which imply our local perspective; feature denosing (Xie et al., 2019), robust pruning (Madaan
& Ju Hwang., 2020) and activation suppressing (Bai et al., 2021) were focused on global intermediate activations of models. On the other hand, taken adversarial training into consideration, Tsipras
et al. (2019), Schmidt et al. (2018) and Zhang et al. (2019) explored the trade-off between robustness and accuracy; Wang et al. (2020a) found adversarially-trained models tend to show smooth
kernels; Tsipras et al. (2019) and Zhang & Zhu (2018) argued adversarially robust models learned
more shape-biased representations.
Different from these studies, we characterize the adversarial examples from both frequency and spatial domain to emphasize local properties of adversarial examples, and propose a locally intermediate
response to rethink the vulnerability of DNNs.
-----
**Adversarial training. Adversarial training can be seen as a min-max optimization problem:**
max _i[)][, y][i][)]_
_x[′]i[∈][B][(][x][)][ L][(][f][θ][(][x][′]_
_i=1_
X
min
where f denotes a DNN model with parameters θ, and (xi, yi) denotes a pair of a natural example
_xi and its ground-truth label yi. Given a classification loss L, the inner maximization problem can_
be regarded as searching for suitable perturbations in boundary B to maximize loss, while the outer
minimization problem is to optimize model parameters on adversarial examples {x[′]i[}]i[n]=1 [generated]
from the inner maximization.
3 DIFFERENT PERTURBATIONS AFFECT THE VULNERABILITY OF THE MODEL
In this section, we first investigate adversarial examples in the frequency and spatial domain, and
show the connections between models and their adversarial examples. Note that our threat model is
a white-box model, and the fact that the adversarial example is only defined by the misleading result
under the legal threat model, our findings suggest that the properties of the sampled examples can
broadly reflect the fragile tendency of the model.
**Setup. We generate ℓ** bounded adversarial examples by PGD-10 (maximum perturbation ϵ =
_∞_
8/255 and step size 2/255) with random start for the robust model (Madry et al., 2018). Specifically,
we use ResNet-18 (He et al., 2016) as the backbone to train the standard and adversarially-trained
models for 100 epochs on CIFAR-10 (Krizhevsky, 2009). Following (Pang et al., 2021), we use
the SGD optimizer with momentum 0.9, weight decay 5 × 10[−][4] and initial learning rate 0.1, with
a three-stage learning rate divided by 10 at 50 and 75 epoch respectively. The robustness of both
models is evaluated by PGD-20 (step size 1/255).
3.1 THE DESTRUCTIVENESS OF HIGH-FREQUENCY COMPONENTS FROM ADVERSARIES
1.00.9 natural examplesadversarial examples 1.00.9 natural examplesadversarial examples 1.00.9 merged natural examplesmerged adversarial examples
0.8 0.8 0.8
0.7 0.7 0.7
0.6 0.6 0.6
0.5 0.5 0.5
0.4 0.4 0.4
Test Accuracy on Filtered Images 0.3 Test Accuracy on Filtered Images 0.3 Test Accuracy on Merged Images 0.3
0.2 0.2 0.2
0.1 0.1 0.1
0.00.0 0.2 0.4 Frequency scale0.6 0.8 1.0 1.2 1.4 0.00.0 0.2 0.4 Frequency scale0.6 0.8 1.0 1.2 1.4 0.00.0 0.2 0.4 Frequency scale0.6 0.8 1.0 1.2 1.4
(a) filtered images (STD)
1.00.9 natural examplesadversarial examples
0.8
0.7
0.6
0.5
0.4
Test Accuracy on Filtered Images 0.3
0.2
0.1
0.00.0 0.2 0.4 Frequency scale0.6 0.8 1.0 1.2 1.4
(b) filtered images (ADV)
1.00.9 merged natural examplesmerged adversarial examples
0.8
0.7
0.6
0.5
0.4
Test Accuracy on Merged Images 0.3
0.2
0.1
0.00.0 0.2 0.4 Frequency scale0.6 0.8 1.0 1.2 1.4
(c) merged images (STD)
1.00.9 merged natural examplesmerged adversarial examples
0.8
0.7
0.6
0.5
0.4
Test Accuracy on Merged Images 0.3
0.2
0.1
0.00.0 0.2 0.4 Frequency scale0.6 0.8 1.0 1.2 1.4
(d) merged images (ADV)
Figure 1: The destructiveness of high-frequency components from natural and adversarial examples
on both standard (STD) and adversarially-trained (ADV) models. Shown above are well-trained
models tested with (a)-(b) images through low-pass filter and (c)-(d) frequency-swapped images.
Inspired by (Wang et al., 2020a), we are naturally curious whether the adversarial examples cause
considerable damage to the model mainly because of their high-frequency components. To answer
this question, Figure 1 illustrates the trend of model performance and robustness on the test set with
the high-frequency components increased (the increase of the filtering scale denotes that more highfrequency components are added to the filtered images). Figure 1(a) shows that, for standard models,
the high-frequency components of natural examples promote classification and reach the model performance (green line); on the contrary, the performance of the filtered adversarial examples first rises
to get the highest accuracy 47.5%, and then drops rapidly to reach 0.0% (red line). Obviously, in the
low-frequency range, the performance of natural and adversarial examples are quite close, yet more
_high-frequency components widen the difference. That is, the special high-frequency components_
caused by adversarial perturbations exhibit a clear destructive effect on standard models, and simply
filter out them can effectively alleviate the destructiveness of adversaries even on standard models.
However, for robust models, we show that the prediction performance finally reaches robustness
without a rapid drop in Figure 1(b). But surprisingly, we find that the performance of filtered adversarial examples in some range exceeds the final robustness 47.5% (red line), reaching a maximum of
51.2%. That is, although these high-frequency components do not exhibit a clear destructive effect,
simply filtering out them has a positive impact on alleviating robust overfitting (Rice et al., 2020).
-----
We then swap their high-frequency components between both examples controlled by a frequency
threshold in Figure 1(c)-1(d). For merged natural examples with high-frequency components from
adversaries, the increase of the frequency threshold controls the accuracy increasing from the model
robustness (red line) to the model performance (green line), the opposite occurs on merged adversarial examples. These clearly illustrate the boost effect of the high-frequency components from natural
examples and the destructive effect of the high-frequency components from adversarial examples.
|Col1|6|
|---|---|
||6 4 2 0|
|||
|||
|Col1|6|
|---|---|
||6 4 2 0|
|||
|||
|Col1|1 1 0 0|
|---|---|
|||
|||
|Col1|6|
|---|---|
||6 4 2 0|
|||
|||
|Col1|1 1 0 0|
|---|---|
|||
|||
0 5 10 15 20 25 30 6 0 5 10 15 20 25 30 6 0 5 10 15 20 25 30 1.5 0 5 10 15 20 25 30 6 0 5 10 15 20 25 30 1.5
0 0 0 0 0
5 4 5 4 5 1.0 5 4 5 1.0
10 2 10 2 10 0.5 10 2 10 0.5
15 0 15 0 15 0.0 15 0 15 0.0
20 2 20 2 20 0.5 20 2 20 0.5
25 25 25 25 25
30 4 30 4 30 1.0 30 4 30 1.0
6 6 1.5 6 1.5
(a) ori. images
(b) adv (STD)
(c) diff (STD)
(d) adv (ADV)
(e) diff (ADV)
Figure 2: The average logarithmic amplitude spectrum of (a) 1000 three-channel images and (b) their
adversarial examples generated by the standard (STD) model, where the corners represent high-freq
range, and the colorbars represent the logarithmic amplitude log(| · |) (the redder the larger). And
(c) denotes the difference between (b) and (a), log(|adv|) − _log(|nat|) = log(|adv|/|nat|), where_
the write color of (c) represents equivalent, and the red represents |adv| > |nat|. (d) and (e) are on
the adversarially-trained (ADV) model, and (e) denotes the difference between (d) and (a).
To further illustrate, we find that statistically, the main difference in the frequency domain between
natural and adversarial examples is concentrated in the high-frequency region in Figure 2. We visualize the logarithmic amplitude spectrum of both examples. Figure 2(a) and 2(b) show that, compare
with natural examples’, the high-frequency components of the adversaries are hard to ignore. Figure
2(c) further emphasizes that adversarial examples markedly show more high-frequency components,
_indicating relatively drastic local changes among pixels. This statistical difference explains the high_
detection rate of using Magnitude Fourier Spectrum (Harder et al., 2021) to detect adversarial examples. Furthermore, Figure 2(d) and 2(e) show that the high-frequency components of adversarial
examples generated by robust models are less than those from standard models, yet still more than
natural examples’. Besides, the analysis of filtering out low-frequency components in Figure 6
(Appendix A) also emphasizes our statement. That is, compared with natural examples’, the special high-frequency components of adversarial examples show their serious misleading effects on
standard models, yet are not enough to be fully responsible for the destructiveness to robust models.
3.2 DIFFERENT PERTURBATIONS AND THE FRAGILE TENDENCY OF THE MODEL
(a) (b) (c) (d) (e) (f) (g) (h) (i) (j) (k) (l) (m)
Figure 3: Visualisation of adversarial perturbations (PGD) generated by Left: adversarially-trained
(ADV) model and Right: standard (STD) model. The columns from left to right are represented
by (a)-(m) respectively: (a) natural example; (b) adversarial example and (c)-(g) its perturbations
(overall, average of channels, and three channels) generated by ADV model; (h) adversarial example
and (i)-(m) its perturbations generated by STD model. Shown above are all successfully attacked,
the first row is attacked as a bird from a dog, the second a cat from a frog, the third a car from a ship.
To answer the different effects of high-frequency components, we directly explore the adversarial
examples themselves, e.g., in the spatial domain. We first get well-trained models as above and
visualize adversarial perturbations in Figure 3, including the overall perturbations scaled to [0,255],
the average perturbations of channels, and perturbations of three channels.
-----
As shown in Figure 3(Right), the perturbations generated by the standard models tend to be locally
inconsistent and disordered. This observation is different from Figure 3(Left) that adversarial examples show locally-consistent perturbations related to image shapes on the adversarially-trained
models, that is, perturbations tend to be locally co-increasing or co-decreasing on each channel. In
more detail, the perturbation of each pixel on a single channel tend to reach the perturbation bound,
i.e., +8/255 (the reddest), -8/255 (the bluest), which is counter-intuitive under the iterative attack of
PGD-20 and naturally associated with the one-step attack FGSM in Figure 7 (Appendix B). In fact,
both attacks show similar perturbations and similar model robustness[1], while the former produces
more detailed perturbations. Besides, compared with failed attacks in Figure 8 (Appendix B), adversarial examples of successful attacks in Figure 3 show more misleading local perturbations, e.g.,
the perturbations in the first row are more like a bird (bird wings are added), and the third like a car.
**Perturbations in frequency vs. spatial domain. Similar to the different destructive effects of**
high-frequency components, the perturbations in the spatial domain exhibit a more intuitive difference related to the models. Since more high-frequency components indicate images have relatively
drastic local changes, we count that adversarial examples generated by adversarially-trained models
show fewer high-frequency components mainly due to their locally-consistent perturbations. These
perturbations with smooth local changes imply that simply filtering out high-frequency components
has little promotion on robust models, while locally-disordered perturbations imply the effectiveness
of filtering out high-frequency components on standard models.
**Perturbations vs. fragile trend of models. For a given model, optimization-based attacks attempt**
to search for special perturbations to maximize the classification loss of adversarial examples. We
note that for adversarially-trained models with smooth kernels, the perturbations that tend to maximize loss exhibit a locally-consistent tendency, and for standard models with non-smooth kernels,
the perturbations tend to be locally-disordered to maximize loss. This implies a potential connection
between the fragile trend of models and their potentially adversarial examples.
4 LOCAL RESPONSES: FURTHER IMPACT OF PERTURBATIONS ON MODELS
In this section, motivated by local properties of adversarial examples and the local receptive field
of convolution kernels, we first introduce a locally intermediate response perspective to rethink the
vulnerability of the model, and then empirically show the relationship between local responses and
the vulnerability of the model.
Different from the existing idea of searching for adversarial examples, we consider under what
_circumstances the potential examples, refer to all legal variants in boundary B regardless of whether_
_destructive or not, would show their destructive effects on a well-trained model, that is, the reason_
why some macro-similar potential examples present prediction results far away from their natural
examples. Inspired by local properties of adversarial examples and kernels, we naturally consider
whether the destructiveness of potential examples can be viewed from local responses, which reflects
the combined effect of the local features and the model property.
Note that under ideal circumstances, given a well-trained model, if the local responses of any examples on a certain layer are completely consistent, then the final predictions of these examples
are exactly the same. To relax the condition in reality, we hypothesize as follows (referred to as
**Assumption 1): If macro-similar features through the same layer exhibit a sufficiently small differ-**
_ence in local responses, with sufficiently small differences in the subsequent layers, then the final_
_responses of the network are relatively close; otherwise, the large enough local differences make_
the network difficult to treat them as the same category. To illustrate this, we take the convolutional
layer as an example.
4.1 LOCAL RESPONSES
Macroscopically, we denote a DNN model f for classification, the l-th layer of activation feature
maps f _[l], and f_ [0] to represent the input features. Macro-similar image and its potential example
pass through the l-th layer to get f _[l]_ and _f[ˆ][l]_ respectively. We also denote the (l+1)-th weight com
1The adversarially-trained model (the best checkpoint) reaches 51.9% robustness on the PGD-20 attack and
57.4% robustness on the FGSM attack.
-----
ponent M _[l][+1]_ _∈_ R[H][×][W][ ×][K], where H, W, K represent height, weight and number of kernels respectively. From the locally intermediate response perspective, we first get one of the convolution
kernels m[l][+1] R[H][×][W], and then capture the l-th layer of local features centered at (i,j) position x[l]i,j
_∈_
and ˆx[l]i,j [corresponding to its local receptive field. Formally, for the difference of local responses,]
∆[l]i,j[+1] [:= ˆ]x[l]i,j _i,j_ _l_ _ml+1_
_[⊗]_ _[m][l][+1][ −]_ _[x][l]_ _[⊗]_ _[m][l][+1][ =][ x][d]_ _⊗_
where ∆[l]i,j[+1] [denotes the (i,j)-th response difference of local features through the same kernel][ m][l][+1][,]
natural example. The second equation is based on the linear property of the convolution operation.and xd[l] := ˆx[l]i,j _[−][x]i,j[l]_ [denotes the difference of local feature maps between its potential example and]
Ideally, if the differences are all equal to zeros at any positions in the (l+1)-th layer, that is, the
intermediate layer has the same utility for both examples, then the potential example after the (l+1)th layer shows exactly the same responses as the natural example’s. However, in most cases the local
differences are difficult to be exactly zeros everywhere, then based on Assumption 1, our aim is to
make the absolute difference of local responses ∆[l]i,j[+1]
cognition of potential example and natural example. | _[|][ as small as possible to approach the model’s]_
The difference of local responses ∆[l]i,j[+1] [composed of][ x][dl][ and][ m][l][+1][ can be further expressed as:]
_xdl_ _ml+1 =_
_⊗_
_c[l]ij[m][l]ij[+1]_
_j=1_
X
_i=1_
where c[l]ij [and][ m]ij[l][+1] represent the (i,j)-th elements of xd[l] and m[l][+1] respectively, thus the absolute
local difference ∆[l]i,j[+1]
_|_ _[|][ is affected jointly by both local features and kernel parameters.]_
Note that what we care about is under what circumstances are more likely to produce large enough
response differences. Though the specific impact of c[l]ij [and][ m]ij[l][+1] on ∆[l]i,j[+1] is quite complicated,
we consider statistical circumstances as numerous convolution kernels and various local features are
combined to affect the response differences in real networks. Statistically, the larger amplitude of
_m[l][+1]_ with fixed xd[l], or larger amplitude of xd[l] with fixed m[l][+1], tends to produce larger ∆[l]i,j[+1]
_|_ _[|][.]_
Besides, if a kernel m[l][+1] is relatively non-smooth, then non-smooth local features xd[l], rather than
smooth local features, are more likely to produce large enough local response differences.
4.2 FURTHER IMPACT OF LOCAL PERTURBATIONS ON MODELS
Recalling that local properties of adversarial examples in Section 3, we further explore how different
local perturbations affect the models from the locally intermediate response perspective. To illustrate
the effect of perturbations, we assume two convolution kernels with the same mean but different
variances, one is disordered, and the other is smoother.
|-1/255|8/255|0|
|---|---|---|
|-8/255|8/255|-2/255|
|-4/255|4/255|-4/255|
|8/255|8/255|8/255|
|---|---|---|
|8/255|8/255|8/255|
|8/255|8/255|8/255|
|1/9|1/9|1/9|
|---|---|---|
|1/9|1/9|1/9|
|1/9|1/9|1/9|
|STD ADV Adversarial examples|Col2|-1/2558/255 0 -8/255 8/255-2/255 -4/255 4/255-4/255 Locally-disordered 8/255 8/255 8/255 8/255 8/255 8/255 8/255 8/255 8/255 Locally-consistent Local perturbations|Col4|-1 -1 -1 -1 9 -1 -1 -1 -1 Disordered kernel 1/9 1/9 1/9 1/9 1/9 1/9 1/9 1/9 1/9 Smooth kernel Kernels|Col6|Col7|Col8|Col9|Col10|79/255 8/255 0.11/255 8/255 Local response differences|…|
|---|---|---|---|---|---|---|---|---|---|---|---|
||||||-1|-1|-1|||||
||||||-1|9|-1|||||
||||||-1|-1|-1|||||
|||||||||||||
|||||||||||||
|||||||||||||
|||||||||||||
|||||||||||||
-1/255 8/255 0
-8/255 8/255 -2/255
-4/255 4/255 -4/255
8/255 8/255 8/255
8/255 8/255 8/255
8/255 8/255 8/255
Figure 4: Further impact of local perturbations on different convolutional kernels. Shown above is
adversarial perturbations with same amplitude (single channel) act on kernels with different smoothness and then get the local response differences of adversarial example and natural example.
-----
**Local responses vs. locally-disordered perturbations. Since adversarial perturbations generated**
by standard models tend to be locally-disordered, we discuss these perturbations affect the local
responses. Locally-disordered perturbations indicate relatively drastic local changes among pixels
of potential example, in other words, local input difference xd[0] of potential example and natural example (also refers to local perturbations) show its great variance. These perturbations are convolved
on numerous kernels. As shown in Figure 4, suppose convolution kernels with the same mean but
greater variance, these locally-disordered perturbations are more likely to yield large enough absolute response differences |∆[1]i,j[|][ in the first layer, since the non-smooth kernels tend to emphasize]
the relationship between different pixels. On the other hand, suppose smoother kernels with relatively small variance, these locally-disordered perturbations tend to show smaller absolute response
differences since smoother kernels tend to focus on the average situation within their local receptive
fields. That is, locally-disordered perturbations pass through different types of convolution kernels
tend to produce different response differences, which can be accumulated in the subsequent layers.
**Local responses vs. locally-consistent perturbations. Since adversarial perturbations generated**
by adversarially robust models tend to be locally-consistent, we count that such perturbations are
more destructive to smoother kernels, comparing with locally-disordered perturbations in Figure 4.
Locally-consistent perturbations xd[0] reaching the perturbation bound tend to produce larger absolute
response differences in the first layer since they tend to show a larger absolute average, yet locallydisordered perturbations show smaller response differences due to a smaller absolute average.
Both different local perturbations have different impacts on the local response differences of potential examples and natural examples, implying the fragile trend of models and the transferability of
adversarial examples.
4.3 EMPIRICAL UNDERSTANDING OF LOCAL RESPONSES
Motivated by adversarially-trained models that tend to show smoother kernels (Wang et al., 2020a),
we first provide a further understanding of local responses and then give empirical understandings.
**Local responses vs. model smoothness. Given a non-smooth convolution kernel m[l][+1]** with great
variance, it is more likely to cause huge impact on an absolute response difference ∆[l]i,j[+1]
_|_ _[|][ within its]_
local receptive field when fixing xd[l]. Considering numerous non-smooth kernels and various local
features are combined in the same layer, it is quite easy to yield large enough response differences.
More complicated is the subsequent layers act on the current local response differences and get intricate accumulation effects. That is, for a standard model with non-smooth kernels, the model itself
tends to amplify the local response differences between potential examples and natural examples.
On the other hand, since an adversarially-trained model show smoother kernels, the model has a tendency to weaken the local response differences and narrow the final responses of potential examples
and natural examples, indicating a trade-off between model robustness and accuracy.
**Setup. We get both standard and adversarially-trained models in Section 3. Considering complex**
functional layers, including linear and nonlinear ones in DNNs, based on Assumption 1, we use the
maximum and the total absolute differences of local responses in some layers to show their effects.
Taking ResNet-18 as an example, we investigate some layers including the first convolutional layer
as Conv 1, the feature maps before layer 1 as Layer 0, and Layer 1 to Layer 4 respectively[1].
To verify above analysis, we first compare the local response differences of potential examples and
natural examples between standard and robust models. As Table 1 shows, the standard model exhibits significantly larger differences in local responses layer by layer, especially from the perspective of total differences yielded by per pairs. These large enough response differences accumulate,
making the model’s cognition of potential examples far away from their natural ones and ultimately
leading the model to express a high vulnerability. On the other hand, the adversarially-trained model
significantly shortens local response differences in corresponding layers, yet still some differences
exist. In other words, how to further reduce the local response differences is a perspective of approaching model robustness to performance. We also note that, due to the complex effect of nonlinear layers (e.g., BatchNorm, ReLu), the maximum absolute difference of local responses does not
strictly increase monotonically but increases in trend.
1Feature map size of Conv 1, Layer 0 and Layer 1: 64 × 32 × 32, Layer 2: 128 × 16 × 16, Layer 3:
256 × 8 × 8, and Layer 4: 512 × 4 × 4.
-----
Table 1: The absolute difference of local responses (per image) between test set images and their
potentially adversarial examples on different layers. Left/Right denote standard/robust model.
|Model|Conv 1|Layer 0|Layer 1|Layer 2|Layer 3|Layer 4|
|---|---|---|---|---|---|---|
|Max Total|0.12/0.05 370/556|0.30/0.15 443/489|1.92/0.52 4276/1246|1.46/0.36 2093/430|1.45/0.41 581/164|3.72/1.02 2624/526|
|---|---|---|---|---|---|---|
**Local responses vs. destructive effect. Naturally, we wonder whether a clear difference between**
potential examples of successful and failed attacks on a certain model exists. We first select the
natural examples that are classified correctly, and count their potential examples of successful and
failed attacks. As Table 2 shows[1], for the standard model, the successful ones show similar differences with the failed ones in the front layers, but greater differences especially in Layer 4 close to
the outputs, leading to finally destructive effects. However, for the adversarially-trained model in
Table 4 (Appendix C), similar differences even in Layer 4 may be due to the smoother kernels and
locally smoother perturbations, leading to the final responses of both close to natural examples’.
Table 2: Left/Right denote successfully-attacked/failed adversarial examples on standard model.
|Model|Conv 1|Layer 0|Layer 1|Layer 2|Layer 3|Layer 4|
|---|---|---|---|---|---|---|
|Max Total|0.16/0.15 558/554|0.379/0.378 676/669|2.50/2.51 5955/6001|1.38/1.37 2363/2339|1.16/1.08 522/494|2.11/1.38 1481/757|
|---|---|---|---|---|---|---|
Table 3: Left/Right denote original/transferred adversarial examples on adversarially-trained model.
|Model|Conv 1|Layer 0|Layer 1|Layer 2|Layer 3|Layer 4|
|---|---|---|---|---|---|---|
|Max Total|0.05/0.04 556/321|0.15/0.12 489/302|0.52/0.44 1246/767|0.36/0.14 430/138|0.41/0.08 164/38|1.02/0.10 526/57|
|---|---|---|---|---|---|---|
**Local responses vs. transferability. Different local perturbations related to models imply the diffi-**
culty of adversarial examples’ transferability, then we further explore whether the transferability can
be understood from the locally intermediate response perspective. To verify the analysis of Section
4.2, we exchange the adversaries obtained from the standard and robust model. Table 3 shows,
locally-disordered perturbations, combined with a smoother adversarially-trained model, exhibit
fairly small local response differences layer by layer, making the model’s cognition close enough to
the natural examples and leading to 83.6% robustness close enough to 84.9% model performance.
Similar situations occur when locally-consistent perturbations combined with a non-smooth standard model in Table 5 (Appendix C). These results indicate that the searched perturbations related to
models tend to amplify response differences as much as possible to enlarge the model’s cognition of
potential examples and natural examples, and the weak transferability of adversarial examples may
be due to transferred perturbations that are difficult to enlarge the model’s cognition.
4.4 SMOOTHER KERNELS: ALLEVIATE LOCAL RESPONSE DIFFERENCES
To further exhibit the effect of shortening local response differences, we simply show smoother adversarially robust models can alleviate local response differences and then improve their robustness.
As shown in Figure 5, adversarially-trained models with different smoothness (i.e., larger weight
decay parameters show the smaller magnitude of the kernels (Loshchilov & Hutter, 2019)) show
different local response differences. Figures 5(b)-5(g) for the maximum differences and Figure 9 for
the total differences (Appendix C) indicate that smoother kernels slightly weaken the local response
differences between the potential examples and natural examples layer by layer, and finally tend to
narrow the robustness and performance in Figure 5(h). This to some extent explains adversariallytrained models are more sensitive to weight decay (Pang et al., 2021). On the other hand, we find
that the increase of weight decay is quite difficult to further reduce the magnitude of parameters
and the local response differences in each layer, which may be one of the reasons for the current
bottleneck in the robustness of the adversarial training. Besides, the increase of weight decay can
effectively weaken the robust overfitting (Rice et al., 2020) in Figure 5(h) and Figure 9(g).
1Note that under the PGD-20 attack, the standard model hardly yields examples of failed attacks, then we
use the FGSM attack to illustrate the problem.
-----
4000 weight decay 1e-4weight decay 3e-4weight decay 5e-4weight decay 7e-4 0.3 weight decay 1e-4weight decay 3e-4weight decay 5e-4weight decay 7e-4 0.6 weight decay 1e-4weight decay 3e-4weight decay 5e-4weight decay 7e-4 1.5 weight decay 1e-4weight decay 3e-4weight decay 5e-4weight decay 7e-4
weight decay 10e-4 0.2 weight decay 10e-4 weight decay 10e-4 weight decay 10e-4
0.4 1.0
2000
0.1 0.2
Maximum Differences Maximum Differences Maximum Differences0.5
Quadratic Sum of Parameters
0 25 50 75 100 0 25 50 75 100 0 25 50 75 100 0 25 50 75 100
Epochs Epochs Epochs Epochs
(a) Parameters
weight decay 1e-4 2.0 weight decay 1e-4 weight decay 1e-4 weight decay 1e-4
1.5 weight decay 3e-4weight decay 5e-4weight decay 7e-4weight decay 10e-4 1.5 weight decay 3e-4weight decay 5e-4weight decay 7e-4weight decay 10e-4 1.251.00 weight decay 3e-4weight decay 5e-4weight decay 7e-4weight decay 10e-4 8060 weight decay 3e-4weight decay 5e-4weight decay 7e-4weight decay 10e-4
1.0 1.0 0.75
Maximum Differences0.5 Maximum Differences0.5 Maximum Differences0.50 Test Accuracy (%)40
0 25 50 75 100 0 25 50 75 100 0 25 50 75 100 0 25 50 75 100
Epochs Epochs Epochs Epochs
(e) Layer 2
(b) Conv 1
25 50
Epochs
(f) Layer 3
(c) Layer 0
weight decay 1e-4
weight decay 3e-4
weight decay 5e-4
weight decay 7e-4
weight decay 10e-4
25 50
Epochs
(g) Layer 4
(d) Layer 1
weight decay 1e-4
weight decay 3e-4
weight decay 5e-4
weight decay 7e-4
weight decay 10e-4
0 25 50 75
Epochs
(h) Test Accuracy
Figure 5: The maximum local response differences on robust models with different smoothness. (a)
shows different model smoothness affected by weight decay, (b)-(g) denote the impact of weight
decay on the local response differences, and (h) shows the model robustness and performance.
4.5 DISCUSSION: LOCAL RESPONSES AND MODEL ROBUSTNESS
The above suggests that the model robustness is related to the model itself and the property of
potential examples, while they are combined through the local responses. Due to enough differences
in local responses, some potential examples are not regarded as similar to natural examples (or their
predictions are different from the natural ones with correct predictions), so they eventually show
their destructive effects as adversarial examples. That is, if a model tends to weaken the local
response differences of potential and natural examples, then the model exhibits great robustness as
the final responses of the potential examples are more likely to be close to the natural ones.
On the other hand, though small differences of local responses make the network more inclined to
treat both examples as the same categories, but the existing differences, especially the intricate differences after multi-layer accumulation emphasize the difficulty of approaching model robustness
to performance, and demonstrate that DNN models are naturally fragile. Essentially, these response
differences come from whether a non-zero convolution kernel acts on all legal perturbations in
_boundary B to obtain differences that are all zeros or sufficiently small without further accumula-_
_tion. In other words, it tells the model robustness is that, given legal potential examples from any_
attacks, the accumulated response differences can be alleviated or removed. For instance, feature
denosing (Xie et al., 2019) and activation suppressing (Bai et al., 2021) can be viewed as shortening
local response differences of both examples to improve the model robustness.
Besides, shortening local response differences does not directly mean an improvement of the model’s
robustness, in fact, it expresses the closeness of the model’s cognition on both examples. That is,
a poorly-trained model may also treat both as relatively close, but give a bad model performance.
To further improve the model robustness, a more realistic idea is to find proper parameters (whether
theoretical parameters to minimize the local response differences exist) or a new method (nonlinear
layers, loss functions, model structures) to shorten the response differences as much as possible,
while not overly weaken the classification performance of the model.
5 CONCLUSION
In this paper, we investigate the local properties of adversarial examples generated by different models and rethink the vulnerability of DNNs from a novel locally intermediate response perspective. We
find that the high-frequency components of adversarial examples tend to mislead standard DNNs,
but have little impact on adversarially-trained models. Furthermore, locally-disordered perturbations
are shown on standard models, but locally-consistent perturbations on adversarially-trained models.
Both explorations emphasize the local perspective and the potential relationship between models
and adversarial examples, then we explore how different local perturbations affect the models. We
demonstrate DNN models are naturally fragile at least for large enough local response differences
between potentially adversarial examples and natural examples, and empirically show smoother
adversarially-trained models can alleviate local response differences to improve robustness.
-----
REFERENCES
Anish Athalye, Nicholas Carlini, and David Wagner. Obfuscated gradients give a false sense of
security: Circumventing defenses to adversarial examples. In ICML. 2018.
Yang Bai, Yuyuan Zeng, Yong Jiang, Shu-Tao Xia, Xingjun Ma, and Yisen Wang. Improving
adversarial robustness via channel-wise activation suppressing. In ICLR. 2021.
Nicholas Carlini and David Wagner. Towards evaluating the robustness of neural networks. In IEEE
_Symposium on Security and Privacy (SP). 2017._
Jinghui Chen and Quanquan Gu. Rays: A ray searching method for hard-label adversarial attack. In
_KDD. 2020._
Francesco Croce and Matthias Hein. Reliable evaluation of adversarial robustness with an ensemble
of diverse parameter-free attacks. In ICML. 2020.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep
bidirectional transformers for language understanding. In NAACL. 2019.
Alhussein Fawzi, Omar Fawzi, and Pascal Frossard. Analysis of classifiers’ robustness to adversarial
perturbations. In Machine Learning. 2018.
Mahyar Fazlyab, Alexander Robey, Hamed Hassani, Manfred Morari, and George J. Pappas. Efficient and accurate estimation of lipschitz constants for deep neural networks. In NeurIPS. 2019.
Justin Gilmer, Luke Metz, Fartash Faghri, Samuel S. Schoenholz, Maithra Raghu, Martin Wattenberg, and Ian Goodfellow. Adversarial spheres. In arXiv preprint arXiv:1801.02774. 2018.
Ian J. Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial
examples. In ICLR. 2015.
Paula Harder, Franz-Josef Pfreundt, Margret Keuper, and Janis Keuper. Spectraldefense: Detecting
adversarial attacks on cnns in the fourier domain. In arXiv preprint arXiv:2103.03000. 2021.
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR. 2016.
Andrew Ilyas, Shibani Santurkar, Dimitris Tsipras, Logan Engstrom, Brandon Tran, and Aleksander
Madry. Adversarial examples are not bugs, they are features. In NeurIPS. 2019.
Malhar Jere, Maghav Kumar, and Farinaz Koushanfar. A singular value perspective on model robustness. In arXiv preprint arXiv:2012.03516. 2020.
Alex Krizhevsky. Learning multiple layers of features from tiny images. 2009.
Alexey Kurakin, Ian Goodfellow, and Samy Bengio. Adversarial examples in the physical world. In
_ICLR. 2017._
Fangzhou Liao, Ming Liang, Yinpeng Dong, Tianyu Pang, Xiaolin Hu, and Jun Zhu. Defense against
adversarial attacks using high-level representation guided denoiser. In CVPR. 2018.
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In ICLR. 2019.
Divyam Madaan and Sung Ju Hwang. Adversarial neural pruning with latent vulnerability suppression. In ICML. 2020.
Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu.
Towards deep learning models resistant to adversarial attacks. In ICLR. 2018.
Tianyu Pang, Xiao Yang, Yinpeng Dong, Hang Su, and Jun Zhu. Bag of tricks for adversarial
training. In ICLR. 2021.
Nicolas Papernot, Patrick McDaniel, Xi Wu, Somesh Jha, and Ananthram Swami. Distillation as
a defense to adversarial perturbations against deep neural networks. In IEEE Symposium on
_Security and Privacy (SP). 2016._
-----
Sylvestre-Alvise Rebuffi, Sven Gowal, Dan A. Calian, Florian Stimberg, Olivia Wiles, and Timothy Mann. Fixing data augmentation to improve adversarial robustness. In arXiv preprint
_arXiv:2103.01946. 2021._
Jie Ren, Die Zhang, Yisen Wang, Lu Chen, Zhanpeng Zhou, Yiting Chen, Xu Cheng, Xin Wang,
Meng Zhou, Jie Shi, and Quanshi Zhang. Towards a unified game-theoretic view of adversarial
perturbations and robustness. In arXiv preprint arXiv:2103.07364. 2021.
Leslie Rice, Eric Wong, and Zico Kolter. Overfitting in adversarially robust deep learning. In ICML.
2020.
Ludwig Schmidt, Shibani Santurkar, Dimitris Tsipras, Kunal Talwar, and Aleksander Madry. Adversarially robust generalization requires more data. In NeurIPS. 2018.
Mahmood Sharif, Sruti Bhagavatula, Lujo Bauer, and Michael K. Reiter. Accessorize to a crime:
Real and stealthy attacks on state-of-the-art face recognition. In SIGSAC. 2016.
Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow,
and Rob Fergus. Intriguing properties of neural networks. In ICLR. 2014.
Thomas Tanay and Lewis Griffin. A boundary tilting persepective on the phenomenon of adversarial
examples. In arXiv preprint arXiv:1608.07690. 2016.
Dimitris Tsipras, Shibani Santurkar, Logan Engstrom, Alexander Turner, and Aleksander Madry.
Robustness may be at odds with accuracy. In ICLR. 2019.
Haohan Wang, Xindi Wu, Zeyi Huang, and Eric P Xing. High-frequency component helps explain
the generalization of convolutional neural networks. In CVPR. 2020a.
Yisen Wang, Difan Zou, Jinfeng Yi, James Bailey, Xingjun Ma, and Quanquan Gu. Improving
adversarial robustness requires revisiting misclassified examples. In ICLR. 2020b.
Tsui-Wei Weng, Huan Zhang, Pin-Yu Chen, Jinfeng Yi, Dong Su, Yupeng Gao, Cho-Jui Hsieh, and
Luca Daniel. Evaluating the robustness of neural networks: An extreme value theory approach.
In ICLR. 2018.
Cihang Xie, Yuxin Wu, Laurens van der Maaten, Alan Yuille, and Kaiming He. Feature denoising
for improving adversarial robustness. In CVPR. 2019.
Kaidi Xu, Sijia Liu, Gaoyuan Zhang, Mengshu Sun, Pu Zhao, Quanfu Fan, Chuang Gan, and Xue
Lin. Interpreting adversarial examples by activation promotion and suppression. In arXiv preprint
_arXiv:1904.02057. 2019._
Dong Yin, Raphael Gontijo Lopes, Jonathon Shlens, Ekin D. Cubuk, and Justin Gilmer. A fourier
perspective on model robustness in computer vision. In NeurIPS. 2019.
Hongyang Zhang, Yaodong Yu, Jiantao Jiao, Eric P Xing, Laurent El Ghaoui, and Michael I. Jordan.
Theoretically principled trade-off between robustness and accuracy. In ICML. 2019.
Tianyuan Zhang and Zhanxing Zhu. Interpreting adversarially trained convolutional neural networks. In ICML. 2018.
-----
THE DESTRUCTIVENESS OF ADVERSARIAL EXAMPLES ON FREQUENCY
DOMAIN
1.0 1.0
natural examples natural examples
0.9 adversarial examples 0.9 adversarial examples
0.8 0.8
0.7 0.7
0.6 0.6
0.5 0.5
0.4 0.4
Test Accuracy on Filtered Images 0.3 Test Accuracy on Filtered Images 0.3
0.2 0.2
0.1 0.1
0.0 0.0
0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4
Frequency scale Frequency scale
(a) low-frequency components (STD)
(b) low-frequency components (ADV)
Figure 6: The destructiveness of only high-frequency components from natural and adversarial examples on both standard (STD) and adversarially-trained (ADV) models. Shown above are welltrained models tested with images through high-pass filter.
Here, we further investigate the contribution of only high-frequency components to the destructiveness of both examples. We get both standard and adversarially-trained models as above. Figure
6 illustrates the trend of model performance and robustness on the test set with the low-frequency
components decreased (the increase of the filtering scale denotes that the less low-frequency components are added to the filtered images). As the increase of filtering scale, only natural examples on
standard models keep certain classification performance and then decrease to reach 10% (the accuracy of random classification). That is, the high-frequency components of natural examples to some
extent can promote classification. On the other hand, the high-frequency components of adversarial
examples on standard models show almost no promotion but destructive effect, and finally reach
10% as well. For robust models, the high-frequency components of both examples show similar but
little performance.
B THE LOCAL PROPERTIES OF ADVERSARIAL EXAMPLES ON SPATIAL
DOMAIN
We further explore the local properties of adversarial examples generated by the FGSM attack. For
adversarially-trained models, compared with the PGD attack in Figure 3, the adversarial examples
generated by the FGSM attack in Figure 7 show similar perturbations and similar model robustness (51.9% robustness for PGD-20 and 57.4% robustness for FGSM), yet the latter produces less
detailed perburbations. For standard models, though both attacks show locally-disordered perturbations, the latter exhibits less disordered perturbations since less perturbation values can be achieved,
i.e., +8/255, 0, and -8/255.
(a) (b) (c) (d) (e) (f) (g) (h) (i) (j) (k) (l) (m)
Figure 7: Visualisation of adversarial perturbations (FGSM) generated by Left: adversarially-trained
(ADV) model and Right: standard (STD) model.
-----
Similar to adversarial examples of successful attacks in Figure 3, the failed attacks in Figure 8 show
locally-consistent perturbations related to image shapes on the adversarially-trained models as well.
The difference is that, these examples show less misleading local perturbations since the shape of
their perturbations is closer to the shape of the original image.
(a) (b) (c) (d) (e) (f) (g) (h) (i) (j) (k) (l) (m)
Figure 8: Visualisation of adversarial perturbations (PGD) generated by Left: adversarially-trained
(ADV) model and Right: standard (STD) model. Shown above are all unsuccessfully attacked on
ADV model.
C LOCAL RESPONSES
As mentioned in Section 4.3, we report destructive effects on the adversarially-trained model in
Table 4. We first select the natural examples that are classified correctly, and count their potential
examples of successful and failed attacks. Similar to the standard model, the successful ones show
similar differences with the failed ones in the front layers, but greater differences in Layer 4 close to
the outputs. The difference is that, due to the smoother kernels and locally smoother perturbations,
the robust model exhibits a closer local response differences in Layer 4 between the successful ones
and the failed ones. Besides, compared with the PGD attack, the adversarial examples generated by
the FGSM attack show similar differences in the front layers, but less local response differences in
Layer 4, leading to higher model robustness.
Table 4: The absolute difference of local responses (per image) between test set images and their
potentially adversarial examples on different layers. Left/Right denote successfully-attacked/failed
adversarial examples on adversarially-trained model respectively.
|Model|Conv 1|Layer 0|Layer 1|Layer 2|Layer 3|Layer 4|
|---|---|---|---|---|---|---|
|Max (PGD-20) Total (PGD-20) Max (FGSM) Total (FGSM)|0.048/0.047 557/555 0.048/0.047 580/576|0.15/0.15 488/490 0.16/0.15 510/511|0.51/0.53 1264/1242 0.53/0.54 1292/1272|0.36/0.37 445/420 0.35/0.36 436/414|0.40/0.42 169/160 0.37/0.38 161/153|1.10/0.99 578/491 0.90/0.86 478/428|
|---|---|---|---|---|---|---|
We further report the transferability of adversarial examples on the standard model, as shown in
Table 5. We get the potentially adversarial examples obtained from the robust model to attack the
standard model. That is, locally-consistent perturbations, combined with a non-smooth standard
model, exhibit small local response differences layer by layer, making the model’s cognition close
enough to the natural examples and leading to 78.4% robustness close to 94.4% model performance.
Table 5: The absolute difference of local responses (per image) between test set images and their potentially adversarial versions on different layers of ResNet-18. Left/Right denote original/transferred
adversarial examples on standard model respectively.
|Model|Conv 1|Layer 0|Layer 1|Layer 2|Layer 3|Layer 4|
|---|---|---|---|---|---|---|
|Max Total|0.12/0.15 370/532|0.30/0.29 443/509|1.92/0.85 4276/2639|1.46/0.88 2093/1157|1.45/0.67 581/280|3.72/0.88 2624/525|
|---|---|---|---|---|---|---|
-----
As mentioned in Section 4.4, Figure 9 is a supplement to Figure 5 from the total absolute local
response differences perspective, which indicates that smaller local response differences tend to
approach model robustness to performance as well.
2000 weight decay 1e-4weight decay 3e-4weight decay 5e-4weight decay 7e-4weight decay 10e-4 2000 weight decay 1e-4weight decay 3e-4weight decay 5e-4weight decay 7e-4weight decay 10e-4 6000 weight decay 1e-4weight decay 3e-4weight decay 5e-4weight decay 7e-4weight decay 10e-4 40003000 weight decay 1e-4weight decay 3e-4weight decay 5e-4weight decay 7e-4weight decay 10e-4
4000
2000
Total Differences1000 Total Differences1000 Total Differences2000 Total Differences1000
0 25 50 75 100 0 25 50 75 100 0 25 50 75 100 0 25 50 75 100
Epochs Epochs Epochs Epochs
(a) Conv 1
(b) Layer 0
(c) Layer 1
(d) Layer 2
weight decay 1e-4weight decay 3e-4weight decay 5e-4 500 weight decay 1e-4weight decay 3e-4weight decay 5e-4 2.5
2000 weight decay 7e-4weight decay 10e-4 400 weight decay 7e-4weight decay 10e-4 2.0
1.5
Total Differences1000 Total Differences300200 Robust Loss1.0 weight decay 1e-4weight decay 3e-4weight decay 5e-4weight decay 7e-4
0.5 weight decay 10e-4
0
0 25 50 75 100 0 25 50 75 100 0 25 50 75 100
Epochs Epochs Epochs
(e) Layer 3
(f) Layer 4
(g) Robust Loss
Figure 9: The total absolute local response differences (per image) of some layers on adversariallytrained models with different smoothness. Among them, (a)-(f) denote the influence of different
weight decay parameters on the local response differences in each layer, and (g) shows the robust
loss of training set and test set.
-----