pradachan
/

AI-Scientist

Model card Files Files and versions Community

AI-Scientist / review_iclr_bench /iclr_parsed /-AW3SFO63GO.txt

pradachan

Upload folder using huggingface_hub

f71c233 verified 19 days ago

raw

history blame contribute delete

54.9 kB

	# DISSECTING LOCAL PROPERTIES OF ADVERSARIAL EXAMPLES

	Anonymous authors
	Paper under double-blind review

	ABSTRACT

	Adversarial examples have attracted significant attention over the years, yet a sufficient understanding is in lack, especially when analyzing their performances in
	combination with adversarial training. In this paper, we revisit some properties of
	adversarial examples from both frequency and spatial perspectives: 1) the special
	high-frequency components of adversarial examples tend to mislead naturallytrained models while have little impact on adversarially-trained ones, and 2)
	adversarial examples show disorderly perturbations on naturally-trained models
	and locally-consistent (image shape related) perturbations on adversarially-trained
	ones. Motivated by these, we analyze the fragile tendency of models with the generated adversarial perturbations, and propose a connection with model vulnerability and local intermediate response. That is, a smaller local intermediate response
	comes along with better model adversarial robustness. To be specific, we demonstrate that: 1) DNNs are naturally fragile at least for large enough local response
	differences between adversarial/natural examples, 2) and smoother adversariallytrained models can alleviate local response differences with enhanced robustness.

	1 INTRODUCTION

	Despite deep neural networks (DNNs) perform well in many fields (He et al., 2016; Devlin et al.,
	2019), their counter-intuitive vulnerability attracts increasing attention, both for safety-critical applications (Sharif et al., 2016) and the black-box mechanism of DNNs (Fazlyab et al., 2019). DNNs
	have been found vulnerable to adversarial examples (Szegedy et al., 2014; Goodfellow et al., 2015),
	where small perturbations on the input can easily change the predictions of a well-trained DNN with
	high confidence. In computer vision, adversarial examples exhibit their destructiveness both in the
	digital world and the physical world (Kurakin et al., 2017).

	Since then, how to alleviate the vulnerability of DNN so as to narrow the performance gap between
	adversarial/natural examples is another key issue. Existing methods including defensive distillation (Papernot et al., 2016) and pixel denoising (Liao et al., 2018) have shown their limitations
	due to follow-up attack strategies (Carlini & Wagner, 2017) or gradient masking (Athalye et al.,
	2018). Amongst them, adversarial training (Goodfellow et al., 2015; Madry et al., 2018) and its
	variants (Zhang et al., 2019; Wang et al., 2020b) indicate their reliable robustness and outperform.
	Moreover, as a data augmentation method, adversarial training currently seems to rely on additional
	data (Schmidt et al., 2018; Rebuffi et al., 2021) to further improve robustness, while is sensitive to
	some basic model hyper-parameters, e.g., weight decay (Pang et al., 2021). Apart from these, the
	effect of simply early stopping (Rice et al., 2020) even exceeds some promotion methods according
	to recent benchmarks (Croce & Hein, 2020; Chen & Gu, 2020). These studies arise our curiosity
	to further explore the relationship between adversarial examples and adversarial training, hoping to
	provide some new understanding.

	Recalling that high-frequency components can be potentially linked to adversarial examples (Wang
	et al., 2020a; Yin et al., 2019; Harder et al., 2021), however, few explorations discuss the relationship between high-frequency components and the destructiveness of adversarial examples. In this
	paper, we first demonstrate that high-frequency components of adversarial examples tend to mislead
	the standard DNNs, yet little impact on the adversarially robust models. We further show that adversarial examples statistically have more high-frequency components than natural ones, indicating
	relatively drastic local changes among pixels of adversarial examples. Since adversarial examples


	-----

	exhibit more semantically meaningful on robust models (Tsipras et al., 2019), we further notice that
	adversarial examples show locally-consistent perturbations related to image shapes on adversariallytrained models, in contrast to disorderly perturbations on standard models. Both explorations on the
	frequency and spatial domain emphasize local properties of adversarial examples, and motivated
	by the local receptive field of the convolution kernels, we propose a locally intermediate response
	perspective to rethink the vulnerability of DNNs. Different from the existing global activation perspective (Bai et al., 2021; Xu et al., 2019), our local perspective reflects the joint effect of local
	features and the intermediate layers of the model. Based on the local perspective, we emphasize
	that large enough local response differences make it difficult for the network to treat an image and
	its potentially adversarial examples as one category, and demonstrate DNN models are naturally
	fragile at least attributed to it. Motivated by adversarially-trained models tend to have ‘smooth’ kernels (Wang et al., 2020a), we simply use the smoother kernels to alleviate local response differences
	on adversarially-trained models, which in turn affects the model robustness and reduces the robust
	overfitting (Rice et al., 2020). To a certain extent, this explains why weight decay effectively affects
	model robustness (Pang et al., 2021).

	Our main contributions are summarized as follows:

	- We first reveal some properties of adversarial examples in the frequency and spatial domain:
	1) the high-frequency components of adversarial examples tend to mislead naturally-trained
	DNNs, yet have little impact on adversarially-trained models, and 2) adversarial examples
	have locally-consistent perturbations on adversarially-trained models, compared with disorderly local perturbations on naturally-trained models.

	- Then we introduce local response and emphasize its importance in the model adversarial
	robustness. That is, naturally-trained DNNs are often fragile, at least for non-ignorable local response differences through the same layer between potentially adversarial examples
	and natural ones. In contrast, adversarially-trained models effectively alleviate the local response differences. And the smoother adversarially-trained models show better adversarial
	robustness as they can reduce local response differences.

	- Finally we empirically study local response with generated adversarial examples. We further show that, compared with failed attacks, adversarial examples (successful attacks)
	statistically show larger local response differences with natural examples. Moreover, compared with adversarial examples generated by the model itself, those transferred by other
	models show markedly smaller local response differences.

	2 RELATED WORK

	Understandings of model vulnerability. Since the discovery of adversarial examples (Szegedy
	et al., 2014), a number of understandings on model vulnerability has been developed. For instance, linear property of DNNs (Goodfellow et al., 2015), submanifold (Tanay & Griffin, 2016)
	and geometry of the manifold (Gilmer et al., 2018) were considered from the high-dimensional
	perspective; the computing of Lipschitz constant (Szegedy et al., 2014; Fazlyab et al., 2019) and
	lower/upper bounds (Fawzi et al., 2018; Weng et al., 2018) were considered from the definition of
	model robustness; non-robust features (Ilyas et al., 2019), high-frequency components (Wang et al.,
	2020a; Yin et al., 2019), high-rank features (Jere et al., 2020) and high-order interactions among
	pixels (Ren et al., 2021) were explored adversarial examples from the different perspectives on images, which imply our local perspective; feature denosing (Xie et al., 2019), robust pruning (Madaan
	& Ju Hwang., 2020) and activation suppressing (Bai et al., 2021) were focused on global intermediate activations of models. On the other hand, taken adversarial training into consideration, Tsipras
	et al. (2019), Schmidt et al. (2018) and Zhang et al. (2019) explored the trade-off between robustness and accuracy; Wang et al. (2020a) found adversarially-trained models tend to show smooth
	kernels; Tsipras et al. (2019) and Zhang & Zhu (2018) argued adversarially robust models learned
	more shape-biased representations.

	Different from these studies, we characterize the adversarial examples from both frequency and spatial domain to emphasize local properties of adversarial examples, and propose a locally intermediate
	response to rethink the vulnerability of DNNs.


	-----

	Adversarial training. Adversarial training can be seen as a min-max optimization problem:


	max _i[)][, y][i][)]_
	_x[′]i[∈][B][(][x][)][ L][(][f][θ][(][x][′]_
	_i=1_

	X


	min


	where f denotes a DNN model with parameters θ, and (xi, yi) denotes a pair of a natural example
	_xi and its ground-truth label yi. Given a classification loss L, the inner maximization problem can_
	be regarded as searching for suitable perturbations in boundary B to maximize loss, while the outer
	minimization problem is to optimize model parameters on adversarial examples {x[′]i[}]i[n]=1 [generated]
	from the inner maximization.

	3 DIFFERENT PERTURBATIONS AFFECT THE VULNERABILITY OF THE MODEL

	In this section, we first investigate adversarial examples in the frequency and spatial domain, and
	show the connections between models and their adversarial examples. Note that our threat model is
	a white-box model, and the fact that the adversarial example is only defined by the misleading result
	under the legal threat model, our findings suggest that the properties of the sampled examples can
	broadly reflect the fragile tendency of the model.

	Setup. We generate ℓ bounded adversarial examples by PGD-10 (maximum perturbation ϵ =
	_∞_
	8/255 and step size 2/255) with random start for the robust model (Madry et al., 2018). Specifically,
	we use ResNet-18 (He et al., 2016) as the backbone to train the standard and adversarially-trained
	models for 100 epochs on CIFAR-10 (Krizhevsky, 2009). Following (Pang et al., 2021), we use
	the SGD optimizer with momentum 0.9, weight decay 5 × 10[−][4] and initial learning rate 0.1, with
	a three-stage learning rate divided by 10 at 50 and 75 epoch respectively. The robustness of both
	models is evaluated by PGD-20 (step size 1/255).

	3.1 THE DESTRUCTIVENESS OF HIGH-FREQUENCY COMPONENTS FROM ADVERSARIES


	1.00.9 natural examplesadversarial examples 1.00.9 natural examplesadversarial examples 1.00.9 merged natural examplesmerged adversarial examples

	0.8 0.8 0.8

	0.7 0.7 0.7

	0.6 0.6 0.6

	0.5 0.5 0.5

	0.4 0.4 0.4

	Test Accuracy on Filtered Images 0.3 Test Accuracy on Filtered Images 0.3 Test Accuracy on Merged Images 0.3

	0.2 0.2 0.2

	0.1 0.1 0.1

	0.00.0 0.2 0.4 Frequency scale0.6 0.8 1.0 1.2 1.4 0.00.0 0.2 0.4 Frequency scale0.6 0.8 1.0 1.2 1.4 0.00.0 0.2 0.4 Frequency scale0.6 0.8 1.0 1.2 1.4

	(a) filtered images (STD)


	1.00.9 natural examplesadversarial examples

	0.8

	0.7

	0.6

	0.5

	0.4

	Test Accuracy on Filtered Images 0.3

	0.2

	0.1

	0.00.0 0.2 0.4 Frequency scale0.6 0.8 1.0 1.2 1.4

	(b) filtered images (ADV)


	1.00.9 merged natural examplesmerged adversarial examples

	0.8

	0.7

	0.6

	0.5

	0.4

	Test Accuracy on Merged Images 0.3

	0.2

	0.1

	0.00.0 0.2 0.4 Frequency scale0.6 0.8 1.0 1.2 1.4

	(c) merged images (STD)


	1.00.9 merged natural examplesmerged adversarial examples

	0.8

	0.7

	0.6

	0.5

	0.4

	Test Accuracy on Merged Images 0.3

	0.2

	0.1

	0.00.0 0.2 0.4 Frequency scale0.6 0.8 1.0 1.2 1.4

	(d) merged images (ADV)


	Figure 1: The destructiveness of high-frequency components from natural and adversarial examples
	on both standard (STD) and adversarially-trained (ADV) models. Shown above are well-trained
	models tested with (a)-(b) images through low-pass filter and (c)-(d) frequency-swapped images.

	Inspired by (Wang et al., 2020a), we are naturally curious whether the adversarial examples cause
	considerable damage to the model mainly because of their high-frequency components. To answer
	this question, Figure 1 illustrates the trend of model performance and robustness on the test set with
	the high-frequency components increased (the increase of the filtering scale denotes that more highfrequency components are added to the filtered images). Figure 1(a) shows that, for standard models,
	the high-frequency components of natural examples promote classification and reach the model performance (green line); on the contrary, the performance of the filtered adversarial examples first rises
	to get the highest accuracy 47.5%, and then drops rapidly to reach 0.0% (red line). Obviously, in the
	low-frequency range, the performance of natural and adversarial examples are quite close, yet more
	_high-frequency components widen the difference. That is, the special high-frequency components_
	caused by adversarial perturbations exhibit a clear destructive effect on standard models, and simply
	filter out them can effectively alleviate the destructiveness of adversaries even on standard models.

	However, for robust models, we show that the prediction performance finally reaches robustness
	without a rapid drop in Figure 1(b). But surprisingly, we find that the performance of filtered adversarial examples in some range exceeds the final robustness 47.5% (red line), reaching a maximum of
	51.2%. That is, although these high-frequency components do not exhibit a clear destructive effect,
	simply filtering out them has a positive impact on alleviating robust overfitting (Rice et al., 2020).


	-----

	We then swap their high-frequency components between both examples controlled by a frequency
	threshold in Figure 1(c)-1(d). For merged natural examples with high-frequency components from
	adversaries, the increase of the frequency threshold controls the accuracy increasing from the model
	robustness (red line) to the model performance (green line), the opposite occurs on merged adversarial examples. These clearly illustrate the boost effect of the high-frequency components from natural
	examples and the destructive effect of the high-frequency components from adversarial examples.

	\|Col1\|6\|
	\|---\|---\|
	\|\|6 4 2 0\|
	\|\|\|
	\|\|\|

	\|Col1\|6\|
	\|---\|---\|
	\|\|6 4 2 0\|
	\|\|\|
	\|\|\|

	\|Col1\|1 1 0 0\|
	\|---\|---\|
	\|\|\|
	\|\|\|

	\|Col1\|6\|
	\|---\|---\|
	\|\|6 4 2 0\|
	\|\|\|
	\|\|\|

	\|Col1\|1 1 0 0\|
	\|---\|---\|
	\|\|\|
	\|\|\|


	0 5 10 15 20 25 30 6 0 5 10 15 20 25 30 6 0 5 10 15 20 25 30 1.5 0 5 10 15 20 25 30 6 0 5 10 15 20 25 30 1.5

	0 0 0 0 0

	5 4 5 4 5 1.0 5 4 5 1.0

	10 2 10 2 10 0.5 10 2 10 0.5

	15 0 15 0 15 0.0 15 0 15 0.0

	20 2 20 2 20 0.5 20 2 20 0.5

	25 25 25 25 25

	30 4 30 4 30 1.0 30 4 30 1.0

	6 6 1.5 6 1.5


	(a) ori. images


	(b) adv (STD)


	(c) diff (STD)


	(d) adv (ADV)


	(e) diff (ADV)


	Figure 2: The average logarithmic amplitude spectrum of (a) 1000 three-channel images and (b) their
	adversarial examples generated by the standard (STD) model, where the corners represent high-freq
	range, and the colorbars represent the logarithmic amplitude log(\| · \|) (the redder the larger). And
	(c) denotes the difference between (b) and (a), log(\|adv\|) − _log(\|nat\|) = log(\|adv\|/\|nat\|), where_
	the write color of (c) represents equivalent, and the red represents \|adv\| > \|nat\|. (d) and (e) are on
	the adversarially-trained (ADV) model, and (e) denotes the difference between (d) and (a).

	To further illustrate, we find that statistically, the main difference in the frequency domain between
	natural and adversarial examples is concentrated in the high-frequency region in Figure 2. We visualize the logarithmic amplitude spectrum of both examples. Figure 2(a) and 2(b) show that, compare
	with natural examples’, the high-frequency components of the adversaries are hard to ignore. Figure
	2(c) further emphasizes that adversarial examples markedly show more high-frequency components,
	_indicating relatively drastic local changes among pixels. This statistical difference explains the high_
	detection rate of using Magnitude Fourier Spectrum (Harder et al., 2021) to detect adversarial examples. Furthermore, Figure 2(d) and 2(e) show that the high-frequency components of adversarial
	examples generated by robust models are less than those from standard models, yet still more than
	natural examples’. Besides, the analysis of filtering out low-frequency components in Figure 6
	(Appendix A) also emphasizes our statement. That is, compared with natural examples’, the special high-frequency components of adversarial examples show their serious misleading effects on
	standard models, yet are not enough to be fully responsible for the destructiveness to robust models.

	3.2 DIFFERENT PERTURBATIONS AND THE FRAGILE TENDENCY OF THE MODEL

	(a) (b) (c) (d) (e) (f) (g) (h) (i) (j) (k) (l) (m)

	Figure 3: Visualisation of adversarial perturbations (PGD) generated by Left: adversarially-trained
	(ADV) model and Right: standard (STD) model. The columns from left to right are represented
	by (a)-(m) respectively: (a) natural example; (b) adversarial example and (c)-(g) its perturbations
	(overall, average of channels, and three channels) generated by ADV model; (h) adversarial example
	and (i)-(m) its perturbations generated by STD model. Shown above are all successfully attacked,
	the first row is attacked as a bird from a dog, the second a cat from a frog, the third a car from a ship.

	To answer the different effects of high-frequency components, we directly explore the adversarial
	examples themselves, e.g., in the spatial domain. We first get well-trained models as above and
	visualize adversarial perturbations in Figure 3, including the overall perturbations scaled to [0,255],
	the average perturbations of channels, and perturbations of three channels.


	-----

	As shown in Figure 3(Right), the perturbations generated by the standard models tend to be locally
	inconsistent and disordered. This observation is different from Figure 3(Left) that adversarial examples show locally-consistent perturbations related to image shapes on the adversarially-trained
	models, that is, perturbations tend to be locally co-increasing or co-decreasing on each channel. In
	more detail, the perturbation of each pixel on a single channel tend to reach the perturbation bound,
	i.e., +8/255 (the reddest), -8/255 (the bluest), which is counter-intuitive under the iterative attack of
	PGD-20 and naturally associated with the one-step attack FGSM in Figure 7 (Appendix B). In fact,
	both attacks show similar perturbations and similar model robustness[1], while the former produces
	more detailed perturbations. Besides, compared with failed attacks in Figure 8 (Appendix B), adversarial examples of successful attacks in Figure 3 show more misleading local perturbations, e.g.,
	the perturbations in the first row are more like a bird (bird wings are added), and the third like a car.

	Perturbations in frequency vs. spatial domain. Similar to the different destructive effects of
	high-frequency components, the perturbations in the spatial domain exhibit a more intuitive difference related to the models. Since more high-frequency components indicate images have relatively
	drastic local changes, we count that adversarial examples generated by adversarially-trained models
	show fewer high-frequency components mainly due to their locally-consistent perturbations. These
	perturbations with smooth local changes imply that simply filtering out high-frequency components
	has little promotion on robust models, while locally-disordered perturbations imply the effectiveness
	of filtering out high-frequency components on standard models.

	Perturbations vs. fragile trend of models. For a given model, optimization-based attacks attempt
	to search for special perturbations to maximize the classification loss of adversarial examples. We
	note that for adversarially-trained models with smooth kernels, the perturbations that tend to maximize loss exhibit a locally-consistent tendency, and for standard models with non-smooth kernels,
	the perturbations tend to be locally-disordered to maximize loss. This implies a potential connection
	between the fragile trend of models and their potentially adversarial examples.

	4 LOCAL RESPONSES: FURTHER IMPACT OF PERTURBATIONS ON MODELS

	In this section, motivated by local properties of adversarial examples and the local receptive field
	of convolution kernels, we first introduce a locally intermediate response perspective to rethink the
	vulnerability of the model, and then empirically show the relationship between local responses and
	the vulnerability of the model.

	Different from the existing idea of searching for adversarial examples, we consider under what
	_circumstances the potential examples, refer to all legal variants in boundary B regardless of whether_
	_destructive or not, would show their destructive effects on a well-trained model, that is, the reason_
	why some macro-similar potential examples present prediction results far away from their natural
	examples. Inspired by local properties of adversarial examples and kernels, we naturally consider
	whether the destructiveness of potential examples can be viewed from local responses, which reflects
	the combined effect of the local features and the model property.

	Note that under ideal circumstances, given a well-trained model, if the local responses of any examples on a certain layer are completely consistent, then the final predictions of these examples
	are exactly the same. To relax the condition in reality, we hypothesize as follows (referred to as
	Assumption 1): If macro-similar features through the same layer exhibit a sufficiently small differ-
	_ence in local responses, with sufficiently small differences in the subsequent layers, then the final_
	_responses of the network are relatively close; otherwise, the large enough local differences make_
	the network difficult to treat them as the same category. To illustrate this, we take the convolutional
	layer as an example.

	4.1 LOCAL RESPONSES

	Macroscopically, we denote a DNN model f for classification, the l-th layer of activation feature
	maps f _[l], and f_ [0] to represent the input features. Macro-similar image and its potential example
	pass through the l-th layer to get f _[l]_ and _f[ˆ][l]_ respectively. We also denote the (l+1)-th weight com
	1The adversarially-trained model (the best checkpoint) reaches 51.9% robustness on the PGD-20 attack and
	57.4% robustness on the FGSM attack.


	-----

	ponent M _[l][+1]_ _∈_ R[H][×][W][ ×][K], where H, W, K represent height, weight and number of kernels respectively. From the locally intermediate response perspective, we first get one of the convolution
	kernels m[l][+1] R[H][×][W], and then capture the l-th layer of local features centered at (i,j) position x[l]i,j
	_∈_
	and ˆx[l]i,j [corresponding to its local receptive field. Formally, for the difference of local responses,]

	∆[l]i,j[+1] [:= ˆ]x[l]i,j _i,j_ _l_ _ml+1_

	_[⊗]_ _[m][l][+1][ −]_ _[x][l]_ _[⊗]_ _[m][l][+1][ =][ x][d]_ _⊗_

	where ∆[l]i,j[+1] [denotes the (i,j)-th response difference of local features through the same kernel][ m][l][+1][,]
	natural example. The second equation is based on the linear property of the convolution operation.and xd[l] := ˆx[l]i,j _[−][x]i,j[l]_ [denotes the difference of local feature maps between its potential example and]

	Ideally, if the differences are all equal to zeros at any positions in the (l+1)-th layer, that is, the
	intermediate layer has the same utility for both examples, then the potential example after the (l+1)th layer shows exactly the same responses as the natural example’s. However, in most cases the local
	differences are difficult to be exactly zeros everywhere, then based on Assumption 1, our aim is to
	make the absolute difference of local responses ∆[l]i,j[+1]
	cognition of potential example and natural example. \| _[\|][ as small as possible to approach the model’s]_

	The difference of local responses ∆[l]i,j[+1] [composed of][ x][dl][ and][ m][l][+1][ can be further expressed as:]


	_xdl_ _ml+1 =_
	_⊗_


	_c[l]ij[m][l]ij[+1]_
	_j=1_

	X


	_i=1_


	where c[l]ij [and][ m]ij[l][+1] represent the (i,j)-th elements of xd[l] and m[l][+1] respectively, thus the absolute
	local difference ∆[l]i,j[+1]
	_\|_ _[\|][ is affected jointly by both local features and kernel parameters.]_

	Note that what we care about is under what circumstances are more likely to produce large enough
	response differences. Though the specific impact of c[l]ij [and][ m]ij[l][+1] on ∆[l]i,j[+1] is quite complicated,
	we consider statistical circumstances as numerous convolution kernels and various local features are
	combined to affect the response differences in real networks. Statistically, the larger amplitude of
	_m[l][+1]_ with fixed xd[l], or larger amplitude of xd[l] with fixed m[l][+1], tends to produce larger ∆[l]i,j[+1]
	_\|_ _[\|][.]_
	Besides, if a kernel m[l][+1] is relatively non-smooth, then non-smooth local features xd[l], rather than
	smooth local features, are more likely to produce large enough local response differences.

	4.2 FURTHER IMPACT OF LOCAL PERTURBATIONS ON MODELS

	Recalling that local properties of adversarial examples in Section 3, we further explore how different
	local perturbations affect the models from the locally intermediate response perspective. To illustrate
	the effect of perturbations, we assume two convolution kernels with the same mean but different
	variances, one is disordered, and the other is smoother.

	\|-1/255\|8/255\|0\|
	\|---\|---\|---\|
	\|-8/255\|8/255\|-2/255\|
	\|-4/255\|4/255\|-4/255\|


	\|8/255\|8/255\|8/255\|
	\|---\|---\|---\|
	\|8/255\|8/255\|8/255\|
	\|8/255\|8/255\|8/255\|


	\|1/9\|1/9\|1/9\|
	\|---\|---\|---\|
	\|1/9\|1/9\|1/9\|
	\|1/9\|1/9\|1/9\|


	\|STD ADV Adversarial examples\|Col2\|-1/2558/255 0 -8/255 8/255-2/255 -4/255 4/255-4/255 Locally-disordered 8/255 8/255 8/255 8/255 8/255 8/255 8/255 8/255 8/255 Locally-consistent Local perturbations\|Col4\|-1 -1 -1 -1 9 -1 -1 -1 -1 Disordered kernel 1/9 1/9 1/9 1/9 1/9 1/9 1/9 1/9 1/9 Smooth kernel Kernels\|Col6\|Col7\|Col8\|Col9\|Col10\|79/255 8/255 0.11/255 8/255 Local response differences\|…\|
	\|---\|---\|---\|---\|---\|---\|---\|---\|---\|---\|---\|---\|
	\|\|\|\|\|\|-1\|-1\|-1\|\|\|\|\|
	\|\|\|\|\|\|-1\|9\|-1\|\|\|\|\|
	\|\|\|\|\|\|-1\|-1\|-1\|\|\|\|\|
	\|\|\|\|\|\|\|\|\|\|\|\|\|
	\|\|\|\|\|\|\|\|\|\|\|\|\|
	\|\|\|\|\|\|\|\|\|\|\|\|\|
	\|\|\|\|\|\|\|\|\|\|\|\|\|
	\|\|\|\|\|\|\|\|\|\|\|\|\|


	-1/255 8/255 0

	-8/255 8/255 -2/255

	-4/255 4/255 -4/255

	8/255 8/255 8/255

	8/255 8/255 8/255

	8/255 8/255 8/255


	Figure 4: Further impact of local perturbations on different convolutional kernels. Shown above is
	adversarial perturbations with same amplitude (single channel) act on kernels with different smoothness and then get the local response differences of adversarial example and natural example.


	-----

	Local responses vs. locally-disordered perturbations. Since adversarial perturbations generated
	by standard models tend to be locally-disordered, we discuss these perturbations affect the local
	responses. Locally-disordered perturbations indicate relatively drastic local changes among pixels
	of potential example, in other words, local input difference xd[0] of potential example and natural example (also refers to local perturbations) show its great variance. These perturbations are convolved
	on numerous kernels. As shown in Figure 4, suppose convolution kernels with the same mean but
	greater variance, these locally-disordered perturbations are more likely to yield large enough absolute response differences \|∆[1]i,j[\|][ in the first layer, since the non-smooth kernels tend to emphasize]
	the relationship between different pixels. On the other hand, suppose smoother kernels with relatively small variance, these locally-disordered perturbations tend to show smaller absolute response
	differences since smoother kernels tend to focus on the average situation within their local receptive
	fields. That is, locally-disordered perturbations pass through different types of convolution kernels
	tend to produce different response differences, which can be accumulated in the subsequent layers.

	Local responses vs. locally-consistent perturbations. Since adversarial perturbations generated
	by adversarially robust models tend to be locally-consistent, we count that such perturbations are
	more destructive to smoother kernels, comparing with locally-disordered perturbations in Figure 4.
	Locally-consistent perturbations xd[0] reaching the perturbation bound tend to produce larger absolute
	response differences in the first layer since they tend to show a larger absolute average, yet locallydisordered perturbations show smaller response differences due to a smaller absolute average.

	Both different local perturbations have different impacts on the local response differences of potential examples and natural examples, implying the fragile trend of models and the transferability of
	adversarial examples.

	4.3 EMPIRICAL UNDERSTANDING OF LOCAL RESPONSES

	Motivated by adversarially-trained models that tend to show smoother kernels (Wang et al., 2020a),
	we first provide a further understanding of local responses and then give empirical understandings.

	Local responses vs. model smoothness. Given a non-smooth convolution kernel m[l][+1] with great
	variance, it is more likely to cause huge impact on an absolute response difference ∆[l]i,j[+1]
	_\|_ _[\|][ within its]_
	local receptive field when fixing xd[l]. Considering numerous non-smooth kernels and various local
	features are combined in the same layer, it is quite easy to yield large enough response differences.
	More complicated is the subsequent layers act on the current local response differences and get intricate accumulation effects. That is, for a standard model with non-smooth kernels, the model itself
	tends to amplify the local response differences between potential examples and natural examples.
	On the other hand, since an adversarially-trained model show smoother kernels, the model has a tendency to weaken the local response differences and narrow the final responses of potential examples
	and natural examples, indicating a trade-off between model robustness and accuracy.

	Setup. We get both standard and adversarially-trained models in Section 3. Considering complex
	functional layers, including linear and nonlinear ones in DNNs, based on Assumption 1, we use the
	maximum and the total absolute differences of local responses in some layers to show their effects.
	Taking ResNet-18 as an example, we investigate some layers including the first convolutional layer
	as Conv 1, the feature maps before layer 1 as Layer 0, and Layer 1 to Layer 4 respectively[1].

	To verify above analysis, we first compare the local response differences of potential examples and
	natural examples between standard and robust models. As Table 1 shows, the standard model exhibits significantly larger differences in local responses layer by layer, especially from the perspective of total differences yielded by per pairs. These large enough response differences accumulate,
	making the model’s cognition of potential examples far away from their natural ones and ultimately
	leading the model to express a high vulnerability. On the other hand, the adversarially-trained model
	significantly shortens local response differences in corresponding layers, yet still some differences
	exist. In other words, how to further reduce the local response differences is a perspective of approaching model robustness to performance. We also note that, due to the complex effect of nonlinear layers (e.g., BatchNorm, ReLu), the maximum absolute difference of local responses does not
	strictly increase monotonically but increases in trend.

	1Feature map size of Conv 1, Layer 0 and Layer 1: 64 × 32 × 32, Layer 2: 128 × 16 × 16, Layer 3:
	256 × 8 × 8, and Layer 4: 512 × 4 × 4.


	-----

	Table 1: The absolute difference of local responses (per image) between test set images and their
	potentially adversarial examples on different layers. Left/Right denote standard/robust model.

	\|Model\|Conv 1\|Layer 0\|Layer 1\|Layer 2\|Layer 3\|Layer 4\|
	\|---\|---\|---\|---\|---\|---\|---\|


	\|Max Total\|0.12/0.05 370/556\|0.30/0.15 443/489\|1.92/0.52 4276/1246\|1.46/0.36 2093/430\|1.45/0.41 581/164\|3.72/1.02 2624/526\|
	\|---\|---\|---\|---\|---\|---\|---\|



	Local responses vs. destructive effect. Naturally, we wonder whether a clear difference between
	potential examples of successful and failed attacks on a certain model exists. We first select the
	natural examples that are classified correctly, and count their potential examples of successful and
	failed attacks. As Table 2 shows[1], for the standard model, the successful ones show similar differences with the failed ones in the front layers, but greater differences especially in Layer 4 close to
	the outputs, leading to finally destructive effects. However, for the adversarially-trained model in
	Table 4 (Appendix C), similar differences even in Layer 4 may be due to the smoother kernels and
	locally smoother perturbations, leading to the final responses of both close to natural examples’.

	Table 2: Left/Right denote successfully-attacked/failed adversarial examples on standard model.

	\|Model\|Conv 1\|Layer 0\|Layer 1\|Layer 2\|Layer 3\|Layer 4\|
	\|---\|---\|---\|---\|---\|---\|---\|


	\|Max Total\|0.16/0.15 558/554\|0.379/0.378 676/669\|2.50/2.51 5955/6001\|1.38/1.37 2363/2339\|1.16/1.08 522/494\|2.11/1.38 1481/757\|
	\|---\|---\|---\|---\|---\|---\|---\|



	Table 3: Left/Right denote original/transferred adversarial examples on adversarially-trained model.

	\|Model\|Conv 1\|Layer 0\|Layer 1\|Layer 2\|Layer 3\|Layer 4\|
	\|---\|---\|---\|---\|---\|---\|---\|


	\|Max Total\|0.05/0.04 556/321\|0.15/0.12 489/302\|0.52/0.44 1246/767\|0.36/0.14 430/138\|0.41/0.08 164/38\|1.02/0.10 526/57\|
	\|---\|---\|---\|---\|---\|---\|---\|



	Local responses vs. transferability. Different local perturbations related to models imply the diffi-
	culty of adversarial examples’ transferability, then we further explore whether the transferability can
	be understood from the locally intermediate response perspective. To verify the analysis of Section
	4.2, we exchange the adversaries obtained from the standard and robust model. Table 3 shows,
	locally-disordered perturbations, combined with a smoother adversarially-trained model, exhibit
	fairly small local response differences layer by layer, making the model’s cognition close enough to
	the natural examples and leading to 83.6% robustness close enough to 84.9% model performance.
	Similar situations occur when locally-consistent perturbations combined with a non-smooth standard model in Table 5 (Appendix C). These results indicate that the searched perturbations related to
	models tend to amplify response differences as much as possible to enlarge the model’s cognition of
	potential examples and natural examples, and the weak transferability of adversarial examples may
	be due to transferred perturbations that are difficult to enlarge the model’s cognition.

	4.4 SMOOTHER KERNELS: ALLEVIATE LOCAL RESPONSE DIFFERENCES

	To further exhibit the effect of shortening local response differences, we simply show smoother adversarially robust models can alleviate local response differences and then improve their robustness.
	As shown in Figure 5, adversarially-trained models with different smoothness (i.e., larger weight
	decay parameters show the smaller magnitude of the kernels (Loshchilov & Hutter, 2019)) show
	different local response differences. Figures 5(b)-5(g) for the maximum differences and Figure 9 for
	the total differences (Appendix C) indicate that smoother kernels slightly weaken the local response
	differences between the potential examples and natural examples layer by layer, and finally tend to
	narrow the robustness and performance in Figure 5(h). This to some extent explains adversariallytrained models are more sensitive to weight decay (Pang et al., 2021). On the other hand, we find
	that the increase of weight decay is quite difficult to further reduce the magnitude of parameters
	and the local response differences in each layer, which may be one of the reasons for the current
	bottleneck in the robustness of the adversarial training. Besides, the increase of weight decay can
	effectively weaken the robust overfitting (Rice et al., 2020) in Figure 5(h) and Figure 9(g).

	1Note that under the PGD-20 attack, the standard model hardly yields examples of failed attacks, then we
	use the FGSM attack to illustrate the problem.


	-----

	4000 weight decay 1e-4weight decay 3e-4weight decay 5e-4weight decay 7e-4 0.3 weight decay 1e-4weight decay 3e-4weight decay 5e-4weight decay 7e-4 0.6 weight decay 1e-4weight decay 3e-4weight decay 5e-4weight decay 7e-4 1.5 weight decay 1e-4weight decay 3e-4weight decay 5e-4weight decay 7e-4

	weight decay 10e-4 0.2 weight decay 10e-4 weight decay 10e-4 weight decay 10e-4

	0.4 1.0

	2000

	0.1 0.2

	Maximum Differences Maximum Differences Maximum Differences0.5

	Quadratic Sum of Parameters

	0 25 50 75 100 0 25 50 75 100 0 25 50 75 100 0 25 50 75 100

	Epochs Epochs Epochs Epochs


	(a) Parameters

	weight decay 1e-4 2.0 weight decay 1e-4 weight decay 1e-4 weight decay 1e-4

	1.5 weight decay 3e-4weight decay 5e-4weight decay 7e-4weight decay 10e-4 1.5 weight decay 3e-4weight decay 5e-4weight decay 7e-4weight decay 10e-4 1.251.00 weight decay 3e-4weight decay 5e-4weight decay 7e-4weight decay 10e-4 8060 weight decay 3e-4weight decay 5e-4weight decay 7e-4weight decay 10e-4

	1.0 1.0 0.75

	Maximum Differences0.5 Maximum Differences0.5 Maximum Differences0.50 Test Accuracy (%)40

	0 25 50 75 100 0 25 50 75 100 0 25 50 75 100 0 25 50 75 100

	Epochs Epochs Epochs Epochs


	(e) Layer 2


	(b) Conv 1

	25 50

	Epochs

	(f) Layer 3


	(c) Layer 0

	weight decay 1e-4
	weight decay 3e-4
	weight decay 5e-4
	weight decay 7e-4
	weight decay 10e-4

	25 50

	Epochs

	(g) Layer 4


	(d) Layer 1

	weight decay 1e-4
	weight decay 3e-4
	weight decay 5e-4
	weight decay 7e-4
	weight decay 10e-4

	0 25 50 75

	Epochs

	(h) Test Accuracy


	Figure 5: The maximum local response differences on robust models with different smoothness. (a)
	shows different model smoothness affected by weight decay, (b)-(g) denote the impact of weight
	decay on the local response differences, and (h) shows the model robustness and performance.

	4.5 DISCUSSION: LOCAL RESPONSES AND MODEL ROBUSTNESS


	The above suggests that the model robustness is related to the model itself and the property of
	potential examples, while they are combined through the local responses. Due to enough differences
	in local responses, some potential examples are not regarded as similar to natural examples (or their
	predictions are different from the natural ones with correct predictions), so they eventually show
	their destructive effects as adversarial examples. That is, if a model tends to weaken the local
	response differences of potential and natural examples, then the model exhibits great robustness as
	the final responses of the potential examples are more likely to be close to the natural ones.

	On the other hand, though small differences of local responses make the network more inclined to
	treat both examples as the same categories, but the existing differences, especially the intricate differences after multi-layer accumulation emphasize the difficulty of approaching model robustness
	to performance, and demonstrate that DNN models are naturally fragile. Essentially, these response
	differences come from whether a non-zero convolution kernel acts on all legal perturbations in
	_boundary B to obtain differences that are all zeros or sufficiently small without further accumula-_
	_tion. In other words, it tells the model robustness is that, given legal potential examples from any_
	attacks, the accumulated response differences can be alleviated or removed. For instance, feature
	denosing (Xie et al., 2019) and activation suppressing (Bai et al., 2021) can be viewed as shortening
	local response differences of both examples to improve the model robustness.

	Besides, shortening local response differences does not directly mean an improvement of the model’s
	robustness, in fact, it expresses the closeness of the model’s cognition on both examples. That is,
	a poorly-trained model may also treat both as relatively close, but give a bad model performance.
	To further improve the model robustness, a more realistic idea is to find proper parameters (whether
	theoretical parameters to minimize the local response differences exist) or a new method (nonlinear
	layers, loss functions, model structures) to shorten the response differences as much as possible,
	while not overly weaken the classification performance of the model.

	5 CONCLUSION


	In this paper, we investigate the local properties of adversarial examples generated by different models and rethink the vulnerability of DNNs from a novel locally intermediate response perspective. We
	find that the high-frequency components of adversarial examples tend to mislead standard DNNs,
	but have little impact on adversarially-trained models. Furthermore, locally-disordered perturbations
	are shown on standard models, but locally-consistent perturbations on adversarially-trained models.
	Both explorations emphasize the local perspective and the potential relationship between models
	and adversarial examples, then we explore how different local perturbations affect the models. We
	demonstrate DNN models are naturally fragile at least for large enough local response differences
	between potentially adversarial examples and natural examples, and empirically show smoother
	adversarially-trained models can alleviate local response differences to improve robustness.


	-----

	REFERENCES

	Anish Athalye, Nicholas Carlini, and David Wagner. Obfuscated gradients give a false sense of
	security: Circumventing defenses to adversarial examples. In ICML. 2018.

	Yang Bai, Yuyuan Zeng, Yong Jiang, Shu-Tao Xia, Xingjun Ma, and Yisen Wang. Improving
	adversarial robustness via channel-wise activation suppressing. In ICLR. 2021.

	Nicholas Carlini and David Wagner. Towards evaluating the robustness of neural networks. In IEEE
	_Symposium on Security and Privacy (SP). 2017._

	Jinghui Chen and Quanquan Gu. Rays: A ray searching method for hard-label adversarial attack. In
	_KDD. 2020._

	Francesco Croce and Matthias Hein. Reliable evaluation of adversarial robustness with an ensemble
	of diverse parameter-free attacks. In ICML. 2020.

	Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep
	bidirectional transformers for language understanding. In NAACL. 2019.

	Alhussein Fawzi, Omar Fawzi, and Pascal Frossard. Analysis of classifiers’ robustness to adversarial
	perturbations. In Machine Learning. 2018.

	Mahyar Fazlyab, Alexander Robey, Hamed Hassani, Manfred Morari, and George J. Pappas. Efficient and accurate estimation of lipschitz constants for deep neural networks. In NeurIPS. 2019.

	Justin Gilmer, Luke Metz, Fartash Faghri, Samuel S. Schoenholz, Maithra Raghu, Martin Wattenberg, and Ian Goodfellow. Adversarial spheres. In arXiv preprint arXiv:1801.02774. 2018.

	Ian J. Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial
	examples. In ICLR. 2015.

	Paula Harder, Franz-Josef Pfreundt, Margret Keuper, and Janis Keuper. Spectraldefense: Detecting
	adversarial attacks on cnns in the fourier domain. In arXiv preprint arXiv:2103.03000. 2021.

	Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR. 2016.

	Andrew Ilyas, Shibani Santurkar, Dimitris Tsipras, Logan Engstrom, Brandon Tran, and Aleksander
	Madry. Adversarial examples are not bugs, they are features. In NeurIPS. 2019.

	Malhar Jere, Maghav Kumar, and Farinaz Koushanfar. A singular value perspective on model robustness. In arXiv preprint arXiv:2012.03516. 2020.

	Alex Krizhevsky. Learning multiple layers of features from tiny images. 2009.

	Alexey Kurakin, Ian Goodfellow, and Samy Bengio. Adversarial examples in the physical world. In
	_ICLR. 2017._

	Fangzhou Liao, Ming Liang, Yinpeng Dong, Tianyu Pang, Xiaolin Hu, and Jun Zhu. Defense against
	adversarial attacks using high-level representation guided denoiser. In CVPR. 2018.

	Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In ICLR. 2019.

	Divyam Madaan and Sung Ju Hwang. Adversarial neural pruning with latent vulnerability suppression. In ICML. 2020.

	Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu.
	Towards deep learning models resistant to adversarial attacks. In ICLR. 2018.

	Tianyu Pang, Xiao Yang, Yinpeng Dong, Hang Su, and Jun Zhu. Bag of tricks for adversarial
	training. In ICLR. 2021.

	Nicolas Papernot, Patrick McDaniel, Xi Wu, Somesh Jha, and Ananthram Swami. Distillation as
	a defense to adversarial perturbations against deep neural networks. In IEEE Symposium on
	_Security and Privacy (SP). 2016._


	-----

	Sylvestre-Alvise Rebuffi, Sven Gowal, Dan A. Calian, Florian Stimberg, Olivia Wiles, and Timothy Mann. Fixing data augmentation to improve adversarial robustness. In arXiv preprint
	_arXiv:2103.01946. 2021._

	Jie Ren, Die Zhang, Yisen Wang, Lu Chen, Zhanpeng Zhou, Yiting Chen, Xu Cheng, Xin Wang,
	Meng Zhou, Jie Shi, and Quanshi Zhang. Towards a unified game-theoretic view of adversarial
	perturbations and robustness. In arXiv preprint arXiv:2103.07364. 2021.

	Leslie Rice, Eric Wong, and Zico Kolter. Overfitting in adversarially robust deep learning. In ICML.
	2020.

	Ludwig Schmidt, Shibani Santurkar, Dimitris Tsipras, Kunal Talwar, and Aleksander Madry. Adversarially robust generalization requires more data. In NeurIPS. 2018.

	Mahmood Sharif, Sruti Bhagavatula, Lujo Bauer, and Michael K. Reiter. Accessorize to a crime:
	Real and stealthy attacks on state-of-the-art face recognition. In SIGSAC. 2016.

	Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow,
	and Rob Fergus. Intriguing properties of neural networks. In ICLR. 2014.

	Thomas Tanay and Lewis Griffin. A boundary tilting persepective on the phenomenon of adversarial
	examples. In arXiv preprint arXiv:1608.07690. 2016.

	Dimitris Tsipras, Shibani Santurkar, Logan Engstrom, Alexander Turner, and Aleksander Madry.
	Robustness may be at odds with accuracy. In ICLR. 2019.

	Haohan Wang, Xindi Wu, Zeyi Huang, and Eric P Xing. High-frequency component helps explain
	the generalization of convolutional neural networks. In CVPR. 2020a.

	Yisen Wang, Difan Zou, Jinfeng Yi, James Bailey, Xingjun Ma, and Quanquan Gu. Improving
	adversarial robustness requires revisiting misclassified examples. In ICLR. 2020b.

	Tsui-Wei Weng, Huan Zhang, Pin-Yu Chen, Jinfeng Yi, Dong Su, Yupeng Gao, Cho-Jui Hsieh, and
	Luca Daniel. Evaluating the robustness of neural networks: An extreme value theory approach.
	In ICLR. 2018.

	Cihang Xie, Yuxin Wu, Laurens van der Maaten, Alan Yuille, and Kaiming He. Feature denoising
	for improving adversarial robustness. In CVPR. 2019.

	Kaidi Xu, Sijia Liu, Gaoyuan Zhang, Mengshu Sun, Pu Zhao, Quanfu Fan, Chuang Gan, and Xue
	Lin. Interpreting adversarial examples by activation promotion and suppression. In arXiv preprint
	_arXiv:1904.02057. 2019._

	Dong Yin, Raphael Gontijo Lopes, Jonathon Shlens, Ekin D. Cubuk, and Justin Gilmer. A fourier
	perspective on model robustness in computer vision. In NeurIPS. 2019.

	Hongyang Zhang, Yaodong Yu, Jiantao Jiao, Eric P Xing, Laurent El Ghaoui, and Michael I. Jordan.
	Theoretically principled trade-off between robustness and accuracy. In ICML. 2019.

	Tianyuan Zhang and Zhanxing Zhu. Interpreting adversarially trained convolutional neural networks. In ICML. 2018.


	-----

	THE DESTRUCTIVENESS OF ADVERSARIAL EXAMPLES ON FREQUENCY
	DOMAIN


	1.0 1.0

	natural examples natural examples

	0.9 adversarial examples 0.9 adversarial examples

	0.8 0.8

	0.7 0.7

	0.6 0.6

	0.5 0.5

	0.4 0.4

	Test Accuracy on Filtered Images 0.3 Test Accuracy on Filtered Images 0.3

	0.2 0.2

	0.1 0.1

	0.0 0.0

	0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4

	Frequency scale Frequency scale


	(a) low-frequency components (STD)


	(b) low-frequency components (ADV)


	Figure 6: The destructiveness of only high-frequency components from natural and adversarial examples on both standard (STD) and adversarially-trained (ADV) models. Shown above are welltrained models tested with images through high-pass filter.

	Here, we further investigate the contribution of only high-frequency components to the destructiveness of both examples. We get both standard and adversarially-trained models as above. Figure
	6 illustrates the trend of model performance and robustness on the test set with the low-frequency
	components decreased (the increase of the filtering scale denotes that the less low-frequency components are added to the filtered images). As the increase of filtering scale, only natural examples on
	standard models keep certain classification performance and then decrease to reach 10% (the accuracy of random classification). That is, the high-frequency components of natural examples to some
	extent can promote classification. On the other hand, the high-frequency components of adversarial
	examples on standard models show almost no promotion but destructive effect, and finally reach
	10% as well. For robust models, the high-frequency components of both examples show similar but
	little performance.


	B THE LOCAL PROPERTIES OF ADVERSARIAL EXAMPLES ON SPATIAL
	DOMAIN

	We further explore the local properties of adversarial examples generated by the FGSM attack. For
	adversarially-trained models, compared with the PGD attack in Figure 3, the adversarial examples
	generated by the FGSM attack in Figure 7 show similar perturbations and similar model robustness (51.9% robustness for PGD-20 and 57.4% robustness for FGSM), yet the latter produces less
	detailed perburbations. For standard models, though both attacks show locally-disordered perturbations, the latter exhibits less disordered perturbations since less perturbation values can be achieved,
	i.e., +8/255, 0, and -8/255.


	(a) (b) (c) (d) (e) (f) (g) (h) (i) (j) (k) (l) (m)

	Figure 7: Visualisation of adversarial perturbations (FGSM) generated by Left: adversarially-trained
	(ADV) model and Right: standard (STD) model.


	-----

	Similar to adversarial examples of successful attacks in Figure 3, the failed attacks in Figure 8 show
	locally-consistent perturbations related to image shapes on the adversarially-trained models as well.
	The difference is that, these examples show less misleading local perturbations since the shape of
	their perturbations is closer to the shape of the original image.

	(a) (b) (c) (d) (e) (f) (g) (h) (i) (j) (k) (l) (m)

	Figure 8: Visualisation of adversarial perturbations (PGD) generated by Left: adversarially-trained
	(ADV) model and Right: standard (STD) model. Shown above are all unsuccessfully attacked on
	ADV model.

	C LOCAL RESPONSES

	As mentioned in Section 4.3, we report destructive effects on the adversarially-trained model in
	Table 4. We first select the natural examples that are classified correctly, and count their potential
	examples of successful and failed attacks. Similar to the standard model, the successful ones show
	similar differences with the failed ones in the front layers, but greater differences in Layer 4 close to
	the outputs. The difference is that, due to the smoother kernels and locally smoother perturbations,
	the robust model exhibits a closer local response differences in Layer 4 between the successful ones
	and the failed ones. Besides, compared with the PGD attack, the adversarial examples generated by
	the FGSM attack show similar differences in the front layers, but less local response differences in
	Layer 4, leading to higher model robustness.

	Table 4: The absolute difference of local responses (per image) between test set images and their
	potentially adversarial examples on different layers. Left/Right denote successfully-attacked/failed
	adversarial examples on adversarially-trained model respectively.

	\|Model\|Conv 1\|Layer 0\|Layer 1\|Layer 2\|Layer 3\|Layer 4\|
	\|---\|---\|---\|---\|---\|---\|---\|


	\|Max (PGD-20) Total (PGD-20) Max (FGSM) Total (FGSM)\|0.048/0.047 557/555 0.048/0.047 580/576\|0.15/0.15 488/490 0.16/0.15 510/511\|0.51/0.53 1264/1242 0.53/0.54 1292/1272\|0.36/0.37 445/420 0.35/0.36 436/414\|0.40/0.42 169/160 0.37/0.38 161/153\|1.10/0.99 578/491 0.90/0.86 478/428\|
	\|---\|---\|---\|---\|---\|---\|---\|



	We further report the transferability of adversarial examples on the standard model, as shown in
	Table 5. We get the potentially adversarial examples obtained from the robust model to attack the
	standard model. That is, locally-consistent perturbations, combined with a non-smooth standard
	model, exhibit small local response differences layer by layer, making the model’s cognition close
	enough to the natural examples and leading to 78.4% robustness close to 94.4% model performance.

	Table 5: The absolute difference of local responses (per image) between test set images and their potentially adversarial versions on different layers of ResNet-18. Left/Right denote original/transferred
	adversarial examples on standard model respectively.

	\|Model\|Conv 1\|Layer 0\|Layer 1\|Layer 2\|Layer 3\|Layer 4\|
	\|---\|---\|---\|---\|---\|---\|---\|

	\|Max Total\|0.12/0.15 370/532\|0.30/0.29 443/509\|1.92/0.85 4276/2639\|1.46/0.88 2093/1157\|1.45/0.67 581/280\|3.72/0.88 2624/525\|
	\|---\|---\|---\|---\|---\|---\|---\|


	-----

	As mentioned in Section 4.4, Figure 9 is a supplement to Figure 5 from the total absolute local
	response differences perspective, which indicates that smaller local response differences tend to
	approach model robustness to performance as well.


	2000 weight decay 1e-4weight decay 3e-4weight decay 5e-4weight decay 7e-4weight decay 10e-4 2000 weight decay 1e-4weight decay 3e-4weight decay 5e-4weight decay 7e-4weight decay 10e-4 6000 weight decay 1e-4weight decay 3e-4weight decay 5e-4weight decay 7e-4weight decay 10e-4 40003000 weight decay 1e-4weight decay 3e-4weight decay 5e-4weight decay 7e-4weight decay 10e-4

	4000

	2000

	Total Differences1000 Total Differences1000 Total Differences2000 Total Differences1000

	0 25 50 75 100 0 25 50 75 100 0 25 50 75 100 0 25 50 75 100

	Epochs Epochs Epochs Epochs


	(a) Conv 1


	(b) Layer 0


	(c) Layer 1


	(d) Layer 2


	weight decay 1e-4weight decay 3e-4weight decay 5e-4 500 weight decay 1e-4weight decay 3e-4weight decay 5e-4 2.5

	2000 weight decay 7e-4weight decay 10e-4 400 weight decay 7e-4weight decay 10e-4 2.0

	1.5

	Total Differences1000 Total Differences300200 Robust Loss1.0 weight decay 1e-4weight decay 3e-4weight decay 5e-4weight decay 7e-4

	0.5 weight decay 10e-4

	0

	0 25 50 75 100 0 25 50 75 100 0 25 50 75 100

	Epochs Epochs Epochs


	(e) Layer 3


	(f) Layer 4


	(g) Robust Loss


	Figure 9: The total absolute local response differences (per image) of some layers on adversariallytrained models with different smoothness. Among them, (a)-(f) denote the influence of different
	weight decay parameters on the local response differences in each layer, and (g) shows the robust
	loss of training set and test set.


	-----