# NETWORK INSENSITIVITY TO PARAMETER NOISE VIA
## ADVERSARIAL REGULARIZATION


**Julian Büchel**
IBM Research - Zurich
SynSense, Zürich, Switzerland
ETH Zürich, Switzerland
jbu@zurich.ibm.com


**Fynn Faber**
ETH Zürich, Switzerland
faberf@ethz.ch

ABSTRACT


**Dylan R. Muir**
SynSense, Zürich, Switzerland
dylan.muir@synsense.ai


Neuromorphic neural network processors, in the form of compute-in-memory crossbar arrays of memristors, or in the form of subthreshold analog and mixed-signal
ASICs, promise enormous advantages in compute density and energy efficiency
for NN-based ML tasks. However, these technologies are prone to computational
non-idealities, due to process variation and intrinsic device physics. This degrades
the task performance of networks deployed to the processor, by introducing parameter noise into the deployed model. While it is possible to calibrate each device,
or train networks individually for each processor, these approaches are expensive
and impractical for commercial deployment. Alternative methods are therefore
needed to train networks that are inherently robust against parameter variation, as a
consequence of network architecture and parameters. We present a new network
training algorithm that attacks network parameters during training, and promotes
robust performance during inference in the face of random parameter variation.
Our approach introduces a loss regularization term that penalizes the susceptibility
of a network to weight perturbation. We compare against previous approaches for
producing parameter insensitivity such as dropout, weight smoothing and introducing parameter noise during training. We show that our approach produces models
that are more robust to random mismatch-induced parameter variation as well as
to targeted parameter variation. Our approach finds minima in flatter locations in
the weight-loss landscape compared with other approaches, highlighting that the
networks found by our technique are less sensitive to parameter perturbation. Our
work provides an approach to deploy neural network architectures to inference
devices that suffer from computational non-idealities, with minimal loss of performance. This method will enable deployment at scale to novel energy-efficient
computational substrates, promoting cheaper and more prevalent edge inference.

1 INTRODUCTION

There is increasing interest in NN and ML inference on IoT and embedded devices, which imposes
energy constraints due to small battery capacity and untethered operation. Existing edge inference
solutions based on CPUs or vector processing engines such as GPUs or TPUs are improving in
energy efficiency, but still entail considerable energy cost (Huang et al., 2009). Alternative compute
architectures such as memristor crossbar arrays and mixed-signal event-driven neural network accelerators promise significantly reduced energy consumption for edge inference tasks. Novel non-volatile
memory technologies such as resistive RAM and phase-change materials (Chen, 2016; Yu & Chen,
2016) promise increased memory density with multiple bits per memory cell, as well as compact
compute-in-memory for NN inference tasks (Sebastian et al., 2020). Analog implementations of
neurons and synapses, coupled with asynchronous digital routing fabrics, permit high sparsity in both
network architecture and activity, thereby reducing energy costs associated with computation.

However, both of these novel compute fabrics introduce complexity in the form of computational
non-idealities, which do not exist for pure synchronous digital solutions. Some novel memory
technologies support several bits per memory cell, but with uncertainty about the precise value stored
on each cycle (Le Gallo et al., 2018b; Wu et al., 2019). Others exhibit significant drift in stored


-----

states (Joshi et al., 2020). Inference processors based on analog and mixed-signal devices (Neckar
et al., 2019; Moradi et al., 2018; Cassidy et al., 2016; Schemmel et al., 2010; Khaddam-Aljameh
et al., 2022) exhibit parameter variation across the surface of a chip, and between chips, due to
manufacturing process non-idealities. Collectively these processes known as “device mismatch”
manifest as frozen parameter noise in weights and neuron parameters.

In all cases the mismatch between configured and implemented network parameters degrades the task
performance by modifying the resulting mapping between input and output. Existing solutions for
deploying networks to inference devices that exhibit mismatch mostly focus on per-device calibration
or re-training (Ambrogio et al., 2018; Bauer et al., 2019; Nandakumar et al., 2020a). However, this,
and other approaches such as few-shot learning or meta learning entail significant per-device handling
costs, making them unfit for commercial deployment.

We consider a network to be “robust” if the output of a network to a given input does not change in
the face of parameter perturbation. With this goal, network architectures that are intrinsically robust
against device mismatch can be investigated (Thakur et al., 2018; Büchel et al., 2021). Another
approach is to introduce parameter perturbations during training that promote robustness during
inference, for example via random pruning (dropout) (Srivastava et al., 2014) or by injecting noise
(Murray & Edwards, 1994).

In this paper we introduce a novel solution, by applying adversarial training approaches to parameter
mismatch. Most existing adversarial training methods attack the input space. Here we describe an
adversarial attack during training that seeks the parameter perturbation that causes the maximum
degradation in network response. In summary, we make the following contributions:

-  We propose a novel algorithm for gradient-based supervised training of networks that are robust
against parameter mismatch, by performing adversarial training in the weight space.

-  We demonstrate that our algorithm flattens the weight-loss landscape and therefore leads to models
that are inherently more robust to parameter noise.

-  We show that our approach outperforms existing methods in terms of robustness.

-  We validate our algorithm on a highly accurate Phase Change Memory (PCM)-based Computein-Memory (CiM) simulator and achieve new state-of-the-art results in terms of performance and
performance retention over time.

2 RELATED WORK

Research to date has focused mainly on adversarial attacks in the input space. With an increasing
number of adversarial attacks, an increasing number of schemes defending against those attacks
have been proposed (Wang et al., 2020; Zhang et al., 2019; Madry et al., 2019; Moosavi-Dezfooli
et al., 2018). In contrast, adversarial attacks in parameter space have received little attention. Where
parameter-space adversaries have been examined, it has been to enhance performance in semisupervised learning (Cicek & Soatto, 2019), to improve robustness to input-space adversarial attacks
(Wu et al., 2020), or to improve generalisation capability (Zheng et al., 2020).

We define “robustness” to mean that the network output should change only minimally in the face of
a parameter perturbation — in other words, the weight-loss landscape should be as flat as possible at
a loss minimum. Other algorithms that promote flat loss landscapes may therefore also be useful to
promote robustness to parameter perturbations.

**Dropout (Srivastava et al., 2014) is a widely used method to reduce overfitting. During training, a**
random subset of units are chosen with some probability, and these units are pruned from the network
for a single trial or batch. This results in the network learning to distribute its computation across
many units, and acts as a regularization against overfitting.

**Entropy-SGD (Chaudhari et al., 2019) is a network optimisation method that minimises the local**
entropy around a solution in parameter space. This results in a smoothed parameter-loss landscape
that should penalize sharp minima.

**Adversarial Block Coordinate Descent (ABCD) (Cicek & Soatto, 2019) was proposed in order**
to complement input-space smoothing with weight-space smoothing in semi-supervised learning.


-----

ABCD repeatedly picks half of the network weights and performs one step of gradient ascent on
them, followed by applying gradient descent on the other half.

**Adversarial Weight Perturbation (AWP) (Wu et al., 2020) was designed to improve the robustness**
of a network to adversarial attacks in the input space. The authors use Projected Gradient Ascent
(PGA) on the network parameters to approximate a worst case perturbation of the weights Θ[′]. PGA
repeatedly computes the gradient of a loss function and updates the parameters in the direction of
the (positive) gradient. After each update, the parameters are projected back onto a ball (e.g. in
_l[2]) around the original parameters to ensure that a maximum distance is kept. Having identified an_
adversarial perturbation in the weight-space, an adversarial perturbation in the input-space is also
found using PGA. Finally, the original weights Θ are updated using the gradient of the loss evaluated
at the adversarial perturbation Θ[′].

**Adversarial Model Perturbation (AMP) (Zheng et al., 2020) improves the generalisation of conven-**
tional neural networks by optimizing a standard loss evaluated using parameters that were perturbed
adversarially using PGA. Unlike our method, (Zheng et al., 2020) did not formulate the loss function
as a trade-off between performance and robustness. Furthermore, the presented algorithm, unlike our
method, treats the perturbation ∆Θ to the parameters Θ as a constant during backpropagation.

**TRadeoff-inspired Adversarial DEfense via Surrogate-loss minimization (TRADES) (Zhang**
et al., 2019) is a method for training networks that are robust against adversarial examples in the
input space. The method consists of adding a boundary loss term to the loss function that measures
how the network performance changes when the input is attacked. The boundary loss does not take
the labels into account, so scaling it by a factor βrob allows for a principled trade-off between the
robustness and the accuracy of the network.

**Noise injection during the forward pass (Murray & Edwards, 1994) is a simple method for in-**
creasing network robustness to parameter noise. This method adds Gaussian noise to the network
parameters during the forward pass and computes weight gradients with respect to the original
parameters. This method regularizes the gradient magnitudes of output units with respect to the
weights, thus enforcing distributed information processing and insensitivity to parameter noise. We
refer to this method as “Forward Noise”.

A recent paper proposed a method for improving the resilience to random and targeted bit errors in
SRAM cells on digital Deep Neural Network (DNN) accelerators (Stutz et al., 2021). By employing
adversarial or random bit flips during training, the authors significantly improved the robustness to
bit perturbations, enabling the accelerators to be operated below the conventional supply voltage.

3 METHODS

We use Θ to denote the set of parameters of a neural network f (x, Θ) that are trainable and susceptible
to mismatch. The adversarial weights are denoted Θ[∗], where Θ[∗]t [are the adversarial weights at the][ t][-th]
iteration of PGA. We denote the PGA-adversary as a function A that maps parameters Θ to attacking
parameters Θ[∗]. We denote a mini-batch of training examples as X with y being the corresponding
ground-truth labels.

_Eζ[p]_ [(][m][)][ denotes the projection operator on the][ ζ][-ellipsoid in][ l][p][ space. The]
operator ⊙ denotes elementwise multiplication.

[Q]

The effect of component mismatch on a network parameter can be modelled using a Gaussian
distribution where the standard deviation depends on the parameter magnitude (Joshi et al., 2020;
Büchel et al., 2021). In this paper we restrict ourselves to mismatch-driven perturbations in the
network weights. For complex Spiking Neural Networks (SNNs), “network parameters” can refer to
additional quantities such as neuronal and synaptic time constants or spiking thresholds. Our training
approach described here can be equally applied to these additional parameters.

We define the value of an individual parameter when deployed on a neuromorphic chip as


Θ[mismatch] _∼N_ (Θ, diag(ζ|Θ|)) (1)

where ζ governs the perturbation magnitude, referred to as the “mismatch level”. The physics underlying the neuronal- and synaptic circuits lead to a model where the amount of noise introduced into the
system depends linearly on the magnitude of the parameters. If mismatch-induced perturbations had
constant standard deviation independent of weight values, one could use the weight-scale invariance


-----

of neural networks as a means to achieve robustness, by simply scaling up all network weights (see
Figure S4). The linear dependence of weight magnitude and mismatch noise precludes this approach.

In contrast to adversarial attacks in the input space (Carlini & Wagner, 2016; Moosavi-Dezfooli
et al., 2015; Madry et al., 2019; Goodfellow et al., 2015), our method relies on adversarial attacks
in parameter space. During training, we approximate the worst case perturbation of the network
parameters using PGA and update the network parameters in order to mitigate these attacks. To
trade-off robustness and performance, we use a surrogate loss (Zhang et al., 2019) to capture the
difference in output between the normal and attacked network. Algorithm 1 illustrates the training
procedure in more detail.

**begin**

Θ[∗]0 Θ + Θ _ϵ_ _R ; R_ (0, 1)
**for t[←] = 1−** **to N |** _steps|_ _⊙ do_ _∼N_

_g_ Θ[∗]t 1 _[L][rob][(Θ][,][ Θ]t[∗]_ 1[, X][)]
_←−∇_ _−_ _−_
_v_ arg max
_←−_ _v:_ _v_ _p_ 1 _[v][T][ g]_
_∥_ _∥_ _≤_

Θ[∗]t _[←]_ [Q]Eζ[p]attack [(Θ]t[∗]−1 [+][ α][ ⊙] _[v][)]_

**end**
Θ ←− Θ − _η∇ΘLnat((Θ, X), y) + βrobLrob(Θ, Θ[∗]Nsteps_ _[, X][)]_
**end**
**Algorithm 1: In l∞, v corresponds to sign(g) and the step size α is** _[|]N[Θ][|⊙]steps[ζ]_ [.][ Q]Eζ[p]attack [(][m][)][ de-]

notes the projection operator on the ζattack-ellipsoid in l[p] space. In l[∞] this corresponds to
min(max(model. _m, Θ −_ _ϵ), Θ + ϵ) with ϵ = ζattack ⊙|Θ|. ζattack and βrob are hyperparameters of our_


Unlike adversarial training in the input space, where adversarial inputs can be seen as a form of data
augmentation, adversarial training in the parameter space poses the following challenge: Because
the parameters that are attacked are the same parameters being optimized, performing gradient
descent using the same loss that was used for PGA would simply revert the previous updates and
no learning would occur. ABCD circumvents this problem by masking one half of the parameters
in the adversarial loop and masking the other half during the gradient descent step. However, this
limits the adversary in its power, and requires multiple iterations to be performed in order to update
all parameters at least once. AWP approached this problem by assuming that the gradient of the loss
with respect to the attacking parameters can be used in order to update the original parameters to
favor minima in flatter locations in weight-space. However, it is not clear whether this assumption
always holds since the gradient of the loss with respect to the attacking parameters is not necessarily
the same direction that would lead to a flatter region in the weight loss-landscape.

We approach this problem slightly differently: Similar to the TRADES algorithm (Zhang et al., 2019),
our algorithm optimizes a natural (task) loss and a separate robustness loss.

_Lgen(Θ, X, y) = Lnat(Θ, X, y) + βrobLrob(Θ, A(Θ), X)_

Using a different loss for capturing the susceptibility of the network to adversarial attacks enables
us to simultaneously optimise for performance and robustness, without PGA interfering with the
gradient descent step. In our experiments, Lrob is defined as

_Lrob(Θ, Θ[∗], X) = KL (f_ (Θ, X), f (Θ[∗], X)) (2)

This formulation comes with a large computational overhead since it requires computing the Jacobian
**JΘ∗** (Θ) of a complex recurrent relation between Θ and Θ[∗]. To make our algorithm more efficient
we assume that the Jacobian is diagonal, meaning that Θ[∗] = Θ + ∆Θ for some ∆Θ given by the
adversary. In l, the Jacobian can then be calculated efficiently using (see suppl. material for details):
_∞_

**JΘ∗** (Θ) = I + diag sign(Θ)⊙N(ζstepsattack+ϵ·R1) _t=1_ [sign] Θ[∗]t _t_ _[, X]_

_⊙_ [P][N][steps] _∇_ _[L][rob][(Θ][,][ Θ][∗]_

h

  [i]

By making this assumption, our algorithm effectively multiplies the original training time by the
number of PGA steps, similar to (Wu et al., 2020; Cicek & Soatto, 2019; Zheng et al., 2020).


-----

Because component mismatch is independently proportional to the magnitude of each parameter, one
has to model the space in which the adversary can search for a perturbation using an axis-aligned
ellipsoid in l2 and an axis-aligned box in l . Using an ϵ-ball where the radius depends linearly on
_∞_
the individual parameter sets (Li et al., 2018; Cicek & Soatto, 2019; Wu et al., 2020) would either
give the adversary too little or too much attack space. Projecting onto an axis-aligned ellipsoid in
_l2 corresponds to solving the following optimization problem (Gabay & Mercier, 1976; Dai, 2006),_
which does not have a closed-form solution:

1
_x[∗]_ = arg min
_x_ 2 _[∥][m][ −]_ _[x][∥][2]_

s.t. (x − _c)[T]_ _W_ _[−][2](x −_ _c) ≤_ 1

where W = diag(|Θ| ⊙ _ζ) + I · ζconst, c = Θ and m = Θ[∗]_ + α ⊙ _v. Because of the computational_
overhead this would incur, we only consider the l case in our experiments.
_∞_

4 RESULTS

The ultra-low power consumption of mixed-signal neuromorphic chips make them suitable for edgeapplications, such as always-on voice detection (Cho et al., 2019), vibration monitoring (Gies et al.,
2021) or always-on face recognition (Liu et al., 2019). For this reason, we consider two compact
network architectures in our experiments: A Long Short-term spiking recurrent Neural Network
(LSNN) with roughly 65k trainable parameters; a conventional CNN with roughly 500k trainable
parameters; and a Resnet32 architecture (He et al., 2015) (see Supplementary Material S1 for more
information). We trained models to perform four different tasks:

-  Speech command detection of 6 classes (Warden, 2018);

-  ECG-anomaly detection on 4 classes (Bauer et al., 2019);

-  Fashion-MNIST (F-MNIST): clothing-image classification on 10 classes (Xiao et al., 2017);
and

-  The Cifar10 colour image classification task (Krizhevsky, 2009).

We compared several training and attack methods, beginning with a standard Stochastic Gradient
Descent (SGD) approach using the Adam optimizer (Kingma & Ba, 2015) (“Standard”). Learning
rate varied by architecture, but was kept constant when comparing training methods on an architecture.
We examined networks trained with dropout (Srivastava et al., 2014), AWP (Wu et al., 2020), AMP
(Zheng et al., 2020), ABCD (Cicek & Soatto, 2019), and Entropy-SGD (Chaudhari et al., 2019).
The adversarial perturbations used in AWP and ABCD were adapted to our mismatch model (i.e.
magnitude-dependent in l ) unless stated otherwise. AMP was not adapted.
_∞_

A dropout probability of 0.3 was used in the dropout models and γ in AWP was set to 0.1. When
Gaussian noise was applied to the weights during the forward pass (Murray & Edwards, 1994) a
relative standard deviation of 0.3 times the weight magnitude was used (ηtrain = 0.3). For EntropySGD, we set the number of inner iterations to 10 with a Langevin learning rate of 0.1. Because
Entropy-SGD and ABCD have inner loops, the number of total epochs were reduced accordingly. All
other models were trained for the same number of epochs (no early stopping) and the model with the
highest validation accuracy was selected.

**Effectiveness of adversarial weight attack** We examined the strength of our adversarial weight
attack during inference and training. Standard networks trained using gradient descent alone with no
additional regularization (Fig. S5a, “Standard”) were disrupted badly by our adversarial attack during
inference (ζ = 0.1; final mean test accuracy 91.40% → 17.50%), and this was not ameliorated by
further training. When our adversarial attack was implemented during training (Fig. S5a, βrob = 0.1),
the trained network was protected from disruption both during training and during inference (final
test accuracy 91.97% → 78.41%).

Our adversarial attack degrades network performance significantly more than a random perturbation.
Because our adversary uses PGA during the attack, it approximates a worst-case perturbation of
the network within an ellipsoid around the nominal weights Θ. We compared the effect of our
attack against a random weight perturbation (random point on ζ−ellipsoid) of equal magnitude.


-----

|AWP Beta|Col2|
|---|---|
|Beta Forwa Dropo Forwa Stand|rd Noise + Beta ut rd Noise ard|
|||
|||
|||


F-MNIST CNN ECG LSNN Speech LSNN

1.0 AWPBetaForward Noise + BetaDropoutForward Noise 00..3530 11..21

0.8 Standard 0.25 1.0

0.20 0.9

0.6 0.15 0.8

Cross Entropy Loss 0.4 Cross Entropy Loss 0.10 Cross Entropy Loss 0.7

0.05 0.6

0.2 0.5

_−1.5_ _−1.0_ _−0.5_ 0.0α 0.5 1.0 1.5 _−1.5_ _−1.0_ _−0.5_ 0.0α 0.5 1.0 1.5 _−1.5_ _−1.0_ _−0.5_ 0.0α 0.5 1.0 1.5


Figure 1: Our training method flattens the test weight-loss landscape. When moving away from
the trained weight minimum (α = 0) in randomly-chosen directions, we find that our adversarial
training method (Beta; Beta+Forward) finds deeper minima (for F-MNIST and Speech tasks) at
flatter locations in the cross-entropy test loss landscape. See text for further details, and Fig. S2 for
visualisation over several random seeds.

For increasing perturbation size during inference (Fig. S5c; ζ), our adversarial attack disrupted
the performance of the standard network significantly more than a random perturbation (test acc.
91.40% → 17.50% (attack) vs. 91.40% → 90.63% (random) for ζ = 0.1). When our adversarial
attack was applied during training (Fig. S5d; βrob = 0.1), the network was protected against both
random and adversarial attacks for magnitudes up to ζ = 0.7 and ζ = 0.1, respectively.

**Flatness of the weight-loss landscape** Under our definition of robustness, the network output
should change only minimally when the network parameters are perturbed. This corresponds to a
loss surface that is close to flat in weight space. We measured the test weight-loss landscape for
trained networks, compared over alternative training methods and for several architectures (Fig. 1).
We examined only cross-entropy loss over the test set, and not the adversarial attack loss component
(KL divergence loss; see Eq. 2). For each trained network, we chose a random vector v ∼N (0, ζ|Θ|)
and calculated Lcce(f (Xtest, Θ + α · v), ytest) for many evenly-spaced α ∈ [−2, 2]. This process
was repeated 5 times for ζ = 0.2, and the means plotted in Fig. 1. Weight-loss landscapes for the
individual trials are shown in Fig. S2.

Our adversarial training approach found minima of trained parameters Θ in flatter areas of the weightloss landscape, compared with all other approaches examined (flatter curves in Fig. 1). In most
cases our training approach also found deeper minima at lower categorical cross-entropy loss (Lcce),
reflecting better task performance. These results are reflected in the better generalization performance
of our approach (see Table 1). Not surprisingly, dropout and AWP also lead to flatter minima than
the Standard network with no regularization. ABCD and Entropy-SGD were not included in Fig. 1
because they did not outperform the Standard model.

**Network robustness against parameter mismatch** We evaluated the ability of our training
method to protect against simulated device mismatch. We introduced frozen parameter noise into
models trained with adversarial attack, with noise modelled on that observed in neuromorphic processors (Joshi et al., 2020; Büchel et al., 2021). In these devices, uncertainty associated with each
weight parameter is approximately normally distributed around the nominal value, with a standard
deviation that scales with the weight magnitude (Eq 1). We measured test accuracy under simulated
random mismatch for 100 samples across two model instances. A comparison of our method against
standard training is shown in Fig. 2. For mismatch levels up to 70% (ζ = 0.7), our approach protected
significantly against simulated mismatch for all three tasks examined (p < 2 × 10[−][8] in all cases; U
test). A more detailed comparison between the different models is given in Table 1.

**Network robustness against direct adversarial attack on task performance** Our training approach improves network robustness against mismatch parameter noise. We further evaluated the
robustness of our trained networks against a parameter adversary that directly attacks task performance, by performing PGA on the cross-entropy loss _cce. Note that this is separate from the_
_L_
adversary used in our training method, which attacks the boundary loss (Eq.2). The AWP method
uses the cross-entropy loss to find adversarial parameters during training. Nevertheless, we found that


-----

|c|Col2|Col3|Col4|Col5|Col6|Col7|Col8|Col9|Col10|
|---|---|---|---|---|---|---|---|---|---|
|||||||||||
|||||||||||

|Col1|Col2|Col3|Col4|Col5|Col6|Col7|Col8|Col9|Col10|
|---|---|---|---|---|---|---|---|---|---|
|||||||||||


**a** **b** **c**

Robust

Robust Robust

Groundtruth

Standard Standard

Standard

1.0 1.0 1.0

0.8 0.8 0.8

Test acc. 00..64 00..64 00..64

0.2 Standard 0.2 0.2

Ours

0.0 0.0 0.0
0.1 0.2 0.3 0.5 0.7 0.1 0.2 0.3 0.5 0.7 0.1 0.2 0.3 0.5 0.7

Mismatch level (ζ) Mismatch level (ζ) Mismatch level (ζ)


Figure 2: **Adversarial attack during training protects networks against random mismatch-**
**induced parameter noise. Networks were evaluated for the Speech (a), ECG (b) and F-MNIST**
tasks (c), under increasing levels of simulated mismatch (ζ). Networks trained using standard SGD
were disrupted by mismatch levels ζ > 0.1. At all mismatch levels our adversarial training approach
performed significantly better in the presence of mismatch (higher test accuracy). Red boxes highlight
misclassified examples.

|Col1|Col2|Col3|Col4|Col5|Col6|Col7|
|---|---|---|---|---|---|---|
||||||||
||||||||
||||Beta||||
||Forward Beta|Noise +|Beta||||
||Standard Forward AWP|Noise|||||


F-MNIST CNN ECG LSNN Speech LSNN

100 80

80

80

60

60

60

Test Acc. 40 Test Acc. Test Acc. 40

Forward Noise + Beta

20 Beta 40

Standard 20
Forward Noise

0 AWP 20

0.0 0.005 0.01 0.05 0.1 0.2 0.3 0.5 0.0 0.005 0.01 0.05 0.1 0.2 0.3 0.5 0.0 0.005 0.01 0.05 0.1 0.2 0.3 0.5

Attack size ζ Attack size ζ Attack size ζ


Figure 3: Our training method protects against task-adversarial attacks in parameter space.
Networks trained under several methods were attacked using a PGA adversary that directly attacked
the task performance _cce. Our adversarial training approach (bold; dashed) outperformed all other_
_L_
methods against this attack.

our method consistently outperforms all other compared methods for increasing attack magnitude ζ
(Fig. 3). In networks trained with our method, the adversary needed to perform a considerably larger
attack to significantly reduce performance (test accuracy < 70%). ABCD and Entropy-SGD were not
included in the comparison because they did not outperform the standard network.

**Robustness against parameter drift for PCM based CiM** CiM devices based on memristor
technologies such as PCM promise to deliver energy- and space-efficient accelerators. With the
increasing interest in CiM devices for accelerated inference and energy-efficient edge computing
(Sebastian et al., 2020), the problem of deploying a model that is robust to noise originating from the
device physics of PCM cells has gained in significance. We investigated the effect of our training
method on the robustness of networks that are simulated to run on PCM-based CiM hardware (for
details on the simulator, see SM 5).

Networks trained with our method outperform state-of-the-art networks deployed on PCM-based
CiM. Currently, the method that has proven to yield the best performance on CiM hardware is training
with noise on the network parameters during the forward pass. We adapted this method by adding
our algorithm and show that we consistently outperform the conventional method (see Figure S14)
for a wide range of hyperparameters, and even surpass the FP baseline (a model trained without
noise injection and evaluated on a standard PC) for some configurations (see Figure 4). Following


-----

attack [= 0.01] 93.5 attack [= 0.03] attack [= 0.05] 93.25 attack [= 0.10] 0.1

93.00

93.0 93.0 93.0 92.75

92.50 0.05

92.5 92.5 92.5 92.25 rob

est acc. (%)T 92.0 est acc. (%)T 92.0 est acc. (%)T 92.0 est acc. (%)T 92.00 0.025

91.75

91.5 FP Baseline 91.5 91.5 91.50

rob [= 0.0,] train [= 0.110] 91.25 0.01

10[2] 10[3] 10[4] 10[5] 10[6] 10[7] 10[2] 10[3] 10[4] 10[5] 10[6] 10[7] 10[2] 10[3] 10[4] 10[5] 10[6] 10[7] 10[2] 10[3] 10[4] 10[5] 10[6] 10[7]

Tinf (s) Tinf (s) Tinf (s) Tinf (s)


Figure 4: Networks trained with our method show overall better performance when deployed
**on PCM-based CiM hardware. This figure shows the performance degradation as a consequence of**
the PCM devices drifting over time (x-axis, up to one year) of networks deployed on CiM hardware.
Each subplot shows networks trained with a different attacking magnitude (ζattack) that are trained
with different values of βrob. Each network is compared to the FP-baseline and a network trained
with Gaussian noise injection on the weights.


|Col1|attack = 0.01|Col3|Col4|
|---|---|---|---|
|||||
|||||
|FP Baseline rob = 0.0||||

|Col1|Col2|Col3|Col4|
|---|---|---|---|
|FP Baseline rob = 0.0||||


attack [= 0.01] attack [= 0.03] attack [= 0.05] attack [= 0.10]

93 93

92 FP Baseline 92 92 92

est acc. (%)T 9190 rob [= 0.0] est acc. (%)T 9190 est acc. (%)T 90 est acc. (%)T 90

0.026 0.037 0.056 0.075 0.110 0.026 0.037 0.056 0.075 0.110 0.026 0.037 0.056 0.075 0.110 0.026 0.037 0.056 0.075 0.110

train [=] train [=] train [=] train [=]

93.5 attack [= 0.01] 93.5 attack [= 0.03] 93.5 attack [= 0.05] 93.5 attack [= 0.10] 0.1

est acc. (%)T 93.092.592.0 FP Baselinerob [= 0.0] est acc. (%)T 93.092.592.0 est acc. (%)T 93.092.592.0 est acc. (%)T 93.092.592.0 0.050.0250.01 rob

0.026 0.037 0.056 0.075 0.110 0.026 0.037 0.056 0.075 0.110 0.026 0.037 0.056 0.075 0.110 0.026 0.037 0.056 0.075 0.110

train [=] train [=] train [=] train [=]


Figure 5: **Our method consistently yields networks that outperform training with Gaussian**
**noise injection. This figure compares the robustness to Gaussian noise at various levels (ζ) for**
networks trained with our method (blue) and a networks trained with Gaussian noise injection (red),
where the level of noise used during training (ηtrain) matches the noise used during inference. Each
row represents a different type of noise: The first row models Gaussian noise with a standard deviation
that is proportional to the largest absolute weight in the individual weight kernels (Joshi et al., 2020)
and the second row follows the model presented in this paper (see Eq. 1).

experiments conducted in (Joshi et al., 2020), we used Resnet32 (He et al., 2015) trained on Cifar10
(Krizhevsky, 2009).

Injecting Gaussian noise on the weights during the forward pass yields strong improvements compared
to the standard network. We show that by adding our method, we consistently improve this robustness
by a significant amount (see Figure 5).

We furthermore improve the scalability of our method by using a pretrained model and fewer steps for
the adversary. Our algorithm incurs an additional training time that scales linearly with the number
of attack steps used in the adversary (note that we cache the necessary gradients for the Jacobian
calculation). To alleviate this additional time, we show that our method produces good results even
for just one single adversarial step. Figure S13 shows the resulting performance when varying the
number of attack steps used by the adversary. It should be noted that all results reported on PCM
robustness were obtained using three adversarial steps and a pretrained model.


**Verifiable robustness for LSNNs** We investigated the provable robustness of LSNNs trained using
our method using abstract interpretation (Cousot & Cousot, 1977; Gehr et al., 2018; Mirman et al.,
2018). In this analysis a function f (x, Θ) (in our case, a neural network with input x and parameters
Θ) is overapproximated using an abstract domain. We specify the weights Θ in our network as
an interval parameterised by the attack size ζ, spanning [Θ − _ζ|Θ|, Θ + ζ|Θ|]. We examined the_
proportion of provably correctly classified test samples for the Speech and ECG tasks, under a range
of attack sizes ζ, and comparing our approach against standard gradient descent and against training


-----

with forward-pass noise only (Fig. S9). We found that our approach is provably more correct over
increasing attack size ζ (higher verified test accuracy).

5 DISCUSSION

We proposed a new training approach that includes adversarial attacks on the parameter space during
training. Our proposed adversarial attack was significantly stronger than random weight perturbations
at disrupting the performance of a trained network. Including the adversarial attack during training
significantly protected the trained network from weight perturbations during inference. Our approach
found minima in the weight-loss landscape that usually corresponded to lower loss values, and were
always in flatter regions of the loss landscape. This indicates that our approach found network
solutions that are less sensitive to parameter variation, and therefore more robust. Our approach was
more robust than several other methods for inducing robustness and good generalisation. To the best
of our knowledge, our work represents the first example of interval bound propagation applied to
SNNs, and the first application of parameter-space adversarial attacks to promote network robustness
against device mismatch for mixed-signal compute. Our experiments only considered the impact
of weight perturbations, and did not examine the influence of uncertainty in other parameters of
mixed-signal neuromorphic processors such as time constants or spiking thresholds. Our approach
can be adapted to include adversarial attacks in the full network parameter space, increasing the
robustness of spiking networks. The technique of interval bound propagation can also be applied to
these additional network parameters. We did not quantize network parameters either during or after
training, in this work. On some platforms (Moradi et al., 2018) it is necessary to deploy quantized
weights and it is unclear how our adversarial attacks would interact with quantization during the
training process. However, most PCM-based CiM hardware does not require quantization during
training in order to get good performance (Joshi et al., 2020). Per-device training for a device with
known calibrated parameter noise is likely to achieve the highest possible deployed performance on
that single device. However, this approach has significant drawbacks. Firstly, each device must be
either measured / calibrated accurately — not a trivial requirement — or trained with the device in
the forward inference pass of the training loop. Secondly, training must be performed individually
for each device, entailing significant logistical problems if the training is conducted in the factory or
inside a consumer product. Thirdly, this approach will retain full sensitivity to parameter variation on
the device. Our method improves the performance of neural networks deployed to inference hardware
that include computational non-idealities. For example, NN processors with crossbar architectures
based on novel memory devices such as RRAM and PCM (Sebastian et al., 2020) display uncertainty
in stored memory values as well as conductance drift over time (Le Gallo et al., 2018b; Wu et al.,
2019; Joshi et al., 2020). Our method could also address in-memory computing-based NN processors
based on SRAM and switched capacitors (Verma et al., 2019). Analog neurons and synapses in
mixed-signal NN processors, for example SNN inference processors (Moradi et al., 2018), exhibit
variation in weights and neuron parameters across a processor. We showed that our training approach
finds network solutions that are insensitive to mismatch-induced parameter variation. Our networks
can therefore be deployed to inference devices with computational non-idealities with only minimal
reduction in task performance, and without requiring per-device calibration or model training. This
reduction in per-device handling implies a considerable reduction in expense when deploying at
commercial scale. Our method therefore brings low-power neuromorphic inference processors closer
to commercial viability.

**Ethics statement** The authors declare no conflicts of interest.

**Reproducibility statement** Code for reproduce all experiments described in this work are provided
[at https://github.com/jubueche/BPTT-Lipschitzness and https://github.](https://github.com/jubueche/BPTT-Lipschitzness)
[com/jubueche/Resnet32-ICLR](https://github.com/jubueche/Resnet32-ICLR)

**Acknowledgments** This work was partially supported by EU grants 826655 “TEMPO”; 871371
“MEMSCALES”; and 876925 “ANDANTE” to DRM. JB would also like to thank Manuel Le GalloBourdeau, Irem Boybat and Abu Sebastian from IBM Research - Zurich, for insightful discussions
and technical support.


-----

Table 1: **Results of training multiple networks using several methods over three tasks.**
Networks were evaluated under different levels of mismatch (ζ).

**CNN**
Forward Noise, βrob = 0.1 _βrob = 0.25_ Standard Forward Noise AWP (ϵpga = 0.0) Dropout AMP (ϵ = 0.005)

Mismatch Mean Acc. Std. Min. Mean Acc. Std. Min. Mean Acc. Std. Min. Mean Acc. Std. Min. Mean Acc. Std. Min. Mean Acc. Std. Min. Mean Acc. Std. Min.
Baseline (0.0) 91.88 **0.00** 91.88 92.11 0.22 91.89 91.30 0.15 91.15 92.00 0.02 91.98 92.35 0.12 **92.23** **92.42** 0.24 92.18 91.34 0.12 91.22
0.1 91.77 0.12 91.29 91.59 0.26 90.91 90.45 0.37 88.80 91.94 **0.09** 91.69 **92.19** 0.16 **91.78** 91.13 0.85 86.50 90.42 0.35 89.37
0.2 91.62 **0.16** **91.19** 90.67 0.42 89.28 87.93 1.03 83.40 **91.63** **0.16** 91.06 91.42 0.44 89.87 88.71 1.55 80.97 87.97 0.88 85.33
0.3 **91.25** **0.22** **90.36** 89.64 0.65 87.10 82.70 2.45 71.99 91.01 0.26 90.01 89.88 0.95 85.11 84.94 3.06 72.97 83.13 2.13 75.73
0.5 **89.36** **0.73** **86.84** 85.96 1.87 76.28 61.14 6.92 42.46 87.82 0.88 84.37 82.59 3.66 66.73 71.74 5.84 53.10 60.60 7.36 38.67
0.7 **84.19** **2.63** **74.25** 79.39 4.40 59.15 36.93 7.73 20.17 78.48 2.95 69.67 65.38 8.27 40.66 53.91 9.03 22.49 36.79 7.42 18.53

**ECG LSNN**
Forward Noise, βrob = 0.1 _βrob = 0.25_ Standard Forward Noise AWP (ϵpga = 0.0) Dropout AMP (ϵ = 0.02)

Mismatch Mean Acc. Std. Min. Mean Acc. Std. Min. Mean Acc. Std. Min. Mean Acc. Std. Min. Mean Acc. Std. Min. Mean Acc. Std. Min. Mean Acc. Std. Min.
Baseline (0.0) 99.07 0.34 98.73 99.10 0.07 99.03 99.07 **0.04** **99.03** 99.25 0.37 98.88 99.22 0.19 99.03 98.10 0.11 97.99 **99.40** 0.00 **99.40**
0.1 99.04 0.27 98.36 98.96 0.24 98.28 98.95 0.27 97.76 **99.09** **0.19** **98.66** 98.93 0.31 97.69 97.71 0.34 96.72 99.04 0.38 97.39
0.2 98.87 0.35 96.94 98.16 0.71 93.96 97.22 1.46 91.87 **99.01** **0.26** **98.06** 97.71 0.97 93.06 96.89 0.87 93.81 97.03 1.52 90.67
0.3 98.45 **0.45** **96.49** 96.34 1.95 89.33 92.97 4.38 65.30 **98.59** 0.52 95.30 94.60 2.93 81.34 94.85 2.47 85.60 92.24 3.92 76.27
0.5 **94.86** **2.69** 82.39 86.22 6.66 60.75 76.55 9.61 35.75 94.44 2.86 **82.84** 80.32 8.08 41.94 87.56 6.05 64.10 73.36 11.27 29.55
0.7 **82.02** **8.40** **50.22** 70.07 11.16 39.33 58.67 11.44 29.48 80.00 8.70 37.69 63.44 9.61 32.91 74.24 12.58 30.15 56.64 11.02 27.76

**Speech LSNN**
Forward Noise, βrob = 0.5 _βrob = 0.5_ Standard Forward Noise AWP (ϵpga = 0.01) Dropout AMP (ϵ = 0.01)

Mismatch Mean Acc. Std. Min. Mean Acc. Std. Min. Mean Acc. Std. Min. Mean Acc. Std. Min. Mean Acc. Std. Min. Mean Acc. Std. Min. Mean Acc. Std. Min.
Baseline (0.0) 81.52 0.05 81.47 **82.48** **0.00** **82.48** 80.86 0.03 80.83 81.33 0.14 81.20 82.38 0.14 82.25 79.02 0.36 78.66 80.83 0.44 80.39
0.1 81.33 **0.24** 80.72 82.01 0.31 **81.03** 79.72 0.45 78.53 81.12 0.32 80.18 **82.03** 0.37 80.79 79.16 0.44 77.88 80.00 0.55 78.49
0.2 80.77 **0.35** 79.71 **81.24** 0.43 **79.98** 76.96 1.11 72.57 80.10 0.53 78.73 80.58 0.65 78.69 78.58 0.67 75.85 77.56 0.87 74.81
0.3 79.57 **0.62** **77.98** **79.80** 0.67 77.54 72.31 2.00 65.40 78.05 0.88 75.45 77.47 1.35 70.88 77.18 0.96 74.43 73.80 1.61 68.24
0.5 72.67 2.75 **63.85** **73.40** **2.32** 60.91 57.16 4.25 42.07 67.41 3.93 53.50 63.98 4.28 47.38 69.80 3.13 54.75 59.91 4.49 44.74
0.7 58.03 6.57 32.19 **60.23** **4.66** **45.49** 40.70 5.72 24.92 49.19 6.98 30.47 44.83 6.65 23.13 55.96 5.74 37.74 42.88 6.83 25.03

REFERENCES

Stefano Ambrogio, Pritish Narayanan, Hsinyu Tsai, Robert M. Shelby, Irem Boybat, Carmelo
di Nolfo, Severin Sidler, Massimo Giordano, Martina Bodini, Nathan C. P. Farinha, Benjamin
Killeen, Christina Cheng, Yassine Jaoudi, and Geoffrey W. Burr. Equivalent-accuracy accelerated neural-network training using analogue memory. Nature, 558(7708):60–67, June 2018.
[ISSN 1476-4687. doi: 10.1038/s41586-018-0180-5. URL https://doi.org/10.1038/](https://doi.org/10.1038/s41586-018-0180-5)
[s41586-018-0180-5.](https://doi.org/10.1038/s41586-018-0180-5)

F. C. Bauer, D. R. Muir, and G. Indiveri. Real-time ultra-low power ecg anomaly detection using an
event-driven neuromorphic processor. IEEE Transactions on Biomedical Circuits and Systems, 13
(6):1575–1582, 2019. doi: 10.1109/TBCAS.2019.2953001.

Guillaume Bellec, Darjan Salaj, Anand Subramoney, Robert A. Legenstein, and Wolfgang
Maass. Long short-term memory and learning-to-learn in networks of spiking neurons. CoRR,
[abs/1803.09574, 2018. URL http://arxiv.org/abs/1803.09574.](http://arxiv.org/abs/1803.09574)

Irem Boybat, Manuel Le Gallo, SR Nandakumar, Timoleon Moraitis, Thomas Parnell, Tomas Tuma,
Bipin Rajendran, Yusuf Leblebici, Abu Sebastian, and Evangelos Eleftheriou. Neuromorphic
computing with multi-memristive synapses. Nature communications, 9(1):2514, 2018.

Julian Büchel, Dmitrii Zendrikov, Sergio Solinas, Giacomo Indiveri, and Dylan R. Muir. Supervised training of spiking neural networks for robust deployment on mixed-signal neuromorphic
processors, 2021.

Nicholas Carlini and David A. Wagner. Towards evaluating the robustness of neural networks. CoRR,
[abs/1608.04644, 2016. URL http://arxiv.org/abs/1608.04644.](http://arxiv.org/abs/1608.04644)

Andrew S. Cassidy, Jun Sawada, Paul Merolla, John V. Arthur, Rodrigo Alvarez-Icaza, Filipp
Akopyan, Bryan L. Jackson, and Dharmendra S. Modha. Truenorth: A high-performance, lowpower neurosynaptic processor for multi-sensory perception, action, and cognition. 2016.

Pratik Chaudhari, Anna Choromanska, Stefano Soatto, Yann LeCun, Carlo Baldassi, Christian
Borgs, Jennifer Chayes, Levent Sagun, and Riccardo Zecchina. Entropy-SGD: biasing gradient
descent into wide valleys. Journal of Statistical Mechanics: Theory and Experiment, 2019(12):
[124018, dec 2019. doi: 10.1088/1742-5468/ab39d9. URL https://doi.org/10.1088%](https://doi.org/10.1088%2F1742-5468%2Fab39d9)
[2F1742-5468%2Fab39d9.](https://doi.org/10.1088%2F1742-5468%2Fab39d9)

An Chen. A review of emerging non-volatile memory (nvm) technologies and applications.
_Solid-State Electronics, 125:25–38, 2016. ISSN 0038-1101. doi: https://doi.org/10.1016/j.sse._
2016.07.006. [URL https://www.sciencedirect.com/science/article/pii/](https://www.sciencedirect.com/science/article/pii/S0038110116300867)
[S0038110116300867. Extended papers selected from ESSDERC 2015.](https://www.sciencedirect.com/science/article/pii/S0038110116300867)

Minchang Cho, Sechang Oh, Zhan Shi, Jongyup Lim, Yejoong Kim, Seokhyeon Jeong, Yu Chen,
David Blaauw, Hun-Seok Kim, and Dennis Sylvester. 17.2 a 142nw voice and acoustic activity


-----

detection chip for mm-scale sensor nodes using time-interleaved mixer-based frequency scanning.
In 2019 IEEE International Solid- State Circuits Conference - (ISSCC), pp. 278–280, 2019. doi:
10.1109/ISSCC.2019.8662540.

Safa Cicek and Stefano Soatto. Input and weight space smoothing for semi-supervised learning. In
_2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), pp. 1344–1353,_
2019. doi: 10.1109/ICCVW.2019.00170.

Patrick Cousot and Radhia Cousot. Abstract interpretation: A unified lattice model for static analysis
of programs by construction or approximation of fixpoints. In Proceedings of the 4th ACM
_SIGACT-SIGPLAN Symposium on Principles of Programming Languages, POPL ’77, pp. 238–252,_
New York, NY, USA, 1977. Association for Computing Machinery. ISBN 9781450373500. doi:
[10.1145/512950.512973. URL https://doi.org/10.1145/512950.512973.](https://doi.org/10.1145/512950.512973)

Yu-Hong Dai. Fast algorithms for projection on an ellipsoid. SIAM Journal on Optimization,
16(4):986–1006, 2006. doi: 10.1137/040613305. [URL https://doi.org/10.1137/](https://doi.org/10.1137/040613305)
[040613305.](https://doi.org/10.1137/040613305)

Daniel Gabay and Bertrand Mercier. A dual algorithm for the solution of nonlinear variational
problems via finite element approximation. Computers & Mathematics with Applications, 2(1):
[17–40, 1976. ISSN 0898-1221. doi: https://doi.org/10.1016/0898-1221(76)90003-1. URL https:](https://www.sciencedirect.com/science/article/pii/0898122176900031)
[//www.sciencedirect.com/science/article/pii/0898122176900031.](https://www.sciencedirect.com/science/article/pii/0898122176900031)

Timon Gehr, Matthew Mirman, Dana Drachsler-Cohen, Petar Tsankov, Swarat Chaudhuri, and Martin
Vechev. Ai2: Safety and robustness certification of neural networks with abstract interpretation. In
_2018 IEEE Symposium on Security and Privacy (SP), pp. 3–18, 2018. doi: 10.1109/SP.2018.00058._

Valentin Gies, Sebastián Marzetti, Valentin Barchasz, Hervé Barthélemy, and Hervé Glotin. Ultra-low
power embedded unsupervised learning smart sensor for industrial fault classificatio. In 2020 IEEE
_International Conference on Internet of Things and Intelligence System (IoTaIS), pp. 181–187,_
2021. doi: 10.1109/IoTaIS50849.2021.9359716.

Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward
neural networks. In Yee Whye Teh and Mike Titterington (eds.), Proceedings of the Thirteenth
_International Conference on Artificial Intelligence and Statistics, volume 9 of Proceedings of_
_Machine Learning Research, pp. 249–256, Chia Laguna Resort, Sardinia, Italy, 13–15 May 2010._
[JMLR Workshop and Conference Proceedings. URL http://proceedings.mlr.press/](http://proceedings.mlr.press/v9/glorot10a.html)
[v9/glorot10a.html.](http://proceedings.mlr.press/v9/glorot10a.html)

Ian J. Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial
examples, 2015.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image
[recognition. CoRR, abs/1512.03385, 2015. URL http://arxiv.org/abs/1512.03385.](http://arxiv.org/abs/1512.03385)

S. Huang, S. Xiao, and W. Feng. On the energy efficiency of graphics processing units for scientific
computing. In 2009 IEEE International Symposium on Parallel Distributed Processing, pp. 1–8,
2009. doi: 10.1109/IPDPS.2009.5160980.

Vinay Joshi, Manuel Le Gallo, Simon Haefeli, Irem Boybat, S. R. Nandakumar, Christophe Piveteau,
Martino Dazzi, Bipin Rajendran, Abu Sebastian, and Evangelos Eleftheriou. Accurate deep
neural network inference using computational phase-change memory. Nature Communications,
[11(1):2473, May 2020. ISSN 2041-1723. doi: 10.1038/s41467-020-16108-9. URL https:](https://doi.org/10.1038/s41467-020-16108-9)
[//doi.org/10.1038/s41467-020-16108-9.](https://doi.org/10.1038/s41467-020-16108-9)

Riduan Khaddam-Aljameh, Milos Stanisavljevic, Jordi Fornt Mas, Geethan Karunaratne, Matthias
Brändli, Feng Liu, Abhairaj Singh, Silvia M Müller, Urs Egger, Anastasios Petropoulos, et al.
HERMES-core–a 1.59-TOPS/mm[2] PCM on 14-nm CMOS in-memory compute core using 300ps/LSB linearized CCO-based ADCs. IEEE Journal of Solid-State Circuits, 2022.

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Yoshua
Bengio and Yann LeCun (eds.), 3rd International Conference on Learning Representations, ICLR
_[2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015. URL http:](http://arxiv.org/abs/1412.6980)_
[//arxiv.org/abs/1412.6980.](http://arxiv.org/abs/1412.6980)


-----

Alex Krizhevsky. Learning multiple layers of features from tiny images. 2009.

Manuel Le Gallo, Daniel Krebs, Federico Zipoli, Martin Salinga, and Abu Sebastian. Collective structural relaxation in phase-change memory devices. _Advanced Electronic Mate-_
_rials, 4(9):1700627, 2018a._ doi: https://doi.org/10.1002/aelm.201700627. [URL https:](https://onlinelibrary.wiley.com/doi/abs/10.1002/aelm.201700627)
[//onlinelibrary.wiley.com/doi/abs/10.1002/aelm.201700627.](https://onlinelibrary.wiley.com/doi/abs/10.1002/aelm.201700627)

Manuel Le Gallo, Abu Sebastian, Roland Mathis, Matteo Manica, Heiner Giefers, Tomas Tuma,
Costas Bekas, Alessandro Curioni, and Evangelos Eleftheriou. Mixed-precision in-memory
computing. Nature Electronics, 1(4):246–253, April 2018b. ISSN 2520-1131. doi: 10.1038/
[s41928-018-0054-8. URL https://doi.org/10.1038/s41928-018-0054-8.](https://doi.org/10.1038/s41928-018-0054-8)

Hao Li, Zheng Xu, Gavin Taylor, Christoph Studer, and Tom Goldstein. Visualizing the loss landscape of neural nets. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and
R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 31. Curran As[sociates, Inc., 2018. URL https://proceedings.neurips.cc/paper/2018/file/](https://proceedings.neurips.cc/paper/2018/file/a41b3bb3e6b050b6c9067c67f663b915-Paper.pdf)
[a41b3bb3e6b050b6c9067c67f663b915-Paper.pdf.](https://proceedings.neurips.cc/paper/2018/file/a41b3bb3e6b050b6c9067c67f663b915-Paper.pdf)

Qian Liu, Ole Richter, Carsten Nielsen, Sadique Sheik, Giacomo Indiveri, and Ning Qiao. Live
demonstration: Face recognition on an ultra-low power event-driven convolutional neural network
asic. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops
_(CVPRW), pp. 1680–1681, 2019. doi: 10.1109/CVPRW.2019.00213._

Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu.
Towards deep learning models resistant to adversarial attacks, 2019.

Matthew Mirman, Timon Gehr, and Martin T. Vechev. Differentiable abstract interpretation for
provably robust neural networks. In Jennifer G. Dy and Andreas Krause (eds.), Proceedings of the
_35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm,_
_Sweden, July 10-15, 2018, volume 80 of Proceedings of Machine Learning Research, pp. 3575–_
[3583. PMLR, 2018. URL http://proceedings.mlr.press/v80/mirman18b.html.](http://proceedings.mlr.press/v80/mirman18b.html)

Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, and Pascal Frossard. Deepfool: a simple and
[accurate method to fool deep neural networks. CoRR, abs/1511.04599, 2015. URL http:](http://arxiv.org/abs/1511.04599)
[//arxiv.org/abs/1511.04599.](http://arxiv.org/abs/1511.04599)

Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, Jonathan Uesato, and Pascal Frossard. Robustness via curvature regularization, and vice versa. _CoRR, abs/1811.09716, 2018._ URL

[http://arxiv.org/abs/1811.09716.](http://arxiv.org/abs/1811.09716)

S. Moradi, N. Qiao, F. Stefanini, and G. Indiveri. A scalable multicore architecture with
heterogeneous memory structures for dynamic neuromorphic asynchronous processors (dynaps). _IEEE Transactions on Biomedical Circuits and Systems, 12(1):106–122, 2018._ doi:
10.1109/TBCAS.2017.2759700.

A.F. Murray and P.J. Edwards. Enhanced mlp performance and fault tolerance resulting from synaptic
weight noise during training. IEEE Transactions on Neural Networks, 5(5):792–802, 1994. doi:
10.1109/72.317730.

S. R. Nandakumar, I. Boybat, V. Joshi, C. Piveteau, M. Le Gallo, B. Rajendran, A. Sebastian, and
E. Eleftheriou. Phase-change memory models for deep learning training and inference. In 26th
_IEEE International Conference on Electronics, Circuits and Systems (ICECS), pp. 727–730, 2019._
doi: 10.1109/ICECS46596.2019.8964852.

S. R. Nandakumar, Manuel Le Gallo, Christophe Piveteau, Vinay Joshi, Giovanni Mariani, Irem
Boybat, Geethan Karunaratne, Riduan Khaddam-Aljameh, Urs Egger, Anastasios Petropoulos,
Theodore Antonakopoulos, Bipin Rajendran, Abu Sebastian, and Evangelos Eleftheriou. Mixedprecision deep learning based on computational memory. Frontiers in Neuroscience, 14:406, 2020a.
[ISSN 1662-453X. doi: 10.3389/fnins.2020.00406. URL https://www.frontiersin.org/](https://www.frontiersin.org/article/10.3389/fnins.2020.00406)
[article/10.3389/fnins.2020.00406.](https://www.frontiersin.org/article/10.3389/fnins.2020.00406)


-----

SR Nandakumar, Irem Boybat, Jin-Ping Han, Stefano Ambrogio, Praneet Adusumilli, Robert L Bruce,
Matthew BrightSky, Malte Rasch, Manuel Le Gallo, and Abu Sebastian. Precision of synaptic
weights programmed in phase-change memory devices for deep learning inference. In 2020 IEEE
_International Electron Devices Meeting (IEDM), pp. 29–4. IEEE, 2020b._

Alexander Neckar, Sam Fok, Ben Benjamin, Terrence Stewart, Aaron Voelker, Chris Eliasmith, Rajit
Manohar, and Kwabena Boahen. Braindrop: A mixed-signal neuromorphic architecture with a
dynamical systems-based programming model. Proceedings of the IEEE, 107:144–164, 01 2019.
doi: 10.1109/JPROC.2018.2881432.

Yurii Nesterov. A method for solving the convex programming problem with convergence rate
O(1/k[2]). Proceedings of the USSR Academy of Sciences, 269:543–547, 1983.

J. Schemmel, D. Brüderle, A. Grübl, M. Hock, K. Meier, and S. Millner. A wafer-scale neuromorphic
hardware system for large-scale neural modeling. In 2010 IEEE International Symposium on
_Circuits and Systems (ISCAS), pp. 1947–1950, 2010. doi: 10.1109/ISCAS.2010.5536970._

Abu Sebastian, Manuel Le Gallo, Riduan Khaddam-Aljameh, and Evangelos Eleftheriou. Memory devices and applications for in-memory computing. _Nature Nanotechnology, 15(7):_
[529–544, 2020. doi: 10.1038/s41565-020-0655-z. URL https://doi.org/10.1038/](https://doi.org/10.1038/s41565-020-0655-z)
[s41565-020-0655-z.](https://doi.org/10.1038/s41565-020-0655-z)

Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov.
Dropout: A simple way to prevent neural networks from overfitting. _Journal of Machine_
_Learning Research, 15(56):1929–1958, 2014._ [URL http://jmlr.org/papers/v15/](http://jmlr.org/papers/v15/srivastava14a.html)
[srivastava14a.html.](http://jmlr.org/papers/v15/srivastava14a.html)

David Stutz, Nandhini Chandramoorthy, Matthias Hein, and Bernt Schiele. Random and adversarial
bit error robustness: Energy-efficient and secure dnn accelerators. CoRR, abs/2104.08323, 2021.

C. S. Thakur, R. Wang, T. J. Hamilton, R. Etienne-Cummings, J. Tapson, and A. van Schaik. An
analogue neuromorphic co-processor that utilizes device mismatch for learning applications. IEEE
_Transactions on Circuits and Systems I: Regular Papers, 65(4):1174–1184, 2018._

Naveen Verma, Hongyang Jia, Hossein Valavi, Yinqi Tang, Murat Ozatay, Lung-Yen Chen, Bonan
Zhang, and Peter Deaville. In-memory computing: Advances and prospects. IEEE Solid-State
_Circuits Magazine, 11(3):43–55, 2019. doi: 10.1109/MSSC.2019.2922889._

Yisen Wang, Difan Zou, Jinfeng Yi, James Bailey, Xingjun Ma, and Quanquan Gu. Improving adversarial robustness requires revisiting misclassified examples. In International Conference on Learn_[ing Representations, 2020. URL https://openreview.net/forum?id=rklOg6EFwS.](https://openreview.net/forum?id=rklOg6EFwS)_

P. Warden. Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. ArXiv
_[e-prints, April 2018. URL https://arxiv.org/abs/1804.03209.](https://arxiv.org/abs/1804.03209)_

Dongxian Wu, Yisen Wang, and Shutao Xia. Revisiting loss landscape for adversarial robustness.
_[CoRR, abs/2004.05884, 2020. URL https://arxiv.org/abs/2004.05884.](https://arxiv.org/abs/2004.05884)_

Lei Wu, Hongxia Liu, Jiabin Li, Shulong Wang, and Xing Wang. A Multi-level Memristor
Based on Al-Doped HfO2 Thin Film. _Nanoscale Research Letters, 14(1):177, May 2019._
[ISSN 1556-276X. doi: 10.1186/s11671-019-3015-x. URL https://doi.org/10.1186/](https://doi.org/10.1186/s11671-019-3015-x)
[s11671-019-3015-x.](https://doi.org/10.1186/s11671-019-3015-x)

Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for benchmarking
machine learning algorithms, 2017.

Shimeng Yu and Pai-Yu Chen. Emerging memory technologies: Recent trends and prospects. IEEE
_Solid-State Circuits Magazine, 8(2):43–56, Spring 2016. ISSN 1943-0590. doi: 10.1109/MSSC._
2016.2546199.

Hongyang Zhang, Yaodong Yu, Jiantao Jiao, Eric P. Xing, Laurent El Ghaoui, and Michael I. Jordan.
Theoretically principled trade-off between robustness and accuracy. CoRR, abs/1901.08573, 2019.
[URL http://arxiv.org/abs/1901.08573.](http://arxiv.org/abs/1901.08573)


-----

Yaowei Zheng, Richong Zhang, and Yongyi Mao. Regularizing neural networks via adversarial
[model perturbation. CoRR, abs/2010.04925, 2020. URL https://arxiv.org/abs/2010.](https://arxiv.org/abs/2010.04925)
[04925.](https://arxiv.org/abs/2010.04925)


-----

SUPPLEMENTARY MATERIAL

SPIKING RNN ARCHITECTURE

The dynamics of the spiking model (Bellec et al., 2018) can be summarised by a set of differential
equations (Eq. 3).

**_B[t]_** = b[0] + βb[t]

**_o[t]_** = 1(V _[t]_ _> B[t]) unless refractory trefr_


**_b[t][+1]_** = ρβb[t] + (1 **_ρβ)_** **_[o][t]_**
_−_ dt


(3)


**_IReset[t]_** [=][ o][t]

dt **_[B][t][dt]_**

**_V_** _[t][+1]_ = ρV V _[t]_ + (1 **_ρV )(IinWin +_** **_[o][t]_** Reset
_−_ dt _[W][rec][)][ −]_ **_[I]_** _[t]_

where ρβ = e[−][dt][/τ][ada] and ρV = e[−][dt][/τ] . The variables B describe the spiking thresholds with
spike-frequency adaptation. The vector o[t] denotes the population spike train at time t. The membrane
potentials V have a time constant τ and the adaptive threshold time constant is denoted by τada.
The speech signals and ECG traces fed into the network are represented as currents Iin. Since the
derivative of the spiking function with respect to its input is mostly 0 we use a surrogate gradient that
is explicitly defined as


_∂E_ _∂z[t]_

_, 0)_ (4)

_∂V_ _[t][ =][ ∂E]∂z[t]_ _∂V_ _[t][ =][ ∂E]∂z[t][ d][ ·][ max(][1][ −|]_ **_[V][ t][ −]B[t][B][t]_** _|_

where E is the error and d is the dampening factor. To get the final prediction of the network, we
average the population spike trains along the time-axis zavg and compute

**_l = softmax(zavgWout + bout)_**
_yˆ = arg max_ **_li_**
_i_

CNN ARCHITECTURE

Our architecture comprises two convolutional blocks (2 × [4 × 4, 64 channels, MaxPool, ReLU]),
followed by three dense layers (N = 1600, 256, 64, ReLU) and a softmax layer. All weights and
kernels are initialized using the Glorot normal initialisation (Glorot & Bengio, 2010). Using this
architecture, we achieved a test accuracy of ∼ 93%. Attacked parameters for this network included
all the kernel weights, as well as all the dense layer parameters.


-----

DERIVATION OF JACOBIAN

Under the assumption that α = _[ζ][attack]Nsteps[⊙|][Θ][|]_ and p =, we can rewrite the inner loop of Algorithm 1 to

_∞_


**begin**

Θ[∗] _←−_ Θ + |Θ|ϵ ⊙ _R ; R ∼N_ (0, 1)
**for t = 1 to Nsteps do**

Θ[∗]t _t_ 1 [+][ α][ ·][ sign] Θ[∗]t 1 _[L][rob][(Θ][,][ Θ]t[∗]_ 1[, X][)]

_[←]_ [Θ][∗]− _∇_ _−_ _−_

**end** 
**end**

By rewriting Θ[∗] in the form of Θ[∗] = Θ + ∆Θ we get

_Nsteps_


Θ[∗] = Θ + |Θ|ϵ ⊙ _R + α ⊙_


sign Θ[∗]t 1 _[L][rob][(][f]_ [(Θ][, X][)][, f] [(Θ]t[∗] 1[, X][))]
_∇_ _−_ _−_
_t=1_

X 


From this the Jacobian can be easily calculated. Plugging in the defintion for α we get

**JΘ∗** (Θ) = I + diag sign(Θ)⊙N(ζstepsattack+ϵ·R1) _t=1_ [sign] Θ[∗]t _t_ _[, X][))]_

_⊙_ [P][N][steps] _∇_ _[L][rob][(][f]_ [(Θ][, X][)][, f] [(Θ][∗]

h

  [i]


-----

MISMATCH MODEL

To model the parameter noise introduced by component mismatch we used a Gaussian distribution
where the mean is the nominal noise-free weight value and the standard deviation depends linearly on
the weight value. This model realistically captures the behaviour of parameter mismatch on a mixedsignal neuromorphic SNN inference processor (Moradi et al., 2018). Fig. S1 shows the quantified
parameter mismatch recorded directly from neuromorphic HW, over a range of nominal parameter
values and for several neuronal and synaptic parameters. The measured mismatch parameter variation
follows an approximately Gaussian distribution where the standard deviation depends linearly on the
mean.

**a** **b**

3 7 14 19 15 31 39 59

_Wslow, peak (mV)_ _τmem (ms)_

**c** **d**

6

4

_τmem_

2 _Wslow_

std. dev. (ms; mV) _Wfast_

2 7 12 18 20 40 60

_Wfast, peak (mV)_ mean value (ms; mV)


Figure S1: Quantification of mismatch on analog neuromorphic hardware. Parameter values
for several weight parameters and membrane time constants were measured for a range of nominal
parameter values, using an oscilloscope directly connected to a mixed-signal neuromorphic SNN
processor. (a-c) Various parameters of the chip follow a Gaussian distribution with increasing width.
(d) The mismatch standard deviation depends linearly on the nominal value of each parameter.


-----

WEIGHT LOSS-LANDSCAPE VISUALIZATION

We characterized the shape of the weight loss-landscape by plotting the categorical cross entropy
loss for varying levels of noise added to the weights of the trained network. In each trial, we picked
a random vector v ∼N (0, ζ|Θ|) and evaluated the categorical cross entropy of the whole test set
given the weights Θ + α · v, where α ∈ [−2, 2] and ζ = 0.2. As we show in Figure S2, the variance
of the individual 1D weight loss-landscapes is small.

Table S1 quantifies the flatness of the illustrated weight loss-landscapes. As can be seen, our method
combined with adding noise during the forward pass yields the flattest landscapes.

Table S1: Average slope of estimated 1D weight loss-landscapes. Slopes were calculated as the
mean absolute differences between sample points divided by the sampling distance.

F-MNIST CNN ECG LSNN Speech LSNN

Standard 0.3375 0.1605 0.2702
Beta 0.0480 0.0879 0.0633
Forward Noise 0.0332 **0.0180** 0.1009
Forward Noise + Beta **0.0190** 0.0187 **0.0540**
Dropout 0.3022 0.0952 0.0707
AWP 0.0841 0.1013 0.1389

|Col1|AWP|Col3|
|---|---|---|
||Beta Forward Dropout|Noise + Beta|
||Forward Standar|Noise d|
||||
||||
||||
||||


F-MNIST CNN ECG LSNN Speech LSNN

1.4 AWPBeta 0.7 1.4

1.2 Forward Noise + BetaDropoutForward Noise 0.6 1.2

Standard 0.5

1.0

0.4 1.0

0.8

0.3

Cross Entropy Loss 0.6 Cross Entropy Loss 0.2 Cross Entropy Loss 0.8

0.4 0.1 0.6

0.2 0.0

_−2.0_ _−1.5_ _−1.0_ _−0.5_ 0α.0 0.5 1.0 1.5 2.0 _−2.0_ _−1.5_ _−1.0_ _−0.5_ 0α.0 0.5 1.0 1.5 2.0 _−2.0_ _−1.5_ _−1.0_ _−0.5_ 0α.0 0.5 1.0 1.5 2.0


Figure S2: Illustration of the test weight loss-landscape highlighting individual trials to mea**sure the loss landscapes. See the results text for more details.**


-----

ATTACKING KL-DIVERGENCE LOSS DURING INFERENCE

While the adversary in AWP attacks the task loss directly, i.e. max Lcce(f (Θ[∗], X), y), the adversary
in our training algorithm attacks the KL divergence max KL(f (Θ, X), f (Θ[∗], X)). This implies that
our parameter attack seeks simply to change the response of the network in any way, and is agnostic
to the task itself. In the main results we attack the cross-entropy task loss during inference, in order
to not give an undue advantage to our training approach. Here we show that our training approach
also provides robustness against parameter attacks on the KL-divergence loss during inference.

Fig. S3 shows the adversarial robustness of the methods for an adversary that attacks the KLdivergence rather than the cross-entropy loss. When the network is attacked by maximizing the
KL-divergence between the normal and attacked network, our adversarially-trained networks are
more robust than AWP, SGD and forward-noise. Because the parameter attack used here during
inference as well as in the inner optimization loop of the training procedure is the same, this result is
expected, and serves as a sanity check that our networks indeed learn to defend against the attack
they were trained against.

|Col1|Col2|Col3|Col4|Col5|Col6|Col7|Col8|
|---|---|---|---|---|---|---|---|
|||||||||
|||||||||
||||Beta|||||
||Forward|Noise +|Beta|||||
||Beta Standard Forward AWP|Noise||||||


F-MNIST CNN ECG LSNN Speech LSNN

100

80

80

70

80

60 60

50

60

Test Acc. 40 Test Acc. Test Acc. 40

Forward Noise + Beta

20 Beta 40 30

Standard
Forward Noise 20

0 AWP

0.0 0.005 0.01 0.05 0.1 0.2 0.3 0.5 0.0 0.005 0.01 0.05 0.1 0.2 0.3 0.5 0.0 0.005 0.01 0.05 0.1 0.2 0.3 0.5

Attack size ζ Attack size ζ Attack size ζ


Figure S3: Robustness to weight attack targeting KL divergence during inference. When the
network is attacked by maximizing the KL-divergence between the normal and attacked network, our
adversarially trained networks are more robust than standard SGD or AWP.


-----

EFFECT OF CONSTANT VERSUS RELATIVE PARAMETER NOISE

As described in the main text, parameter noise that has constant magnitude (for example, Gaussian
noise with fixed standard deviation) is trivial to protect against by increasing weight magnitudes. We
examined this effect by training MLPs with an adversary that employs Gaussian noise with fixed std.
dev. ϵ = 0.2. Figure S4 (right) illustrates the test accuracy of two MLPs trained on F-MNIST over
the course of training. Using an inverse weight decay term, the weight-magnitude of one network is
forced to increase to 2.0 over the course of training (black crosses, Θ[∗]). The weight magnitude of the
other network (red crosses) is limited during training to 0.2 (red crosses; Θ). One can observe that
as the weight magnitude of the increasing magnitude network Θ[∗] increases, also the robustness to
Gaussian noise (ϵ = 0.2) increases (blue), while the performance of the small magnitude network Θ
remains poor (cyan).

|Col1|Col2|W in|Col4|Col5|
|---|---|---|---|---|
||Cons Relat|tant ive|||
||||||
||||||
||||||
||||||


|Tes Tes|t acc. Θ t acc. Θ∗|Col3|Col4|Col5|Col6|Col7|Col8|Col9||Θ| |Θ∗||
|---|---|---|---|---|---|---|---|---|---|
|ϵ-te ϵ-te|st acc.,Θ st acc.,Θ∗|||||||||
|||||||||||
|||||||||||
|||||||||||
|||||||||||
|||||||||||


W in W rec MLP MNIST

0.26 0.400

Constant Test acc. Θ _|Θ|_ 2.00
Relative 0.8 Test acc.ϵ-test acc., ΘΘ[∗] _|Θ[∗]|_ 1.75

0.21 0.345 _ϵ-test acc.,Θ[∗]_

1.50

))Θ 0.6 1.25 ))Θ

0.16 0.290

1.00

Sum(Abs( Test accuracy (%) 0.4 0.75 Sum(Abs(

0.11 0.235

0.50

0.2

0.25

0.06 0.180

_β 0.0_ _β 0.1_ _β 0.5_ _β 1.0_ _β 0.0_ _β 0.1_ _β 0.5_ _β 1.0_ 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5

Epoch


Figure S4: Constant magnitude versus magnitude-relative parameter noise. (left, middle) When
parameter noise of constant magnitude (blue) is introduced by the adversarial attack during training,
the networks learn to increase the magnitude of the weights to trivially improve robustness to the
constant-magnitude attack. When the parameter attack is relative to each parameter magnitude, as in
the main text (red), the weight magnitudes do not increase. These networks were trained for a range
of βrob, i.e. varying emphasis on robustness.
(right) One can also trivially increase the robustness to fixed-magnitude noise by introducing an
inverse weight decay term that causes the weights to increase in magnitude (black, cross marker)
while retaining performance. This causes the MLP trained on MNIST to become increasingly robust
to fixed-magnitude parameter noise (blue dots; ϵ-test acc., Θ[∗]). The network where the weight
magnitude was not increased over time (red, cross marker) did not improve in terms of robustness
(cyan; ϵ-test acc., Θ), although both models perform similarly when there is no parameter noise
applied (red and black dots).


-----

EFFECT OF ADVERSARIAL REGULARIZATION DURING TRAINING


|a|Col2|Training βr|rob=0.0|Col5|Col6|Col7|
|---|---|---|---|---|---|---|
||||||||
||||||||
||||||Attac Attac|k ζ = 0.0 k ζ = 0.1|
||||||||
||||||||

|c|Col2|Train|ning βrob=|=0.0|Col6|Col7|Col8|
|---|---|---|---|---|---|---|---|
||||||||Adversarial Random|
|||||||||
|||||||||
|||||||||
|||||||||


**a** Training βrob=0.0 **b** Training βrob=0.1

0.8 0.8

0.6 Attack ζ = 0.0 0.6

Attack ζ = 0.1

0.4 0.4

Validation acc. Validation acc.

0.2 0.2

0 10 20 30 40 50 0 10 20 30 40 50

Epochs Epochs

**c** Training βrob=0.0 **d** Training βrob=0.1

Adversarial

0.8 Random 0.8

0.6 0.6

Test acc. 0.4 Test acc. 0.4

0.2 0.2

0.0

0.0 0.01 0.05 0.1 0.2 0.3 0.5 0.7 0.0 0.01 0.05 0.1 0.2 0.3 0.5 0.7

Attack size ζ Attack size ζ


Figure S5: Our parameter attack is effective at decreasing the performance of a network dur**ing inference, and using our attack during training protects a network from later disruption.**
(a) A network trained using standard SGD (i.e. βrob = 0.0) on the F-MNIST task is disrupted badly by
the parameter noise adversary during inference (ζ = 0.1; black curve). (b) When the same network is
trained with parameter attacks during training (βrob = 0.1), the network is protected from parameter
attacks during inference (high accuracy of attacked network; black curve). (c) When trained with
standard SGD (βrob = 0.0), both random noise (red) and parameter attacks (black) disrupt network
performance for increasing attack size ζ. (d) Under our training approach (βrob = 0.1), networks are
significantly protected against random and adversarial weight perturbations during inference.


-----

EFFECT OF VARYING βrob

We additionally quantify the trade-off between test loss and robustness of our algorithm by repeating
the experiment in Figure 1 at different values of βrob. As shown in Figure S6, increasing βrob flattens
the weight loss-landscape, effectively increasing the robustness of the model. To further substantiate
this claim, we repeated the experiment of Figure S5 with the same values of βrob. As Figure S8 shows,
increasing βrob and therefore increasing the flatness of the landscape yields increased robustness to
random, as well as, adversarial perturbations.

We note that, relative to the baseline, the test loss does not consistently increase with increasing values
of βrob. We hypothesize that this is due to the increased generalization capability of our networks.
To check whether this is indeed the case, we repeated the experiment from Fig. S6 using the loss
computed on the training set. As Fig. S7 shows, increasing values of βrob lead to flatter minima
and higher loss on the training set, which is the exact trade-off to be expected from the formulation
of our loss function. However, the increased generalization capability that follows from a flatter
loss-landscape seems to disrupt this trade-off. As a result, when choosing βrob one should aim at
choosing the highest value that still yields good performance on the validation set.

|Col1|Beta 0.05 Beta 0.1 Beta 0.2 Beta 0.3 Beta 0.5 Beta 0.8|
|---|---|
|||
|||
|||


F-MNIST CNN ECG LSNN Speech LSNN

0.40 Beta 0.05Beta 0.1 0.80

Beta 0.2Beta 0.3 0.20 0.75

0.35 Beta 0.5Beta 0.8 0.15 0.70

0.65

0.30

0.10

Cross Entropy Loss Cross Entropy Loss Cross Entropy Loss 0.60

0.25 0.05 0.55

0.50

_−1.5_ _−1.0_ _−0.5_ 0.0α 0.5 1.0 1.5 _−1.5_ _−1.0_ _−0.5_ 0.0α 0.5 1.0 1.5 _−1.5_ _−1.0_ _−0.5_ 0.0α 0.5 1.0 1.5


Figure S6: **Effect of βrob on the test loss landscape. Increasing βrob promotes robustness by**
flattening the test loss landscape. However, increasing βrob does not lead to a systematic rise in loss
on the test set.

F-MNIST CNN ECG LSNN Speech LSNN

0.30 0.20 0.5

0.25

0.20 0.15 0.4

0.15 0.10 0.3

Beta 0.05

Cross Entropy Loss 0.10 Beta 0.1 Cross Entropy Loss Cross Entropy Loss

Beta 0.2Beta 0.3 0.05 0.2

0.05 Beta 0.5

Beta 0.8

_−1.5_ _−1.0_ _−0.5_ 0.0α 0.5 1.0 1.5 _−1.5_ _−1.0_ _−0.5_ 0.0α 0.5 1.0 1.5 _−1.5_ _−1.0_ _−0.5_ 0.0α 0.5 1.0 1.5


Figure S7: Effect of βrob on the training loss landscape. Weight loss landscape computed on the
training set (c.f. Fig. S6). Increasing βrob leads to flatter loss-landscapes (increased robustness) as
before, but also a systematic increase in training loss (higher values for cross-entropy loss; note
ordering of curves).


-----

We additionally trained the CNN on the F-MNIST data for a range of βrob, and measured the network
robustness to attacks of varying magnitude ζ (Fig. S8). We compared the effectiveness of random
attack versus adversarial attack during inference, evaluated on the test set. We found that increasing
_βrob improved the robustness of the network to both random and adversarial attack during inference._

|a|Col2|Train|ning βrob=|=0.0|Col6|Col7|Col8|
|---|---|---|---|---|---|---|---|
||||||||Adversarial Random|
|||||||||
|||||||||
|||||||||
|||||||||


|c|Col2|Train|ning βrob=|=0.3|Col6|Col7|
|---|---|---|---|---|---|---|
||||||||
||||||||
||||||||
||||||||
||||||||


|d|Col2|Train|ning βrob=|=0.5|Col6|Col7|
|---|---|---|---|---|---|---|
||||||||
||||||||
||||||||
||||||||
||||||||


**a** Training βrob=0.0 **b** Training βrob=0.1

Adversarial

0.8 Random 0.8

0.6 0.6

Test acc. 0.4 Test acc. 0.4

0.2 0.2

0.0 0.0

0.0 0.01 0.05 0.1 0.2 0.3 0.5 0.7 0.0 0.01 0.05 0.1 0.2 0.3 0.5 0.7

Attack size ζ Attack size ζ

**c** Training βrob=0.3 **d** Training βrob=0.5

0.8 0.8

0.6

0.6

Test acc. Test acc. 0.4

0.4

0.2

0.2

0.0

0.0 0.01 0.05 0.1 0.2 0.3 0.5 0.7 0.0 0.01 0.05 0.1 0.2 0.3 0.5 0.7

Attack size ζ Attack size ζ


Figure S8: **Increasing βrob during training improves robustness during inference. The CNN**
was trained using different values of βrob and evaluated on the test set after the weights were perturbed
either randomly (red) or adversarially (black).


-----

VERIFIABLE ROBUSTNESS FOR LSNNS

Computations through the network are performed on intervals rather than on discrete values. By
propagating the intervals through a network over a test set, we obtain output logits that are also
expressed as intervals and can therefore determine whether a sample will always be classified
correctly.

The final classification of the network is made using an arg max operator. For provability of network
performance, we consider that a test sample x is correctly classified when the lower bound of the
logit interval for the correct class is the maximum lower bound across all logit intervals, and when the
logit interval for the correct class is disjoint from the other logit intervals. When the logit interval for
the correct class overlaps with another logit interval, that test sample is not considered to be provably
correctly classified. Note that interval domain analysis provides a relatively loose bound (Gehr et al.,
2018), with the implication that the results here probably underestimate the true performance of our
method.

Speech LSNN ECG LSNN

1.0

0.8 Beta

Forward Noise + Beta

0.8

Forward Noise

0.6

Standard

0.6

0.4

0.4

Verified test acc. Verified test acc.

0.2

0.2

0.0 0.0

0.0 1e-05 5e-05 0.0001 0.0005 0.001 0.0 1e-05 5e-05 0.0001 0.0005 0.001

Attack size ζ Attack size ζ


Figure S9: **Networks trained with our method are provably more robust than those trained**
**with standard gradient descent or forward noise alone. We used interval bound propagation to**
determine the proportion of test samples that are verifiably correctly classified, under increasing
weight perturbations ζ. For both the Speech and ECG tasks, computed on the trained LSNNs, our
method was provably more robust for ζ < 5 × 10[−][4].


-----

WIDE-MARGIN NETWORK ACTIVATIONS

Murray et al. show that adding random forward noise to the weights of a network during the forward

2

pass implicitly adds a regularizer of the form Θ[2]i,j _∂∂oΘk,li,j_ to the network weights (Murray & Edwards,

1994). When sigmoid activation functions are used, this regularizer favors high or low activations. 
When implementing interval bound propagation for LSNNs, intervals over spiking activity must
be computed by passing intervals through the spiking threshold function. As a result, intervals
for spiking activity become either [0, 0] for neurons that never emit a spike regardless of weight
attack; [1, 1] for neurons that always emit a spike; and [0, 1] for neurons for which activity becomes
uncertain in the presence of weight attack. By definition, a robust network should promote bounds

[0, 0] and [1, 1], where the activity of the network is unchanged by weight attack. Robust network
configurations should therefore avoid states where the membrane potentials of neurons are close to
the firing threshold.

To see whether this was also the case for our LSNNs, we investigated the distribution of the membrane
potentials on a batch of test examples for a network that was trained with- and without noise during
the forward pass. We found that robust networks exhibited a broader distribution of membrane
potentials, with comparatively less distribution mass close to the firing threshold (Fig. S10). This
indicates that neurons in the robust network spend more time in a “safe” regime where a small change
in the weights cannot trigger an unwanted spike, or remove a desired spike.

0.7 Forward Noise + Beta 3.0 Forward Noise + Beta

Standard Standard

0.6

2.5

0.5

2.0

0.4

1.5

0.3

Normalized bin count 0.2 Normalized bin count 1.0

0.1 0.5

0.0 0.0

_−20_ _−15_ _−10_ _−5_ 0 _−4_ _−3_ _−2_ _−1_ 0 1

Membrane potential Membrane potential


Figure S10: Membrane potentials are distributed away from the firing threshold for networks
**trained with forward noise. We measured the distribution of membrane potentials in LSNNs trained**
using standard gradient descent (“Standard”), and in the presence of weight noise injected in the
forward pass (“Forward Noise + Beta”). As predicted, networks trained with forward noise have
membrane potentials distributed away from the firing threshold. This implies that weight perturbations
are less likely to inject or delete a spike erroneously, improving the robustness of the network.


-----

EFFECT OF VARYING ϵPGA IN AWP

In addition to attacking the network parameters, AWP (Wu et al., 2020) also attacks the input using
PGA. Since the relation between robustness to weight- and input-space perturbations is still unclear,
we performed additional sweeps over the attack size in the input space. Figure S11 demonstrates that
attacking the input during training generally does not improve the robustness, with the exception for
the network trained on the speech dataset. We also note that attacking the inputs improved robustness
for larger mismatch values, but generally degraded performance for the small values.


1.0

0.9

0.8


0.8

0.7


0.90

0.85

0.80

0.75

0.70

|FMNIST CNN|Col2|
|---|---|
|ϵpga =0.000 ϵpga =0.001 ϵpga =0.010 ϵpga =0.100||
||ϵpga =0.000 ϵpga =0.001 ϵpga =0.010 ϵpga =0.100|


_ϵpga =0.000_
_ϵpga =0.001_
_ϵpga =0.010_
_ϵpga =0.100_


0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7


0.6

0.5


0.7

0.6


|Speech LSNN|Col2|
|---|---|
|ϵpga =0.000 ϵpga =0.010 ϵpga =0.100 ϵpga =1.000||
||ϵpga =0.000 ϵpga =0.010 ϵpga =0.100 ϵpga =1.000|


_ϵpga =0.000_
_ϵpga =0.010_
_ϵpga =0.100_
_ϵpga =1.000_

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7


0.5

|ECG LSNN|Col2|
|---|---|
|ϵpga =0.000 ϵpga =0.010 ϵpga =0.100 ϵpga =1.000||
||ϵpga =0.000 ϵpga =0.010 ϵpga =0.100 ϵpga =1.000|


_ϵpga =0.000_
_ϵpga =0.010_
_ϵpga =0.100_
_ϵpga =1.000_


0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7

Mismatch level


Mismatch level


Mismatch level


Figure S11: **Attacking the input using conventional PGA generally has a limited effect on**
**robustness to weight-space perturbations. We swept various values of ϵpga, the parameter that**
determines the maximum perturbation in l[∞] for the AWP algorithm.


-----

TRAINING OF CNN USED FOR PCM-BASED CIM SIMULATION

The CNN that was used for this series of experiments is Resnet32 (He et al., 2015) trained on the
Cifar10 (Krizhevsky, 2009) dataset. The CNN was generally trained for 300 epochs using a batch
size of 256. We used SGD with an initial learning rate of 0.001 that was decreased by a multiplicative
factor of 0.2 after epochs 60, 120 and 160. Additionally, Nesterov momentum (Nesterov, 1983) was
used with a value of 0.9 and weight decay with a value of 5e-4.


|Col1|B ttta rrraaas iiinnne l===in 235e...76 6%% %|
|---|---|
|ttrraaiinn == 71.15.%0% 50 100 150 200 250 300 Epochs wl =||


|Col1|B tta rraas iinne l===in 235e...76 6%% %|
|---|---|
||tttrrraaaiiinnn == 71.15.%0%|


)train 94 wi[l], j [=] train 93[w]max[l]

91 92

Val. Acc. ( = 8885 Baselinetraintraintraintraintrain [= 2.6%][= 3.7%][= 5.6%][= 7.5%][= 11.0%] est. Acc.T 9190

50 100 150 200 250 300 0.026 0.037 0.056 0.075 0.110

Epochs train [=]

)train 94 wi[l], j [=] train [|][w]i[l], j[|]

93.0

91

Val. Acc. ( = 8885 Baselinetraintraintraintraintrain [= 2.6%][= 3.7%][= 5.6%][= 7.5%][= 11.0%] est. Acc.T 92.592.0

50 100 150 200 250 300 0.026 0.037 0.056 0.075 0.110

Epochs train [=]

Figure S12: The left panel illustrates the validation accuracy over the course of finetuning a pretrained
model for additional 240 epochs. One should note that convergence is usually achieved much quicker
(roughly after 150 additional epochs). The right panel illustrates the performance on the test set of
each model trained with noise injection of magnitude ηtrain (x-axis). Each row depicts a different
noise model. It should be noted that the weights in each filter were clipped to two standard deviations
during training to avoid outliers causing excessive amounts of noise following the model that relies
on the maximum weight value.


VARYING THE NUMBER OF ATTACK STEPS FOR THE PCM-BASED CIM SIMULATION


attack [= 0.10]

7

93.5

93.0 6

92.5

5

92.0

4 steps

91.5 N

est acc. (%)
T 3

91.0

90.5 Baseline 2

90.0 rob [= 0.0]

1

0.026 0.037 0.056 0.075 0.110

train [=]


Figure S13: **One can greatly reduce the number of attack steps used during training. Our**
method still produces strong results for very few number of attack steps (blue) when compared to the
baseline model (red, trained with Gaussian noise (ηtrain)).


-----

PERFORMANCE OF VARYING HYPERPARAMETERS FOR THE PCM-BASED CIM SIMULATION

In this experiment we show that the choice of hyperparameters is generally not very important to
outperform the baseline that was trained with noise injection. However, in order to surpass the FP
baseline, i.e. the model that was trained without noise injection and evaluated on a standard PC, one
has to tune the hyperparameters in order to obtain a combination that yields the highest performance.
Figure S14 illustrates this sweep. Each row represents a different value of ηtrain that was used for
training the baseline model (red). Each column represents a different attack size and the different
hues of blue correspond to varying values of βrob.


attack [= 0.01] attack [= 0.03] attack [= 0.05] 93.5 attack [= 0.10] 0.1

93.0 93.0 93.0 93.0

92.592.091.5 92.592.091.5 92.592.091.5 92.592.091.5 0.05 rob

est acc. (%)T 91.090.590.0 FP Baselinerob [= 0.0,] train [= 0.026] est acc. (%)T 91.090.590.0 est acc. (%)T 91.090.5 est acc. (%)T 91.090.5 0.0250.01

10[2] 10[3] 10[4] 10[5] 10[6] 10[7] 10[2] 10[3] 10[4] 10[5] 10[6] 10[7] 10[2] 10[3] 10[4] 10[5] 10[6] 10[7] 10[2] 10[3] 10[4] 10[5] 10[6] 10[7]

Tinf (s) Tinf (s) Tinf (s) Tinf (s)

attack [= 0.01] attack [= 0.03] attack [= 0.05] attack [= 0.10] 0.1

93.0 93.0 93.0 93.0

92.5 92.5 92.5 92.5

0.05

92.0 92.0 92.0 92.0

91.5 91.5 91.5 91.5 rob

est acc. (%)T 91.090.5 FP Baseline est acc. (%)T 91.090.5 est acc. (%)T 91.090.5 est acc. (%)T 91.090.5 0.025

rob [= 0.0,] train [= 0.037]

90.0 90.0 90.0 90.0 0.01

10[2] 10[3] 10[4] 10[5] 10[6] 10[7] 10[2] 10[3] 10[4] 10[5] 10[6] 10[7] 10[2] 10[3] 10[4] 10[5] 10[6] 10[7] 10[2] 10[3] 10[4] 10[5] 10[6] 10[7]

Tinf (s) Tinf (s) Tinf (s) Tinf (s)

attack [= 0.01] attack [= 0.03] attack [= 0.05] attack [= 0.10] 0.1

93.0 93.0 93.0 93.0

92.5 92.5 92.5 92.5 0.05

92.0 92.0 92.0 92.0 rob

est acc. (%)T 91.591.0 FP Baselinerob [= 0.0,] train [= 0.056] est acc. (%)T 91.591.0 est acc. (%)T 91.591.0 est acc. (%)T 91.591.0 0.025

0.01

10[2] 10[3] 10[4] 10[5] 10[6] 10[7] 10[2] 10[3] 10[4] 10[5] 10[6] 10[7] 10[2] 10[3] 10[4] 10[5] 10[6] 10[7] 10[2] 10[3] 10[4] 10[5] 10[6] 10[7]

Tinf (s) Tinf (s) Tinf (s) Tinf (s)

attack [= 0.01] attack [= 0.03] 93.25 attack [= 0.05] 93.25 attack [= 0.10] 0.1

93.0 93.0 93.00 93.00

92.75 92.75

92.5 92.5 92.50 92.50 0.05

92.25 92.25 rob

92.0 92.0 92.00 92.00

est acc. (%)T 91.5 FP Baseline est acc. (%)T 91.5 est acc. (%)T 91.7591.50 est acc. (%)T 91.7591.50 0.025

91.0 rob [= 0.0,] train [= 0.075] 91.0 91.25 91.25 0.01

10[2] 10[3] 10[4] 10[5] 10[6] 10[7] 10[2] 10[3] 10[4] 10[5] 10[6] 10[7] 10[2] 10[3] 10[4] 10[5] 10[6] 10[7] 10[2] 10[3] 10[4] 10[5] 10[6] 10[7]

Tinf (s) Tinf (s) Tinf (s) Tinf (s)

attack [= 0.01] 93.5 attack [= 0.03] attack [= 0.05] 93.25 attack [= 0.10] 0.1

93.00

93.0 93.0 93.0 92.75

92.5 92.5 92.5 92.5092.25 0.05 rob

est acc. (%)T 92.091.5 FP Baselinerob [= 0.0,] train [= 0.110] est acc. (%)T 92.091.5 est acc. (%)T 92.091.5 est acc. (%)T 92.0091.7591.5091.25 0.0250.01

10[2] 10[3] 10[4] 10[5] 10[6] 10[7] 10[2] 10[3] 10[4] 10[5] 10[6] 10[7] 10[2] 10[3] 10[4] 10[5] 10[6] 10[7] 10[2] 10[3] 10[4] 10[5] 10[6] 10[7]

Tinf (s) Tinf (s) Tinf (s) Tinf (s)


Figure S14: The choice of hyperparameters is not critical in order to beat the baseline model.
Our method is resilient to variations in hyperparameters. However, to obtain configurations where
even the FP baseline is surpassed, one has to fine-tune the method.


-----

PCM NOISE MODEL

Analog CiM comes in various flavors, depending on the memory technology used. In this paper,
we assume the use of PCM devices, which have been heavily studied in the context of analog CiM
accelerators (Joshi et al., 2020; Nandakumar et al., 2019; Boybat et al., 2018). PCM-based, or, more
generally, Non-Volatile Memory (NVM)-based architectures, essentially perform Matrix-VectorMultiplications (MVMs) using Kirchhoff’s current law. The weights of the matrix are organized
as differential pairs in order to account for positive and negative weights. When storing a neural
network, each weight matrix is programmed into the NVM devices by applying short electrical
pulses (Nandakumar et al., 2020b). Because of various noise sources, this process is often imprecise
and exhibits noise on the weights, termed "programming noise". Additionally, PCM devices suffer
from 1/f and telegraph noise, adding even more noise during inference ("read noise"). At last,
PCM devices also drift due to the underlying physical properties (Le Gallo et al., 2018a). Although
the effect of drift can mostly be alleviated by scaling the output of the MVM (a method called
Global Drift Compensation (GDC)), the non-uniform drift of the devices still leads to performance
degradation over time. In the simulator that we used, we model these three main sources of noise,
analog-to-digital and digital-to-analog converters, GDC and splitting of the MVM to account for
smaller tile sizes (typically each crossbar is 256 × 256).
Initially, the clipped weights are mapped to target conductances in a differential manner, i.e. the weight
matrix is split into two conductance matrices that both represent the positive and negative weights as
conductances. The target conductances typically range from zero to Gmax, where Gmax is assumed
to be 25µS. After mapping the weights to the target conductances GT, the programming noise is
simulated (these statistical models assume the conductances to be normalized): GP = GT +N (0, σP )
where σP = max(−1.1731G[2]T [+ 1][.][9650][G][T][ + 0][.][2635][,][ 0][.][0)][.]
After the conductances have been programmed they drift over time, with the conductance of a device
typically following GD = GP (t/tc)[−][ν], where ν is the drift coefficient, t is the time at inference,
and tc is the time the conductances were programmed. Additionally, the drift coefficient is modelled
to follow a Gaussian distribution. This makes it typically hard to correct for drift and it is the main
reason why drift is a problem in PCM based CiM devices.
Finally, the read noise is modelled using a Gaussian: GR ∼N (GD, σnG(t)), where σnG(t) =
_GD(t)Q_ _log((t + tr)/tr) with Q = min(0.0088/G[0]T[.][65], 0.2) and tr = 250ns and t is the time at_

inference. For the experiments in this paper, we simulated the performance of the networks deployed

p

on CiM hardware for up to one year.


-----