# NETWORK INSENSITIVITY TO PARAMETER NOISE VIA ## ADVERSARIAL REGULARIZATION **Julian Büchel** IBM Research - Zurich SynSense, Zürich, Switzerland ETH Zürich, Switzerland jbu@zurich.ibm.com **Fynn Faber** ETH Zürich, Switzerland faberf@ethz.ch ABSTRACT **Dylan R. Muir** SynSense, Zürich, Switzerland dylan.muir@synsense.ai Neuromorphic neural network processors, in the form of compute-in-memory crossbar arrays of memristors, or in the form of subthreshold analog and mixed-signal ASICs, promise enormous advantages in compute density and energy efficiency for NN-based ML tasks. However, these technologies are prone to computational non-idealities, due to process variation and intrinsic device physics. This degrades the task performance of networks deployed to the processor, by introducing parameter noise into the deployed model. While it is possible to calibrate each device, or train networks individually for each processor, these approaches are expensive and impractical for commercial deployment. Alternative methods are therefore needed to train networks that are inherently robust against parameter variation, as a consequence of network architecture and parameters. We present a new network training algorithm that attacks network parameters during training, and promotes robust performance during inference in the face of random parameter variation. Our approach introduces a loss regularization term that penalizes the susceptibility of a network to weight perturbation. We compare against previous approaches for producing parameter insensitivity such as dropout, weight smoothing and introducing parameter noise during training. We show that our approach produces models that are more robust to random mismatch-induced parameter variation as well as to targeted parameter variation. Our approach finds minima in flatter locations in the weight-loss landscape compared with other approaches, highlighting that the networks found by our technique are less sensitive to parameter perturbation. Our work provides an approach to deploy neural network architectures to inference devices that suffer from computational non-idealities, with minimal loss of performance. This method will enable deployment at scale to novel energy-efficient computational substrates, promoting cheaper and more prevalent edge inference. 1 INTRODUCTION There is increasing interest in NN and ML inference on IoT and embedded devices, which imposes energy constraints due to small battery capacity and untethered operation. Existing edge inference solutions based on CPUs or vector processing engines such as GPUs or TPUs are improving in energy efficiency, but still entail considerable energy cost (Huang et al., 2009). Alternative compute architectures such as memristor crossbar arrays and mixed-signal event-driven neural network accelerators promise significantly reduced energy consumption for edge inference tasks. Novel non-volatile memory technologies such as resistive RAM and phase-change materials (Chen, 2016; Yu & Chen, 2016) promise increased memory density with multiple bits per memory cell, as well as compact compute-in-memory for NN inference tasks (Sebastian et al., 2020). Analog implementations of neurons and synapses, coupled with asynchronous digital routing fabrics, permit high sparsity in both network architecture and activity, thereby reducing energy costs associated with computation. However, both of these novel compute fabrics introduce complexity in the form of computational non-idealities, which do not exist for pure synchronous digital solutions. Some novel memory technologies support several bits per memory cell, but with uncertainty about the precise value stored on each cycle (Le Gallo et al., 2018b; Wu et al., 2019). Others exhibit significant drift in stored ----- states (Joshi et al., 2020). Inference processors based on analog and mixed-signal devices (Neckar et al., 2019; Moradi et al., 2018; Cassidy et al., 2016; Schemmel et al., 2010; Khaddam-Aljameh et al., 2022) exhibit parameter variation across the surface of a chip, and between chips, due to manufacturing process non-idealities. Collectively these processes known as “device mismatch” manifest as frozen parameter noise in weights and neuron parameters. In all cases the mismatch between configured and implemented network parameters degrades the task performance by modifying the resulting mapping between input and output. Existing solutions for deploying networks to inference devices that exhibit mismatch mostly focus on per-device calibration or re-training (Ambrogio et al., 2018; Bauer et al., 2019; Nandakumar et al., 2020a). However, this, and other approaches such as few-shot learning or meta learning entail significant per-device handling costs, making them unfit for commercial deployment. We consider a network to be “robust” if the output of a network to a given input does not change in the face of parameter perturbation. With this goal, network architectures that are intrinsically robust against device mismatch can be investigated (Thakur et al., 2018; Büchel et al., 2021). Another approach is to introduce parameter perturbations during training that promote robustness during inference, for example via random pruning (dropout) (Srivastava et al., 2014) or by injecting noise (Murray & Edwards, 1994). In this paper we introduce a novel solution, by applying adversarial training approaches to parameter mismatch. Most existing adversarial training methods attack the input space. Here we describe an adversarial attack during training that seeks the parameter perturbation that causes the maximum degradation in network response. In summary, we make the following contributions: - We propose a novel algorithm for gradient-based supervised training of networks that are robust against parameter mismatch, by performing adversarial training in the weight space. - We demonstrate that our algorithm flattens the weight-loss landscape and therefore leads to models that are inherently more robust to parameter noise. - We show that our approach outperforms existing methods in terms of robustness. - We validate our algorithm on a highly accurate Phase Change Memory (PCM)-based Computein-Memory (CiM) simulator and achieve new state-of-the-art results in terms of performance and performance retention over time. 2 RELATED WORK Research to date has focused mainly on adversarial attacks in the input space. With an increasing number of adversarial attacks, an increasing number of schemes defending against those attacks have been proposed (Wang et al., 2020; Zhang et al., 2019; Madry et al., 2019; Moosavi-Dezfooli et al., 2018). In contrast, adversarial attacks in parameter space have received little attention. Where parameter-space adversaries have been examined, it has been to enhance performance in semisupervised learning (Cicek & Soatto, 2019), to improve robustness to input-space adversarial attacks (Wu et al., 2020), or to improve generalisation capability (Zheng et al., 2020). We define “robustness” to mean that the network output should change only minimally in the face of a parameter perturbation — in other words, the weight-loss landscape should be as flat as possible at a loss minimum. Other algorithms that promote flat loss landscapes may therefore also be useful to promote robustness to parameter perturbations. **Dropout (Srivastava et al., 2014) is a widely used method to reduce overfitting. During training, a** random subset of units are chosen with some probability, and these units are pruned from the network for a single trial or batch. This results in the network learning to distribute its computation across many units, and acts as a regularization against overfitting. **Entropy-SGD (Chaudhari et al., 2019) is a network optimisation method that minimises the local** entropy around a solution in parameter space. This results in a smoothed parameter-loss landscape that should penalize sharp minima. **Adversarial Block Coordinate Descent (ABCD) (Cicek & Soatto, 2019) was proposed in order** to complement input-space smoothing with weight-space smoothing in semi-supervised learning. ----- ABCD repeatedly picks half of the network weights and performs one step of gradient ascent on them, followed by applying gradient descent on the other half. **Adversarial Weight Perturbation (AWP) (Wu et al., 2020) was designed to improve the robustness** of a network to adversarial attacks in the input space. The authors use Projected Gradient Ascent (PGA) on the network parameters to approximate a worst case perturbation of the weights Θ[′]. PGA repeatedly computes the gradient of a loss function and updates the parameters in the direction of the (positive) gradient. After each update, the parameters are projected back onto a ball (e.g. in _l[2]) around the original parameters to ensure that a maximum distance is kept. Having identified an_ adversarial perturbation in the weight-space, an adversarial perturbation in the input-space is also found using PGA. Finally, the original weights Θ are updated using the gradient of the loss evaluated at the adversarial perturbation Θ[′]. **Adversarial Model Perturbation (AMP) (Zheng et al., 2020) improves the generalisation of conven-** tional neural networks by optimizing a standard loss evaluated using parameters that were perturbed adversarially using PGA. Unlike our method, (Zheng et al., 2020) did not formulate the loss function as a trade-off between performance and robustness. Furthermore, the presented algorithm, unlike our method, treats the perturbation ∆Θ to the parameters Θ as a constant during backpropagation. **TRadeoff-inspired Adversarial DEfense via Surrogate-loss minimization (TRADES) (Zhang** et al., 2019) is a method for training networks that are robust against adversarial examples in the input space. The method consists of adding a boundary loss term to the loss function that measures how the network performance changes when the input is attacked. The boundary loss does not take the labels into account, so scaling it by a factor βrob allows for a principled trade-off between the robustness and the accuracy of the network. **Noise injection during the forward pass (Murray & Edwards, 1994) is a simple method for in-** creasing network robustness to parameter noise. This method adds Gaussian noise to the network parameters during the forward pass and computes weight gradients with respect to the original parameters. This method regularizes the gradient magnitudes of output units with respect to the weights, thus enforcing distributed information processing and insensitivity to parameter noise. We refer to this method as “Forward Noise”. A recent paper proposed a method for improving the resilience to random and targeted bit errors in SRAM cells on digital Deep Neural Network (DNN) accelerators (Stutz et al., 2021). By employing adversarial or random bit flips during training, the authors significantly improved the robustness to bit perturbations, enabling the accelerators to be operated below the conventional supply voltage. 3 METHODS We use Θ to denote the set of parameters of a neural network f (x, Θ) that are trainable and susceptible to mismatch. The adversarial weights are denoted Θ[∗], where Θ[∗]t [are the adversarial weights at the][ t][-th] iteration of PGA. We denote the PGA-adversary as a function A that maps parameters Θ to attacking parameters Θ[∗]. We denote a mini-batch of training examples as X with y being the corresponding ground-truth labels. _Eζ[p]_ [(][m][)][ denotes the projection operator on the][ ζ][-ellipsoid in][ l][p][ space. The] operator ⊙ denotes elementwise multiplication. [Q] The effect of component mismatch on a network parameter can be modelled using a Gaussian distribution where the standard deviation depends on the parameter magnitude (Joshi et al., 2020; Büchel et al., 2021). In this paper we restrict ourselves to mismatch-driven perturbations in the network weights. For complex Spiking Neural Networks (SNNs), “network parameters” can refer to additional quantities such as neuronal and synaptic time constants or spiking thresholds. Our training approach described here can be equally applied to these additional parameters. We define the value of an individual parameter when deployed on a neuromorphic chip as Θ[mismatch] _∼N_ (Θ, diag(ζ|Θ|)) (1) where ζ governs the perturbation magnitude, referred to as the “mismatch level”. The physics underlying the neuronal- and synaptic circuits lead to a model where the amount of noise introduced into the system depends linearly on the magnitude of the parameters. If mismatch-induced perturbations had constant standard deviation independent of weight values, one could use the weight-scale invariance ----- of neural networks as a means to achieve robustness, by simply scaling up all network weights (see Figure S4). The linear dependence of weight magnitude and mismatch noise precludes this approach. In contrast to adversarial attacks in the input space (Carlini & Wagner, 2016; Moosavi-Dezfooli et al., 2015; Madry et al., 2019; Goodfellow et al., 2015), our method relies on adversarial attacks in parameter space. During training, we approximate the worst case perturbation of the network parameters using PGA and update the network parameters in order to mitigate these attacks. To trade-off robustness and performance, we use a surrogate loss (Zhang et al., 2019) to capture the difference in output between the normal and attacked network. Algorithm 1 illustrates the training procedure in more detail. **begin** Θ[∗]0 Θ + Θ _ϵ_ _R ; R_ (0, 1) **for t[←] = 1−** **to N |** _steps|_ _⊙ do_ _∼N_ _g_ Θ[∗]t 1 _[L][rob][(Θ][,][ Θ]t[∗]_ 1[, X][)] _←−∇_ _−_ _−_ _v_ arg max _←−_ _v:_ _v_ _p_ 1 _[v][T][ g]_ _∥_ _∥_ _≤_ Θ[∗]t _[←]_ [Q]Eζ[p]attack [(Θ]t[∗]−1 [+][ α][ ⊙] _[v][)]_ **end** Θ ←− Θ − _η∇ΘLnat((Θ, X), y) + βrobLrob(Θ, Θ[∗]Nsteps_ _[, X][)]_ **end** **Algorithm 1: In l∞, v corresponds to sign(g) and the step size α is** _[|]N[Θ][|⊙]steps[ζ]_ [.][ Q]Eζ[p]attack [(][m][)][ de-] notes the projection operator on the ζattack-ellipsoid in l[p] space. In l[∞] this corresponds to min(max(model. _m, Θ −_ _ϵ), Θ + ϵ) with ϵ = ζattack ⊙|Θ|. ζattack and βrob are hyperparameters of our_ Unlike adversarial training in the input space, where adversarial inputs can be seen as a form of data augmentation, adversarial training in the parameter space poses the following challenge: Because the parameters that are attacked are the same parameters being optimized, performing gradient descent using the same loss that was used for PGA would simply revert the previous updates and no learning would occur. ABCD circumvents this problem by masking one half of the parameters in the adversarial loop and masking the other half during the gradient descent step. However, this limits the adversary in its power, and requires multiple iterations to be performed in order to update all parameters at least once. AWP approached this problem by assuming that the gradient of the loss with respect to the attacking parameters can be used in order to update the original parameters to favor minima in flatter locations in weight-space. However, it is not clear whether this assumption always holds since the gradient of the loss with respect to the attacking parameters is not necessarily the same direction that would lead to a flatter region in the weight loss-landscape. We approach this problem slightly differently: Similar to the TRADES algorithm (Zhang et al., 2019), our algorithm optimizes a natural (task) loss and a separate robustness loss. _Lgen(Θ, X, y) = Lnat(Θ, X, y) + βrobLrob(Θ, A(Θ), X)_ Using a different loss for capturing the susceptibility of the network to adversarial attacks enables us to simultaneously optimise for performance and robustness, without PGA interfering with the gradient descent step. In our experiments, Lrob is defined as _Lrob(Θ, Θ[∗], X) = KL (f_ (Θ, X), f (Θ[∗], X)) (2) This formulation comes with a large computational overhead since it requires computing the Jacobian **JΘ∗** (Θ) of a complex recurrent relation between Θ and Θ[∗]. To make our algorithm more efficient we assume that the Jacobian is diagonal, meaning that Θ[∗] = Θ + ∆Θ for some ∆Θ given by the adversary. In l, the Jacobian can then be calculated efficiently using (see suppl. material for details): _∞_ **JΘ∗** (Θ) = I + diag sign(Θ)⊙N(ζstepsattack+ϵ·R1) _t=1_ [sign] Θ[∗]t _t_ _[, X]_ _⊙_ [P][N][steps] _∇_ _[L][rob][(Θ][,][ Θ][∗]_ h [i] By making this assumption, our algorithm effectively multiplies the original training time by the number of PGA steps, similar to (Wu et al., 2020; Cicek & Soatto, 2019; Zheng et al., 2020). ----- Because component mismatch is independently proportional to the magnitude of each parameter, one has to model the space in which the adversary can search for a perturbation using an axis-aligned ellipsoid in l2 and an axis-aligned box in l . Using an ϵ-ball where the radius depends linearly on _∞_ the individual parameter sets (Li et al., 2018; Cicek & Soatto, 2019; Wu et al., 2020) would either give the adversary too little or too much attack space. Projecting onto an axis-aligned ellipsoid in _l2 corresponds to solving the following optimization problem (Gabay & Mercier, 1976; Dai, 2006),_ which does not have a closed-form solution: 1 _x[∗]_ = arg min _x_ 2 _[∥][m][ −]_ _[x][∥][2]_ s.t. (x − _c)[T]_ _W_ _[−][2](x −_ _c) ≤_ 1 where W = diag(|Θ| ⊙ _ζ) + I · ζconst, c = Θ and m = Θ[∗]_ + α ⊙ _v. Because of the computational_ overhead this would incur, we only consider the l case in our experiments. _∞_ 4 RESULTS The ultra-low power consumption of mixed-signal neuromorphic chips make them suitable for edgeapplications, such as always-on voice detection (Cho et al., 2019), vibration monitoring (Gies et al., 2021) or always-on face recognition (Liu et al., 2019). For this reason, we consider two compact network architectures in our experiments: A Long Short-term spiking recurrent Neural Network (LSNN) with roughly 65k trainable parameters; a conventional CNN with roughly 500k trainable parameters; and a Resnet32 architecture (He et al., 2015) (see Supplementary Material S1 for more information). We trained models to perform four different tasks: - Speech command detection of 6 classes (Warden, 2018); - ECG-anomaly detection on 4 classes (Bauer et al., 2019); - Fashion-MNIST (F-MNIST): clothing-image classification on 10 classes (Xiao et al., 2017); and - The Cifar10 colour image classification task (Krizhevsky, 2009). We compared several training and attack methods, beginning with a standard Stochastic Gradient Descent (SGD) approach using the Adam optimizer (Kingma & Ba, 2015) (“Standard”). Learning rate varied by architecture, but was kept constant when comparing training methods on an architecture. We examined networks trained with dropout (Srivastava et al., 2014), AWP (Wu et al., 2020), AMP (Zheng et al., 2020), ABCD (Cicek & Soatto, 2019), and Entropy-SGD (Chaudhari et al., 2019). The adversarial perturbations used in AWP and ABCD were adapted to our mismatch model (i.e. magnitude-dependent in l ) unless stated otherwise. AMP was not adapted. _∞_ A dropout probability of 0.3 was used in the dropout models and γ in AWP was set to 0.1. When Gaussian noise was applied to the weights during the forward pass (Murray & Edwards, 1994) a relative standard deviation of 0.3 times the weight magnitude was used (ηtrain = 0.3). For EntropySGD, we set the number of inner iterations to 10 with a Langevin learning rate of 0.1. Because Entropy-SGD and ABCD have inner loops, the number of total epochs were reduced accordingly. All other models were trained for the same number of epochs (no early stopping) and the model with the highest validation accuracy was selected. **Effectiveness of adversarial weight attack** We examined the strength of our adversarial weight attack during inference and training. Standard networks trained using gradient descent alone with no additional regularization (Fig. S5a, “Standard”) were disrupted badly by our adversarial attack during inference (ζ = 0.1; final mean test accuracy 91.40% → 17.50%), and this was not ameliorated by further training. When our adversarial attack was implemented during training (Fig. S5a, βrob = 0.1), the trained network was protected from disruption both during training and during inference (final test accuracy 91.97% → 78.41%). Our adversarial attack degrades network performance significantly more than a random perturbation. Because our adversary uses PGA during the attack, it approximates a worst-case perturbation of the network within an ellipsoid around the nominal weights Θ. We compared the effect of our attack against a random weight perturbation (random point on ζ−ellipsoid) of equal magnitude. ----- |AWP Beta|Col2| |---|---| |Beta Forwa Dropo Forwa Stand|rd Noise + Beta ut rd Noise ard| ||| ||| ||| F-MNIST CNN ECG LSNN Speech LSNN 1.0 AWPBetaForward Noise + BetaDropoutForward Noise 00..3530 11..21 0.8 Standard 0.25 1.0 0.20 0.9 0.6 0.15 0.8 Cross Entropy Loss 0.4 Cross Entropy Loss 0.10 Cross Entropy Loss 0.7 0.05 0.6 0.2 0.5 _−1.5_ _−1.0_ _−0.5_ 0.0α 0.5 1.0 1.5 _−1.5_ _−1.0_ _−0.5_ 0.0α 0.5 1.0 1.5 _−1.5_ _−1.0_ _−0.5_ 0.0α 0.5 1.0 1.5 Figure 1: Our training method flattens the test weight-loss landscape. When moving away from the trained weight minimum (α = 0) in randomly-chosen directions, we find that our adversarial training method (Beta; Beta+Forward) finds deeper minima (for F-MNIST and Speech tasks) at flatter locations in the cross-entropy test loss landscape. See text for further details, and Fig. S2 for visualisation over several random seeds. For increasing perturbation size during inference (Fig. S5c; ζ), our adversarial attack disrupted the performance of the standard network significantly more than a random perturbation (test acc. 91.40% → 17.50% (attack) vs. 91.40% → 90.63% (random) for ζ = 0.1). When our adversarial attack was applied during training (Fig. S5d; βrob = 0.1), the network was protected against both random and adversarial attacks for magnitudes up to ζ = 0.7 and ζ = 0.1, respectively. **Flatness of the weight-loss landscape** Under our definition of robustness, the network output should change only minimally when the network parameters are perturbed. This corresponds to a loss surface that is close to flat in weight space. We measured the test weight-loss landscape for trained networks, compared over alternative training methods and for several architectures (Fig. 1). We examined only cross-entropy loss over the test set, and not the adversarial attack loss component (KL divergence loss; see Eq. 2). For each trained network, we chose a random vector v ∼N (0, ζ|Θ|) and calculated Lcce(f (Xtest, Θ + α · v), ytest) for many evenly-spaced α ∈ [−2, 2]. This process was repeated 5 times for ζ = 0.2, and the means plotted in Fig. 1. Weight-loss landscapes for the individual trials are shown in Fig. S2. Our adversarial training approach found minima of trained parameters Θ in flatter areas of the weightloss landscape, compared with all other approaches examined (flatter curves in Fig. 1). In most cases our training approach also found deeper minima at lower categorical cross-entropy loss (Lcce), reflecting better task performance. These results are reflected in the better generalization performance of our approach (see Table 1). Not surprisingly, dropout and AWP also lead to flatter minima than the Standard network with no regularization. ABCD and Entropy-SGD were not included in Fig. 1 because they did not outperform the Standard model. **Network robustness against parameter mismatch** We evaluated the ability of our training method to protect against simulated device mismatch. We introduced frozen parameter noise into models trained with adversarial attack, with noise modelled on that observed in neuromorphic processors (Joshi et al., 2020; Büchel et al., 2021). In these devices, uncertainty associated with each weight parameter is approximately normally distributed around the nominal value, with a standard deviation that scales with the weight magnitude (Eq 1). We measured test accuracy under simulated random mismatch for 100 samples across two model instances. A comparison of our method against standard training is shown in Fig. 2. For mismatch levels up to 70% (ζ = 0.7), our approach protected significantly against simulated mismatch for all three tasks examined (p < 2 × 10[−][8] in all cases; U test). A more detailed comparison between the different models is given in Table 1. **Network robustness against direct adversarial attack on task performance** Our training approach improves network robustness against mismatch parameter noise. We further evaluated the robustness of our trained networks against a parameter adversary that directly attacks task performance, by performing PGA on the cross-entropy loss _cce. Note that this is separate from the_ _L_ adversary used in our training method, which attacks the boundary loss (Eq.2). The AWP method uses the cross-entropy loss to find adversarial parameters during training. Nevertheless, we found that ----- |c|Col2|Col3|Col4|Col5|Col6|Col7|Col8|Col9|Col10| |---|---|---|---|---|---|---|---|---|---| ||||||||||| ||||||||||| |Col1|Col2|Col3|Col4|Col5|Col6|Col7|Col8|Col9|Col10| |---|---|---|---|---|---|---|---|---|---| ||||||||||| **a** **b** **c** Robust Robust Robust Groundtruth Standard Standard Standard 1.0 1.0 1.0 0.8 0.8 0.8 Test acc. 00..64 00..64 00..64 0.2 Standard 0.2 0.2 Ours 0.0 0.0 0.0 0.1 0.2 0.3 0.5 0.7 0.1 0.2 0.3 0.5 0.7 0.1 0.2 0.3 0.5 0.7 Mismatch level (ζ) Mismatch level (ζ) Mismatch level (ζ) Figure 2: **Adversarial attack during training protects networks against random mismatch-** **induced parameter noise. Networks were evaluated for the Speech (a), ECG (b) and F-MNIST** tasks (c), under increasing levels of simulated mismatch (ζ). Networks trained using standard SGD were disrupted by mismatch levels ζ > 0.1. At all mismatch levels our adversarial training approach performed significantly better in the presence of mismatch (higher test accuracy). Red boxes highlight misclassified examples. |Col1|Col2|Col3|Col4|Col5|Col6|Col7| |---|---|---|---|---|---|---| |||||||| |||||||| ||||Beta|||| ||Forward Beta|Noise +|Beta|||| ||Standard Forward AWP|Noise||||| F-MNIST CNN ECG LSNN Speech LSNN 100 80 80 80 60 60 60 Test Acc. 40 Test Acc. Test Acc. 40 Forward Noise + Beta 20 Beta 40 Standard 20 Forward Noise 0 AWP 20 0.0 0.005 0.01 0.05 0.1 0.2 0.3 0.5 0.0 0.005 0.01 0.05 0.1 0.2 0.3 0.5 0.0 0.005 0.01 0.05 0.1 0.2 0.3 0.5 Attack size ζ Attack size ζ Attack size ζ Figure 3: Our training method protects against task-adversarial attacks in parameter space. Networks trained under several methods were attacked using a PGA adversary that directly attacked the task performance _cce. Our adversarial training approach (bold; dashed) outperformed all other_ _L_ methods against this attack. our method consistently outperforms all other compared methods for increasing attack magnitude ζ (Fig. 3). In networks trained with our method, the adversary needed to perform a considerably larger attack to significantly reduce performance (test accuracy < 70%). ABCD and Entropy-SGD were not included in the comparison because they did not outperform the standard network. **Robustness against parameter drift for PCM based CiM** CiM devices based on memristor technologies such as PCM promise to deliver energy- and space-efficient accelerators. With the increasing interest in CiM devices for accelerated inference and energy-efficient edge computing (Sebastian et al., 2020), the problem of deploying a model that is robust to noise originating from the device physics of PCM cells has gained in significance. We investigated the effect of our training method on the robustness of networks that are simulated to run on PCM-based CiM hardware (for details on the simulator, see SM 5). Networks trained with our method outperform state-of-the-art networks deployed on PCM-based CiM. Currently, the method that has proven to yield the best performance on CiM hardware is training with noise on the network parameters during the forward pass. We adapted this method by adding our algorithm and show that we consistently outperform the conventional method (see Figure S14) for a wide range of hyperparameters, and even surpass the FP baseline (a model trained without noise injection and evaluated on a standard PC) for some configurations (see Figure 4). Following ----- attack [= 0.01] 93.5 attack [= 0.03] attack [= 0.05] 93.25 attack [= 0.10] 0.1 93.00 93.0 93.0 93.0 92.75 92.50 0.05 92.5 92.5 92.5 92.25 rob est acc. (%)T 92.0 est acc. (%)T 92.0 est acc. (%)T 92.0 est acc. (%)T 92.00 0.025 91.75 91.5 FP Baseline 91.5 91.5 91.50 rob [= 0.0,] train [= 0.110] 91.25 0.01 10[2] 10[3] 10[4] 10[5] 10[6] 10[7] 10[2] 10[3] 10[4] 10[5] 10[6] 10[7] 10[2] 10[3] 10[4] 10[5] 10[6] 10[7] 10[2] 10[3] 10[4] 10[5] 10[6] 10[7] Tinf (s) Tinf (s) Tinf (s) Tinf (s) Figure 4: Networks trained with our method show overall better performance when deployed **on PCM-based CiM hardware. This figure shows the performance degradation as a consequence of** the PCM devices drifting over time (x-axis, up to one year) of networks deployed on CiM hardware. Each subplot shows networks trained with a different attacking magnitude (ζattack) that are trained with different values of βrob. Each network is compared to the FP-baseline and a network trained with Gaussian noise injection on the weights. |Col1|attack = 0.01|Col3|Col4| |---|---|---|---| ||||| ||||| |FP Baseline rob = 0.0|||| |Col1|Col2|Col3|Col4| |---|---|---|---| |FP Baseline rob = 0.0|||| attack [= 0.01] attack [= 0.03] attack [= 0.05] attack [= 0.10] 93 93 92 FP Baseline 92 92 92 est acc. (%)T 9190 rob [= 0.0] est acc. (%)T 9190 est acc. (%)T 90 est acc. (%)T 90 0.026 0.037 0.056 0.075 0.110 0.026 0.037 0.056 0.075 0.110 0.026 0.037 0.056 0.075 0.110 0.026 0.037 0.056 0.075 0.110 train [=] train [=] train [=] train [=] 93.5 attack [= 0.01] 93.5 attack [= 0.03] 93.5 attack [= 0.05] 93.5 attack [= 0.10] 0.1 est acc. (%)T 93.092.592.0 FP Baselinerob [= 0.0] est acc. (%)T 93.092.592.0 est acc. (%)T 93.092.592.0 est acc. (%)T 93.092.592.0 0.050.0250.01 rob 0.026 0.037 0.056 0.075 0.110 0.026 0.037 0.056 0.075 0.110 0.026 0.037 0.056 0.075 0.110 0.026 0.037 0.056 0.075 0.110 train [=] train [=] train [=] train [=] Figure 5: **Our method consistently yields networks that outperform training with Gaussian** **noise injection. This figure compares the robustness to Gaussian noise at various levels (ζ) for** networks trained with our method (blue) and a networks trained with Gaussian noise injection (red), where the level of noise used during training (ηtrain) matches the noise used during inference. Each row represents a different type of noise: The first row models Gaussian noise with a standard deviation that is proportional to the largest absolute weight in the individual weight kernels (Joshi et al., 2020) and the second row follows the model presented in this paper (see Eq. 1). experiments conducted in (Joshi et al., 2020), we used Resnet32 (He et al., 2015) trained on Cifar10 (Krizhevsky, 2009). Injecting Gaussian noise on the weights during the forward pass yields strong improvements compared to the standard network. We show that by adding our method, we consistently improve this robustness by a significant amount (see Figure 5). We furthermore improve the scalability of our method by using a pretrained model and fewer steps for the adversary. Our algorithm incurs an additional training time that scales linearly with the number of attack steps used in the adversary (note that we cache the necessary gradients for the Jacobian calculation). To alleviate this additional time, we show that our method produces good results even for just one single adversarial step. Figure S13 shows the resulting performance when varying the number of attack steps used by the adversary. It should be noted that all results reported on PCM robustness were obtained using three adversarial steps and a pretrained model. **Verifiable robustness for LSNNs** We investigated the provable robustness of LSNNs trained using our method using abstract interpretation (Cousot & Cousot, 1977; Gehr et al., 2018; Mirman et al., 2018). In this analysis a function f (x, Θ) (in our case, a neural network with input x and parameters Θ) is overapproximated using an abstract domain. We specify the weights Θ in our network as an interval parameterised by the attack size ζ, spanning [Θ − _ζ|Θ|, Θ + ζ|Θ|]. We examined the_ proportion of provably correctly classified test samples for the Speech and ECG tasks, under a range of attack sizes ζ, and comparing our approach against standard gradient descent and against training ----- with forward-pass noise only (Fig. S9). We found that our approach is provably more correct over increasing attack size ζ (higher verified test accuracy). 5 DISCUSSION We proposed a new training approach that includes adversarial attacks on the parameter space during training. Our proposed adversarial attack was significantly stronger than random weight perturbations at disrupting the performance of a trained network. Including the adversarial attack during training significantly protected the trained network from weight perturbations during inference. Our approach found minima in the weight-loss landscape that usually corresponded to lower loss values, and were always in flatter regions of the loss landscape. This indicates that our approach found network solutions that are less sensitive to parameter variation, and therefore more robust. Our approach was more robust than several other methods for inducing robustness and good generalisation. To the best of our knowledge, our work represents the first example of interval bound propagation applied to SNNs, and the first application of parameter-space adversarial attacks to promote network robustness against device mismatch for mixed-signal compute. Our experiments only considered the impact of weight perturbations, and did not examine the influence of uncertainty in other parameters of mixed-signal neuromorphic processors such as time constants or spiking thresholds. Our approach can be adapted to include adversarial attacks in the full network parameter space, increasing the robustness of spiking networks. The technique of interval bound propagation can also be applied to these additional network parameters. We did not quantize network parameters either during or after training, in this work. On some platforms (Moradi et al., 2018) it is necessary to deploy quantized weights and it is unclear how our adversarial attacks would interact with quantization during the training process. However, most PCM-based CiM hardware does not require quantization during training in order to get good performance (Joshi et al., 2020). Per-device training for a device with known calibrated parameter noise is likely to achieve the highest possible deployed performance on that single device. However, this approach has significant drawbacks. Firstly, each device must be either measured / calibrated accurately — not a trivial requirement — or trained with the device in the forward inference pass of the training loop. Secondly, training must be performed individually for each device, entailing significant logistical problems if the training is conducted in the factory or inside a consumer product. Thirdly, this approach will retain full sensitivity to parameter variation on the device. Our method improves the performance of neural networks deployed to inference hardware that include computational non-idealities. For example, NN processors with crossbar architectures based on novel memory devices such as RRAM and PCM (Sebastian et al., 2020) display uncertainty in stored memory values as well as conductance drift over time (Le Gallo et al., 2018b; Wu et al., 2019; Joshi et al., 2020). Our method could also address in-memory computing-based NN processors based on SRAM and switched capacitors (Verma et al., 2019). Analog neurons and synapses in mixed-signal NN processors, for example SNN inference processors (Moradi et al., 2018), exhibit variation in weights and neuron parameters across a processor. We showed that our training approach finds network solutions that are insensitive to mismatch-induced parameter variation. Our networks can therefore be deployed to inference devices with computational non-idealities with only minimal reduction in task performance, and without requiring per-device calibration or model training. This reduction in per-device handling implies a considerable reduction in expense when deploying at commercial scale. Our method therefore brings low-power neuromorphic inference processors closer to commercial viability. **Ethics statement** The authors declare no conflicts of interest. **Reproducibility statement** Code for reproduce all experiments described in this work are provided [at https://github.com/jubueche/BPTT-Lipschitzness and https://github.](https://github.com/jubueche/BPTT-Lipschitzness) [com/jubueche/Resnet32-ICLR](https://github.com/jubueche/Resnet32-ICLR) **Acknowledgments** This work was partially supported by EU grants 826655 “TEMPO”; 871371 “MEMSCALES”; and 876925 “ANDANTE” to DRM. JB would also like to thank Manuel Le GalloBourdeau, Irem Boybat and Abu Sebastian from IBM Research - Zurich, for insightful discussions and technical support. ----- Table 1: **Results of training multiple networks using several methods over three tasks.** Networks were evaluated under different levels of mismatch (ζ). **CNN** Forward Noise, βrob = 0.1 _βrob = 0.25_ Standard Forward Noise AWP (ϵpga = 0.0) Dropout AMP (ϵ = 0.005) Mismatch Mean Acc. Std. Min. Mean Acc. Std. Min. Mean Acc. Std. Min. Mean Acc. Std. Min. Mean Acc. Std. Min. Mean Acc. Std. Min. Mean Acc. Std. Min. Baseline (0.0) 91.88 **0.00** 91.88 92.11 0.22 91.89 91.30 0.15 91.15 92.00 0.02 91.98 92.35 0.12 **92.23** **92.42** 0.24 92.18 91.34 0.12 91.22 0.1 91.77 0.12 91.29 91.59 0.26 90.91 90.45 0.37 88.80 91.94 **0.09** 91.69 **92.19** 0.16 **91.78** 91.13 0.85 86.50 90.42 0.35 89.37 0.2 91.62 **0.16** **91.19** 90.67 0.42 89.28 87.93 1.03 83.40 **91.63** **0.16** 91.06 91.42 0.44 89.87 88.71 1.55 80.97 87.97 0.88 85.33 0.3 **91.25** **0.22** **90.36** 89.64 0.65 87.10 82.70 2.45 71.99 91.01 0.26 90.01 89.88 0.95 85.11 84.94 3.06 72.97 83.13 2.13 75.73 0.5 **89.36** **0.73** **86.84** 85.96 1.87 76.28 61.14 6.92 42.46 87.82 0.88 84.37 82.59 3.66 66.73 71.74 5.84 53.10 60.60 7.36 38.67 0.7 **84.19** **2.63** **74.25** 79.39 4.40 59.15 36.93 7.73 20.17 78.48 2.95 69.67 65.38 8.27 40.66 53.91 9.03 22.49 36.79 7.42 18.53 **ECG LSNN** Forward Noise, βrob = 0.1 _βrob = 0.25_ Standard Forward Noise AWP (ϵpga = 0.0) Dropout AMP (ϵ = 0.02) Mismatch Mean Acc. Std. Min. Mean Acc. Std. Min. Mean Acc. Std. Min. Mean Acc. Std. Min. Mean Acc. Std. Min. Mean Acc. Std. Min. Mean Acc. Std. Min. Baseline (0.0) 99.07 0.34 98.73 99.10 0.07 99.03 99.07 **0.04** **99.03** 99.25 0.37 98.88 99.22 0.19 99.03 98.10 0.11 97.99 **99.40** 0.00 **99.40** 0.1 99.04 0.27 98.36 98.96 0.24 98.28 98.95 0.27 97.76 **99.09** **0.19** **98.66** 98.93 0.31 97.69 97.71 0.34 96.72 99.04 0.38 97.39 0.2 98.87 0.35 96.94 98.16 0.71 93.96 97.22 1.46 91.87 **99.01** **0.26** **98.06** 97.71 0.97 93.06 96.89 0.87 93.81 97.03 1.52 90.67 0.3 98.45 **0.45** **96.49** 96.34 1.95 89.33 92.97 4.38 65.30 **98.59** 0.52 95.30 94.60 2.93 81.34 94.85 2.47 85.60 92.24 3.92 76.27 0.5 **94.86** **2.69** 82.39 86.22 6.66 60.75 76.55 9.61 35.75 94.44 2.86 **82.84** 80.32 8.08 41.94 87.56 6.05 64.10 73.36 11.27 29.55 0.7 **82.02** **8.40** **50.22** 70.07 11.16 39.33 58.67 11.44 29.48 80.00 8.70 37.69 63.44 9.61 32.91 74.24 12.58 30.15 56.64 11.02 27.76 **Speech LSNN** Forward Noise, βrob = 0.5 _βrob = 0.5_ Standard Forward Noise AWP (ϵpga = 0.01) Dropout AMP (ϵ = 0.01) Mismatch Mean Acc. Std. Min. Mean Acc. Std. Min. Mean Acc. Std. Min. Mean Acc. Std. Min. Mean Acc. Std. Min. Mean Acc. Std. Min. Mean Acc. Std. Min. Baseline (0.0) 81.52 0.05 81.47 **82.48** **0.00** **82.48** 80.86 0.03 80.83 81.33 0.14 81.20 82.38 0.14 82.25 79.02 0.36 78.66 80.83 0.44 80.39 0.1 81.33 **0.24** 80.72 82.01 0.31 **81.03** 79.72 0.45 78.53 81.12 0.32 80.18 **82.03** 0.37 80.79 79.16 0.44 77.88 80.00 0.55 78.49 0.2 80.77 **0.35** 79.71 **81.24** 0.43 **79.98** 76.96 1.11 72.57 80.10 0.53 78.73 80.58 0.65 78.69 78.58 0.67 75.85 77.56 0.87 74.81 0.3 79.57 **0.62** **77.98** **79.80** 0.67 77.54 72.31 2.00 65.40 78.05 0.88 75.45 77.47 1.35 70.88 77.18 0.96 74.43 73.80 1.61 68.24 0.5 72.67 2.75 **63.85** **73.40** **2.32** 60.91 57.16 4.25 42.07 67.41 3.93 53.50 63.98 4.28 47.38 69.80 3.13 54.75 59.91 4.49 44.74 0.7 58.03 6.57 32.19 **60.23** **4.66** **45.49** 40.70 5.72 24.92 49.19 6.98 30.47 44.83 6.65 23.13 55.96 5.74 37.74 42.88 6.83 25.03 REFERENCES Stefano Ambrogio, Pritish Narayanan, Hsinyu Tsai, Robert M. Shelby, Irem Boybat, Carmelo di Nolfo, Severin Sidler, Massimo Giordano, Martina Bodini, Nathan C. P. Farinha, Benjamin Killeen, Christina Cheng, Yassine Jaoudi, and Geoffrey W. Burr. Equivalent-accuracy accelerated neural-network training using analogue memory. Nature, 558(7708):60–67, June 2018. [ISSN 1476-4687. doi: 10.1038/s41586-018-0180-5. URL https://doi.org/10.1038/](https://doi.org/10.1038/s41586-018-0180-5) [s41586-018-0180-5.](https://doi.org/10.1038/s41586-018-0180-5) F. C. Bauer, D. R. Muir, and G. Indiveri. Real-time ultra-low power ecg anomaly detection using an event-driven neuromorphic processor. IEEE Transactions on Biomedical Circuits and Systems, 13 (6):1575–1582, 2019. doi: 10.1109/TBCAS.2019.2953001. Guillaume Bellec, Darjan Salaj, Anand Subramoney, Robert A. Legenstein, and Wolfgang Maass. Long short-term memory and learning-to-learn in networks of spiking neurons. CoRR, [abs/1803.09574, 2018. URL http://arxiv.org/abs/1803.09574.](http://arxiv.org/abs/1803.09574) Irem Boybat, Manuel Le Gallo, SR Nandakumar, Timoleon Moraitis, Thomas Parnell, Tomas Tuma, Bipin Rajendran, Yusuf Leblebici, Abu Sebastian, and Evangelos Eleftheriou. Neuromorphic computing with multi-memristive synapses. Nature communications, 9(1):2514, 2018. Julian Büchel, Dmitrii Zendrikov, Sergio Solinas, Giacomo Indiveri, and Dylan R. Muir. Supervised training of spiking neural networks for robust deployment on mixed-signal neuromorphic processors, 2021. Nicholas Carlini and David A. Wagner. Towards evaluating the robustness of neural networks. CoRR, [abs/1608.04644, 2016. URL http://arxiv.org/abs/1608.04644.](http://arxiv.org/abs/1608.04644) Andrew S. Cassidy, Jun Sawada, Paul Merolla, John V. Arthur, Rodrigo Alvarez-Icaza, Filipp Akopyan, Bryan L. Jackson, and Dharmendra S. Modha. Truenorth: A high-performance, lowpower neurosynaptic processor for multi-sensory perception, action, and cognition. 2016. Pratik Chaudhari, Anna Choromanska, Stefano Soatto, Yann LeCun, Carlo Baldassi, Christian Borgs, Jennifer Chayes, Levent Sagun, and Riccardo Zecchina. Entropy-SGD: biasing gradient descent into wide valleys. Journal of Statistical Mechanics: Theory and Experiment, 2019(12): [124018, dec 2019. doi: 10.1088/1742-5468/ab39d9. URL https://doi.org/10.1088%](https://doi.org/10.1088%2F1742-5468%2Fab39d9) [2F1742-5468%2Fab39d9.](https://doi.org/10.1088%2F1742-5468%2Fab39d9) An Chen. A review of emerging non-volatile memory (nvm) technologies and applications. _Solid-State Electronics, 125:25–38, 2016. ISSN 0038-1101. doi: https://doi.org/10.1016/j.sse._ 2016.07.006. [URL https://www.sciencedirect.com/science/article/pii/](https://www.sciencedirect.com/science/article/pii/S0038110116300867) [S0038110116300867. Extended papers selected from ESSDERC 2015.](https://www.sciencedirect.com/science/article/pii/S0038110116300867) Minchang Cho, Sechang Oh, Zhan Shi, Jongyup Lim, Yejoong Kim, Seokhyeon Jeong, Yu Chen, David Blaauw, Hun-Seok Kim, and Dennis Sylvester. 17.2 a 142nw voice and acoustic activity ----- detection chip for mm-scale sensor nodes using time-interleaved mixer-based frequency scanning. In 2019 IEEE International Solid- State Circuits Conference - (ISSCC), pp. 278–280, 2019. doi: 10.1109/ISSCC.2019.8662540. Safa Cicek and Stefano Soatto. Input and weight space smoothing for semi-supervised learning. In _2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), pp. 1344–1353,_ 2019. doi: 10.1109/ICCVW.2019.00170. Patrick Cousot and Radhia Cousot. Abstract interpretation: A unified lattice model for static analysis of programs by construction or approximation of fixpoints. In Proceedings of the 4th ACM _SIGACT-SIGPLAN Symposium on Principles of Programming Languages, POPL ’77, pp. 238–252,_ New York, NY, USA, 1977. Association for Computing Machinery. ISBN 9781450373500. doi: [10.1145/512950.512973. URL https://doi.org/10.1145/512950.512973.](https://doi.org/10.1145/512950.512973) Yu-Hong Dai. Fast algorithms for projection on an ellipsoid. SIAM Journal on Optimization, 16(4):986–1006, 2006. doi: 10.1137/040613305. [URL https://doi.org/10.1137/](https://doi.org/10.1137/040613305) [040613305.](https://doi.org/10.1137/040613305) Daniel Gabay and Bertrand Mercier. A dual algorithm for the solution of nonlinear variational problems via finite element approximation. Computers & Mathematics with Applications, 2(1): [17–40, 1976. ISSN 0898-1221. doi: https://doi.org/10.1016/0898-1221(76)90003-1. URL https:](https://www.sciencedirect.com/science/article/pii/0898122176900031) [//www.sciencedirect.com/science/article/pii/0898122176900031.](https://www.sciencedirect.com/science/article/pii/0898122176900031) Timon Gehr, Matthew Mirman, Dana Drachsler-Cohen, Petar Tsankov, Swarat Chaudhuri, and Martin Vechev. Ai2: Safety and robustness certification of neural networks with abstract interpretation. In _2018 IEEE Symposium on Security and Privacy (SP), pp. 3–18, 2018. doi: 10.1109/SP.2018.00058._ Valentin Gies, Sebastián Marzetti, Valentin Barchasz, Hervé Barthélemy, and Hervé Glotin. Ultra-low power embedded unsupervised learning smart sensor for industrial fault classificatio. In 2020 IEEE _International Conference on Internet of Things and Intelligence System (IoTaIS), pp. 181–187,_ 2021. doi: 10.1109/IoTaIS50849.2021.9359716. Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In Yee Whye Teh and Mike Titterington (eds.), Proceedings of the Thirteenth _International Conference on Artificial Intelligence and Statistics, volume 9 of Proceedings of_ _Machine Learning Research, pp. 249–256, Chia Laguna Resort, Sardinia, Italy, 13–15 May 2010._ [JMLR Workshop and Conference Proceedings. URL http://proceedings.mlr.press/](http://proceedings.mlr.press/v9/glorot10a.html) [v9/glorot10a.html.](http://proceedings.mlr.press/v9/glorot10a.html) Ian J. Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples, 2015. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image [recognition. CoRR, abs/1512.03385, 2015. URL http://arxiv.org/abs/1512.03385.](http://arxiv.org/abs/1512.03385) S. Huang, S. Xiao, and W. Feng. On the energy efficiency of graphics processing units for scientific computing. In 2009 IEEE International Symposium on Parallel Distributed Processing, pp. 1–8, 2009. doi: 10.1109/IPDPS.2009.5160980. Vinay Joshi, Manuel Le Gallo, Simon Haefeli, Irem Boybat, S. R. Nandakumar, Christophe Piveteau, Martino Dazzi, Bipin Rajendran, Abu Sebastian, and Evangelos Eleftheriou. Accurate deep neural network inference using computational phase-change memory. Nature Communications, [11(1):2473, May 2020. ISSN 2041-1723. doi: 10.1038/s41467-020-16108-9. URL https:](https://doi.org/10.1038/s41467-020-16108-9) [//doi.org/10.1038/s41467-020-16108-9.](https://doi.org/10.1038/s41467-020-16108-9) Riduan Khaddam-Aljameh, Milos Stanisavljevic, Jordi Fornt Mas, Geethan Karunaratne, Matthias Brändli, Feng Liu, Abhairaj Singh, Silvia M Müller, Urs Egger, Anastasios Petropoulos, et al. HERMES-core–a 1.59-TOPS/mm[2] PCM on 14-nm CMOS in-memory compute core using 300ps/LSB linearized CCO-based ADCs. IEEE Journal of Solid-State Circuits, 2022. Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Yoshua Bengio and Yann LeCun (eds.), 3rd International Conference on Learning Representations, ICLR _[2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015. URL http:](http://arxiv.org/abs/1412.6980)_ [//arxiv.org/abs/1412.6980.](http://arxiv.org/abs/1412.6980) ----- Alex Krizhevsky. Learning multiple layers of features from tiny images. 2009. Manuel Le Gallo, Daniel Krebs, Federico Zipoli, Martin Salinga, and Abu Sebastian. Collective structural relaxation in phase-change memory devices. _Advanced Electronic Mate-_ _rials, 4(9):1700627, 2018a._ doi: https://doi.org/10.1002/aelm.201700627. [URL https:](https://onlinelibrary.wiley.com/doi/abs/10.1002/aelm.201700627) [//onlinelibrary.wiley.com/doi/abs/10.1002/aelm.201700627.](https://onlinelibrary.wiley.com/doi/abs/10.1002/aelm.201700627) Manuel Le Gallo, Abu Sebastian, Roland Mathis, Matteo Manica, Heiner Giefers, Tomas Tuma, Costas Bekas, Alessandro Curioni, and Evangelos Eleftheriou. Mixed-precision in-memory computing. Nature Electronics, 1(4):246–253, April 2018b. ISSN 2520-1131. doi: 10.1038/ [s41928-018-0054-8. URL https://doi.org/10.1038/s41928-018-0054-8.](https://doi.org/10.1038/s41928-018-0054-8) Hao Li, Zheng Xu, Gavin Taylor, Christoph Studer, and Tom Goldstein. Visualizing the loss landscape of neural nets. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 31. Curran As[sociates, Inc., 2018. URL https://proceedings.neurips.cc/paper/2018/file/](https://proceedings.neurips.cc/paper/2018/file/a41b3bb3e6b050b6c9067c67f663b915-Paper.pdf) [a41b3bb3e6b050b6c9067c67f663b915-Paper.pdf.](https://proceedings.neurips.cc/paper/2018/file/a41b3bb3e6b050b6c9067c67f663b915-Paper.pdf) Qian Liu, Ole Richter, Carsten Nielsen, Sadique Sheik, Giacomo Indiveri, and Ning Qiao. Live demonstration: Face recognition on an ultra-low power event-driven convolutional neural network asic. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops _(CVPRW), pp. 1680–1681, 2019. doi: 10.1109/CVPRW.2019.00213._ Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks, 2019. Matthew Mirman, Timon Gehr, and Martin T. Vechev. Differentiable abstract interpretation for provably robust neural networks. In Jennifer G. Dy and Andreas Krause (eds.), Proceedings of the _35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm,_ _Sweden, July 10-15, 2018, volume 80 of Proceedings of Machine Learning Research, pp. 3575–_ [3583. PMLR, 2018. URL http://proceedings.mlr.press/v80/mirman18b.html.](http://proceedings.mlr.press/v80/mirman18b.html) Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, and Pascal Frossard. Deepfool: a simple and [accurate method to fool deep neural networks. CoRR, abs/1511.04599, 2015. URL http:](http://arxiv.org/abs/1511.04599) [//arxiv.org/abs/1511.04599.](http://arxiv.org/abs/1511.04599) Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, Jonathan Uesato, and Pascal Frossard. Robustness via curvature regularization, and vice versa. _CoRR, abs/1811.09716, 2018._ URL [http://arxiv.org/abs/1811.09716.](http://arxiv.org/abs/1811.09716) S. Moradi, N. Qiao, F. Stefanini, and G. Indiveri. A scalable multicore architecture with heterogeneous memory structures for dynamic neuromorphic asynchronous processors (dynaps). _IEEE Transactions on Biomedical Circuits and Systems, 12(1):106–122, 2018._ doi: 10.1109/TBCAS.2017.2759700. A.F. Murray and P.J. Edwards. Enhanced mlp performance and fault tolerance resulting from synaptic weight noise during training. IEEE Transactions on Neural Networks, 5(5):792–802, 1994. doi: 10.1109/72.317730. S. R. Nandakumar, I. Boybat, V. Joshi, C. Piveteau, M. Le Gallo, B. Rajendran, A. Sebastian, and E. Eleftheriou. Phase-change memory models for deep learning training and inference. In 26th _IEEE International Conference on Electronics, Circuits and Systems (ICECS), pp. 727–730, 2019._ doi: 10.1109/ICECS46596.2019.8964852. S. R. Nandakumar, Manuel Le Gallo, Christophe Piveteau, Vinay Joshi, Giovanni Mariani, Irem Boybat, Geethan Karunaratne, Riduan Khaddam-Aljameh, Urs Egger, Anastasios Petropoulos, Theodore Antonakopoulos, Bipin Rajendran, Abu Sebastian, and Evangelos Eleftheriou. Mixedprecision deep learning based on computational memory. Frontiers in Neuroscience, 14:406, 2020a. [ISSN 1662-453X. doi: 10.3389/fnins.2020.00406. URL https://www.frontiersin.org/](https://www.frontiersin.org/article/10.3389/fnins.2020.00406) [article/10.3389/fnins.2020.00406.](https://www.frontiersin.org/article/10.3389/fnins.2020.00406) ----- SR Nandakumar, Irem Boybat, Jin-Ping Han, Stefano Ambrogio, Praneet Adusumilli, Robert L Bruce, Matthew BrightSky, Malte Rasch, Manuel Le Gallo, and Abu Sebastian. Precision of synaptic weights programmed in phase-change memory devices for deep learning inference. In 2020 IEEE _International Electron Devices Meeting (IEDM), pp. 29–4. IEEE, 2020b._ Alexander Neckar, Sam Fok, Ben Benjamin, Terrence Stewart, Aaron Voelker, Chris Eliasmith, Rajit Manohar, and Kwabena Boahen. Braindrop: A mixed-signal neuromorphic architecture with a dynamical systems-based programming model. Proceedings of the IEEE, 107:144–164, 01 2019. doi: 10.1109/JPROC.2018.2881432. Yurii Nesterov. A method for solving the convex programming problem with convergence rate O(1/k[2]). Proceedings of the USSR Academy of Sciences, 269:543–547, 1983. J. Schemmel, D. Brüderle, A. Grübl, M. Hock, K. Meier, and S. Millner. A wafer-scale neuromorphic hardware system for large-scale neural modeling. In 2010 IEEE International Symposium on _Circuits and Systems (ISCAS), pp. 1947–1950, 2010. doi: 10.1109/ISCAS.2010.5536970._ Abu Sebastian, Manuel Le Gallo, Riduan Khaddam-Aljameh, and Evangelos Eleftheriou. Memory devices and applications for in-memory computing. _Nature Nanotechnology, 15(7):_ [529–544, 2020. doi: 10.1038/s41565-020-0655-z. URL https://doi.org/10.1038/](https://doi.org/10.1038/s41565-020-0655-z) [s41565-020-0655-z.](https://doi.org/10.1038/s41565-020-0655-z) Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. _Journal of Machine_ _Learning Research, 15(56):1929–1958, 2014._ [URL http://jmlr.org/papers/v15/](http://jmlr.org/papers/v15/srivastava14a.html) [srivastava14a.html.](http://jmlr.org/papers/v15/srivastava14a.html) David Stutz, Nandhini Chandramoorthy, Matthias Hein, and Bernt Schiele. Random and adversarial bit error robustness: Energy-efficient and secure dnn accelerators. CoRR, abs/2104.08323, 2021. C. S. Thakur, R. Wang, T. J. Hamilton, R. Etienne-Cummings, J. Tapson, and A. van Schaik. An analogue neuromorphic co-processor that utilizes device mismatch for learning applications. IEEE _Transactions on Circuits and Systems I: Regular Papers, 65(4):1174–1184, 2018._ Naveen Verma, Hongyang Jia, Hossein Valavi, Yinqi Tang, Murat Ozatay, Lung-Yen Chen, Bonan Zhang, and Peter Deaville. In-memory computing: Advances and prospects. IEEE Solid-State _Circuits Magazine, 11(3):43–55, 2019. doi: 10.1109/MSSC.2019.2922889._ Yisen Wang, Difan Zou, Jinfeng Yi, James Bailey, Xingjun Ma, and Quanquan Gu. Improving adversarial robustness requires revisiting misclassified examples. In International Conference on Learn_[ing Representations, 2020. URL https://openreview.net/forum?id=rklOg6EFwS.](https://openreview.net/forum?id=rklOg6EFwS)_ P. Warden. Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. ArXiv _[e-prints, April 2018. URL https://arxiv.org/abs/1804.03209.](https://arxiv.org/abs/1804.03209)_ Dongxian Wu, Yisen Wang, and Shutao Xia. Revisiting loss landscape for adversarial robustness. _[CoRR, abs/2004.05884, 2020. URL https://arxiv.org/abs/2004.05884.](https://arxiv.org/abs/2004.05884)_ Lei Wu, Hongxia Liu, Jiabin Li, Shulong Wang, and Xing Wang. A Multi-level Memristor Based on Al-Doped HfO2 Thin Film. _Nanoscale Research Letters, 14(1):177, May 2019._ [ISSN 1556-276X. doi: 10.1186/s11671-019-3015-x. URL https://doi.org/10.1186/](https://doi.org/10.1186/s11671-019-3015-x) [s11671-019-3015-x.](https://doi.org/10.1186/s11671-019-3015-x) Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms, 2017. Shimeng Yu and Pai-Yu Chen. Emerging memory technologies: Recent trends and prospects. IEEE _Solid-State Circuits Magazine, 8(2):43–56, Spring 2016. ISSN 1943-0590. doi: 10.1109/MSSC._ 2016.2546199. Hongyang Zhang, Yaodong Yu, Jiantao Jiao, Eric P. Xing, Laurent El Ghaoui, and Michael I. Jordan. Theoretically principled trade-off between robustness and accuracy. CoRR, abs/1901.08573, 2019. [URL http://arxiv.org/abs/1901.08573.](http://arxiv.org/abs/1901.08573) ----- Yaowei Zheng, Richong Zhang, and Yongyi Mao. Regularizing neural networks via adversarial [model perturbation. CoRR, abs/2010.04925, 2020. URL https://arxiv.org/abs/2010.](https://arxiv.org/abs/2010.04925) [04925.](https://arxiv.org/abs/2010.04925) ----- SUPPLEMENTARY MATERIAL SPIKING RNN ARCHITECTURE The dynamics of the spiking model (Bellec et al., 2018) can be summarised by a set of differential equations (Eq. 3). **_B[t]_** = b[0] + βb[t] **_o[t]_** = 1(V _[t]_ _> B[t]) unless refractory trefr_ **_b[t][+1]_** = ρβb[t] + (1 **_ρβ)_** **_[o][t]_** _−_ dt (3) **_IReset[t]_** [=][ o][t] dt **_[B][t][dt]_** **_V_** _[t][+1]_ = ρV V _[t]_ + (1 **_ρV )(IinWin +_** **_[o][t]_** Reset _−_ dt _[W][rec][)][ −]_ **_[I]_** _[t]_ where ρβ = e[−][dt][/τ][ada] and ρV = e[−][dt][/τ] . The variables B describe the spiking thresholds with spike-frequency adaptation. The vector o[t] denotes the population spike train at time t. The membrane potentials V have a time constant τ and the adaptive threshold time constant is denoted by τada. The speech signals and ECG traces fed into the network are represented as currents Iin. Since the derivative of the spiking function with respect to its input is mostly 0 we use a surrogate gradient that is explicitly defined as _∂E_ _∂z[t]_ _, 0)_ (4) _∂V_ _[t][ =][ ∂E]∂z[t]_ _∂V_ _[t][ =][ ∂E]∂z[t][ d][ ·][ max(][1][ −|]_ **_[V][ t][ −]B[t][B][t]_** _|_ where E is the error and d is the dampening factor. To get the final prediction of the network, we average the population spike trains along the time-axis zavg and compute **_l = softmax(zavgWout + bout)_** _yˆ = arg max_ **_li_** _i_ CNN ARCHITECTURE Our architecture comprises two convolutional blocks (2 × [4 × 4, 64 channels, MaxPool, ReLU]), followed by three dense layers (N = 1600, 256, 64, ReLU) and a softmax layer. All weights and kernels are initialized using the Glorot normal initialisation (Glorot & Bengio, 2010). Using this architecture, we achieved a test accuracy of ∼ 93%. Attacked parameters for this network included all the kernel weights, as well as all the dense layer parameters. ----- DERIVATION OF JACOBIAN Under the assumption that α = _[ζ][attack]Nsteps[⊙|][Θ][|]_ and p =, we can rewrite the inner loop of Algorithm 1 to _∞_ **begin** Θ[∗] _←−_ Θ + |Θ|ϵ ⊙ _R ; R ∼N_ (0, 1) **for t = 1 to Nsteps do** Θ[∗]t _t_ 1 [+][ α][ ·][ sign] Θ[∗]t 1 _[L][rob][(Θ][,][ Θ]t[∗]_ 1[, X][)] _[←]_ [Θ][∗]− _∇_ _−_ _−_ **end**  **end** By rewriting Θ[∗] in the form of Θ[∗] = Θ + ∆Θ we get _Nsteps_ Θ[∗] = Θ + |Θ|ϵ ⊙ _R + α ⊙_ sign Θ[∗]t 1 _[L][rob][(][f]_ [(Θ][, X][)][, f] [(Θ]t[∗] 1[, X][))] _∇_ _−_ _−_ _t=1_ X  From this the Jacobian can be easily calculated. Plugging in the defintion for α we get **JΘ∗** (Θ) = I + diag sign(Θ)⊙N(ζstepsattack+ϵ·R1) _t=1_ [sign] Θ[∗]t _t_ _[, X][))]_ _⊙_ [P][N][steps] _∇_ _[L][rob][(][f]_ [(Θ][, X][)][, f] [(Θ][∗] h [i] ----- MISMATCH MODEL To model the parameter noise introduced by component mismatch we used a Gaussian distribution where the mean is the nominal noise-free weight value and the standard deviation depends linearly on the weight value. This model realistically captures the behaviour of parameter mismatch on a mixedsignal neuromorphic SNN inference processor (Moradi et al., 2018). Fig. S1 shows the quantified parameter mismatch recorded directly from neuromorphic HW, over a range of nominal parameter values and for several neuronal and synaptic parameters. The measured mismatch parameter variation follows an approximately Gaussian distribution where the standard deviation depends linearly on the mean. **a** **b** 3 7 14 19 15 31 39 59 _Wslow, peak (mV)_ _τmem (ms)_ **c** **d** 6 4 _τmem_ 2 _Wslow_ std. dev. (ms; mV) _Wfast_ 2 7 12 18 20 40 60 _Wfast, peak (mV)_ mean value (ms; mV) Figure S1: Quantification of mismatch on analog neuromorphic hardware. Parameter values for several weight parameters and membrane time constants were measured for a range of nominal parameter values, using an oscilloscope directly connected to a mixed-signal neuromorphic SNN processor. (a-c) Various parameters of the chip follow a Gaussian distribution with increasing width. (d) The mismatch standard deviation depends linearly on the nominal value of each parameter. ----- WEIGHT LOSS-LANDSCAPE VISUALIZATION We characterized the shape of the weight loss-landscape by plotting the categorical cross entropy loss for varying levels of noise added to the weights of the trained network. In each trial, we picked a random vector v ∼N (0, ζ|Θ|) and evaluated the categorical cross entropy of the whole test set given the weights Θ + α · v, where α ∈ [−2, 2] and ζ = 0.2. As we show in Figure S2, the variance of the individual 1D weight loss-landscapes is small. Table S1 quantifies the flatness of the illustrated weight loss-landscapes. As can be seen, our method combined with adding noise during the forward pass yields the flattest landscapes. Table S1: Average slope of estimated 1D weight loss-landscapes. Slopes were calculated as the mean absolute differences between sample points divided by the sampling distance. F-MNIST CNN ECG LSNN Speech LSNN Standard 0.3375 0.1605 0.2702 Beta 0.0480 0.0879 0.0633 Forward Noise 0.0332 **0.0180** 0.1009 Forward Noise + Beta **0.0190** 0.0187 **0.0540** Dropout 0.3022 0.0952 0.0707 AWP 0.0841 0.1013 0.1389 |Col1|AWP|Col3| |---|---|---| ||Beta Forward Dropout|Noise + Beta| ||Forward Standar|Noise d| |||| |||| |||| |||| F-MNIST CNN ECG LSNN Speech LSNN 1.4 AWPBeta 0.7 1.4 1.2 Forward Noise + BetaDropoutForward Noise 0.6 1.2 Standard 0.5 1.0 0.4 1.0 0.8 0.3 Cross Entropy Loss 0.6 Cross Entropy Loss 0.2 Cross Entropy Loss 0.8 0.4 0.1 0.6 0.2 0.0 _−2.0_ _−1.5_ _−1.0_ _−0.5_ 0α.0 0.5 1.0 1.5 2.0 _−2.0_ _−1.5_ _−1.0_ _−0.5_ 0α.0 0.5 1.0 1.5 2.0 _−2.0_ _−1.5_ _−1.0_ _−0.5_ 0α.0 0.5 1.0 1.5 2.0 Figure S2: Illustration of the test weight loss-landscape highlighting individual trials to mea**sure the loss landscapes. See the results text for more details.** ----- ATTACKING KL-DIVERGENCE LOSS DURING INFERENCE While the adversary in AWP attacks the task loss directly, i.e. max Lcce(f (Θ[∗], X), y), the adversary in our training algorithm attacks the KL divergence max KL(f (Θ, X), f (Θ[∗], X)). This implies that our parameter attack seeks simply to change the response of the network in any way, and is agnostic to the task itself. In the main results we attack the cross-entropy task loss during inference, in order to not give an undue advantage to our training approach. Here we show that our training approach also provides robustness against parameter attacks on the KL-divergence loss during inference. Fig. S3 shows the adversarial robustness of the methods for an adversary that attacks the KLdivergence rather than the cross-entropy loss. When the network is attacked by maximizing the KL-divergence between the normal and attacked network, our adversarially-trained networks are more robust than AWP, SGD and forward-noise. Because the parameter attack used here during inference as well as in the inner optimization loop of the training procedure is the same, this result is expected, and serves as a sanity check that our networks indeed learn to defend against the attack they were trained against. |Col1|Col2|Col3|Col4|Col5|Col6|Col7|Col8| |---|---|---|---|---|---|---|---| ||||||||| ||||||||| ||||Beta||||| ||Forward|Noise +|Beta||||| ||Beta Standard Forward AWP|Noise|||||| F-MNIST CNN ECG LSNN Speech LSNN 100 80 80 70 80 60 60 50 60 Test Acc. 40 Test Acc. Test Acc. 40 Forward Noise + Beta 20 Beta 40 30 Standard Forward Noise 20 0 AWP 0.0 0.005 0.01 0.05 0.1 0.2 0.3 0.5 0.0 0.005 0.01 0.05 0.1 0.2 0.3 0.5 0.0 0.005 0.01 0.05 0.1 0.2 0.3 0.5 Attack size ζ Attack size ζ Attack size ζ Figure S3: Robustness to weight attack targeting KL divergence during inference. When the network is attacked by maximizing the KL-divergence between the normal and attacked network, our adversarially trained networks are more robust than standard SGD or AWP. ----- EFFECT OF CONSTANT VERSUS RELATIVE PARAMETER NOISE As described in the main text, parameter noise that has constant magnitude (for example, Gaussian noise with fixed standard deviation) is trivial to protect against by increasing weight magnitudes. We examined this effect by training MLPs with an adversary that employs Gaussian noise with fixed std. dev. ϵ = 0.2. Figure S4 (right) illustrates the test accuracy of two MLPs trained on F-MNIST over the course of training. Using an inverse weight decay term, the weight-magnitude of one network is forced to increase to 2.0 over the course of training (black crosses, Θ[∗]). The weight magnitude of the other network (red crosses) is limited during training to 0.2 (red crosses; Θ). One can observe that as the weight magnitude of the increasing magnitude network Θ[∗] increases, also the robustness to Gaussian noise (ϵ = 0.2) increases (blue), while the performance of the small magnitude network Θ remains poor (cyan). |Col1|Col2|W in|Col4|Col5| |---|---|---|---|---| ||Cons Relat|tant ive||| |||||| |||||| |||||| |||||| |Tes Tes|t acc. Θ t acc. Θ∗|Col3|Col4|Col5|Col6|Col7|Col8|Col9||Θ| |Θ∗|| |---|---|---|---|---|---|---|---|---|---| |ϵ-te ϵ-te|st acc.,Θ st acc.,Θ∗||||||||| ||||||||||| ||||||||||| ||||||||||| ||||||||||| ||||||||||| W in W rec MLP MNIST 0.26 0.400 Constant Test acc. Θ _|Θ|_ 2.00 Relative 0.8 Test acc.ϵ-test acc., ΘΘ[∗] _|Θ[∗]|_ 1.75 0.21 0.345 _ϵ-test acc.,Θ[∗]_ 1.50 ))Θ 0.6 1.25 ))Θ 0.16 0.290 1.00 Sum(Abs( Test accuracy (%) 0.4 0.75 Sum(Abs( 0.11 0.235 0.50 0.2 0.25 0.06 0.180 _β 0.0_ _β 0.1_ _β 0.5_ _β 1.0_ _β 0.0_ _β 0.1_ _β 0.5_ _β 1.0_ 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 Epoch Figure S4: Constant magnitude versus magnitude-relative parameter noise. (left, middle) When parameter noise of constant magnitude (blue) is introduced by the adversarial attack during training, the networks learn to increase the magnitude of the weights to trivially improve robustness to the constant-magnitude attack. When the parameter attack is relative to each parameter magnitude, as in the main text (red), the weight magnitudes do not increase. These networks were trained for a range of βrob, i.e. varying emphasis on robustness. (right) One can also trivially increase the robustness to fixed-magnitude noise by introducing an inverse weight decay term that causes the weights to increase in magnitude (black, cross marker) while retaining performance. This causes the MLP trained on MNIST to become increasingly robust to fixed-magnitude parameter noise (blue dots; ϵ-test acc., Θ[∗]). The network where the weight magnitude was not increased over time (red, cross marker) did not improve in terms of robustness (cyan; ϵ-test acc., Θ), although both models perform similarly when there is no parameter noise applied (red and black dots). ----- EFFECT OF ADVERSARIAL REGULARIZATION DURING TRAINING |a|Col2|Training βr|rob=0.0|Col5|Col6|Col7| |---|---|---|---|---|---|---| |||||||| |||||||| ||||||Attac Attac|k ζ = 0.0 k ζ = 0.1| |||||||| |||||||| |c|Col2|Train|ning βrob=|=0.0|Col6|Col7|Col8| |---|---|---|---|---|---|---|---| ||||||||Adversarial Random| ||||||||| ||||||||| ||||||||| ||||||||| **a** Training βrob=0.0 **b** Training βrob=0.1 0.8 0.8 0.6 Attack ζ = 0.0 0.6 Attack ζ = 0.1 0.4 0.4 Validation acc. Validation acc. 0.2 0.2 0 10 20 30 40 50 0 10 20 30 40 50 Epochs Epochs **c** Training βrob=0.0 **d** Training βrob=0.1 Adversarial 0.8 Random 0.8 0.6 0.6 Test acc. 0.4 Test acc. 0.4 0.2 0.2 0.0 0.0 0.01 0.05 0.1 0.2 0.3 0.5 0.7 0.0 0.01 0.05 0.1 0.2 0.3 0.5 0.7 Attack size ζ Attack size ζ Figure S5: Our parameter attack is effective at decreasing the performance of a network dur**ing inference, and using our attack during training protects a network from later disruption.** (a) A network trained using standard SGD (i.e. βrob = 0.0) on the F-MNIST task is disrupted badly by the parameter noise adversary during inference (ζ = 0.1; black curve). (b) When the same network is trained with parameter attacks during training (βrob = 0.1), the network is protected from parameter attacks during inference (high accuracy of attacked network; black curve). (c) When trained with standard SGD (βrob = 0.0), both random noise (red) and parameter attacks (black) disrupt network performance for increasing attack size ζ. (d) Under our training approach (βrob = 0.1), networks are significantly protected against random and adversarial weight perturbations during inference. ----- EFFECT OF VARYING βrob We additionally quantify the trade-off between test loss and robustness of our algorithm by repeating the experiment in Figure 1 at different values of βrob. As shown in Figure S6, increasing βrob flattens the weight loss-landscape, effectively increasing the robustness of the model. To further substantiate this claim, we repeated the experiment of Figure S5 with the same values of βrob. As Figure S8 shows, increasing βrob and therefore increasing the flatness of the landscape yields increased robustness to random, as well as, adversarial perturbations. We note that, relative to the baseline, the test loss does not consistently increase with increasing values of βrob. We hypothesize that this is due to the increased generalization capability of our networks. To check whether this is indeed the case, we repeated the experiment from Fig. S6 using the loss computed on the training set. As Fig. S7 shows, increasing values of βrob lead to flatter minima and higher loss on the training set, which is the exact trade-off to be expected from the formulation of our loss function. However, the increased generalization capability that follows from a flatter loss-landscape seems to disrupt this trade-off. As a result, when choosing βrob one should aim at choosing the highest value that still yields good performance on the validation set. |Col1|Beta 0.05 Beta 0.1 Beta 0.2 Beta 0.3 Beta 0.5 Beta 0.8| |---|---| ||| ||| ||| F-MNIST CNN ECG LSNN Speech LSNN 0.40 Beta 0.05Beta 0.1 0.80 Beta 0.2Beta 0.3 0.20 0.75 0.35 Beta 0.5Beta 0.8 0.15 0.70 0.65 0.30 0.10 Cross Entropy Loss Cross Entropy Loss Cross Entropy Loss 0.60 0.25 0.05 0.55 0.50 _−1.5_ _−1.0_ _−0.5_ 0.0α 0.5 1.0 1.5 _−1.5_ _−1.0_ _−0.5_ 0.0α 0.5 1.0 1.5 _−1.5_ _−1.0_ _−0.5_ 0.0α 0.5 1.0 1.5 Figure S6: **Effect of βrob on the test loss landscape. Increasing βrob promotes robustness by** flattening the test loss landscape. However, increasing βrob does not lead to a systematic rise in loss on the test set. F-MNIST CNN ECG LSNN Speech LSNN 0.30 0.20 0.5 0.25 0.20 0.15 0.4 0.15 0.10 0.3 Beta 0.05 Cross Entropy Loss 0.10 Beta 0.1 Cross Entropy Loss Cross Entropy Loss Beta 0.2Beta 0.3 0.05 0.2 0.05 Beta 0.5 Beta 0.8 _−1.5_ _−1.0_ _−0.5_ 0.0α 0.5 1.0 1.5 _−1.5_ _−1.0_ _−0.5_ 0.0α 0.5 1.0 1.5 _−1.5_ _−1.0_ _−0.5_ 0.0α 0.5 1.0 1.5 Figure S7: Effect of βrob on the training loss landscape. Weight loss landscape computed on the training set (c.f. Fig. S6). Increasing βrob leads to flatter loss-landscapes (increased robustness) as before, but also a systematic increase in training loss (higher values for cross-entropy loss; note ordering of curves). ----- We additionally trained the CNN on the F-MNIST data for a range of βrob, and measured the network robustness to attacks of varying magnitude ζ (Fig. S8). We compared the effectiveness of random attack versus adversarial attack during inference, evaluated on the test set. We found that increasing _βrob improved the robustness of the network to both random and adversarial attack during inference._ |a|Col2|Train|ning βrob=|=0.0|Col6|Col7|Col8| |---|---|---|---|---|---|---|---| ||||||||Adversarial Random| ||||||||| ||||||||| ||||||||| ||||||||| |c|Col2|Train|ning βrob=|=0.3|Col6|Col7| |---|---|---|---|---|---|---| |||||||| |||||||| |||||||| |||||||| |||||||| |d|Col2|Train|ning βrob=|=0.5|Col6|Col7| |---|---|---|---|---|---|---| |||||||| |||||||| |||||||| |||||||| |||||||| **a** Training βrob=0.0 **b** Training βrob=0.1 Adversarial 0.8 Random 0.8 0.6 0.6 Test acc. 0.4 Test acc. 0.4 0.2 0.2 0.0 0.0 0.0 0.01 0.05 0.1 0.2 0.3 0.5 0.7 0.0 0.01 0.05 0.1 0.2 0.3 0.5 0.7 Attack size ζ Attack size ζ **c** Training βrob=0.3 **d** Training βrob=0.5 0.8 0.8 0.6 0.6 Test acc. Test acc. 0.4 0.4 0.2 0.2 0.0 0.0 0.01 0.05 0.1 0.2 0.3 0.5 0.7 0.0 0.01 0.05 0.1 0.2 0.3 0.5 0.7 Attack size ζ Attack size ζ Figure S8: **Increasing βrob during training improves robustness during inference. The CNN** was trained using different values of βrob and evaluated on the test set after the weights were perturbed either randomly (red) or adversarially (black). ----- VERIFIABLE ROBUSTNESS FOR LSNNS Computations through the network are performed on intervals rather than on discrete values. By propagating the intervals through a network over a test set, we obtain output logits that are also expressed as intervals and can therefore determine whether a sample will always be classified correctly. The final classification of the network is made using an arg max operator. For provability of network performance, we consider that a test sample x is correctly classified when the lower bound of the logit interval for the correct class is the maximum lower bound across all logit intervals, and when the logit interval for the correct class is disjoint from the other logit intervals. When the logit interval for the correct class overlaps with another logit interval, that test sample is not considered to be provably correctly classified. Note that interval domain analysis provides a relatively loose bound (Gehr et al., 2018), with the implication that the results here probably underestimate the true performance of our method. Speech LSNN ECG LSNN 1.0 0.8 Beta Forward Noise + Beta 0.8 Forward Noise 0.6 Standard 0.6 0.4 0.4 Verified test acc. Verified test acc. 0.2 0.2 0.0 0.0 0.0 1e-05 5e-05 0.0001 0.0005 0.001 0.0 1e-05 5e-05 0.0001 0.0005 0.001 Attack size ζ Attack size ζ Figure S9: **Networks trained with our method are provably more robust than those trained** **with standard gradient descent or forward noise alone. We used interval bound propagation to** determine the proportion of test samples that are verifiably correctly classified, under increasing weight perturbations ζ. For both the Speech and ECG tasks, computed on the trained LSNNs, our method was provably more robust for ζ < 5 × 10[−][4]. ----- WIDE-MARGIN NETWORK ACTIVATIONS Murray et al. show that adding random forward noise to the weights of a network during the forward 2 pass implicitly adds a regularizer of the form Θ[2]i,j _∂∂oΘk,li,j_ to the network weights (Murray & Edwards, 1994). When sigmoid activation functions are used, this regularizer favors high or low activations.  When implementing interval bound propagation for LSNNs, intervals over spiking activity must be computed by passing intervals through the spiking threshold function. As a result, intervals for spiking activity become either [0, 0] for neurons that never emit a spike regardless of weight attack; [1, 1] for neurons that always emit a spike; and [0, 1] for neurons for which activity becomes uncertain in the presence of weight attack. By definition, a robust network should promote bounds [0, 0] and [1, 1], where the activity of the network is unchanged by weight attack. Robust network configurations should therefore avoid states where the membrane potentials of neurons are close to the firing threshold. To see whether this was also the case for our LSNNs, we investigated the distribution of the membrane potentials on a batch of test examples for a network that was trained with- and without noise during the forward pass. We found that robust networks exhibited a broader distribution of membrane potentials, with comparatively less distribution mass close to the firing threshold (Fig. S10). This indicates that neurons in the robust network spend more time in a “safe” regime where a small change in the weights cannot trigger an unwanted spike, or remove a desired spike. 0.7 Forward Noise + Beta 3.0 Forward Noise + Beta Standard Standard 0.6 2.5 0.5 2.0 0.4 1.5 0.3 Normalized bin count 0.2 Normalized bin count 1.0 0.1 0.5 0.0 0.0 _−20_ _−15_ _−10_ _−5_ 0 _−4_ _−3_ _−2_ _−1_ 0 1 Membrane potential Membrane potential Figure S10: Membrane potentials are distributed away from the firing threshold for networks **trained with forward noise. We measured the distribution of membrane potentials in LSNNs trained** using standard gradient descent (“Standard”), and in the presence of weight noise injected in the forward pass (“Forward Noise + Beta”). As predicted, networks trained with forward noise have membrane potentials distributed away from the firing threshold. This implies that weight perturbations are less likely to inject or delete a spike erroneously, improving the robustness of the network. ----- EFFECT OF VARYING ϵPGA IN AWP In addition to attacking the network parameters, AWP (Wu et al., 2020) also attacks the input using PGA. Since the relation between robustness to weight- and input-space perturbations is still unclear, we performed additional sweeps over the attack size in the input space. Figure S11 demonstrates that attacking the input during training generally does not improve the robustness, with the exception for the network trained on the speech dataset. We also note that attacking the inputs improved robustness for larger mismatch values, but generally degraded performance for the small values. 1.0 0.9 0.8 0.8 0.7 0.90 0.85 0.80 0.75 0.70 |FMNIST CNN|Col2| |---|---| |ϵpga =0.000 ϵpga =0.001 ϵpga =0.010 ϵpga =0.100|| ||ϵpga =0.000 ϵpga =0.001 ϵpga =0.010 ϵpga =0.100| _ϵpga =0.000_ _ϵpga =0.001_ _ϵpga =0.010_ _ϵpga =0.100_ 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.6 0.5 0.7 0.6 |Speech LSNN|Col2| |---|---| |ϵpga =0.000 ϵpga =0.010 ϵpga =0.100 ϵpga =1.000|| ||ϵpga =0.000 ϵpga =0.010 ϵpga =0.100 ϵpga =1.000| _ϵpga =0.000_ _ϵpga =0.010_ _ϵpga =0.100_ _ϵpga =1.000_ 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.5 |ECG LSNN|Col2| |---|---| |ϵpga =0.000 ϵpga =0.010 ϵpga =0.100 ϵpga =1.000|| ||ϵpga =0.000 ϵpga =0.010 ϵpga =0.100 ϵpga =1.000| _ϵpga =0.000_ _ϵpga =0.010_ _ϵpga =0.100_ _ϵpga =1.000_ 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Mismatch level Mismatch level Mismatch level Figure S11: **Attacking the input using conventional PGA generally has a limited effect on** **robustness to weight-space perturbations. We swept various values of ϵpga, the parameter that** determines the maximum perturbation in l[∞] for the AWP algorithm. ----- TRAINING OF CNN USED FOR PCM-BASED CIM SIMULATION The CNN that was used for this series of experiments is Resnet32 (He et al., 2015) trained on the Cifar10 (Krizhevsky, 2009) dataset. The CNN was generally trained for 300 epochs using a batch size of 256. We used SGD with an initial learning rate of 0.001 that was decreased by a multiplicative factor of 0.2 after epochs 60, 120 and 160. Additionally, Nesterov momentum (Nesterov, 1983) was used with a value of 0.9 and weight decay with a value of 5e-4. |Col1|B ttta rrraaas iiinnne l===in 235e...76 6%% %| |---|---| |ttrraaiinn == 71.15.%0% 50 100 150 200 250 300 Epochs wl =|| |Col1|B tta rraas iinne l===in 235e...76 6%% %| |---|---| ||tttrrraaaiiinnn == 71.15.%0%| )train 94 wi[l], j [=] train 93[w]max[l] 91 92 Val. Acc. ( = 8885 Baselinetraintraintraintraintrain [= 2.6%][= 3.7%][= 5.6%][= 7.5%][= 11.0%] est. Acc.T 9190 50 100 150 200 250 300 0.026 0.037 0.056 0.075 0.110 Epochs train [=] )train 94 wi[l], j [=] train [|][w]i[l], j[|] 93.0 91 Val. Acc. ( = 8885 Baselinetraintraintraintraintrain [= 2.6%][= 3.7%][= 5.6%][= 7.5%][= 11.0%] est. Acc.T 92.592.0 50 100 150 200 250 300 0.026 0.037 0.056 0.075 0.110 Epochs train [=] Figure S12: The left panel illustrates the validation accuracy over the course of finetuning a pretrained model for additional 240 epochs. One should note that convergence is usually achieved much quicker (roughly after 150 additional epochs). The right panel illustrates the performance on the test set of each model trained with noise injection of magnitude ηtrain (x-axis). Each row depicts a different noise model. It should be noted that the weights in each filter were clipped to two standard deviations during training to avoid outliers causing excessive amounts of noise following the model that relies on the maximum weight value. VARYING THE NUMBER OF ATTACK STEPS FOR THE PCM-BASED CIM SIMULATION attack [= 0.10] 7 93.5 93.0 6 92.5 5 92.0 4 steps 91.5 N est acc. (%) T 3 91.0 90.5 Baseline 2 90.0 rob [= 0.0] 1 0.026 0.037 0.056 0.075 0.110 train [=] Figure S13: **One can greatly reduce the number of attack steps used during training. Our** method still produces strong results for very few number of attack steps (blue) when compared to the baseline model (red, trained with Gaussian noise (ηtrain)). ----- PERFORMANCE OF VARYING HYPERPARAMETERS FOR THE PCM-BASED CIM SIMULATION In this experiment we show that the choice of hyperparameters is generally not very important to outperform the baseline that was trained with noise injection. However, in order to surpass the FP baseline, i.e. the model that was trained without noise injection and evaluated on a standard PC, one has to tune the hyperparameters in order to obtain a combination that yields the highest performance. Figure S14 illustrates this sweep. Each row represents a different value of ηtrain that was used for training the baseline model (red). Each column represents a different attack size and the different hues of blue correspond to varying values of βrob. attack [= 0.01] attack [= 0.03] attack [= 0.05] 93.5 attack [= 0.10] 0.1 93.0 93.0 93.0 93.0 92.592.091.5 92.592.091.5 92.592.091.5 92.592.091.5 0.05 rob est acc. (%)T 91.090.590.0 FP Baselinerob [= 0.0,] train [= 0.026] est acc. (%)T 91.090.590.0 est acc. (%)T 91.090.5 est acc. (%)T 91.090.5 0.0250.01 10[2] 10[3] 10[4] 10[5] 10[6] 10[7] 10[2] 10[3] 10[4] 10[5] 10[6] 10[7] 10[2] 10[3] 10[4] 10[5] 10[6] 10[7] 10[2] 10[3] 10[4] 10[5] 10[6] 10[7] Tinf (s) Tinf (s) Tinf (s) Tinf (s) attack [= 0.01] attack [= 0.03] attack [= 0.05] attack [= 0.10] 0.1 93.0 93.0 93.0 93.0 92.5 92.5 92.5 92.5 0.05 92.0 92.0 92.0 92.0 91.5 91.5 91.5 91.5 rob est acc. (%)T 91.090.5 FP Baseline est acc. (%)T 91.090.5 est acc. (%)T 91.090.5 est acc. (%)T 91.090.5 0.025 rob [= 0.0,] train [= 0.037] 90.0 90.0 90.0 90.0 0.01 10[2] 10[3] 10[4] 10[5] 10[6] 10[7] 10[2] 10[3] 10[4] 10[5] 10[6] 10[7] 10[2] 10[3] 10[4] 10[5] 10[6] 10[7] 10[2] 10[3] 10[4] 10[5] 10[6] 10[7] Tinf (s) Tinf (s) Tinf (s) Tinf (s) attack [= 0.01] attack [= 0.03] attack [= 0.05] attack [= 0.10] 0.1 93.0 93.0 93.0 93.0 92.5 92.5 92.5 92.5 0.05 92.0 92.0 92.0 92.0 rob est acc. (%)T 91.591.0 FP Baselinerob [= 0.0,] train [= 0.056] est acc. (%)T 91.591.0 est acc. (%)T 91.591.0 est acc. (%)T 91.591.0 0.025 0.01 10[2] 10[3] 10[4] 10[5] 10[6] 10[7] 10[2] 10[3] 10[4] 10[5] 10[6] 10[7] 10[2] 10[3] 10[4] 10[5] 10[6] 10[7] 10[2] 10[3] 10[4] 10[5] 10[6] 10[7] Tinf (s) Tinf (s) Tinf (s) Tinf (s) attack [= 0.01] attack [= 0.03] 93.25 attack [= 0.05] 93.25 attack [= 0.10] 0.1 93.0 93.0 93.00 93.00 92.75 92.75 92.5 92.5 92.50 92.50 0.05 92.25 92.25 rob 92.0 92.0 92.00 92.00 est acc. (%)T 91.5 FP Baseline est acc. (%)T 91.5 est acc. (%)T 91.7591.50 est acc. (%)T 91.7591.50 0.025 91.0 rob [= 0.0,] train [= 0.075] 91.0 91.25 91.25 0.01 10[2] 10[3] 10[4] 10[5] 10[6] 10[7] 10[2] 10[3] 10[4] 10[5] 10[6] 10[7] 10[2] 10[3] 10[4] 10[5] 10[6] 10[7] 10[2] 10[3] 10[4] 10[5] 10[6] 10[7] Tinf (s) Tinf (s) Tinf (s) Tinf (s) attack [= 0.01] 93.5 attack [= 0.03] attack [= 0.05] 93.25 attack [= 0.10] 0.1 93.00 93.0 93.0 93.0 92.75 92.5 92.5 92.5 92.5092.25 0.05 rob est acc. (%)T 92.091.5 FP Baselinerob [= 0.0,] train [= 0.110] est acc. (%)T 92.091.5 est acc. (%)T 92.091.5 est acc. (%)T 92.0091.7591.5091.25 0.0250.01 10[2] 10[3] 10[4] 10[5] 10[6] 10[7] 10[2] 10[3] 10[4] 10[5] 10[6] 10[7] 10[2] 10[3] 10[4] 10[5] 10[6] 10[7] 10[2] 10[3] 10[4] 10[5] 10[6] 10[7] Tinf (s) Tinf (s) Tinf (s) Tinf (s) Figure S14: The choice of hyperparameters is not critical in order to beat the baseline model. Our method is resilient to variations in hyperparameters. However, to obtain configurations where even the FP baseline is surpassed, one has to fine-tune the method. ----- PCM NOISE MODEL Analog CiM comes in various flavors, depending on the memory technology used. In this paper, we assume the use of PCM devices, which have been heavily studied in the context of analog CiM accelerators (Joshi et al., 2020; Nandakumar et al., 2019; Boybat et al., 2018). PCM-based, or, more generally, Non-Volatile Memory (NVM)-based architectures, essentially perform Matrix-VectorMultiplications (MVMs) using Kirchhoff’s current law. The weights of the matrix are organized as differential pairs in order to account for positive and negative weights. When storing a neural network, each weight matrix is programmed into the NVM devices by applying short electrical pulses (Nandakumar et al., 2020b). Because of various noise sources, this process is often imprecise and exhibits noise on the weights, termed "programming noise". Additionally, PCM devices suffer from 1/f and telegraph noise, adding even more noise during inference ("read noise"). At last, PCM devices also drift due to the underlying physical properties (Le Gallo et al., 2018a). Although the effect of drift can mostly be alleviated by scaling the output of the MVM (a method called Global Drift Compensation (GDC)), the non-uniform drift of the devices still leads to performance degradation over time. In the simulator that we used, we model these three main sources of noise, analog-to-digital and digital-to-analog converters, GDC and splitting of the MVM to account for smaller tile sizes (typically each crossbar is 256 × 256). Initially, the clipped weights are mapped to target conductances in a differential manner, i.e. the weight matrix is split into two conductance matrices that both represent the positive and negative weights as conductances. The target conductances typically range from zero to Gmax, where Gmax is assumed to be 25µS. After mapping the weights to the target conductances GT, the programming noise is simulated (these statistical models assume the conductances to be normalized): GP = GT +N (0, σP ) where σP = max(−1.1731G[2]T [+ 1][.][9650][G][T][ + 0][.][2635][,][ 0][.][0)][.] After the conductances have been programmed they drift over time, with the conductance of a device typically following GD = GP (t/tc)[−][ν], where ν is the drift coefficient, t is the time at inference, and tc is the time the conductances were programmed. Additionally, the drift coefficient is modelled to follow a Gaussian distribution. This makes it typically hard to correct for drift and it is the main reason why drift is a problem in PCM based CiM devices. Finally, the read noise is modelled using a Gaussian: GR ∼N (GD, σnG(t)), where σnG(t) = _GD(t)Q_ _log((t + tr)/tr) with Q = min(0.0088/G[0]T[.][65], 0.2) and tr = 250ns and t is the time at_ inference. For the experiments in this paper, we simulated the performance of the networks deployed p on CiM hardware for up to one year. -----