|
# NETWORK INSENSITIVITY TO PARAMETER NOISE VIA |
|
## ADVERSARIAL REGULARIZATION |
|
|
|
|
|
**Julian Büchel** |
|
IBM Research - Zurich |
|
SynSense, Zürich, Switzerland |
|
ETH Zürich, Switzerland |
|
jbu@zurich.ibm.com |
|
|
|
|
|
**Fynn Faber** |
|
ETH Zürich, Switzerland |
|
faberf@ethz.ch |
|
|
|
ABSTRACT |
|
|
|
|
|
**Dylan R. Muir** |
|
SynSense, Zürich, Switzerland |
|
dylan.muir@synsense.ai |
|
|
|
|
|
Neuromorphic neural network processors, in the form of compute-in-memory crossbar arrays of memristors, or in the form of subthreshold analog and mixed-signal |
|
ASICs, promise enormous advantages in compute density and energy efficiency |
|
for NN-based ML tasks. However, these technologies are prone to computational |
|
non-idealities, due to process variation and intrinsic device physics. This degrades |
|
the task performance of networks deployed to the processor, by introducing parameter noise into the deployed model. While it is possible to calibrate each device, |
|
or train networks individually for each processor, these approaches are expensive |
|
and impractical for commercial deployment. Alternative methods are therefore |
|
needed to train networks that are inherently robust against parameter variation, as a |
|
consequence of network architecture and parameters. We present a new network |
|
training algorithm that attacks network parameters during training, and promotes |
|
robust performance during inference in the face of random parameter variation. |
|
Our approach introduces a loss regularization term that penalizes the susceptibility |
|
of a network to weight perturbation. We compare against previous approaches for |
|
producing parameter insensitivity such as dropout, weight smoothing and introducing parameter noise during training. We show that our approach produces models |
|
that are more robust to random mismatch-induced parameter variation as well as |
|
to targeted parameter variation. Our approach finds minima in flatter locations in |
|
the weight-loss landscape compared with other approaches, highlighting that the |
|
networks found by our technique are less sensitive to parameter perturbation. Our |
|
work provides an approach to deploy neural network architectures to inference |
|
devices that suffer from computational non-idealities, with minimal loss of performance. This method will enable deployment at scale to novel energy-efficient |
|
computational substrates, promoting cheaper and more prevalent edge inference. |
|
|
|
1 INTRODUCTION |
|
|
|
There is increasing interest in NN and ML inference on IoT and embedded devices, which imposes |
|
energy constraints due to small battery capacity and untethered operation. Existing edge inference |
|
solutions based on CPUs or vector processing engines such as GPUs or TPUs are improving in |
|
energy efficiency, but still entail considerable energy cost (Huang et al., 2009). Alternative compute |
|
architectures such as memristor crossbar arrays and mixed-signal event-driven neural network accelerators promise significantly reduced energy consumption for edge inference tasks. Novel non-volatile |
|
memory technologies such as resistive RAM and phase-change materials (Chen, 2016; Yu & Chen, |
|
2016) promise increased memory density with multiple bits per memory cell, as well as compact |
|
compute-in-memory for NN inference tasks (Sebastian et al., 2020). Analog implementations of |
|
neurons and synapses, coupled with asynchronous digital routing fabrics, permit high sparsity in both |
|
network architecture and activity, thereby reducing energy costs associated with computation. |
|
|
|
However, both of these novel compute fabrics introduce complexity in the form of computational |
|
non-idealities, which do not exist for pure synchronous digital solutions. Some novel memory |
|
technologies support several bits per memory cell, but with uncertainty about the precise value stored |
|
on each cycle (Le Gallo et al., 2018b; Wu et al., 2019). Others exhibit significant drift in stored |
|
|
|
|
|
----- |
|
|
|
states (Joshi et al., 2020). Inference processors based on analog and mixed-signal devices (Neckar |
|
et al., 2019; Moradi et al., 2018; Cassidy et al., 2016; Schemmel et al., 2010; Khaddam-Aljameh |
|
et al., 2022) exhibit parameter variation across the surface of a chip, and between chips, due to |
|
manufacturing process non-idealities. Collectively these processes known as “device mismatch” |
|
manifest as frozen parameter noise in weights and neuron parameters. |
|
|
|
In all cases the mismatch between configured and implemented network parameters degrades the task |
|
performance by modifying the resulting mapping between input and output. Existing solutions for |
|
deploying networks to inference devices that exhibit mismatch mostly focus on per-device calibration |
|
or re-training (Ambrogio et al., 2018; Bauer et al., 2019; Nandakumar et al., 2020a). However, this, |
|
and other approaches such as few-shot learning or meta learning entail significant per-device handling |
|
costs, making them unfit for commercial deployment. |
|
|
|
We consider a network to be “robust” if the output of a network to a given input does not change in |
|
the face of parameter perturbation. With this goal, network architectures that are intrinsically robust |
|
against device mismatch can be investigated (Thakur et al., 2018; Büchel et al., 2021). Another |
|
approach is to introduce parameter perturbations during training that promote robustness during |
|
inference, for example via random pruning (dropout) (Srivastava et al., 2014) or by injecting noise |
|
(Murray & Edwards, 1994). |
|
|
|
In this paper we introduce a novel solution, by applying adversarial training approaches to parameter |
|
mismatch. Most existing adversarial training methods attack the input space. Here we describe an |
|
adversarial attack during training that seeks the parameter perturbation that causes the maximum |
|
degradation in network response. In summary, we make the following contributions: |
|
|
|
- We propose a novel algorithm for gradient-based supervised training of networks that are robust |
|
against parameter mismatch, by performing adversarial training in the weight space. |
|
|
|
- We demonstrate that our algorithm flattens the weight-loss landscape and therefore leads to models |
|
that are inherently more robust to parameter noise. |
|
|
|
- We show that our approach outperforms existing methods in terms of robustness. |
|
|
|
- We validate our algorithm on a highly accurate Phase Change Memory (PCM)-based Computein-Memory (CiM) simulator and achieve new state-of-the-art results in terms of performance and |
|
performance retention over time. |
|
|
|
2 RELATED WORK |
|
|
|
Research to date has focused mainly on adversarial attacks in the input space. With an increasing |
|
number of adversarial attacks, an increasing number of schemes defending against those attacks |
|
have been proposed (Wang et al., 2020; Zhang et al., 2019; Madry et al., 2019; Moosavi-Dezfooli |
|
et al., 2018). In contrast, adversarial attacks in parameter space have received little attention. Where |
|
parameter-space adversaries have been examined, it has been to enhance performance in semisupervised learning (Cicek & Soatto, 2019), to improve robustness to input-space adversarial attacks |
|
(Wu et al., 2020), or to improve generalisation capability (Zheng et al., 2020). |
|
|
|
We define “robustness” to mean that the network output should change only minimally in the face of |
|
a parameter perturbation — in other words, the weight-loss landscape should be as flat as possible at |
|
a loss minimum. Other algorithms that promote flat loss landscapes may therefore also be useful to |
|
promote robustness to parameter perturbations. |
|
|
|
**Dropout (Srivastava et al., 2014) is a widely used method to reduce overfitting. During training, a** |
|
random subset of units are chosen with some probability, and these units are pruned from the network |
|
for a single trial or batch. This results in the network learning to distribute its computation across |
|
many units, and acts as a regularization against overfitting. |
|
|
|
**Entropy-SGD (Chaudhari et al., 2019) is a network optimisation method that minimises the local** |
|
entropy around a solution in parameter space. This results in a smoothed parameter-loss landscape |
|
that should penalize sharp minima. |
|
|
|
**Adversarial Block Coordinate Descent (ABCD) (Cicek & Soatto, 2019) was proposed in order** |
|
to complement input-space smoothing with weight-space smoothing in semi-supervised learning. |
|
|
|
|
|
----- |
|
|
|
ABCD repeatedly picks half of the network weights and performs one step of gradient ascent on |
|
them, followed by applying gradient descent on the other half. |
|
|
|
**Adversarial Weight Perturbation (AWP) (Wu et al., 2020) was designed to improve the robustness** |
|
of a network to adversarial attacks in the input space. The authors use Projected Gradient Ascent |
|
(PGA) on the network parameters to approximate a worst case perturbation of the weights Θ[′]. PGA |
|
repeatedly computes the gradient of a loss function and updates the parameters in the direction of |
|
the (positive) gradient. After each update, the parameters are projected back onto a ball (e.g. in |
|
_l[2]) around the original parameters to ensure that a maximum distance is kept. Having identified an_ |
|
adversarial perturbation in the weight-space, an adversarial perturbation in the input-space is also |
|
found using PGA. Finally, the original weights Θ are updated using the gradient of the loss evaluated |
|
at the adversarial perturbation Θ[′]. |
|
|
|
**Adversarial Model Perturbation (AMP) (Zheng et al., 2020) improves the generalisation of conven-** |
|
tional neural networks by optimizing a standard loss evaluated using parameters that were perturbed |
|
adversarially using PGA. Unlike our method, (Zheng et al., 2020) did not formulate the loss function |
|
as a trade-off between performance and robustness. Furthermore, the presented algorithm, unlike our |
|
method, treats the perturbation ∆Θ to the parameters Θ as a constant during backpropagation. |
|
|
|
**TRadeoff-inspired Adversarial DEfense via Surrogate-loss minimization (TRADES) (Zhang** |
|
et al., 2019) is a method for training networks that are robust against adversarial examples in the |
|
input space. The method consists of adding a boundary loss term to the loss function that measures |
|
how the network performance changes when the input is attacked. The boundary loss does not take |
|
the labels into account, so scaling it by a factor βrob allows for a principled trade-off between the |
|
robustness and the accuracy of the network. |
|
|
|
**Noise injection during the forward pass (Murray & Edwards, 1994) is a simple method for in-** |
|
creasing network robustness to parameter noise. This method adds Gaussian noise to the network |
|
parameters during the forward pass and computes weight gradients with respect to the original |
|
parameters. This method regularizes the gradient magnitudes of output units with respect to the |
|
weights, thus enforcing distributed information processing and insensitivity to parameter noise. We |
|
refer to this method as “Forward Noise”. |
|
|
|
A recent paper proposed a method for improving the resilience to random and targeted bit errors in |
|
SRAM cells on digital Deep Neural Network (DNN) accelerators (Stutz et al., 2021). By employing |
|
adversarial or random bit flips during training, the authors significantly improved the robustness to |
|
bit perturbations, enabling the accelerators to be operated below the conventional supply voltage. |
|
|
|
3 METHODS |
|
|
|
We use Θ to denote the set of parameters of a neural network f (x, Θ) that are trainable and susceptible |
|
to mismatch. The adversarial weights are denoted Θ[∗], where Θ[∗]t [are the adversarial weights at the][ t][-th] |
|
iteration of PGA. We denote the PGA-adversary as a function A that maps parameters Θ to attacking |
|
parameters Θ[∗]. We denote a mini-batch of training examples as X with y being the corresponding |
|
ground-truth labels. |
|
|
|
_Eζ[p]_ [(][m][)][ denotes the projection operator on the][ ζ][-ellipsoid in][ l][p][ space. The] |
|
operator ⊙ denotes elementwise multiplication. |
|
|
|
[Q] |
|
|
|
The effect of component mismatch on a network parameter can be modelled using a Gaussian |
|
distribution where the standard deviation depends on the parameter magnitude (Joshi et al., 2020; |
|
Büchel et al., 2021). In this paper we restrict ourselves to mismatch-driven perturbations in the |
|
network weights. For complex Spiking Neural Networks (SNNs), “network parameters” can refer to |
|
additional quantities such as neuronal and synaptic time constants or spiking thresholds. Our training |
|
approach described here can be equally applied to these additional parameters. |
|
|
|
We define the value of an individual parameter when deployed on a neuromorphic chip as |
|
|
|
|
|
Θ[mismatch] _∼N_ (Θ, diag(ζ|Θ|)) (1) |
|
|
|
where ζ governs the perturbation magnitude, referred to as the “mismatch level”. The physics underlying the neuronal- and synaptic circuits lead to a model where the amount of noise introduced into the |
|
system depends linearly on the magnitude of the parameters. If mismatch-induced perturbations had |
|
constant standard deviation independent of weight values, one could use the weight-scale invariance |
|
|
|
|
|
----- |
|
|
|
of neural networks as a means to achieve robustness, by simply scaling up all network weights (see |
|
Figure S4). The linear dependence of weight magnitude and mismatch noise precludes this approach. |
|
|
|
In contrast to adversarial attacks in the input space (Carlini & Wagner, 2016; Moosavi-Dezfooli |
|
et al., 2015; Madry et al., 2019; Goodfellow et al., 2015), our method relies on adversarial attacks |
|
in parameter space. During training, we approximate the worst case perturbation of the network |
|
parameters using PGA and update the network parameters in order to mitigate these attacks. To |
|
trade-off robustness and performance, we use a surrogate loss (Zhang et al., 2019) to capture the |
|
difference in output between the normal and attacked network. Algorithm 1 illustrates the training |
|
procedure in more detail. |
|
|
|
**begin** |
|
|
|
Θ[∗]0 Θ + Θ _ϵ_ _R ; R_ (0, 1) |
|
**for t[←] = 1−** **to N |** _steps|_ _⊙ do_ _∼N_ |
|
|
|
_g_ Θ[∗]t 1 _[L][rob][(Θ][,][ Θ]t[∗]_ 1[, X][)] |
|
_←−∇_ _−_ _−_ |
|
_v_ arg max |
|
_←−_ _v:_ _v_ _p_ 1 _[v][T][ g]_ |
|
_∥_ _∥_ _≤_ |
|
|
|
Θ[∗]t _[←]_ [Q]Eζ[p]attack [(Θ]t[∗]−1 [+][ α][ ⊙] _[v][)]_ |
|
|
|
**end** |
|
Θ ←− Θ − _η∇ΘLnat((Θ, X), y) + βrobLrob(Θ, Θ[∗]Nsteps_ _[, X][)]_ |
|
**end** |
|
**Algorithm 1: In l∞, v corresponds to sign(g) and the step size α is** _[|]N[Θ][|⊙]steps[ζ]_ [.][ Q]Eζ[p]attack [(][m][)][ de-] |
|
|
|
notes the projection operator on the ζattack-ellipsoid in l[p] space. In l[∞] this corresponds to |
|
min(max(model. _m, Θ −_ _ϵ), Θ + ϵ) with ϵ = ζattack ⊙|Θ|. ζattack and βrob are hyperparameters of our_ |
|
|
|
|
|
Unlike adversarial training in the input space, where adversarial inputs can be seen as a form of data |
|
augmentation, adversarial training in the parameter space poses the following challenge: Because |
|
the parameters that are attacked are the same parameters being optimized, performing gradient |
|
descent using the same loss that was used for PGA would simply revert the previous updates and |
|
no learning would occur. ABCD circumvents this problem by masking one half of the parameters |
|
in the adversarial loop and masking the other half during the gradient descent step. However, this |
|
limits the adversary in its power, and requires multiple iterations to be performed in order to update |
|
all parameters at least once. AWP approached this problem by assuming that the gradient of the loss |
|
with respect to the attacking parameters can be used in order to update the original parameters to |
|
favor minima in flatter locations in weight-space. However, it is not clear whether this assumption |
|
always holds since the gradient of the loss with respect to the attacking parameters is not necessarily |
|
the same direction that would lead to a flatter region in the weight loss-landscape. |
|
|
|
We approach this problem slightly differently: Similar to the TRADES algorithm (Zhang et al., 2019), |
|
our algorithm optimizes a natural (task) loss and a separate robustness loss. |
|
|
|
_Lgen(Θ, X, y) = Lnat(Θ, X, y) + βrobLrob(Θ, A(Θ), X)_ |
|
|
|
Using a different loss for capturing the susceptibility of the network to adversarial attacks enables |
|
us to simultaneously optimise for performance and robustness, without PGA interfering with the |
|
gradient descent step. In our experiments, Lrob is defined as |
|
|
|
_Lrob(Θ, Θ[∗], X) = KL (f_ (Θ, X), f (Θ[∗], X)) (2) |
|
|
|
This formulation comes with a large computational overhead since it requires computing the Jacobian |
|
**JΘ∗** (Θ) of a complex recurrent relation between Θ and Θ[∗]. To make our algorithm more efficient |
|
we assume that the Jacobian is diagonal, meaning that Θ[∗] = Θ + ∆Θ for some ∆Θ given by the |
|
adversary. In l, the Jacobian can then be calculated efficiently using (see suppl. material for details): |
|
_∞_ |
|
|
|
**JΘ∗** (Θ) = I + diag sign(Θ)⊙N(ζstepsattack+ϵ·R1) _t=1_ [sign] Θ[∗]t _t_ _[, X]_ |
|
|
|
_⊙_ [P][N][steps] _∇_ _[L][rob][(Θ][,][ Θ][∗]_ |
|
|
|
h |
|
|
|
|