pradachan
/

AI-Scientist

Model card Files Files and versions Community

AI-Scientist / review_iclr_bench /iclr_parsed /1-YP2squpa7.txt

pradachan

Upload folder using huggingface_hub

f71c233 verified 19 days ago

raw

history blame contribute delete

100 kB

	## DEEP LEARNING VIA MESSAGE PASSING ALGORITHMS
	### BASED ON BELIEF PROPAGATION

	Anonymous authors
	Paper under double-blind review

	ABSTRACT

	Message-passing algorithms based on the Belief Propagation (BP) equations constitute a well-known distributed computational scheme. They yield exact marginals
	on tree-like graphical models and have also proven to be effective in many problems
	defined on loopy graphs, from inference to optimization, from signal processing
	to clustering. The BP-based schemes are fundamentally different from stochastic
	gradient descent (SGD), on which the current success of deep networks is based.
	In this paper, we present and adapt to mini-batch training on GPUs a family of
	BP-based message-passing algorithms with a reinforcement term that biases distributions towards locally entropic solutions. These algorithms are capable of
	training multi-layer neural networks with performance comparable to SGD heuristics in a diverse set of experiments on natural datasets including multi-class image
	classification and continual learning, while being capable of yielding improved
	performances on sparse networks. Furthermore, they allow to make approximate
	Bayesian predictions that have higher accuracy than point-wise ones.

	1 INTRODUCTION

	Belief Propagation is a method for computing marginals and entropies in probabilistic inference
	problems (Bethe, 1935; Peierls, 1936; Gallager, 1962; Pearl, 1982). These include optimization
	problems as well once they are written as zero temperature limit of a Gibbs distribution that uses the
	cost function as energy. Learning is one particular case, in which one wants to minimize a cost which
	is a data dependent loss function. These problems are generally intractable and message-passing
	techniques have been particularly successful at providing principled approximations through efficient
	distributed computations.

	A particularly compact representation of inference/optimization problems that is used to build
	massage-passing algorithms is provided by factor graphs. A factor graph is a bipartite graph composed
	of variables nodes and factor nodes expressing the interactions among variables. Belief Propagation
	is exact for tree-like factor graphs (Yedidia et al., 2003)), where the Gibbs distribution is naturally
	factorized, whereas it is approximate for graphs with loops. Still, loopy BP is routinely used with
	success in many real world applications ranging from error correcting codes, vision, clustering, just to
	mention a few. In all these problems, loops are indeed present in the factor graph and yet the variables
	are weakly correlated at long range and BP gives good results. A field in which BP has a long history
	is the statistical physics of disordered systems where it is known as Cavity Method (Mézard et al.,
	1987). It has been used to study the typical properties of spin glass models which represent binary
	variables interacting through random interactions over a given graph. It is very well known that in
	spin glass models defined on complete graphs and in locally tree-like random graphs, which are
	both loopy, the weak correlation conditions between variables may hold and BP give asymptotic
	exact results (Mézard & Montanari, 2009). Here we will mostly focus on neural networks ±1 binary
	weights and sign activation functions, for which the messages and the marginals can be described
	simply by the difference between the probabilities associated with the +1 and -1 states, the so called
	_magnetizations. The effectiveness of BP for deep learning has never been numerically tested in a_
	systematic way, however there is clear evidence that the weak correlation decay condition does not
	hold and thus BP convergence and approximation quality is unpredictable.

	In this paper we explore the effectiveness of a variant of BP that has shown excellent convergence
	properties in hard optimization problems and in non-convex shallow networks. It goes under the


	-----

	name of focusing BP (fBP) and is based on a probability distribution, a likelihood, that focuses on
	highly entropic wide minima, neglecting the contribution to marginals from narrow minima even
	when they are the majority (and hence dominate the Gibbs distribution). This version of BP is thus
	expected to give good results only in models that have such wide entropic minima as part of their
	energy landscape. As discussed in (Baldassi et al., 2016a), a simple way to define fBP is to add a
	"reinforcement" term to the BP equations: an iteration-dependent local field is introduced for each
	variable, with an intensity proportional to its marginal probability computed in the previous iteration
	step. This field is gradually increased until the entire system becomes fully biased on a configuration.
	The first version of reinforced BP was introduced in (Braunstein & Zecchina, 2006) as a heuristic
	algorithm to solve the learning problem in shallow binary networks. Baldassi et al. (2016a) showed
	that this version of BP is a limiting case of fBP, i.e., BP equations written for a likelihood that uses
	the local entropy function instead of the error (energy) loss function. As discussed in depth in that
	study, one way to introduce a likelihood that focuses on highly entropic regions is to create y coupled
	replicas of the original system. fBP equations are obtained as BP equations for the replicated system.
	It turns out that the fBP equations are identical to the BP equations for the original system with the
	only addition of a self-reinforcing term in the message passing scheme. The fBP algorithm can be
	used as a solver by gradually increasing the effect of the reinforcement: one can control the size of
	the regions over which the fBP equations estimate the marginals by tuning the parameters that appear
	in the expression of the reinforcement, until the high entropy regions reduce to a single configuration.
	Interestingly, by keeping the size of the high entropy region fixed, the fBP fixed point allows one to
	estimate the marginals and entropy relative to the region.

	In this work, we present and adapt to GPU computation a family of fBP inspired message passing
	algorithms that are capable of training multi-layer neural networks with generalization performance
	and computational speed comparable to SGD. This is the first work that shows that learning by
	message passing in deep neural networks 1) is possible and 2) is a viable alternative to SGD. Our
	version of fBP adds the reinforcement term at each mini-batch step in what we call the Posterioras-Prior (PasP) rule. Furthermore, using the message-passing algorithm not as a solver but as an
	estimator of marginals allows us to make locally Bayesian predictions, averaging the predictions
	over the approximate posterior. The resulting generalization error is significantly better than those of
	the solver, showing that, although approximate, the marginals of the weights estimated by messagepassing retain useful information. Consistently with the assumptions underlying fBP, we find that
	the solutions provided by the message passing algorithms belong to flat entropic regions of the loss
	landscape and have good performance in continual learning tasks and on sparse networks as well.

	We also remark that our PasP update scheme is of independent interest and can be combined with
	different posterior approximation techniques.

	The paper is structured as follows: in Sec. 2 we give a brief review of some related works. In Sec. 3
	we provide a detailed description of the message-passing equations and of the high level structure
	of the algorithms. In Sec. 4 we compare the performance of the message passing algorithms versus
	SGD based approaches in different learning settings.

	2 RELATED WORKS

	The literature on message passing algorithms is extensive, we refer to Mézard & Montanari (2009)
	and Zdeborová & Krzakala (2016) for a general overview. More related to our work, multilayer
	message-passing algorithms have been developed in inference contexts (Manoel et al., 2017; Fletcher
	et al., 2018), where they have been shown to produce exact marginals under certain statistical
	assumptions on (unlearned) weight matrices.

	The properties of message-passing for learning shallow neural networks have been extensively studied
	(see Baldassi et al. (2020) and reference therein). Barbier et al. (2019) rigorously show that message
	passing algorithms in generalized linear models perform asymptotically exact inference under some
	statistical assumptions. Dictionary learning and matrix factorization are harder problems closely
	related to deep network learning problems, in particular to the modelling of a single intermediate
	layer. They have been approached using message passing in Kabashima et al. (2016) and Parker
	et al. (2014), although the resulting predictions are found to be asymptotically inexact (Maillard
	et al., 2021). The same problem is faced by the message passing algorithm recently proposed for a
	multi-layer matrix factorization scenario (Zou et al., 2021). Unfortunately, our framework as well


	-----

	doesn’t yield asymptotic exact predictions. Nonetheless, it gives a message passing heuristic that for
	the first time is able to train deep neural networks on natural datasets, therefore sets a reference for
	the algorithmic applications of this research line.

	A few papers advocate the success of SGD to the geometrical structure (smoothness and flatness) of
	the loss landscape in neural networks (Baldassi et al., 2015; Chaudhari et al., 2017; Garipov et al.,
	2018; Li et al., 2018; Pittorino et al., 2021; Feng & Tu, 2021). These considerations do not depend on
	the particular form of the SGD dynamics and should extend also to other types of algorithms, although
	SGD is by far the most popular choice among NNs practitioners due to its simplicity, flexibility,
	speed, and generalization performance.

	While our work focuses on message passing schemes, some of the ideas presented here, such as
	the PasP rule, can be combined with algorithms for Bayesian neural networks’ training (HernándezLobato & Adams, 2015; Wu et al., 2018). Recent work extends BP by combining it with graph
	neural networks (Kuck et al., 2020; Satorras & Welling, 2021). Finally, some work in computational
	neuroscience shows similarities to our approach (Rao, 2007).

	3 LEARNING BY MESSAGE PASSING

	3.1 POSTERIOR-AS-PRIOR UPDATES

	We consider a multi-layer perceptron with L hidden neuron layers, having weight and bias parameters
	_W = {W_ _[ℓ], b[ℓ]}ℓ[L]=0[. We allow for stochastic activations][ P][ ℓ][(][x][ℓ][+1][\|][z][ℓ][)][, where][ z][ℓ]_ [is the neuron’s pre-]
	activation vector for layer ℓ, and P _[ℓ]_ is assumed to be factorized over the neurons. If no stochasticity
	is present, P _[ℓ]_ just encodes an element-wise activation function. The probability of output y given an
	input x is then given by


	_P_ _[ℓ][+1](x[ℓ][+1]_ _\| W_ _[ℓ]x[ℓ]_ + b[ℓ]), (1)
	_ℓ=0_

	Y


	_P_ (y \| x, W) = _dx[1:][L]_
	Z


	where for convenience we defined x[0] = x and x[L][+1] = y. In a Bayesian framework, given a training
	set D = {(xn, yn)}n and a prior distribution over the weights qθ(W) in some parametric family, the
	posterior distribution is given by


	_P_ (yn _xn,_ ) qθ( ). (2)
	_\|_ _W_ _W_


	_P_ (W \| D, θ) ∝


	Here ∝ denotes equality up to a normalization factor. Using the posterior one can compute the Bayesian prediction for a new data-point x through the average P (y \| x, D, θ) =
	_dW P_ (y \| x, W) P (W \| D, θ). Unfortunately, the posterior is generically intractable due to the
	hard-to-compute normalization factor. On the other hand, we are mainly interested in training a
	R
	distribution that covers wide minima of the loss landscape that generalize well (Baldassi et al., 2016a)
	and in recovering pointwise estimators within these regions. The Bayesian modeling becomes an
	auxiliary tool to set the stage for the message passing algorithms seeking flat minima. We also need
	a formalism that allows for mini-batch training to speed-up the computation and deal with large
	datasets. Therefore, we devise an update scheme that we call Posterior-as-Prior (PasP), where we
	evolve the parameters θ[t] of a distribution qθt ( ) computed as an approximate mini-batch posterior,
	_W_
	in such a way that the outcome of the previous iteration becomes the prior in the following step. In
	the PasP scheme, θ[t] retains the memory of past observations. We also add an exponential factor ρ,
	that we typically set close to 1, tuning the forgetting rate and playing a role similar to the learning
	rate in SGD. Given a mini-batch (X _[t], y[t]) sampled from the training set at time t and a scalar ρ > 0,_
	the PasP update reads

	_ρ_
	_qθt+1_ ( ) _P_ ( _y[t], X_ _[t], θ[t])_ _,_ (3)
	_W_ _≈_ _W \|_

	where denotes approximate equality and we do not account for the normalization factor. A first
	_≈_
	approximation may be needed in the computation of the posterior, a second to project the approximate
	posterior onto the distribution manifold spanned by θ (Minka, 2001). In practice, we will consider
	factorized approximate posterior in an exponential family and priors qθ in the same family, although
	Eq. 3 generically allow for more refined approximations.


	-----

	Notice that setting ρ = 1, the batch-size to 1, and taking a single pass over the dataset, we recover
	the Assumed Density Filtering algorithm (Minka, 2001). For large enough ρ (including ρ = 1), the
	iterations of qθt will concentrate on a pointwise estimator. This mechanism mimics the reinforcement heuristic commonly used to turn Belief Propagation into a solver for constrained satisfaction
	problems (Braunstein & Zecchina, 2006) and related to flat-minima discovery (see focusing-BP in
	Baldassi et al. (2016a)). A different prior updating mechanism which can be understood as empirical
	Bayes has been used in Baldassi et al. (2016b).

	3.2 INNER MESSAGE PASSING LOOP

	While the PasP rule takes care of the reinforcement heuristic across mini-batches, we compute the
	mini-batch posterior in Eq. 3 using message passing approaches derived from Belief Propagation.
	BP is an iterative scheme for computing marginals and entropies of statistical models Mézard &
	Montanari (2009). It is most conveniently expressed on factor graphs, that is bipartite graphs where
	the two sets of nodes are called variable nodes and factor nodes. They respectively represent the
	variables involved in the statistical model and their interactions. Message from factor nodes to
	variable nodes and viceversa are exchanged along the edges of the factor graph for a certain number
	of BP iterations or until a fixed point is reached.

	The factor graph for P (W \| X _[t], y[t], θ[t]) can be derived from Eq. 2, with the following additional_
	specifications. For simplicity, we will ignore the bias term in each layer. We assume factorized
	_qθt_ ( ), each factor parameterized by its first two moments. In what follows, we drop the PasP
	_W_
	iteration index t. For each example (xn, yn) in the mini-batch, we introduce the auxiliary variables
	_x[ℓ]n[, ℓ]_ [= 1][, . . ., L][, representing the layers’ activations. For each example, each neuron in the network]
	contributes a factor node to the factor graph. The scalar components of the weight matrices and
	the activation vectors become variable nodes. This construction is presented in Appendix A, where
	we also derive the message update rules on the factor graph. The factor graph thus defined is
	extremely loopy and straightforward iteration of BP has convergence issues. Moreover, in presence
	of a homogeneous prior over the weights, the neuron permutation symmetry in each hidden layer
	induces a strongly attractive symmetric fixed point that hinders learning. We work around these
	issues by breaking the symmetry at time t = 0 with an inhomogeneous prior. In our experiments
	a little initial heterogeneity is sufficient to obtain specialized neurons at each following time step.
	Additionally, we do not require message passing convergence in the inner loop (see Algorithm 1) but
	perform one or a few iterations for each θ update. We also include an inertia term commonly called
	damping factor in the message updates (see B.2). As we shall discuss, these simple rules suffice to
	train deep networks by message passing.

	For the inner loop we adapt to deep neural networks four different message passing algorithms, all of
	which are well known to the literature although derived in simpler settings: Belief Propagation (BP),
	BP-Inspired (BPI) message passing, mean-field (MF), and approximate message passing (AMP). The
	last three algorithms can be considered approximations of the first one. In the following paragraphs
	we will discuss their common traits, present the BP updates as an example, and refer to Appendix A
	for an in-depth exposition. For all algorithms, message updates can be divided in a forward pass
	and backward pass, as also done in (Fletcher et al., 2018) in a multi-layer inference setting. The BP
	algorithm is compactly reported in Algorithm 1.

	Meaning of messages. All the messages involved in the message passing can be understood in
	terms of cavity marginals or full marginals (as mentioned in the introduction BP is also known as
	Cavity Method, see (Mézard & Montanari, 2009)). Of particular relevance are m[ℓ]ki [and][ σ]ki[ℓ] [, denoting]
	the mean and variance of the weights Wki[ℓ] [. The quantities][ ˆ]x[ℓ]in [and][ ∆]in[ℓ] [instead denote the mean and]
	variance of the i-th neuron’s activation in layer ℓ for a given input xn.

	Scalar free energies. All message passing schemes are conveniently expressed in terms of two
	functions that correspond to the effective free energy (Zdeborová & Krzakala, 2016) of a single


	-----

	neuron and of a single weight respectively :

	_ϕ[ℓ](B, A, ω, V ) = log_ dx dz e[−] 2[1] _[Ax][2][+][Bx]_ _P_ _[ℓ]_ (x\|z) e[−] [(][ω]2[−]V[z][)2] _ℓ_ = 1, . . ., L (4)
	Z


	_ψ(H, G, θ) = log_ dw e[−] 2[1] _[G][2][w][2][+][Hw]_ _qθ(w)_ (5)
	Z


	Notice that for common deterministic activations such as ReLU and sign, the function ϕ has analytic
	and smooth expressions (see Appendix A.8). The same holds for the function ψ when qθ(w) is
	Gaussian (continuous weights) or a mixture of atoms (discrete weights). At the last layer we impose
	_P_ _[L][+1](y\|z) = I(y = sign(z)) in binary classification tasks and P_ _[L][+1](y\|z) = I(y = arg max(z))_
	in multi-class classification (see Appendix A.9). While in our experiments we use hard constraints
	for the final output, therefore solving a constraint satisfaction problem, it would be interesting to also
	consider soft constraints and introduce a temperature, but this is beyond the scope of our work.

	Start and end of message passing. At the beginning of a new PasP iteration t, we reset the
	messages (see Appendix A) and run message passing for τmax iterations. We then compute the new
	prior’s parameters θ[t][+1] from the posterior given by the message passing.

	BP Forward pass. After initialization of the messages at time τ = 0, for each following time we
	propagate a set of message from the first to the last layer and then another set from the last to the first.
	For an intermediate layer ℓ the forward pass reads

	_xˆ[ℓ,τ]in→k_ = _∂Bϕ[ℓ]_ []Bin[ℓ,τ]→[−]k[1][, A]in[ℓ,τ] _[−][1], ωin[ℓ][−][1][,τ]_ _, Vin[ℓ][−][1][,τ]_ (6)


	∆[ℓ,τ]in = _∂B[2]_ _[ϕ][ℓ]_ []Bin[ℓ,τ] _[−][1], A[ℓ,τ]in_ _[−][1], ωin[ℓ][−][1][,τ]_ _, Vin[ℓ][−][1][,τ]_ (7)


	_m[ℓ,τ]ki_ _n_ = _∂H_ _ψ(Hki[ℓ,τ]_ _[−]n[1][, G]ki[ℓ,τ]_ _[−][1], θki[ℓ]_ [)] (8)
	_→_ _→_

	_σki[ℓ,τ]_ = _∂H[2]_ _[ψ][(][H]ki[ℓ,τ]_ _[−][1], G[ℓ,τ]ki_ _[−][1], θki[ℓ]_ [)] (9)

	2

	_Vkn[ℓ,τ]_ = _m[ℓ,τ]ki_ _n_ ∆[ℓ,τ]in [+][ σ]ki[ℓ,τ] [(ˆ]x[ℓ,τ]in _k[)][2][ +][ σ]ki[ℓ,τ]_ [∆]in[ℓ,τ] (10)

	_→_ _→_

	Xi

	_ωkn[ℓ,τ]→i_ = _m[ℓ,τ]ki[′]→nx[ˆ][ℓ,τ]i[′]n→k_ (11)

	_iX[′]≠_ _i_

	The equations for the first layer differ slightly and in an intuitive way from the ones above (see
	Appendix A.3).

	BP Backward pass. The backward pass updates a set of messages from the last to the first layer:

	_gkn[ℓ,τ]→i_ = _∂ωϕ[ℓ][+1][ ]Bkn[ℓ][+1][,τ]_ _, A[ℓ]kn[+1][,τ]_ _, ωkn[ℓ,τ]→i[, V]kn[ ℓ,τ]_ (12)


	Γ[ℓ,τ]kn = _−∂ω[2]_ _[ϕ][ℓ][+1][ ]Bkn[ℓ][+1][,τ]_ _, A[ℓ]kn[+1][,τ]_ _, ωkn[ℓ,τ]_ _[, V]kn[ ℓ,τ]_ (13)
	2

	_A[ℓ,τ]in_ = _k_ (m[ℓ,τ]ki→n[)][2][ +][ σ]ki[ℓ,τ] Γ[ℓ,τ]kn _[−]_ _[σ]ki[ℓ,τ]_ _gkn[ℓ,τ]→i_ (14)
	X

	_Bin[ℓ,τ]→k_ = _m[ℓ,τ]k[′]i→n[g]k[ℓ,τ][′]n→i_ (15)

	_kX[′]≠_ _k_

	2

	_G[ℓ,τ]ki_ = _n_ (ˆx[ℓ,τ]in→k[)][2][ + ∆]in[ℓ,τ] Γ[ℓ,τ]kn _[−]_ [∆]in[ℓ,τ] _gkn[ℓ,τ]→i_ (16)
	X

	_Hki[ℓ,τ]→n_ = _xˆ[ℓ,τ]in[′]→k[g]kn[ℓ,τ][′]→i_ (17)

	_nX[′]≠_ _n_


	As with the forward pass, we add the caveat that for the last layer the equations are slightly different
	from the ones above.


	-----

	Computational complexity The message passing equations boil down to element-wise operations
	and tensor contractions that we easily implement using the GPU friendly julia library Tullio.jl (Abbott
	et al., 2021). For a layer of input and output size N and considering a batch-size of B, the time
	complexity of a forth-and-back iteration is O(N [2]B) for all message passing algorithms (BP, BPI, MF,
	and AMP), the same as SGD. The prefactor varies and it is generally larger than SGD (see Appendix
	B.9). Also, time complexity for message passing is proportional to τmax (which we typically set to
	1). We provide our implementation in the GitHub repo anonymous.

	Algorithm 1: BP for deep neural networks
	// Message passing used in the PasP Eq. 3 to approximate.

	// the mini-batch posterior.
	// Here we specifically refer to BP updates.
	// BPI, MF, and AMP updates take the same form but using
	// the rules in Appendix A.4, A.5, and A.7 respectively

	1 Initialize messages.

	2 for τ = 1, . . . τmax do

	// Forward Pass

	3 for l = 0, . . ., L do

	4 compute ˆx[ℓ], ∆[ℓ] using (6, 7)

	5 compute m[ℓ], σ[ℓ] using (8, 9)

	6 compute V[ℓ], ω[ℓ] using (10, 11)


	// Backward Pass

	7 for l = L, . . ., 0 do

	8 compute g[ℓ], Γ[ℓ] using (12, 13)

	9 compute A[ℓ], B[ℓ] using (14, 15)

	10 compute G[ℓ], H _[ℓ]_ using (16, 17)

	4 NUMERICAL RESULTS

	We implement our message passing algorithms on neural networks with continuous and binary
	weights and with binary activations. In our experiments we fix τmax = 1. We typically do not observe
	an increase in performance taking more steps, except for some specific cases and in particular for MF
	layers. We remark that for τmax = 1 the BP and the BPI equations are identical, so in most of the
	subsequent numerical results we will only investigate BP.

	We compare our algorithms with a SGD-based algorithm adapted to binary architectures (Hubara
	et al., 2016) which we call BinaryNet along the paper (see Appendix B.6 for details). Comparison
	of Bayesian predictions are with the gradient-based Expectation Backpropagation (EBP) algorithm
	(Soudry et al., 2014a), also able to deal with discrete weights and activations. In all architectures we
	avoid the use of bias terms and batch-normalization layers.

	We find that message-passing algorithms are able to train generic MLP architectures with varying numbers and sizes of hidden layers. As for the datasets, we are able to perform both binary classification
	and multi-class classification on standard computer vision datasets such as MNIST, Fashion-MNIST,
	and CIFAR-10. Since these datasets consist of 10 classes, for the binary classification task we divide
	each dataset in two classes (even vs odd).

	We report that message passing algorithms are able to solve these optimization problems with
	generalization performance comparable to or better than SGD-based algorithms. Some of the
	message passing algorithms (BP and AMP in particular) need fewer epochs to achieve low error than
	the ones required by SGD-based algorithms, even if adaptive methods like Adam are considered.
	Timings of our GPU implementations of message passing algorithms are competitive with SGD (see
	Appendix B.9).


	-----

	60


	\|Col1\|BP train MF train BP test MF test\|
	\|---\|---\|
	\|\|AMP train BinaryNet train AMP test BinaryNet test\|
	\|\|\|
	\|\|\|
	\|\|\|


	20 40 60 80 100

	4.1 EXPERIMENTS ACROSS ARCHITECTURES

	We select a specific task, multi-class classification on Fashion-MNIST, and we compare the message
	passing algorithms with BinaryNet for different choices of the architecture (i.e. we vary the number
	and the size of the hidden layers). In Fig.1 (Left) we present the learning curves for a MLP with
	3 hidden layers with 501 units with binary weights and activations. Similar results hold in our
	experiments with 2 or 3 hidden layers of 101, 501 or 1001 units and with batch sizes from 1 to from
	1024. The parameters used in our simulations are reported in Appendix B.3. Results on networks
	with continuous weights can be found in Fig.2 (Right).

	4.2 SPARSE LAYERS

	Since the BP algorithm has notoriously been successful on sparse graphs, we perform a straightforward implementation of pruning at initialization, i.e. we impose a random boolean mask on the
	weights that we keep fixed along the training. We call sparsity the fraction of zeroed weights. This
	kind of non-adaptive pruning is known to largely hinder learning (Frankle et al., 2021; Sung et al.,
	2021). In the right panel of Fig. 1, we report results on sparse binary networks in which we train
	a MLP with 2 hidden layers of 101 units on the MNIST dataset. For reference, results on pruning
	quantized/binary networks can be found in Refs. (Han et al., 2016; Ardakani et al., 2017; Tung &
	Mori, 2018; Diffenderfer & Kailkhura, 2021). Experimenting with sparsity up to 90%, we observe
	that BP and MF perform better than BinaryNet. AMP struggles behind BinaryNet instead.

	25 100

	BP train MF train

	95

	BP test MF test

	20 AMP train BinaryNet train

	90

	AMP test BinaryNet test

	85

	15

	80

	BP test

	error (%) 10 Bayes BP test

	75

	AMP test

	test accuracy (%)

	70 Bayes AMP test

	5 MF test

	65 Bayes MF test

	BinaryNet test

	0

	epochs


	10 20 30 40 50 60 70 80 90

	\|Col1\|Col2\|Col3\|Col4\|Col5\|Col6\|Col7\|Col8\|Col9\|Col10\|
	\|---\|---\|---\|---\|---\|---\|---\|---\|---\|---\|
	\|\|\|\|\|\|\|\|\|\|\|
	\|\|\|\|\|\|\|\|\|\|\|
	\|\|\|\|\|\|\|\|\|\|\|
	\|\|\|BP test Bayes\|BP test\|\|\|\|\|\|\|
	\|\|\|AMP te Bayes\|st AMP te\|st\|\|\|\|\|\|
	\|\|\|MF tes Bayes\|t MF tes\|t\|\|\|\|\|\|
	\|\|\|Binary\|Net test\|\|\|\|\|\|\|


	sparsity (%)


	Figure 1: (Left) Training curves of message passing algorithms compared with BinaryNet on the
	Fashion-MNIST dataset (multi-class classification) with a binary MLP with 3 hidden layers of 501
	units. (Right) Final test accuracy when varying the layer’s sparsity in a binary MLP with 2 hidden
	layers of 101 units on the MNIST dataset (multi-class). In both panels the batch-size is 128 and
	curves are averaged over 5 realizations of the initial conditions (and sparsity pattern in the right
	panel).

	4.3 EXPERIMENTS ACROSS DATASETS

	We now fix the architecture, a MLP with 2 hidden layers of 501 neurons each with binary weights and
	activations. We vary the dataset, i.e. we test the BP-based algorithms on standard computer vision
	benchmark datasets such as MNIST, Fashion-MNIST and CIFAR-10, in both the multi-class and
	binary classification tasks. In Tab. 1 we report the final test errors obtained by the message passing
	algorithms compared to the BinaryNet baseline. See Appendix B.4 for the corresponding training
	errors and the parameters used in the simulations. We mention that while the test performance is
	mostly comparable, the train error tends to be lower for the message passing algorithms.


	-----

	\|Col1\|BinaryNet BP\|Col3\|AMP MF\|
	\|---\|---\|---\|---\|
	\|\|sses) 1.3 ± 0.1 1.4 ± 0.2\|\|1.4 ± 0.1 1.3 ± 0.\|
	\|\|ST (2 classes) 2.4 ± 0.1 2.3 ± 0.1\|\|2.4 ± 0.1 2.3 ± 0.\|
	\|\|classes) 30.0 ± 0.3 31.4 ± 0.1\|\|31.1 ± 0.3 31.1 ± 0.\|
	\|\|2.2 ± 0.1 2.6 ± 0.1\|\|2.6 ± 0.1 2.3 ± 0.\|
	\|\|ST 12.0 ± 0.6 11.8 ± 0.3\|\|11.9 ± 0.2 12.1 ± 0.\|
	\|\|59.0 ± 0.7 58.7 ± 0.3\|\|58.5 ± 0.2 60.4 ± 1.\|
	\|\|on Fashion-MNIST of various algorith ts and activations. All algorithms are tra ard deviations are calculated over 5 ran IAN ERROR amework used as an estimator of the m ate Bayesian prediction, i.e. averagin We observe better generalization error f wing that the marginals retain useful ith the PasP mini-batch procedure (the e t this converges with difficulty in our te (as also confirmed by the local energy ompute can be considered as a local app cation on the MNIST dataset in Fig. 2, a tasets and architectures. We obtain the gle forward pass of the message passing osterior distribution does not concentra e to the prediction of a single configurat m a comparison of BP (point-wise an m Bayesian predictions, Expectation Ba mplementation details. y Weights 5 EBP Bayes BP BP BinaryNet 4 60 80 100 bayes EBP hs (%) 3 error 2 test 1\|\|ms on a MLP with 2 hid ined with batch-size 12 dom initializations. ini-batch posterior mar g the pointwise predicti rom Bayesian predictio information. However, xact ones should be com sts). Since BP-based alg measure performed in A roximation of the full on nd we observe the same Bayesian prediction fro . To obtain good Bayes te too much, otherwise ion. d Bayesian) with SGD ckpropagation (Soudry Continuous Weights\|
	\|(%) 50 error 40 t\|\|\|SGD EBP\|
	\|tes 0 20 40 epoc\|\|\|bayes EBP\|
	\|\|\|\|\|
	\|\|\|\|\|
	\|\|\|\|\|


	20 40 60 80 100

	epochs


	SGD BP
	EBP Bayes BP
	bayes EBP

	20 40 60 80 100

	epochs


	Figure 2: (Left) Test error curves for Bayesian and point-wise predictions for a MLP with 2 hidden
	layers of 101 units on the 2-classes MNIST dataset. We report the results for (Left) binary and
	(Right) continuous weights. In both cases, we compare SGD, BP (point-wise and Bayesian) and EBP
	(point-wise and Bayesian). See Appendix B.3 for details.

	4.5 CONTINUAL LEARNING


	Given the high local entropy (i.e. the flatness) of the solutions found by the BP-based algorithms
	(see Appendix B.5), we perform additional tests in a classic setting, continual learning, where the


	-----

	possibility of locally rearranging the solutions while keeping low training error can be an advantage.
	When a deep network is trained sequentially on different tasks, it tends to forget exponentially fast
	previously seen tasks while learning new ones (McCloskey & Cohen, 1989; Robins, 1995; Fusi et al.,
	2005). Recent work (Feng & Tu, 2021) has shown that searching for a flat region in the loss landscape
	can indeed help to prevent catastrophic forgetting. Several heuristics have been proposed to mitigate
	the problem (Kirkpatrick et al., 2017; Aljundi et al., 2018; Zenke et al., 2017; Laborieux et al., 2021)
	but all require specialized adjustments to the loss or the dynamics .

	Here we show instead that our message passing schemes are naturally prone to learn multiple tasks
	sequentially, mitigating the characteristic memory issues of the gradient-based schemes without the
	need for explicit modifications. As a prototypical experiment, we sequentially trained a multi-layer
	neural network on 6 different versions of the MNIST dataset, where the pixels of the images have
	been randomly permuted (Goodfellow et al., 2013), giving a fixed budget of 40 epochs on each task.
	We present the results for a two hidden layer neural network with 2001 units on each layer (see
	Appendix B.3 for details). As can be seen in Fig. 3, at the end of the training the BP algorithm is able
	to reach good generalization performances on all the tasks. We compared the BP performance with
	BinaryNet, which already performs better than SGD with continuous weights (see the discussion
	in Laborieux et al. (2021)). While our BP implementation is not competitive with ad-hoc techniques
	specifically designed for this problem, it beats non-specialized heuristics. Moreover, we believe that
	specialized approaches like the one of Laborieux et al. (2021) can be adapted to message passing as
	well.

	\|BP Bi Bi Bi\|naryNet lr= naryNet lr= naryNet lr=\|0.1 1.0 10.0\|Col4\|Col5\|Col6\|
	\|---\|---\|---\|---\|---\|---\|


	1 2 3 4 5 6 0 40 80 120 160 200 240
	task # epochs

	100

	90

	80

	70

	60

	50

	40 BP

	30 BinaryNet lr=0.1

	test accuracy (%) 20 BinaryNet lr=1.0

	10 BinaryNet lr=10.0

	0 1 2 3 4 5 6

	task #


	Figure 3: Performance of BP and BinaryNet on the permuted MNIST task (see text) for a two hidden
	layer network with 2001 units on each layer and binary weights and activations. The model is trained
	sequentially on 6 different versions of the MNIST dataset (the tasks), where the pixels have been
	permuted. (Left) Test accuracy on each task after the network has been trained on all the tasks.
	(Right) Test accuracy on the first task as a function of the number of epochs. Points are averages over
	5 independent runs, shaded areas are errors on the mean.


	5 DISCUSSION AND CONCLUSIONS

	While successful in many fields, message passing algorithms, have notoriously struggled to scale
	to deep neural networks training problems. Here we have developed a class of fBP-based message
	passing algorithms and used them within an update scheme, Posterior-as-Prior (PasP), that makes it
	possible to train deep and wide multilayer perceptrons by message passing.

	We performed experiments binary activations and either binary or continuous weights. Future work
	should try to include different activations, biases, batch-normalization, and convolutional layers as
	well. Another interesting direction is the algorithmic computation of the (local) entropy of the model
	from the messages.

	Further theoretical work is needed for a more complete understanding of the robustness of our
	methods. Recent developments in message passing algorithms (Rangan et al., 2019) and related
	theoretical analysis (Goldt et al., 2020) could provide fruitful inspirations. While our algorithms
	can be used for approximate Bayesian inference, exact posterior calculation is still out of reach for
	message passing approaches and much technical work is needed in that direction.


	-----

	REFERENCES

	Michael Abbott, Dilum Aluthge, N3N5, Simeon Schaub, Carlo Lucibello, Chris Elrod, and Johnny
	[Chen. Tullio.jl julia package, 2021. URL https://github.com/mcabbott/Tullio.jl.](https://github.com/mcabbott/Tullio.jl)

	Rahaf Aljundi, Francesca Babiloni, Mohamed Elhoseiny, Marcus Rohrbach, and Tinne Tuytelaars.
	Memory aware synapses: Learning what (not) to forget. In Proceedings of the European Conference
	_on Computer Vision (ECCV), pp. 139–154, 2018._

	Arash Ardakani, Carlo Condo, and Warren J. Gross. Sparsely-connected neural networks: Towards efficient VLSI implementation of deep neural networks. In 5th International Conference on Learning
	_Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings._
	[OpenReview.net, 2017. URL https://openreview.net/forum?id=r1fYuytex.](https://openreview.net/forum?id=r1fYuytex)

	Carlo Baldassi, Alfredo Braunstein, Nicolas Brunel, and Riccardo Zecchina. Efficient supervised
	learning in networks with binary synapses. Proceedings of the National Academy of Sciences,
	[104(26):11079–11084, 2007. ISSN 0027-8424. doi: 10.1073/pnas.0700324104. URL https:](https://www.pnas.org/content/104/26/11079)
	[//www.pnas.org/content/104/26/11079.](https://www.pnas.org/content/104/26/11079)

	Carlo Baldassi, Alessandro Ingrosso, Carlo Lucibello, Luca Saglietti, and Riccardo Zecchina.
	Subdominant dense clusters allow for simple learning and high computational performance
	in neural networks with discrete synapses. _Phys. Rev. Lett., 115:128101, Sep 2015._
	[doi: 10.1103/PhysRevLett.115.128101. URL https://link.aps.org/doi/10.1103/](https://link.aps.org/doi/10.1103/PhysRevLett.115.128101)
	[PhysRevLett.115.128101.](https://link.aps.org/doi/10.1103/PhysRevLett.115.128101)

	Carlo Baldassi, Christian Borgs, Jennifer T. Chayes, Alessandro Ingrosso, Carlo Lucibello, Luca
	Saglietti, and Riccardo Zecchina. Unreasonable effectiveness of learning neural networks: From
	accessible states and robust ensembles to basic algorithmic schemes. Proceedings of the National
	_Academy of Sciences, 113(48):E7655–E7662, 2016a. ISSN 0027-8424. doi: 10.1073/pnas._
	[1608103113. URL https://www.pnas.org/content/113/48/E7655.](https://www.pnas.org/content/113/48/E7655)

	Carlo Baldassi, Federica Gerace, Carlo Lucibello, Luca Saglietti, and Riccardo Zecchina. Learning
	may need only a few bits of synaptic precision. Phys. Rev. E, 93:052313, May 2016b. doi: 10.
	[1103/PhysRevE.93.052313. URL https://link.aps.org/doi/10.1103/PhysRevE.](https://link.aps.org/doi/10.1103/PhysRevE.93.052313)
	[93.052313.](https://link.aps.org/doi/10.1103/PhysRevE.93.052313)

	Carlo Baldassi, Fabrizio Pittorino, and Riccardo Zecchina. Shaping the learning landscape in neural
	networks around wide flat minima. Proceedings of the National Academy of Sciences, 117(1):
	[161–170, 2020. ISSN 0027-8424. doi: 10.1073/pnas.1908636117. URL https://www.pnas.](https://www.pnas.org/content/117/1/161)
	[org/content/117/1/161.](https://www.pnas.org/content/117/1/161)

	Jean Barbier, Florent Krzakala, Nicolas Macris, Léo Miolane, and Lenka Zdeborová. Optimal errors
	and phase transitions in high-dimensional generalized linear models. Proceedings of the National
	_Academy of Sciences, 116(12):5451–5460, 2019. ISSN 0027-8424. doi: 10.1073/pnas.1802705116._
	[URL https://www.pnas.org/content/116/12/5451.](https://www.pnas.org/content/116/12/5451)

	Hans Bethe. Statistical theory of superlattices. Proc. R. Soc. A, 150:552, 1935.

	Alfredo Braunstein and Riccardo Zecchina. Learning by message passing in networks of discrete
	synapses. Phys. Rev. Lett., 96:030201, Jan 2006. doi: 10.1103/PhysRevLett.96.030201. URL
	[https://link.aps.org/doi/10.1103/PhysRevLett.96.030201.](https://link.aps.org/doi/10.1103/PhysRevLett.96.030201)

	Pratik Chaudhari, Anna Choromanska, Stefano Soatto, Yann LeCun, Carlo Baldassi, Christian Borgs,
	Jennifer T. Chayes, Levent Sagun, and Riccardo Zecchina. Entropy-sgd: Biasing gradient descent
	into wide valleys. In 5th International Conference on Learning Representations, ICLR 2017,
	_Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net, 2017. URL_
	[https://openreview.net/forum?id=B1YfAfcgl.](https://openreview.net/forum?id=B1YfAfcgl)

	James Diffenderfer and Bhavya Kailkhura. Multi-prize lottery ticket hypothesis: Finding accurate
	binary neural networks by pruning a randomly weighted network. In International Confer_[ence on Learning Representations, 2021. URL https://openreview.net/forum?id=](https://openreview.net/forum?id=U_mat0b9iv)_
	[U_mat0b9iv.](https://openreview.net/forum?id=U_mat0b9iv)


	-----

	Yu Feng and Yuhai Tu. The inverse variance–flatness relation in stochastic gradient descent is critical
	for finding flat minima. Proceedings of the National Academy of Sciences, 118(9), 2021.

	Alyson K Fletcher, Sundeep Rangan, and Philip Schniter. Inference in deep networks in high
	dimensions. In 2018 IEEE International Symposium on Information Theory (ISIT), pp. 1884–1888.
	IEEE, 2018.

	Jonathan Frankle, Gintare Karolina Dziugaite, Daniel Roy, and Michael Carbin. Pruning neural
	networks at initialization: Why are we missing the mark? In International Conference on Learning
	_[Representations, 2021. URL https://openreview.net/forum?id=Ig-VyQc-MLK.](https://openreview.net/forum?id=Ig-VyQc-MLK)_

	Stefano Fusi, Patrick J Drew, and Larry F Abbott. Cascade models of synaptically stored memories.
	_Neuron, 45(4):599–611, 2005._

	Marylou Gabrié. Mean-field inference methods for neural networks. Journal of Physics A: Mathe_matical and Theoretical, 53(22):223002, 2020._

	Robert Gallager. Low-density parity-check codes. IRE Transactions on information theory, 8(1):
	21–28, 1962.

	Timur Garipov, Pavel Izmailov, Dmitrii Podoprikhin, Dmitry P Vetrov, and Andrew G Wilson. Loss
	surfaces, mode connectivity, and fast ensembling of dnns. In S. Bengio, H. Wallach, H. Larochelle,
	K. Grauman, N. Cesa-Bianchi, and R. Garnett (eds.), Advances in Neural Information Processing
	_[Systems, volume 31. Curran Associates, Inc., 2018. URL https://proceedings.neurips.](https://proceedings.neurips.cc/paper/2018/file/be3087e74e9100d4bc4c6268cdbe8456-Paper.pdf)_
	[cc/paper/2018/file/be3087e74e9100d4bc4c6268cdbe8456-Paper.pdf.](https://proceedings.neurips.cc/paper/2018/file/be3087e74e9100d4bc4c6268cdbe8456-Paper.pdf)

	Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward
	neural networks. In Yee Whye Teh and Mike Titterington (eds.), Proceedings of the Thirteenth
	_International Conference on Artificial Intelligence and Statistics, volume 9 of Proceedings of_
	_Machine Learning Research, pp. 249–256, Chia Laguna Resort, Sardinia, Italy, 13–15 May 2010._
	[PMLR. URL https://proceedings.mlr.press/v9/glorot10a.html.](https://proceedings.mlr.press/v9/glorot10a.html)

	Sebastian Goldt, Marc Mézard, Florent Krzakala, and Lenka Zdeborová. Modeling the influence of
	data structure on learning in neural networks: The hidden manifold model. Physical Review X, 10
	(4):041044, 2020.

	Ian J Goodfellow, Mehdi Mirza, Da Xiao, Aaron Courville, and Yoshua Bengio. An empirical investigation of catastrophic forgetting in gradient-based neural networks. arXiv preprint arXiv:1312.6211,
	2013.

	Song Han, Huizi Mao, and William J. Dally. Deep compression: Compressing deep neural network
	with pruning, trained quantization and huffman coding. In Yoshua Bengio and Yann LeCun (eds.),
	_4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico,_
	_[May 2-4, 2016, Conference Track Proceedings, 2016. URL http://arxiv.org/abs/1510.](http://arxiv.org/abs/1510.00149)_
	[00149.](http://arxiv.org/abs/1510.00149)

	José Miguel Hernández-Lobato and Ryan P. Adams. Probabilistic backpropagation for scalable
	learning of bayesian neural networks. In Proceedings of the 32nd International Conference on
	_International Conference on Machine Learning - Volume 37, ICML’15, pp. 1861–1869. JMLR.org,_
	2015.

	Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Binarized neural networks. In D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 29. Curran Asso[ciates, Inc., 2016. URL https://proceedings.neurips.cc/paper/2016/file/](https://proceedings.neurips.cc/paper/2016/file/d8330f857a17c53d217014ee776bfd50-Paper.pdf)
	[d8330f857a17c53d217014ee776bfd50-Paper.pdf.](https://proceedings.neurips.cc/paper/2016/file/d8330f857a17c53d217014ee776bfd50-Paper.pdf)

	Yiding Jiang, Behnam Neyshabur, Hossein Mobahi, Dilip Krishnan, and Samy Bengio. Fantastic generalization measures and where to find them. In International Conference on Learning
	_[Representations, 2020. URL https://openreview.net/forum?id=SJgIPJBFvH.](https://openreview.net/forum?id=SJgIPJBFvH)_

	Yoshiyuki Kabashima, Florent Krzakala, Marc Mézard, Ayaka Sakata, and Lenka Zdeborová. Phase
	transitions and sample complexity in bayes-optimal matrix factorization. IEEE Transactions on
	_Information Theory, 62(7):4228–4265, 2016. doi: 10.1109/TIT.2016.2556702._


	-----

	James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A
	Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. Overcoming
	catastrophic forgetting in neural networks. Proceedings of the national academy of sciences, 114
	(13):3521–3526, 2017.

	Jonathan Kuck, Shuvam Chakraborty, Hao Tang, Rachel Luo, Jiaming Song, Ashish Sabharwal, and
	Stefano Ermon. Belief propagation neural networks. In H. Larochelle, M. Ranzato, R. Hadsell,
	M. F. Balcan, and H. Lin (eds.), Advances in Neural Information Processing Systems, volume 33,
	[pp. 667–678. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/](https://proceedings.neurips.cc/paper/2020/file/07217414eb3fbe24d4e5b6cafb91ca18-Paper.pdf)
	[paper/2020/file/07217414eb3fbe24d4e5b6cafb91ca18-Paper.pdf.](https://proceedings.neurips.cc/paper/2020/file/07217414eb3fbe24d4e5b6cafb91ca18-Paper.pdf)

	Axel Laborieux, Maxence Ernoult, Tifenn Hirtzlin, and Damien Querlioz. Synaptic metaplasticity in binarized neural networks. Nature Communications, 12(1):2549, May 2021. ISSN
	2041-1723. doi: 10.1038/s41467-021-22768-y. [URL https://doi.org/10.1038/](https://doi.org/10.1038/s41467-021-22768-y)
	[s41467-021-22768-y.](https://doi.org/10.1038/s41467-021-22768-y)

	Hao Li, Zheng Xu, Gavin Taylor, Christoph Studer, and Tom Goldstein. Visualizing the loss landscape of neural nets. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and
	R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 31. Curran As[sociates, Inc., 2018. URL https://proceedings.neurips.cc/paper/2018/file/](https://proceedings.neurips.cc/paper/2018/file/a41b3bb3e6b050b6c9067c67f663b915-Paper.pdf)
	[a41b3bb3e6b050b6c9067c67f663b915-Paper.pdf.](https://proceedings.neurips.cc/paper/2018/file/a41b3bb3e6b050b6c9067c67f663b915-Paper.pdf)

	Antoine Maillard, Florent Krzakala, Marc Mézard, and Lenka Zdeborová. Perturbative construction
	of mean-field equations in extensive-rank matrix factorization and denoising. arXiv preprint
	_arXiv:2110.08775, 2021._

	Andre Manoel, Florent Krzakala, Marc Mézard, and Lenka Zdeborová. Multi-layer generalized linear
	estimation. In 2017 IEEE International Symposium on Information Theory (ISIT), pp. 2098–2102,
	2017. doi: 10.1109/ISIT.2017.8006899.

	Michael McCloskey and Neal J Cohen. Catastrophic interference in connectionist networks: The
	sequential learning problem. In Psychology of learning and motivation, volume 24, pp. 109–165.
	Elsevier, 1989.

	Marc Mézard. Mean-field message-passing equations in the hopfield model and its generalizations.
	_Physical Review E, 95(2):022117, 2017._

	Marc Mézard, Giorgio Parisi, and Miguel Angel Virasoro. Spin glass theory and beyond: An
	_Introduction to the Replica Method and Its Applications, volume 9. World Scientific Publishing_
	Company, 1987.

	Thomas P. Minka. Expectation propagation for approximate bayesian inference. In Proceedings of
	_the Seventeenth Conference on Uncertainty in Artificial Intelligence, UAI’01, pp. 362–369, San_
	Francisco, CA, USA, 2001. Morgan Kaufmann Publishers Inc. ISBN 1558608001.

	Marc Mézard and Andrea Montanari. Information, Physics, and Computation. Oxford University
	Press, Inc., USA, 2009. ISBN 019857083X.

	Jason T Parker, Philip Schniter, and Volkan Cevher. Bilinear generalized approximate message
	passing—part i: Derivation. IEEE Transactions on Signal Processing, 62(22):5839–5853, 2014.

	Judea Pearl. Reverend Bayes on inference engines: A distributed hierarchical approach. Cognitive
	Systems Laboratory, School of Engineering and Applied Science ..., 1982.

	R. Peierls. On ising’s model of ferromagnetism. Mathematical Proceedings of the Cambridge
	_Philosophical Society, 32(3):477–481, 1936. doi: 10.1017/S0305004100019174._

	Fabrizio Pittorino, Carlo Lucibello, Christoph Feinauer, Gabriele Perugini, Carlo Baldassi, Elizaveta
	Demyanenko, and Riccardo Zecchina. Entropic gradient descent algorithms and wide flat minima.
	[In International Conference on Learning Representations, 2021. URL https://openreview.](https://openreview.net/forum?id=xjXg0bnoDmS)
	[net/forum?id=xjXg0bnoDmS.](https://openreview.net/forum?id=xjXg0bnoDmS)

	Sundeep Rangan, Philip Schniter, and Alyson K Fletcher. Vector approximate message passing. IEEE
	_Transactions on Information Theory, 65(10):6664–6684, 2019._


	-----

	Rajesh P. N. Rao. Neural models of Bayesian belief propagation., pp. 239–267. Bayesian brain: Probabilistic approaches to neural coding. MIT Press, Cambridge, MA, US, 2007. ISBN 026204238X
	(Hardcover); 978-0-262-04238-3 (Hardcover).

	Anthony Robins. Catastrophic forgetting, rehearsal and pseudorehearsal. Connection Science, 7(2):
	123–146, 1995.

	Victor Garcia Satorras and Max Welling. Neural enhanced belief propagation on factor graphs. In
	_International Conference on Artificial Intelligence and Statistics, pp. 685–693. PMLR, 2021._

	Daniel Soudry, Itay Hubara, and Ron Meir. Expectation backpropagation: Parameterfree training of multilayer neural networks with continuous or discrete weights. In
	Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K. Q. Weinberger (eds.),
	_Advances in Neural Information Processing Systems,_ volume 27. Curran Associates,
	Inc., 2014a. URL [https://proceedings.neurips.cc/paper/2014/file/](https://proceedings.neurips.cc/paper/2014/file/076a0c97d09cf1a0ec3e19c7f2529f2b-Paper.pdf)
	[076a0c97d09cf1a0ec3e19c7f2529f2b-Paper.pdf.](https://proceedings.neurips.cc/paper/2014/file/076a0c97d09cf1a0ec3e19c7f2529f2b-Paper.pdf)

	Daniel Soudry, Itay Hubara, and Ron Meir. Expectation backpropagation: Parameter-free training of
	multilayer neural networks with continuous or discrete weights. In NIPS, volume 1, pp. 2, 2014b.

	George Stamatescu, Federica Gerace, Carlo Lucibello, Ian Fuss, and Langford B. White. Critical
	[initialisation in continuous approximations of binary neural networks. 2020. URL https:](https://openreview.net/forum?id=rylmoxrFDH)
	[//openreview.net/forum?id=rylmoxrFDH.](https://openreview.net/forum?id=rylmoxrFDH)

	Yi-Lin Sung, Varun Nair, and Colin Raffel. Training neural networks with fixed sparse masks, 2021.

	Frederick Tung and Greg Mori. Clip-q: Deep network compression learning by in-parallel pruningquantization. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.
	7873–7882, 2018. doi: 10.1109/CVPR.2018.00821.

	Anqi Wu, Sebastian Nowozin, Edward Meeds, Richard E Turner, Jose Miguel Hernandez-Lobato,
	and Alexander L Gaunt. Deterministic variational inference for robust bayesian neural networks.
	_arXiv preprint arXiv:1810.03958, 2018._

	Jonathan S. Yedidia, William T. Freeman, and Yair Weiss. Understanding Belief Propagation and Its
	_Generalizations, pp. 239–269. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2003._
	ISBN 1558608117.

	Lenka Zdeborová and Florent Krzakala. Statistical physics of inference: Thresholds and algorithms.
	_Advances in Physics, 65(5):453–552, 2016._

	Friedemann Zenke, Ben Poole, and Surya Ganguli. Continual learning through synaptic intelligence.
	In International Conference on Machine Learning, pp. 3987–3995. PMLR, 2017.

	Qiuyun Zou, Haochuan Zhang, and Hongwen Yang. Multi-layer bilinear generalized approximate
	message passing. IEEE Transactions on Signal Processing, 69:4529–4543, 2021. doi: 10.1109/
	TSP.2021.3100305.


	-----

	# Appendices

	CONTENTS

	A BP-based message passing algorithms 14

	A.1 Preliminary considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

	A.2 Derivation of the BP equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

	A.3 BP equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

	A.4 BPI equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

	A.5 MF equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

	A.6 Derivation of the AMP equations . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

	A.7 AMP equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

	A.8 Activation Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

	A.9 The ArgMax layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

	B Experimental details 25

	B.1 Hyper-parameters of the BP-based scheme . . . . . . . . . . . . . . . . . . . . . . 25

	B.2 Damping scheme for the message passing . . . . . . . . . . . . . . . . . . . . . . 25

	B.3 Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

	B.4 Varying the dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

	B.5 Local energy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

	B.6 SGD implementation (BinaryNet) . . . . . . . . . . . . . . . . . . . . . . . . . . 28

	B.7 EBP implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

	B.8 Unit polarization and overlaps . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

	B.9 Computational performance: varying batch-size . . . . . . . . . . . . . . . . . . . 30

	A BP-BASED MESSAGE PASSING ALGORITHMS

	A.1 PRELIMINARY CONSIDERATIONS

	Given a mini-batch = (xn, yn) _n, the factor graph defined by Eqs. (1, 2, 18) is explicitly written_
	_B_ _{_ _}_
	as:

	_L_

	_P_ ( _, x[1:][L]_ _, θ)_ _P_ _[ℓ][+1]_ _x[ℓ]kn[+1]_ _Wki[ℓ]_ _[x][ℓ]in_ _qθ(Wki[ℓ]_ [)][,] (18)
	_W_ _\| B_ _∝_

	_ℓ=0_ _k,n_ _i_ _k,i,ℓ_

	Y Y X ! Y

	where x[0]n [=][ x][n][,][ x]n[L][+1] = yn. The derivation of the BP equations for this model is straightforward
	albeit lengthy and involved. It is obtained following the steps presented in multiple papers, books,
	and reviews, see for instance (Mézard & Montanari, 2009; Zdeborová & Krzakala, 2016; Mézard,
	2017), although it has not been attempted before in deep neural networks. It should be noted that a
	(common) approximation that we take here with respect to the standard BP scheme, is that messages
	are assumed to be Gaussian distributed and therefore parameterized by their mean and variance. This
	goes by the name of relaxed belied propagation (rBP), just referred to as BP throughout the paper.

	We derive the BP equations in A.2 and present them all together in A.3. From BP, we derive other 3
	message passing algorithms useful for the deep network training setting, all of which are well known
	to the literature: BP-Inspired (BPI) message passing A.4, mean-field (MF) A.5, and approximate


	-----

	message passing (AMP) A.7. The AMP derivation is the more involved and given in A.6. In all
	these cases, message updates can be divided in a forward pass and a backward pass, as also done in
	Fletcher et al. (2018) in a multi-layer inference setting. The BP algorithm is compactly reported in
	Algorithm 1.

	In our notation, ℓ denotes the layer index, τ the BP iteration index, k an output neuron index, i an
	input neuron index, and n a sample index.

	We report below, for convenience, some of the considerations also present in the main text.

	Meaning of messages. All the messages involved in the message passing equations can be understood in terms of cavity marginals or full marginals (as mentioned in the introduction BP is also
	known as the Cavity Method, see Mézard & Montanari (2009)). Of particular relevance are the
	quantities m[ℓ]ki [and][ σ]ki[ℓ] [, denoting the mean and variance of the weights][ W][ ℓ]ki[. The quantities][ ˆ]x[ℓ]in [and]
	∆[ℓ]in [instead denote mean and variance of the][ i][-th neuron’s activation in layer][ ℓ] [in correspondence of]
	an input xn.

	Scalar free energies. All message passing schemes can be expressed using the following scalar
	functions, corresponding to single neuron and single weight effective free-energies respectively:

	_ϕ[ℓ](B, A, ω, V ) = log_ dx dz e[−] [1]2 _[Ax][2][+][Bx]_ _P_ _[ℓ]_ (x \| z) e[−] [(][ω]2[−]V[z][)2] _,_ (19)
	Z


	_ψ(H, G, θ) = log_ dw e[−] [1]2 _[G][2][w][2][+][Hw]_ _qθ(w)._ (20)
	Z

	These free energies will naturally arise in the derivation of the BP equations in Appendix A.2. For
	the last layer, the neuron function has to be slightly modified:

	_ϕ[L][+1](y, ω, V ) = log_ dz P _[L][+1]_ (y \| z) e[−] [(][ω]2[−]V[z][)2] _._ (21)
	Z

	Notice that for common deterministic activations such as ReLU and sign, the function ϕ has
	analytic and smooth expressions that we give in Appendix A.8. Same goes for ψ when qθ(w)
	is Gaussian (continuous weights) or a mixture of atoms (discrete weights). At the last layer
	we impose P _[L][+1](y\|z) = I(y = sign(z)) in binary classification tasks. For multi-class clas-_
	sification instead, we have to adapt the formalism to vectorial pre-activations z and assume
	_P_ _[L][+1](y\|z) = I(y = arg max(z)) (see Appendix A.9). While in our experiments we use hard_
	constraints for the final output, therefore solving a constraint satisfaction problem, it would be interesting to also consider generic loss functions. That would require minimal changes to our formalism,
	but this is beyond the scope of our work.

	Binary weights. In our experiments we use ±1 weights in each layer. Therefore each marginal can
	be parameterized by a single number and our prior/posterior takes the form

	_qθ(Wki[ℓ]_ [)][ ∝] _[e][θ]ki[ℓ]_ _[W][ ℓ]ki_ (22)

	The effective free energy function Eq. 20 becomes

	_ψ(H, G, θki[ℓ]_ [) = log 2 cosh(][H][ +][ θ]ki[ℓ] [)] (23)

	and the messages G can be dropped from the message passing.

	Start and end of message passing. At the beginning of a new PasP iteration t, we reset the
	messages to zero and run message passing for τmax iterations. We then compute the new prior
	_qθt+1_ ( ) from the posterior given by the message passing iterations.
	_W_

	A.2 DERIVATION OF THE BP EQUATIONS

	In order to derive the BP equations, we start with the following portion of the factor graph reported in
	Eq. 18 in the main text, describing the contribution of a single data example in the inner loop of the
	PasP updates:


	-----

	_P_ _[ℓ][+1]_


	_x[ℓ]kn[+1]_


	_Wki[ℓ]_ _[x][ℓ]in_


	where x[0]n [=][ x][n][,][ x]n[L][+1] = yn. (24)


	_ℓ=0_ _k_ _i_

	where we recall that the quantity x[ℓ]kn [corresponds to the activation of neuron][ k][ in layer][ ℓ] [in corre-]
	spondence of the input example n.

	Let us start by analyzing the single factor:


	_P_ _[ℓ][+1]_


	_x[ℓ]kn[+1]_


	_Wki[ℓ]_ _[x][ℓ]in_


	(25)


	We refer to messages that travel from input to output in the factor graph as upgoing or upwards
	messages, while to the ones that travel from output to input as downgoing or backwards messages.

	Factor-to-variable-W messages The factor-to-variable-W messages read:


	_νˆkn[ℓ][+1]_ _ki[(][W][ ℓ]ki[)][ ∝]_ _dνki[ℓ]_ _[′]_ _n[(][W][ ℓ]ki[′]_ [)]
	_→_ _→_
	Z Yi[′]≠ _i_


	_dνi[ℓ][′]n_ _k[(][x]i[ℓ][′]n[)][ dν][↓][(][x][ℓ]kn[+1][)][ P][ ℓ][+1]_
	_→_
	_i[′]_

	Y


	_x[ℓ]kn[+1]_


	_Wki[ℓ]_ _[′]_ _[x]i[ℓ][′]n_
	_i[′]_ !

	X

	(26)


	where ν denotes the messages travelling downwards (from output to input) in the factor graph.
	_↓_

	We denote the means and variances of the incoming messages respectively with m[ℓ]ki _n[,][ ˆ]x[ℓ]in_ _k_ [and]
	_→_ _→_
	_σki[ℓ]_ _n[,][ ∆][ℓ]in_ _k[:]_
	_→_ _→_

	_m[ℓ]ki_ _n_ [=] _dνki[ℓ]_ _n[(][W][ ℓ]ki[)][ W][ ℓ]ki_ (27)
	_→_ _→_
	Z

	_σki[ℓ]_ _n_ [=] _dνki[ℓ]_ _n[(][W][ ℓ]ki[)]_ _Wki[ℓ]_ _ki_ _n_ 2 (28)
	_→_ _→_ _[−]_ _[m][ℓ]_ _→_
	Z


	_xˆ[ℓ]in_ _k_ [=] _dνin[ℓ]_ _k[(][x]in[ℓ]_ [)][ x]in[ℓ] (29)
	_→_ _→_
	Z


	2
	∆[ℓ]in _k_ [=] _dνin[ℓ]_ _k[(][x]in[ℓ]_ [)] _x[ℓ]in_ _x[ℓ]in_ _k_ (30)
	_→_ _→_ _[−]_ [ˆ] _→_
	Z


	We now use the central limit theorem to observe that with respect to the incoming messages distributions - assuming independence of these messages - in the large input limit the preactivation is a
	Gaussian random variable:


	_Wki[ℓ]_ _[′]_ _[x][ℓ]i[′]n_ _kn_ _i[, V]kn[ ℓ]_ _i[)]_ (31)
	_iX[′]≠_ _i_ _[∼N]_ [(][ω][ℓ] _→_ _→_


	where:


	_ωkn[ℓ]_ _→i_ [=][ E][ν]  _Wki[ℓ]_ _[′]_ _[x][ℓ]i[′]n_ = _m[ℓ]ki[′]→n_ _x[ˆ][ℓ]i[′]n→k_ (32)

	_i[′]≠_ _i_ _iX[′]≠_ _i_

	[X] 

	_Vkn[ℓ]_ _→i_ [=][ V ar][ν]  _Wki[ℓ]_ _[′]_ _[x][ℓ]i[′]n_

	_i[′]≠_ _i_

	= _σki[ℓ][X][′]_ _n_ [∆]i[ℓ][′]n _k[+]_ _m[ℓ]ki[′]_ _n_ 2 ∆ℓi[′]n _k_ [+][ σ]ki[ℓ] _[′]_ _n_ _xˆ[ℓ]i[′]n_ _k_ 2[] (33)

	_→_ _→_ _→_ _→_ _→_ _→_

	_i[′]=i_

	X̸

	Therefore we can rewrite the outgoing messages as:


	-----

	2
	_ki_ _[xℓ]in[)]_
	_νˆkn[ℓ][+1]_ _i[(][W][ ℓ]ki[)][ ∝]_ _dz dνin[ℓ]_ _k[(][x][ℓ]in[)][ dν][↓][(][x][ℓ]kn[+1][)][ e][−]_ [(][z][−][ωkn]2[→]Vkn[i] _[−]→[W ℓ]i_ _P_ _[ℓ][+1]_ _x[ℓ]kn[+1]_
	_→_ _→_
	Z


	(34)


	We now assume Wki[ℓ] _[x]in[ℓ]_ [to be small compared to the other terms. With a second order Taylor]
	expansion we obtain:

	_νˆkn[ℓ]_ _→i[(][W][ ℓ]ki[)][ ∝]_ _dz dν↓(x[ℓ]kn[+1][)][ e][−]_ [(][z][−]2Vkn[ωkn]→[→]i[i][)][2] _P_ _[ℓ][+1]_ _x[ℓ]kn[+1]_ _[z]_
	Z


	1 + _[z][ −]_ _[ω][kn][→][i]_ _xˆ[ℓ]in_ _k[W][ ℓ]ki_ [+ (][z][ −] _[ω][kn][→][i][)][2][ −]_ _[V][kn][→][i]_

	_×_ _Vkn→i_ _→_ 2Vkn→i

	Introducing now the function:


	∆+ _xˆ[ℓ]in→k_ 2[ ] _Wki[ℓ]_ 2
	(35)


	_ϕ[ℓ](B, A, ω, V ) = log_ dx dz e[−] 2[1] _[Ax][2][+][Bx]_ _P_ _[ℓ]_ (x\|z) e[−] [(][ω]2[−]V[z][)2] (36)
	Z


	and defining:

	_gkn[ℓ]_ _i_ [=][ ∂][ω][ϕ][ℓ][+1][(][B][ℓ][+1][, A][ℓ][+1][, ω]kn[ℓ] _i[, V]kn[ ℓ]_ _i[)]_ (37)
	_→_ _→_ _→_

	Γ[ℓ]kn _i_ [=][ −][∂]ω[2] _[ϕ][ℓ][+1][(][B][ℓ][+1][, A][ℓ][+1][, ω]kn[ℓ]_ _i[, V]kn[ ℓ]_ _i[)]_ (38)
	_→_ _→_ _→_

	the expansion for the log-message reads:


	log ˆνkn[ℓ] _i[(][W][ ℓ]ki[)][ ≈]_ _[const][ + ˆ]x[ℓ]in_ _k_ _[g]kn[ℓ]_ _i[W][ ℓ]ki_
	_→_ _→_ _→_

	2[] 2[ ] 2
	∆[ℓ]in _k_ [+] _xˆ[ℓ]in_ _k_ Γ[ℓ]kn _i_ _in_ _k_ _gkn[ℓ]_ _i_ _Wki[ℓ]_ (39)

	_−_ [1]2 _→_ _→_ _→_ _[−]_ [∆][ℓ] _→_ _→_



	Factor-to-variable-x messages The derivation of these messages is analogous to the factor-tovariable-W ones in Eq. 26 just reported. The final result for the log-message is:

	log ˆνkn[ℓ] _i[(][x]in[ℓ]_ [)][ ≈][const][ +][ m]ki[ℓ] _n_ _[g]kn[ℓ]_ _i[x]in[ℓ]_
	_→_ _→_ _→_

	2[] 2[ ] 2
	_σki[ℓ]_ _n_ [+] _m[ℓ]ki_ _n_ Γ[ℓ]kn _i_ _ki_ _n_ _gkn[ℓ]_ _i_ _x[ℓ]in_ (40)

	_−_ [1]2 _→_ _→_ _→_ _[−]_ _[σ][ℓ]_ _→_ _→_



	Variable-W-to-output-factor messages The message from variable Wki[ℓ] [to the output factor][ kn]
	reads:


	_νki[ℓ]_ _n[(][W][ ℓ]ki[)][ ∝]_ _[P][ ℓ]θki_ [(][W][ ℓ]ki[)][e] _n[′]_ ≠ _n_ [log ˆ]νkn[ℓ] _[′]_ _→i[(][W][ ℓ]ki[)]_
	_→_ P 2

	_≈_ _Pθ[ℓ]ki_ [(][W][ ℓ]ki[)][e][H]ki[ℓ] _→n[W][ ℓ]ki[−]_ [1]2 _[G]ki[ℓ]_ _→n[(][W][ ℓ]ki[)]_ (41)

	where we have defined:

	_Hki[ℓ]_ _n_ [=] _xˆ[ℓ]in[′]_ _k_ _[g]kn[ℓ]_ _[′]_ _i_ (42)
	_→_ _→_ _→_

	_nX[′]≠_ _n_

	2[] 2[]

	_G[ℓ]ki_ _n_ [=] ∆[ℓ]in[′] _k_ [+] _xˆ[ℓ]in[′]_ _k_ Γ[ℓ]kn[′] _i_ _in[′]_ _k_ _gkn[ℓ]_ _[′]_ _i_ (43)
	_→_ _n[′]=n_ _→_ _→_ _→_ _[−]_ [∆][ℓ] _→_ _→_
	X̸

	Introducing now the effective free energy:


	-----

	_ψ(H, G, θ) = log_ dW Pθ[ℓ] [(][W] [)][ e][HW][ −] 2[1] _[GW][ 2]_ (44)
	Z

	we can express the first two cumulants of the message νki[ℓ] _n[(][W][ ℓ]ki[)][ as:]_
	_→_

	_m[ℓ]ki_ _n_ [=][ ∂][H] _[ψ][(][H]ki[ℓ]_ _n[, G][ℓ]ki_ _n[, θ][ki][)]_ (45)
	_→_ _→_ _→_

	_σki[ℓ]_ _n_ [=][ ∂]H[2] _[ψ][(][H]ki[ℓ]_ _n[, G][ℓ]ki_ _n[, θ][ki][)]_ (46)
	_→_ _→_ _→_

	Variable-x-to-input-factor messages We can write the downgoing message as:


	_ν_ (x[ℓ]in[)][ ∝] _[e]_ _k_ [log ˆ]νkn[ℓ] _→i[(][x][ℓ]in[)]_
	_↓_
	P

	_≈_ _e[B]in[ℓ]_ _[x][−]_ [1]2 _[A]in[ℓ]_ _[x][2]_ (47)


	where:


	_Bin[ℓ]_ [=] _m[ℓ]ki_ _n_ _[g]kn[ℓ]_ _i_ (48)

	_→_ _→_
	_n_

	X 2[] 2[]

	_A[ℓ]in_ [=] _σki[ℓ]_ _n_ [+] _m[ℓ]ki_ _n_ Γ[ℓ]kn _i_ _ki_ _n_ _gkn[ℓ][+1]_ _i_ (49)

	_n_ _→_ _→_ _→_ _[−]_ _[σ][ℓ]_ _→_ _→_

	X

	Variable-x-to-output-factor messages By defining the following cavity quantities:

	_Bin[ℓ]_ _k_ [=][ B]in[ℓ] _k_ _ki_ _n_ _[g]kn[ℓ]_ _i_ (50)
	_→_ _→_ _[−]_ _[m][ℓ]_ _→_ _→_ 2[] 2[]

	_A[ℓ]in_ _k_ [=][ A]in[ℓ] _k_ _σki[ℓ]_ _n_ [+] _m[ℓ]ki_ _n_ Γ[ℓ]kn _i_ _ki_ _n_ _gkn[ℓ]_ _i_ (51)
	_→_ _→_ _[−]_ _→_ _→_ _→_ _[−]_ _[σ][ℓ]_ _→_ _→_

	and the following non-cavity ones:


	_ωkn[ℓ]_ [=]

	_Vkn[ℓ]_ [=]


	_m[ℓ]ki_ _n_ _x[ˆ][ℓ]in_ _k_ (52)
	_→_ _→_


	_i_

	_Vkn[ℓ]_ [=] _σki[ℓ]_ _n_ [∆]in[ℓ] _k_ [+] _m[ℓ]ki_ _n_ 2 ∆ℓin _k_ [+][ σ]ki[ℓ] _n_ _xˆ[ℓ]i[′]n_ _k_ 2[] (53)

	_→_ _→_ _→_ _→_ _→_ _→_

	_i_

	X

	we can express the first 2 cumulants of the upgoing messages as:


	_xˆ[ℓ]in_ _k_ [=][ ∂][B][ϕ][ℓ][(][B]in[ℓ] _k[, A][ℓ]in_ _k[, ω]in[ℓ][−][1][, V]in[ ℓ][−][1])_ (54)
	_→_ _→_ _→_

	∆[ℓ]in _k_ [=][ ∂]B[2] _[ϕ][ℓ][(][B]in[ℓ]_ _k[, A][ℓ]in_ _k[, ω]in[ℓ][−][1][, V]in[ ℓ][−][1])_ (55)
	_→_ _→_ _→_

	Wrapping it up Additional but straightforward considerations are required for the final input and
	output layers (ℓ = 0 and ℓ = L respectively), since they do not receive messages from below and
	above respectively. In the end, thanks to independence assumptions and the central limit theorem that
	we used throughout the derivations, we arrive to a closed set of equations involving the means and
	the variances (or otherwise the corresponding natural parameters) of the messages. Within the same
	approximation assumption, we also replace the cavity quantities corresponding to variances with the
	non-cavity counterparts. Dividing the update equations in a forward and backward pass, and ordering
	them using time indexes in such a way that we have an efficient flow of information, we obtain the
	set of BP equations presented in the main text Eqs. (6-17) and in the Appendix Eqs. (60-71).

	A.3 BP EQUATIONS

	We report here the end result of the derivation in last section, the complete set of BP equations also
	presented in the main text as Eqs. (6-17).


	-----

	Initialization At τ = 0:


	_Bin[ℓ,][0]_ _k_ [= 0] (56)
	_→_

	_A[ℓ,]in[0]_ [= 0] (57)

	_Hki[ℓ,][0]_ _n_ [= 0] (58)
	_→_

	_G[ℓ,]ki[0]_ [= 0] (59)

	Forward Pass At each τ = 1, . . ., τmax, for ℓ = 0, . . ., L:


	_xˆ[ℓ,τ]in_ _k_ [=][ ∂][B][ϕ][ℓ][(][B]in[ℓ,τ] _[−]k[1][, A]in[ℓ,τ]_ _[−][1], ωin[ℓ][−][1][,τ]_ _, Vin[ℓ][−][1][,τ]_ ) (60)
	_→_ _→_

	∆[ℓ,τ]in [=][ ∂]B[2] _[ϕ][ℓ][(][B]in[ℓ,τ]_ _[−][1], A[ℓ,τ]in_ _[−][1], ωin[ℓ][−][1][,τ]_ _, Vin[ℓ][−][1][,τ]_ ) (61)

	_m[ℓ,τ]ki_ _n_ [=][ ∂][H] _[ψ][(][H]ki[ℓ,τ]_ _[−]n[1][, G]ki[ℓ,τ]_ _[−][1], θki[ℓ]_ [)] (62)
	_→_ _→_

	_σki[ℓ,τ]_ [=][ ∂]H[2] _[ψ][(][H]ki[ℓ,τ]_ _[−][1], G[ℓ,τ]ki_ _[−][1], θki[ℓ]_ [)] (63)

	2 2

	_Vkn[ℓ,τ]_ [=] _m[ℓ,τ]ki_ ∆[ℓ,τ]in [+][ σ]ki[ℓ,τ] _[−][1]_ _xˆ[ℓ,τ]i[′]n_ + σki[ℓ,τ] _[−][1]∆[ℓ,τ]in_ (64)
	Xi

	_ωkn[ℓ,τ]_ _i_ [=] _m[ℓ,τ]ki[′]_ _n_ _x[ˆ][ℓ,τ]i[′]n_ _k_ (65)
	_→_ _→_ _→_

	_iX[′]≠_ _i_


	In these equations for simplicity we abused the notation, in fact for the first layer ˆx[ℓ]n[=0][,τ] is fixed and
	given by the input xn while ∆[ℓ]n[=0][,τ] = 0 instead.

	Backward Pass For τ = 1, . . ., τmax, for ℓ = L, . . ., 0 :

	_gkn[ℓ,τ]_ _i_ [=][ ∂][ω][ϕ][ℓ][+1][(][B]kn[ℓ][+1][,τ] _, A[ℓ]kn[+1][,τ]_ _, ωkn[ℓ,τ]_ _i[, V]kn[ ℓ,τ]_ [)] (66)
	_→_ _→_

	Γ[ℓ,τ]kn [=][ −][∂]ω[2] _[ϕ][ℓ][+1][(][B]kn[ℓ][+1][,τ]_ _, A[ℓ]kn[+1][,τ]_ _, ωkn[ℓ,τ]_ _[, V]kn[ ℓ,τ]_ [)] (67)

	2 2[]

	_A[ℓ,τ]in_ [=] Xk m[ℓ,τ]ki + σki[ℓ,τ] Γ[ℓ,τ]kn _[−]_ _[σ]ki[ℓ,τ]_ gkn[ℓ,τ] (68)

	_Bin[ℓ,τ]_ _k_ [=] _m[ℓ,τ]k[′]i_ _n_ _[g]k[ℓ,τ][′]n_ _i_ (69)
	_→_ _→_ _→_

	_kX[′]≠_ _k_

	2 2[]

	_G[ℓ,τ]ki_ [=] Xn xˆ[ℓ,τ]in + ∆[ℓ,τ]in Γ[ℓ,τ]kn _[−]_ [∆]in[ℓ,τ] gkn[ℓ,τ] (70)

	_Hki[ℓ,τ]_ _n_ [=] _xˆ[ℓ,τ]in[′]_ _k_ _[g]kn[ℓ,τ][′]_ _i_ (71)
	_→_ _→_ _→_

	_nX[′]≠_ _n_


	In these equations as well we abused the notation: calling L the number of hidden neuron layers,
	when ℓ = L one should use ϕ[L][+1](y, ω, V ) from Eq. 21 instead of ϕ[L][+1](B, A, ω, V ).

	A.4 BPI EQUATIONS

	The BP-Inspired algorithm (BPI) is obtained as an approximation of BP replacing some cavity
	quantities with their non-cavity counterparts. What we obtain is a generalization of the single layer
	algorithm of Baldassi et al. (2007).


	-----

	Forward pass.

	Backward pass.


	_xˆ[ℓ,τ]in_ = _∂Bϕ[ℓ]_ []Bin[ℓ,τ] _[−][1], A[ℓ,τ]in_ _[−][1], ωin[ℓ][−][1][,τ]_ _, Vin[ℓ][−][1][,τ]_ (72)


	∆[ℓ,τ]in = _∂B[2]_ _[ϕ][ℓ]_ []Bin[ℓ,τ] _[−][1], A[ℓ,τ]in_ _[−][1], ωin[ℓ][−][1][,τ]_ _, Vin[ℓ][−][1][,τ]_ (73)


	_m[ℓ,τ]ki_ = _∂H_ _ψ(Hki[ℓ,τ]_ _[−][1], G[ℓ,τ]ki_ _[−][1], θki[ℓ]_ [)] (74)

	_σki[ℓ,τ]_ = _∂H[2]_ _[ψ][(][H]ki[ℓ,τ]_ _[−][1], G[ℓ,τ]ki_ _[−][1], θki[ℓ]_ [)] (75)

	2

	_Vkn[ℓ,τ]_ = _m[ℓ,τ]ki_ ∆[ℓ,τ]in [+][ σ]ki[ℓ,τ] [(ˆ]x[ℓ,τ]in [)][2][ +][ σ]ki[ℓ,τ] [∆]in[ℓ,τ] (76)
	Xi

	_ωkn[ℓ,τ]_ = _m[ℓ,τ]ki_ _x[ˆ][ℓ,τ]in_ (77)
	X


	_gkn[ℓ,τ]_ _i_ = _∂ωϕ[ℓ][+1][ ]Bkn[ℓ][+1][,τ]_ _, A[ℓ]kn[+1][,τ]_ _, ωkn[ℓ,τ]_ _ki_ _x[ˆ][ℓ,τ]ai_ _[, V]kn[ ℓ,τ]_ (78)
	_→_ _[−]_ _[m][ℓ,τ]_


	Γ[ℓ,τ]kn = _−∂ω[2]_ _[ϕ][ℓ][+1][ ]Bkn[ℓ][+1][,τ]_ _, A[ℓ]kn[+1][,τ]_ _, ωkn[ℓ,τ]_ _[, V]kn[ ℓ,τ]_ (79)
	2

	_A[ℓ,τ]in_ = _k_ (m[ℓ,τ]ki [)][2][ +][ σ]ki[ℓ,τ] Γ[ℓ,τ]kn _[−]_ _[σ]ki[ℓ,τ]_ _gkn[ℓ,τ]_ (80)
	X

	_Bin[ℓ,τ]_ = _m[ℓ,τ]ki_ _[g]kn[ℓ,τ]_ _i_ (81)

	_→_
	_k_

	X

	2

	_G[ℓ,τ]ki_ = _n_ (ˆx[ℓ,τ]in [)][2][ + ∆]in[ℓ,τ] Γ[ℓ,τ]kn _[−]_ [∆]in[ℓ,τ] _gkn[ℓ,τ]_ (82)
	X

	_Hki[ℓ,τ]_ = _xˆ[ℓ,τ]in_ _[g]kn[ℓ,τ]_ _i_ (83)

	_→_
	_n_

	X

	A.5 MF EQUATIONS

	The mean-field (MF) equations are obtained as a further simplification of BPI, using only non-cavity
	quantities. Although the simplification appears minimal at this point, we empirically observe a
	non-negligible discrepancy between the two algorithms in terms of generalization performance and
	computational time.

	Forward pass.


	_xˆ[ℓ,τ]in_ = _∂Bϕ[ℓ]_ []Bin[ℓ,τ] _[−][1], A[ℓ,τ]in_ _[−][1], ωin[ℓ][−][1][,τ]_ _, Vin[ℓ][−][1][,τ]_ (84)


	∆[ℓ,τ]in = _∂B[2]_ _[ϕ][ℓ]_ []Bin[ℓ,τ] _[−][1], A[ℓ,τ]in_ _[−][1], ωin[ℓ][−][1][,τ]_ _, Vin[ℓ][−][1][,τ]_ (85)


	_m[ℓ,τ]ki_ = _∂H_ _ψ(Hki[ℓ,τ]_ _[−][1], G[ℓ,τ]ki_ _[−][1], θki[ℓ]_ [)] (86)

	_σki[ℓ,τ]_ = _∂H[2]_ _[ψ][(][H]ki[ℓ,τ]_ _[−][1], G[ℓ,τ]ki_ _[−][1], θki[ℓ]_ [)] (87)

	2

	_Vkn[ℓ,τ]_ = _m[ℓ,τ]ki_ ∆[ℓ,τ]in [+][ σ]ki[ℓ,τ] [(ˆ]x[ℓ,τ]in [)][2][ +][ σ]ki[ℓ,τ] [∆]in[ℓ,τ] (88)
	Xi

	_ωkn[ℓ,τ]_ = _m[ℓ,τ]ki_ _x[ˆ][ℓ,τ]in_ (89)
	X


	-----

	Backward pass.


	_gkn[ℓ,τ]_ = _∂ωϕ[ℓ][+1][ ]Bkn[ℓ][+1][,τ]_ _, A[ℓ]kn[+1][,τ]_ _, ωkn[ℓ,τ]_ _[, V]kn[ ℓ,τ]_ (90)


	Γ[ℓ,τ]kn = _−∂ω[2]_ _[ϕ][ℓ][+1][ ]Bkn[ℓ][+1][,τ]_ _, A[ℓ]kn[+1][,τ]_ _, ωkn[ℓ,τ]_ _[, V]kn[ ℓ,τ]_ (91)
	2

	_A[ℓ,τ]in_ = _k_ (m[ℓ,τ]ki [)][2][ +][ σ]ki[ℓ,τ] Γ[ℓ,τ]kn _[−]_ _[σ]ki[ℓ,τ]_ _gkn[ℓ,τ]_ (92)
	X

	_Bin[ℓ,τ]_ = _m[ℓ,τ]ki_ _[g]kn[ℓ,τ]_ (93)

	_k_

	X

	2

	_G[ℓ,τ]ki_ = _n_ (ˆx[ℓ,τ]in [)][2][ + ∆]in[ℓ,τ] Γ[ℓ,τ]kn _[−]_ [∆]in[ℓ,τ] _gkn[ℓ,τ]_ (94)
	X

	_Hki[ℓ,τ]_ = _xˆ[ℓ,τ]in_ _[g]kn[ℓ,τ]_ (95)

	_n_

	X

	A.6 DERIVATION OF THE AMP EQUATIONS

	In order to obtain the AMP equations, we approximate cavity quantities with non-cavity ones in the
	BP equations Eqs. (60-71) using a first order expansion. We start with the mean activation:


	_xˆ[ℓ,τ]in→k_ [=][∂][B][ϕ][ℓ][(][B]in[ℓ,τ] _[−][1]_ _−_ _m[ℓ,τ]ki→[−]n[1]_ _[g]kn[ℓ,τ]→[−][1]i_ _[, A]in[ℓ,τ]_ _[−][1], ωin[ℓ][−][1][,τ]_ _, Vin[ℓ][−][1][,τ]_ )

	_∂Bϕ[ℓ](Bin[ℓ,τ]_ _[−][1], A[ℓ,τ]in_ _[−][1], ωin[ℓ][−][1][,τ]_ _, Vin[ℓ][−][1][,τ]_ )
	_≈_

	_−_ _m[ℓ,τ]ki→[−]n[1]_ _[g]kn[ℓ,τ]→[−][1]i_ _[∂]B[2]_ _[ϕ][ℓ][(][B]in[ℓ,τ]_ _[−][1], A[ℓ,τ]in_ _[−][1], ωin[ℓ][−][1][,τ]_ _, Vin[ℓ][−][1][,τ]_ )

	_xˆ[ℓ,τ]in_ _ki_ _gkn[ℓ,τ]_ _[−][1]∆[ℓ,τ]in_ (96)
	_≈_ _[−]_ _[m][ℓ,τ]_ _[−][1]_

	Analogously, for the weight’s mean we have:

	_m[ℓ,τ]ki→n_ [=][ ∂][H] _[ψ][(][H]ki[ℓ,τ]_ _[−][1]_ _−_ _xˆ[ℓ,τ]in→[−]k[1]_ _[g]kn[ℓ,τ]→[−][1]i_ _[, G][ℓ,τ]ki_ _[−][1], θki[ℓ]_ [)]

	_≈_ _∂H_ _ψ(Hki[ℓ,τ]_ _[−][1], Gki[ℓ,τ]_ _[−][1], θki[ℓ]_ [)][ −] _x[ˆ][ℓ,τ]in→[−]k[1]_ _[g]kn[ℓ,τ]→[−][1]i_ _[∂]H[2]_ _[ψ][(][H]ki[ℓ,τ]_ _[−][1], G[ℓ,τ]ki_ _[−][1], θki[ℓ]_ [)]

	_≈_ _m[ℓ,τ]ki_ _[−]_ _x[ˆ][ℓ,τ]in_ _[−][1]_ _gkn[ℓ,τ]_ _[−][1]_ _σki[ℓ,τ]_ _[.]_ (97)

	This brings us to:


	_ωkn[ℓ,τ]_ [=]


	_m[ℓ,τ]ki_ _n_ _x[ˆ][ℓ,τ]in_ _k_
	_→_ _→_


	_≈_ _i_ _m[ℓ,τ]ki_ _x[ˆ][ℓ,τ]in_ _[−]_ _[g]kn[ℓ,τ]_ _[−][1]_ _i_ _σki[ℓ,τ]_ _x[ˆ][ℓ,τ]in_ _x[ˆ][ℓ,τ]in_ _[−][1]_ _−_ _gkn[ℓ,τ]_ _[−][1]_ _i_ _m[ℓ,τ]ki_ _[m]ki[ℓ,τ]_ _[−][1]∆[ℓ,τ]in_
	X X X

	+ (gkn[ℓ,τ] _[−][1])[2][ X]_ _σki[ℓ,τ]_ _[m]ki[ℓ,τ]_ _[−][1]xˆ[ℓ,τ]in_ _[−][1]∆[ℓ,τ]in_ (98)

	_i_

	Let us now apply the same procedure to the other set of cavity messages:

	_gkn[ℓ,τ]_ _i_ [=][∂][ω][ϕ][ℓ][+1][(][B]kn[ℓ][+1][,τ] _, A[ℓ]kn[+1][,τ]_ _, ωkn[ℓ,τ]_ _ki_ _n_ _x[ˆ][ℓ,τ]in_ _k[, V]kn[ ℓ,τ]_ [)]
	_→_ _[−]_ _[m][ℓ,τ]→_ _→_

	_≈∂ωϕ[ℓ][+1](Bkn[ℓ][+1][,τ]_ _, A[ℓ]kn[+1][,τ]_ _, ωkn[ℓ,τ]_ _[, V]kn[ ℓ,τ]_ [)]

	_−_ _m[ℓ,τ]ki→n_ _x[ˆ][ℓ,τ]in→k[∂]ω[2]_ _[ϕ][ℓ][+1][(][B]kn[ℓ][+1][,τ]_ _, A[ℓ]kn[+1][,τ]_ _, ωkn[ℓ,τ]_ _[, V]kn[ ℓ,τ]_ [)]

	_≈gkn[ℓ,τ]_ [+][ m]ki[ℓ,τ] _x[ˆ][ℓ,τ]in_ [Γ]kn[ℓ,τ] (99)


	-----

	_Bin[ℓ,τ]_ [=]

	_≈_

	_Hki[ℓ,τ]_ [=]


	_m[ℓ,τ]ki_ _n_ _[g]kn[ℓ,τ]_ _i_
	_→_ _→_
	_k_

	X

	2

	_k_ _m[ℓ,τ]ki_ _[g]kn[ℓ,τ]_ _[−]_ _x[ˆ]in_ _k_ _gkn[ℓ,τ]_ _σki[ℓ,τ]_ [+ ˆ]x[ℓ,τ]in _k_ (m[ℓ,τ]ki [)][2][Γ]kn[ℓ,τ]

	X X X

	_−_ (ˆx[ℓ,τ]in [)][2][ X] _σki[ℓ,τ]_ _[m]ki[ℓ,τ]_ _[g]kn[ℓ,τ]_ [Γ]kn[ℓ,τ] (100)

	_k_


	_xˆ[ℓ,τ]in_ _k_ _[g]kn[ℓ,τ]_ _i_
	_→_ _→_

	_xˆ[ℓ,τ]in_ _[g]kn[ℓ,τ]_ [+][ m]ki[ℓ,τ]


	2

	_[ m][ℓ,τ]ki_ _n_ _xˆ[ℓ,τ]in_ Γ[ℓ,τ]kn _[−]_ _[m]ki[ℓ,τ]_ _n_ (gkn[ℓ,τ] [)][2][∆]in[ℓ,τ]
	X X

	_gkn[ℓ,τ]_ [Γ]kn[ℓ,τ] [∆]in[ℓ,τ] _x[ˆ][ℓ,τ]in_ (101)


	_n_ _n_ _n_

	_−_ (m[ℓ,τ]ki [)][2][ X] _gkn[ℓ,τ]_ [Γ]kn[ℓ,τ] [∆]in[ℓ,τ] _x[ˆ][ℓ,τ]in_

	_n_

	We are now able to write down the full AMP equations, that we present in the next section.


	A.7 AMP EQUATIONS

	In summary, in the last section we derived the AMP algorithm as a closure of the BP messages passing
	over non-cavity quantities, relying on some statistical assumptions on messages and interactions.
	With respect to the MF message passing, we find some additional terms that go under the name of
	Onsager corrections. In-depth overviews of the AMP (also known as Thouless-Anderson-Palmer
	(TAP)) approach can be found in Refs. (Zdeborová & Krzakala, 2016; Mézard, 2017; Gabrié, 2020).
	The final form of the AMP equations for the multi-layer perceptron is given below.

	Initialization At τ = 0:

	_Bin[ℓ,][0]_ [= 0] (102)

	_Ain[ℓ,][0]_ [= 0] (103)

	_Hki[ℓ,][0]_ [= 0][ or some values] (104)

	_G[ℓ,]ki[0]_ [= 0][ or some values] (105)

	_gkn[ℓ,][0]_ [= 0] (106)

	Forward Pass At each τ = 1, . . ., τmax, for ℓ = 0, . . ., L:

	_xˆ[ℓ,τ]in_ [=][∂][B][ϕ][ℓ][(][B]in[ℓ,τ] _[−][1], A[ℓ,τ]in_ _[−][1], ωin[ℓ][−][1][,τ]_ _, Vin[ℓ][−][1][,τ]_ ) (107)

	∆[ℓ,τ]in [=][∂]B[2] _[ϕ][ℓ][(][B]in[ℓ,τ]_ _[−][1], A[ℓ,τ]in_ _[−][1], ωin[ℓ][−][1][,τ]_ _, Vin[ℓ][−][1][,τ]_ ) (108)

	_m[ℓ,τ]ki_ [=][∂][H] _[ψ][(][H]ki[ℓ,τ]_ _[−][1], G[ℓ,τ]ki_ _[−][1], θki[ℓ]_ [)] (109)

	_σki[ℓ,τ]_ [=][∂]H[2] _[ψ][(][H]ki[ℓ,τ]_ _[−][1], G[ℓ,τ]ki_ _[−][1], θki[ℓ]_ [)] (110)

	2 2

	_Vkn[ℓ,τ]_ [=] _m[ℓ,τ]ki_ ∆[ℓ,τ]in [+][ σ]ki[ℓ,τ] _xˆ[ℓ,τ]i[′]n_ + σki[ℓ,τ] [∆]in[ℓ,τ] (111)
	Xi

	_ωkn[ℓ,τ]_ [=] _i_ _m[ℓ,τ]ki_ _x[ˆ][ℓ,τ]in_ _[−]_ _[g]kn[ℓ,τ]_ _[−][1]_ _i_ _σki[ℓ,τ]_ _x[ˆ][ℓ,τ]in_ _x[ˆ][ℓ,τ]in_ _[−][1]_ _−_ _gkn[ℓ,τ]_ _[−][1]_ _i_ _m[ℓ,τ]ki_ _[m]ki[ℓ,τ]_ _[−][1]∆[ℓ,τ]in_
	X X X

	+ (gkn[ℓ,τ] _[−][1])[2][ X]_ _σki[ℓ,τ]_ _[m]ki[ℓ,τ]_ _[−][1]xˆ[ℓ,τ]in_ _[−][1]∆[ℓ,τ]in_ (112)

	_i_


	-----

	Backward Pass

	_gkn[ℓ,τ]_ [=][∂][ω][ϕ][ℓ][+1][(][B]kn[ℓ][+1][,τ] _, A[ℓ]kn[+1][,τ]_ _, ωkn[ℓ,τ]_ _i[, V]kn[ ℓ,τ]_ [)] (113)
	_→_

	Γ[ℓ,τ]kn [=][ −] _[∂]ω[2]_ _[ϕ][ℓ][+1][(][B]kn[ℓ][+1][,τ]_ _, A[ℓ]kn[+1][,τ]_ _, ωkn[ℓ,τ]_ _[, V]kn[ ℓ,τ]_ [)] (114)

	2 2[]

	_A[ℓ,τ]in_ [=] Xk m[ℓ,τ]ki + σki[ℓ,τ] Γ[ℓ,τ]kn _[−]_ _[σ]ki[ℓ,τ]_ gkn[ℓ,τ] (115)

	2

	_Bin[ℓ,τ]_ [=] _k_ _m[ℓ,τ]ki_ _[g]kn[ℓ,τ]_ _[−]_ _x[ˆ]in_ _k_ _gkn[ℓ,τ]_ _σki[ℓ,τ]_ [+ ˆ]x[ℓ,τ]in _k_ (m[ℓ,τ]ki [)][2][Γ]kn[ℓ,τ]
	X X X

	_−_ (ˆx[ℓ,τ]in [)][2][ X] _σki[ℓ,τ]_ _[m]ki[ℓ,τ]_ _[g]kn[ℓ,τ]_ [Γ]kn[ℓ,τ] (116)

	_k_

	2 2[]

	_G[ℓ,τ]ki_ [=] Xn xˆ[ℓ,τ]in + ∆[ℓ,τ]in Γ[ℓ,τ]kn _[−]2_ [∆]in[ℓ,τ] gkn[ℓ,τ] (117)

	_Hki[ℓ,τ]_ [=] _n_ _xˆ[ℓ,τ]in_ _[g]kn[ℓ,τ]_ [+][ m]ki[ℓ,τ] _n_ _xˆ[ℓ,τ]in_ Γ[ℓ,τ]kn _[−]_ _[m]ki[ℓ,τ]_ _n_ (gkn[ℓ,τ] [)][2][∆]in[ℓ,τ]
	X X X

	(m[ℓ,τ]ki [)][2][ X] _gkn[ℓ,τ]_ [Γ]kn[ℓ,τ] [∆]in[ℓ,τ] _x[ˆ][ℓ,τ]in_ (118)
	_−_

	_n_

	A.8 ACTIVATION FUNCTIONS

	A.8.1 SIGN

	In most of our experiments we use sign activations in each layer. With this choice, the neuron’s free
	energy 19 takes the form


	+ [1]2 [log(2][πV][ )][,] (119)

	[]

	

	(120)


	 2

	 [1]


	_e[Bx]_ _H_ _−_ _√[xω]V_
	_x_ 1,+1
	_∈{−X_ _}_


	_ϕ(B, A, ω, V ) = log_


	where


	where

	_x_

	= [1] _._
	_H_ 2 [erfc] _√2_



	Notice that for sign activations the messages A can be dropped.


	A.8.2 RELU

	For ReLU (x) = max(0, x) activations the free energy 19 becomes

	_ϕ(B, A, ω, V ) =_ dxdz e[−] 2[1] _[Ax][2][+][Bx]_ _δ(x −_ max(0, z)) e[−] [(][ω]2[−]V[z][)2] (121)
	Z

	_ω_ _A_ [)] _BV + ω_
	= log _H_ + _H_ + [1]
	_√V_ _[N]_ [(]A[ω][;][ B/A, V](B; 0, A[ +])[ 1] _−_ _√V + AV_ [2] 2 [log(2][πV][ )][,]
	_N_

	(122)


	where

	A.9 THE ARGMAX LAYER


	1

	_e[−]_ [(][x][−]2Σ[µ][)2] _._ (123)
	2πΣ


	_N_ (x; µ, Σ) =


	In order to perform multi-class classification, we have to perform an argmax operation on the last
	layer of the neural network. Call zk, for k = 1, . . ., K, the Gaussian random variables output of the
	last layer of the network in correspondence of some input x. Assuming the correct label is class k[∗],
	the effective partition function Zk∗ corresponding to the output constraint reads:


	-----

	25

	20


	25

	20


	15

	10


	15

	10

	\|Col1\|Col2\|BP\|train\|BP tes\|t\|
	\|---\|---\|---\|---\|---\|---\|
	\|\|\|\|\|\|\|
	\|\|\|\|\|\|\|
	\|\|\|\|\|\|\|
	\|\|\|\|\|\|\|

	\|Col1\|Col2\|Col3\|BP\|train\|BP te\|st\|
	\|---\|---\|---\|---\|---\|---\|---\|
	\|\|\|\|\|\|\|\|
	\|\|\|\|\|\|\|\|
	\|\|\|\|\|\|\|\|
	\|\|\|\|\|\|\|\|


	BP train BP test


	BP train BP test


	20 40 60 80 100
	epochs


	20 40 60 80 100
	epochs


	Figure 4: MLP with 2 hidden layers with 101 hidden units each, batch-size 128 on the FashionMNIST dataset. In the first two layers we use the BP equations, while in the last layer the ArgMax
	ones. (Left) ArgMax layer first version; (Right) ArgMax layer second version. Even if it is possible
	to reach similar accuracies with the two versions, we decide to use the first one as it is simpler to use.


	_Zk∗_ = _dzk_ (zk; ωk, Vk)

	_k_ _N_

	Z Y


	Θ(zk∗ _zk)_ (124)
	_−_
	_kY≠_ _k[∗]_

	_H_ _−_ _[z][k]√[∗]_ _[−]Vk[ω][k]_ (125)
	_k=k[∗]_

	Y̸


	_dzk∗_ (zk∗ ; ωk∗ _, Vk∗_ )
	_N_


	Here Θ(x) is the Heaviside indicator function and we used the definition of H from Eq. 120. The
	integral on the last line cannot be expressed analytically, therefore we have to resort to approximations.

	A.9.1 APPROACH 1: JENSEN INEQUALITY


	Using the Jensen inequality we obtain:


	_φk∗_ = log Zk∗ = log Ez∼N (ωk∗ _,Vk∗_ ) _H_ _−_ _[z][ −]√V[ω]k[k]_

	_k=k[∗]_

	Y̸

	_≥_ Ez∼N (ωk∗ _,Vk∗_ ) log H _−_ _[z][ −]√V[ω]k[k]_

	_k=k[∗]_

	X̸


	(126)

	(127)


	Reparameterizing the expectation we have:

	_φ˜k∗_ = Eϵ (0,1) log

	_∼N_ _H_ _−_ _[ω][k][∗]_ [+][ ϵ]√[√]V[V]k[k][∗] _[−]_ _[ω][k]_
	_k=k[∗]_

	X̸

	The derivative ∂ωk _φ[˜]k∗_ and ∂ω[2]k _φ[˜]k∗_ that we need can then be estimated by sampling (once) ϵ:


	(128)


	1
	_√Vk Eϵ∼N (0,1) K_ _−_ _[ω][k][∗]_ [+][ϵ]√[√]V[V]k[k][∗] _[−][ω][k]_

	1
	_k[′]≠_ _k[∗]_ _√Vk′_ [E][ϵ][∼N][ (0][,][1)][ K] _−_ _[ω][k][∗]_ [+][ϵ]√[√]V[V]k[k]′[∗] _[−][ω][k][′]_




	_k ̸= k[∗]_

	(129)
	_k = k[∗]_


	_∂ωk_ _φ[˜]k∗_ =

	where we have defined:


	(x) =
	_K_ _[N]([(]x[x]) [)]_ [=]

	_H_


	2/π

	(130)
	erfcx(x/2)

	p


	-----

	A.9.2 APPROACH 2: JENSEN AGAIN

	A further simplification is obtained by applying Jensen inequality again to 128 but in the opposite
	direction, therefore we renounce to having a bound and look only for an approximation. We have the
	new effective free energy:

	_φ˜k∗_ = log Eϵ∼N (0,1)H _−_ _[ω][k][∗]_ [+][ ϵ]√[√]V[V]k[k][∗] _[−]_ _[ω][k]_ (131)

	_k=k[∗]_

	X̸

	= _k=k[∗]_ log H − _√[ω]V[k]k[∗] +[−] V[ω]k[k]∗_ (132)
	X̸

	This gives, for k ̸= k[∗]:

	_−_ _√Vk1+Vk∗_ _K_ _−_ _√[ω]V[k]k[∗]+[−]V[ω]k[k]∗_ _k ̸= k[∗]_

	_∂ωk_ _φ[˜]k∗_ =  _k[′]≠_ _k[∗]_ _√Vk′1+Vk∗_ _[K]_ _−_ _√[ω]V[k][∗]k′[−]+[ω]V[k]k[′]∗_ _k = k[∗]_ (133)



	P

	

	Notice that ∂ωk∗ _φ[˜]k[∗]_ = − [P]k≠ _k[∗]_ _[∂][ω]k_ _φ[˜]k[∗]_ . In last formulas we used the definition of K in Eq. 130.

	We show in Fig. 4 the negligible difference between the two ArgMax versions when using BP on the
	layers before the last one (which performs only the ArgMax).


	B EXPERIMENTAL DETAILS

	B.1 HYPER-PARAMETERS OF THE BP-BASED SCHEME

	We include here a complete list of the hyper-parameters present in the BP-based algorithms. Notice
	that, like in the SGD type of algorithms, many of them can be fixed or it is possible to find a
	prescription for their value that works in most cases. However, we expect future research to find even
	more effective values of the hyper-parameters, in the same way it has been done for SGD. These
	hyper-parameters are: the mini-batch size bs; the parameter ρ (that has to be tuned similarly to the
	learning rate in SGD); the damping parameter α (that performs a running smoothing on the BP fields
	along the dynamics by adding a fraction of the field at the previous iteration, see Eqs. (134, 135)); the
	initialization coefficient ϵ that we use to to sample the parameters of our prior distribution qθ( )
	_W_
	according to θki[ℓ,t][=0] _ϵ_ (0, 1). Different choices of ϵ correspond to different initial distribution of
	_∼_ _N_
	the weights’ magnetization m[ℓ]ki [= tanh(][θ]ki[ℓ] [)][, as is shown in Fig. 5); the number of internal steps of]
	reinforcement τmax and the associated intensity of the internal reinforcement r. The performances
	of the BP-based algorithms are robust in a reasonable range of these hyper-parameters. A more
	principled choice of a good initialization condition could be made by adapting the technique from
	Stamatescu et al. (2020).

	Notice that among these parameters, the BP dynamics at each layer is mostly sensitive to ρ and α,
	so that in general we consider them layer-dependent. See Sec. B.8 for details on the effect of these
	parameters on the learning dynamics and on layer polarization (i.e. how the BP dynamics tends to
	bias the weights towards a single point-wise configuration with high probability). Unless otherwise
	stated we fix some of the hyper-parameters, in particular: bs = 128 (results are consistent with other
	values of the batch-size, from bs = 1 up to bs = 1024 in our experiments), ϵ = 1.0, τmax = 1, r = 0.

	B.2 DAMPING SCHEME FOR THE MESSAGE PASSING

	We use a damping parameter α ∈ (0, 1) to stabilize the training, changing the updated rule for the
	weights’ means as follows

	_m˜_ _[ℓ,τ]ki_ [=][∂][H] _[ψ][(][H]ki[ℓ,τ]_ _[−][1], G[ℓ,τ]ki_ _[−][1], θki[ℓ]_ [)] (134)

	_m[ℓ,τ]ki_ [=][α m]ki[ℓ,τ] _[−][1]_ + (1 − _α) ˜m[ℓ,τ]ki_ (135)


	-----

	P(m)

	ϵ = 0.1

	m

	\|Col1\|5 4 3 2 1\|ϵ = 0.1 ϵ = 0.5 ϵ = 1.0 ϵ = 1.5\|Col4\|
	\|---\|---\|---\|---\|


	-1.0 -0.5 0.5 1.0

	Figure 5: Initial distribution of the magnetizations varying the parameter ϵ. The initial distribution is
	more concentrated around ±1 as ϵ increases (i.e. it is more bimodal and the initial configuration is
	more polarized).

	B.3 ARCHITECTURES

	In the experiments in which we vary the architecture (see Sec. 4.1), all simulations of the BP-based
	algorithms use a number of internal reinforcement iterations τmax = 1. Learning is performed on the
	totality of the training dataset, the batch-size is bs = 128, the initialization coefficient is ϵ = 1.0.

	For all architectures and all BP approximations, we use α = 0.8 for each layer, apart for the
	501-501-501 MLP in which we use α = (0.1, 0.1, 0.1, 0.9). Concerning the parameter ρ, we use
	_ρ = 0.9 on the last layer for all architectures and BP approximations. On the other layers we_
	use: for the 101-101 and the 501-501 MLPs, ρ = 1.0001 for all BP approximations; for the 101101-101 MLP, ρ = 1.0 for BP and AMP while ρ = 1.001 for MF; for the 501-501-501 MLP
	_ρ = 1.0001 for all BP approximations. For the BinaryNet simulations, the learning rate is lr = 10.0_
	for all MLP architectures, giving the better performance among the learning rates we have tested,
	_lr = 100, 10, 1, 0.1, 0.001._

	We notice that while we need some tuning of the hyper-parameters to reach the performances of
	BinaryNet, it is possible to fix them across datasets and architectures (e.g. ρ = 1 and α = 0.8 on
	each layer) without in general losing more than 20% (relative) of the generalization performances,
	demonstrating that the BP-based algorithms are effective for learning also with minimal hyperparameter tuning.

	The experiments on the Bayesian error are performed on a MLP with 2 hidden layers of 101 units
	on the MNIST dataset (binary classification). Learning is performed on the totality of the training
	dataset, the batch-size is bs = 128, the initialization coefficient is ϵ = 1.0. In order to find the
	pointwise configurations we use α = 0.8 on each layer and ρ = (1.0001, 1.0001, 0.9), while to find
	the Bayesian ones we use α = 0.8 on each layer and ρ = (0.9999, 0.9999, 0.9) (these value prevent
	an excessive polarization of the network towards a particular pointwise configurations).

	For the continual learning task (see Sec. 4.5) we fixed ρ = 1 and α = 0.8 on each layer as we
	empirically observed that polarizing the last layer helps mitigating the forgetting while leaving the
	single-task performances almost unchanged.

	In Fig. 6 we report training curves on architectures different from the ones reported in the main paper.


	-----

	25

	20


	25

	20


	15

	10


	15

	10

	\|BP BP\|train MF train test MF test\|
	\|---\|---\|
	\|A A\|MP train BinaryNet train MP test BinaryNet test\|
	\|\|\|
	\|\|\|
	\|\|\|

	\|Col1\|BP train MF train BP test MF test\|
	\|---\|---\|
	\|\|AMP train BinaryNet train AMP test BinaryNet test\|
	\|\|\|
	\|\|\|
	\|\|\|


	BP train MF train
	BP test MF test
	AMP train BinaryNet train
	AMP test BinaryNet test


	BP train MF train
	BP test MF test
	AMP train BinaryNet train
	AMP test BinaryNet test


	20 40 60 80 100
	epochs


	20 40 60 80 100
	epochs


	Figure 6: Training curves of message passing algorithms compared with BinaryNet on the FashionMNIST dataset (multi-class classification) with a binary MLP with 3 hidden layers of 501 units.
	(Right) The batch-size is 128 and curves are averaged over 5 realizations of the initial conditions

	B.4 VARYING THE DATASET


	When varying the dataset (see Sec. 4.3), all simulation of the BP-based algorithms use a number
	of internal reinforcement iterations τmax = 1. Learning is performed on the totality of the training
	dataset, the batch-size is bs = 128, the initialization coefficient is ϵ = 1.0. For all datasets (MNIST
	(2 classes), FashionMNIST (2 classes), CIFAR-10 (2 classes), MNIST, FashionMNIST, CIFAR-10)
	and all algorithms (BP, AMP, MF) we use ρ = (1.0001, 1.0001, 0.9) and α = 0.8 for each layer.
	Using in the first layers values of ρ = 1 + ϵ with ϵ ≥ 0 and sufficiently small typically leads to good
	results.

	For the BinaryNet simulations, the learning rate is lr = 10.0 (both for binary classification and
	multi-class classification), giving the better performance among the learning rates we have tested,
	_lr = 100, 10, 1, 0.1, 0.001. In Tab. 2 we report the final train errors obtained on the different datasets._

	Dataset BinaryNet BP AMP MF

	MNIST (2 classes) 0.05 ± 0.05 0.0 ± 0.0 0.0 ± 0.0 0.0 ± 0.0

	FashionMNIST (2 classes) 0.3 ± 0.1 0.06 ± 0.01 0.06 ± 0.01 0.09 ± 0.01

	CIFAR10 (2 classes) 1.2 ± 0.5 0.37 ± 0.01 0.4 ± 0.1 0.9 ± 0.2

	MNIST 0.09 ± 0.01 0.12 ± 0.01 0.12 ± 0.01 0.03 ± 0.01

	FashionMNIST 4.0 ± 0.5 3.4 ± 0.1 3.7 ± 0.1 2.5 ± 0.2

	CIFAR10 13.0 ± 0.9 4.7 ± 0.1 4.7 ± 0.2 9.2 ± 0.5


	Table 2: Train error (%) on Fashion-MNIST of a multilayer perceptron with 2 hidden layers of 501
	units each for BinaryNet (baseline), BP, AMP and MF. All algorithms are trained with batch-size 128
	and for 100 epochs. Mean and standard deviations are calculated over 5 random initializations.

	B.5 LOCAL ENERGY


	We adapt the notion of flatness used in (Jiang et al., 2020; Pittorino et al., 2021), that we call local
	energy, to configurations with binary weights. Given a weight configuration w ∈{±1}[N], we define
	the local energy δEtrain(w, p) as the average difference in training error Etrain(w) when perturbing w
	by flipping a random fraction p of its elements:

	_δEtrain(w, p) = Ez Etrain(w ⊙_ _z) −_ _Etrain(w),_ (136)

	where ⊙ denotes the Hadamard (element-wise) product and the expectation is over i.i.d. entries for z
	equal to −1 with probability p and to +1 with probability 1 − _p. We report the resulting local energy_
	profiles (in a range [0, pmax]) in Fig. 7 right panel for BP and BinaryNet. The relative error grows


	-----

	slowly when perturbing the trained configurations (notice the convexity of the curves). This shows
	that both BP-based and SGD-based algorithms find configurations that lie in relatively flat minima in
	the energy landscape. The same qualitative phenomenon holds for different architectures and datasets.


	0.30

	0.25


	0.20

	0.15


	0.10

	0.05


	0.00

	\|Col1\|Col2\|Col3\|Col4\|Col5\|Col6\|Col7\|Col8\|Col9\|
	\|---\|---\|---\|---\|---\|---\|---\|---\|---\|
	\|\|Bi BP\|naryNet\|\|\|\|\|\|\|
	\|\|\|\|\|\|\|\|\|\|
	\|\|\|\|\|\|\|\|\|\|
	\|\|\|\|\|\|\|\|\|\|
	\|\|\|\|\|\|\|\|\|\|
	\|\|\|\|\|\|\|\|\|\|
	\|\|\|\|\|\|\|\|\|\|
	\|\|\|\|\|\|\|\|\|\|


	0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35
	flip probability p

	Figure 7: Local energy curve of the point-wise configuration found by the BP algorithm compared
	with BinaryNet on a MLP with 2 hidden layers of 101 units on the 2-class MNIST dataset.


	B.6 SGD IMPLEMENTATION (BINARYNET)

	We compare the BP-based algorithms with SGD training for neural networks with binary weights
	and activations as introduced in BinaryNet (Hubara et al., 2016). This procedure consists in keeping
	a continuous version of the parameters w which is updated with the SGD rule, with the gradient
	calculated on the binarized configuration wb = sign(w). At inference time the forward pass is
	calculated with the parameters wb. The backward pass with binary activations is performed with the
	so called straight-through estimator.

	Our implementation presents some differences with respect to the original proposal of the algorithm
	in (Hubara et al., 2016), in order to keep the comparison as fair as possible with the BP-based
	algorithms, in particular for what concerns the number of parameters. We do not use biases nor batch
	normalization layers, therefore in order to keep the pre-activations of each hidden layer normalized
	we rescale them by _√1N_ [where][ N][ is the size of the previous layer (or the input size in the case of the]

	pre-activations afferent to the first hidden layer). The standard SGD update rule is applied (instead
	of Adam), and we use the binary cross-entropy loss. Clipping of the continuous configuration w in

	[−1, 1] is applied. We use Xavier initialization (Glorot & Bengio, 2010) for the continuous weights.
	In Fig.2 of the main paper, we apply the Adam optimization rule, noticing that it performs slightly
	better in train and test generalization performance compared to the pure SGD one.


	B.7 EBP IMPLEMENTATION

	Expectation Back Propagation (EBP) Soudry et al. (2014b) is parameter-free Bayesian algorithm
	that uses a mean-field (MF) approximation (fully factorized form for the posterior) in an online
	environment to estimate the Bayesian posterior distribution after the arrival of a new data point.
	The main differences between EBP and our approach relies in the approximation for the posterior
	distribution. Moreover we explicitly base the estimation of the marginals on the local high entropy
	structure. The fact that EBP works has no clear explanation: certainly it cannot be that the MF
	assumption holds for multi-layer neural networks. Still, it’s certainly very interesting that it works.
	We argue that it might work precisely by virtue of the existence of high local entropy minima and


	-----

	\|Col1\|BP layer1\|
	\|---\|---\|
	\|\|AMP layer1\|
	\|\|MF layer1\|

	\|Col1\|Col2\|
	\|---\|---\|
	\|\|BP layer1 AMP layer1 MF layer1\|
	\|\|\|

	\|Col1\|Col2\|
	\|---\|---\|
	\|\|BP layer2 AMP layer2\|
	\|\|MF layer2\|

	\|Col1\|BP layer2\|
	\|---\|---\|
	\|\|AMP layer2\|
	\|\|MF layer2\|

	\|BP tr BP te AMP t\|ain MF train st MF test rain BinaryNet train\|
	\|---\|---\|
	\|AMP t\|est BinaryNet test\|
	\|\|\|
	\|\|\|
	\|\|\|

	\|Col1\|BP layer3 AMP layer3\|
	\|---\|---\|
	\|\|MF layer3\|

	\|Col1\|BP layer3 AMP layer3\|
	\|---\|---\|
	\|\|MF layer3\|


	25 1.00

	BP train MF train 0.010 BP layer1

	20 BP testAMP train MF testBinaryNet train 0q 0.750.50 BP layer1AMP layer1MF layer1 _qab_ 0.005 AMP layer1MF layer1

	AMP test BinaryNet test 0 50 100 0 50 100

	epochs epochs

	15 1.00

	0.0050 BP layer2

	0q 0.75 BP layer2AMP layer2 _qab_ 0.0025 AMP layer2MF layer2

	error (%) 10 0.50 MF layer2

	0 50 100 0 50 100
	epochs epochs

	5 0.4 BP layer3 0.04 BP layer3

	AMP layer3 AMP layer3

	0q 0.2 MF layer3 _qab_ 0.02 MF layer3

	0 0 20 40 60 80 100 0 50 100 0.00 0 50 100

	epochs epochs epochs


	Figure 8: (Right panels) Polarizations _q0_ and overlaps _qab_ on each layer of a MLP with 2 hidden
	_⟨_ _⟩_ _⟨_ _⟩_
	layers of 501 units on the Fashion-MNIST dataset (multi-class), the batch-size is bs = 128. (Right)
	Corresponding train and test error curves.

	expect it to give similar performance to the MF case of our algorithm. The online iteration could in
	fact be seen as way of implementing a reinforcement.

	We implemented the EBP code along the lines of the original matlab implementation
	(https://github.com/ExpectationBackpropagation/EBP_Matlab_Code). In order to perform a fair
	comparison we removed the biases both in the binary and continuous weights versions. It is worth
	noticing that we faced numerical issues in training with a moderate to big batchsize All the experiments were consequently limited to a batchsize of 10 patterns

	B.8 UNIT POLARIZATION AND OVERLAPS

	We define the self-overlap or polarization of a given unit a as q0[a] [=] _N1_ _i[(][w]i[a][)][2][, where][ N][ is the]_

	number of parameters of the unit and _wi[a]_ _i=1_ [its weights. It quantifies how much the unit is polarized]
	towards a unique point-wise binary configuration ( { _[}][N]_ _q0[a]_ [= 1][ corresponding to high confidence in a]P
	given configurations while q0[a] [= 0][ to low). The overlap between two units][ a][ and][ b][ (considered in the]
	same layer) is qab = _N[1]_ _wia[w]i[b][. The number of parameters][ N][ is the same for units belonging to the]_

	same fully connected layer. We denote byP _⟨q0⟩_ = _Nout1_ _Na=1out_ _[q]0[a]_ [and][ ⟨][q][ab][⟩] [=] _Nout1_ _Na<bout_ _[q][ab][ the]_

	mean polarization and mean overlap in a given layer (where Nout is the number of units in the layer).

	P P

	The parameters ρ and α govern the dynamical evolution of the polarization of each layer during
	training. A value ρ ⪆ 1 has the effect to progressively increase the units polarization during training,
	while ρ < 1 disfavours it. The damping α which takes values in [0, 1] has the effect to slow the
	dynamics by a smoothing process (the intensity of which depends on the value of α), generically
	favoring convergence. Given the nature of the updates in Algorithm 1, each layer presents its own
	dynamics given the values of ρℓ and αℓ at layer ℓ, that in general can differ from each other.

	We find that it is is beneficial to control the polarization layer-per-layer, see Fig. 8 for the corresponding typical behavior of the mean polarization and the mean overlaps during training. Empirically, we
	have found that (as we could expect) when training is successful the layers polarize progressively
	towards q0 = 1, i.e. towards a precise point-wise solution, while the overlaps between units in each
	hidden layer are such that qab 1 (indicating low symmetry between intra-layer units, as expected
	for a non-redundant solution). To this aim, in most cases ≪ _αℓ_ can be the same for each layer, while
	tuning ρℓ for each layer allows to find better generalization performances in some cases (but is not
	strictly necessary for learning).

	In particular, it is possible to use the same value ρℓ for each layer before the last one (ℓ< L
	where L is the number of layers in the network), while we have found that the last layer tends to


	-----

	polarize immediately during the dynamics (probably due to its proximity to the output constraints).
	Empirically, it is usually beneficial for learning that this layer does not or only slightly polarize, i.e.
	_⟨q0⟩≪_ 1 (this can be achieved by imposing ρL < 1). Learning is anyway possible even when the
	last layer polarizes towards ⟨q0⟩ = 1 along the dynamics, i.e. by choosing ρL sufficiently large.

	As a simple general prescription in most experiments we can fix α = 0.8 and ρL = 0.9, therefore
	leaving ρℓ<L as the only hyper-parameter to be tuned, akin to the learning rate in SGD. Its value has
	to be very close to 1.0 (a value smaller than 1.0 tends to depolarize the layers, without focusing on a
	particular point-wise binary configuration, while a value greater than 1.0 tends to lead to numerical
	instabilities and parameters’ divergence).

	B.9 COMPUTATIONAL PERFORMANCE: VARYING BATCH-SIZE


	In order to compare the time performances of the BP-based algorithms with our implementation
	of BinaryNet, we report in Fig. 9 the time in seconds taken by a single epoch of each algorithm
	in function of the batch-size, on a MLP of 2 layers of 501 units on Fashion-MNIST. We test both
	algorithms on a NVIDIA GeForce RTX 2080 Ti GPU. Multi-class and binary classification present
	a very similar time scaling with the batch-size, in both cases comparable with BinaryNet. Let us
	also notice that BP-based algorithms are able to reach generalization performances comparable to
	BinaryNet for all the values of the batch-size reported in this section.


	10[1]

	10[0]


	10[0] 10[1] 10[2] 10[3]

	\|Col1\|Col2\|Col3\|Col4\|Col5\|
	\|---\|---\|---\|---\|---\|
	\|\|\|\|BP BPI AMP MF\|\|
	\|\|\|\|BinaryNet\|\|
	\|\|\|\|\|\|


	BP
	BPI
	AMP
	MF
	BinaryNet

	### batch size

	Figure 9: Algorithms time scaling with the batch-size on a MLP with 2 hidden layers of 501 hidden
	units each on the Fashion-MNIST dataset (multi-class classification). The reported time (in seconds)
	refers to one epoch for each algorithm.


	-----