File size: 100,083 Bytes
f71c233
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
1399
1400
1401
1402
1403
1404
1405
1406
1407
1408
1409
1410
1411
1412
1413
1414
1415
1416
1417
1418
1419
1420
1421
1422
1423
1424
1425
1426
1427
1428
1429
1430
1431
1432
1433
1434
1435
1436
1437
1438
1439
1440
1441
1442
1443
1444
1445
1446
1447
1448
1449
1450
1451
1452
1453
1454
1455
1456
1457
1458
1459
1460
1461
1462
1463
1464
1465
1466
1467
1468
1469
1470
1471
1472
1473
1474
1475
1476
1477
1478
1479
1480
1481
1482
1483
1484
1485
1486
1487
1488
1489
1490
1491
1492
1493
1494
1495
1496
1497
1498
1499
1500
1501
1502
1503
1504
1505
1506
1507
1508
1509
1510
1511
1512
1513
1514
1515
1516
1517
1518
1519
1520
1521
1522
1523
1524
1525
1526
1527
1528
1529
1530
1531
1532
1533
1534
1535
1536
1537
1538
1539
1540
1541
1542
1543
1544
1545
1546
1547
1548
1549
1550
1551
1552
1553
1554
1555
1556
1557
1558
1559
1560
1561
1562
1563
1564
1565
1566
1567
1568
1569
1570
1571
1572
1573
1574
1575
1576
1577
1578
1579
1580
1581
1582
1583
1584
1585
1586
1587
1588
1589
1590
1591
1592
1593
1594
1595
1596
1597
1598
1599
1600
1601
1602
1603
1604
1605
1606
1607
1608
1609
1610
1611
1612
1613
1614
1615
1616
1617
1618
1619
1620
1621
1622
1623
1624
1625
1626
1627
1628
1629
1630
1631
1632
1633
1634
1635
1636
1637
1638
1639
1640
1641
1642
1643
1644
1645
1646
1647
1648
1649
1650
1651
1652
1653
1654
1655
1656
1657
1658
1659
1660
1661
1662
1663
1664
1665
1666
1667
1668
1669
1670
1671
1672
1673
1674
1675
1676
1677
1678
1679
1680
1681
1682
1683
1684
1685
1686
1687
1688
1689
1690
1691
1692
1693
1694
1695
1696
1697
1698
1699
1700
1701
1702
1703
1704
1705
1706
1707
1708
1709
1710
1711
1712
1713
1714
1715
1716
1717
1718
1719
1720
1721
1722
1723
1724
1725
1726
1727
1728
1729
1730
1731
1732
1733
1734
1735
1736
1737
1738
1739
1740
1741
1742
1743
1744
1745
1746
1747
1748
1749
1750
1751
1752
1753
1754
1755
1756
1757
1758
1759
1760
1761
1762
1763
1764
1765
1766
1767
1768
1769
1770
1771
1772
1773
1774
1775
1776
1777
1778
1779
1780
1781
1782
1783
1784
1785
1786
1787
1788
1789
1790
1791
1792
1793
1794
1795
1796
1797
1798
1799
1800
1801
1802
1803
1804
1805
1806
1807
1808
1809
1810
1811
1812
1813
1814
1815
1816
1817
1818
1819
1820
1821
1822
1823
1824
1825
1826
1827
1828
1829
1830
1831
1832
1833
1834
1835
1836
1837
1838
1839
1840
1841
1842
1843
1844
1845
1846
1847
1848
1849
1850
1851
1852
1853
1854
1855
1856
1857
1858
1859
1860
1861
1862
1863
1864
1865
1866
1867
1868
1869
1870
1871
1872
1873
1874
1875
1876
1877
1878
1879
1880
1881
1882
1883
1884
1885
1886
1887
1888
1889
1890
1891
1892
1893
1894
1895
1896
1897
1898
1899
1900
1901
1902
1903
1904
1905
1906
1907
1908
1909
1910
1911
1912
1913
1914
1915
1916
1917
1918
1919
1920
1921
1922
1923
1924
1925
1926
1927
1928
1929
1930
1931
1932
1933
1934
1935
1936
1937
1938
1939
1940
1941
1942
1943
1944
1945
1946
1947
1948
1949
1950
1951
1952
1953
1954
1955
1956
1957
1958
1959
1960
1961
1962
1963
1964
1965
1966
1967
1968
1969
1970
1971
1972
1973
1974
1975
1976
1977
1978
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
2025
2026
2027
2028
2029
2030
2031
2032
2033
2034
2035
2036
2037
2038
2039
2040
2041
2042
2043
2044
2045
2046
2047
2048
2049
2050
2051
2052
2053
2054
2055
2056
2057
2058
2059
2060
2061
2062
2063
2064
2065
2066
2067
2068
2069
2070
2071
2072
2073
2074
2075
2076
2077
2078
2079
2080
2081
2082
2083
2084
2085
2086
2087
2088
2089
2090
2091
2092
2093
2094
2095
2096
2097
2098
2099
2100
2101
2102
2103
2104
2105
2106
2107
2108
2109
2110
2111
2112
2113
2114
2115
2116
2117
2118
2119
2120
2121
2122
2123
2124
2125
2126
2127
2128
2129
2130
2131
2132
2133
2134
2135
2136
2137
2138
2139
2140
2141
2142
2143
2144
2145
2146
2147
2148
2149
2150
2151
2152
2153
2154
2155
2156
2157
2158
2159
2160
2161
2162
2163
2164
2165
2166
2167
2168
2169
2170
2171
2172
2173
2174
2175
2176
2177
2178
2179
2180
2181
2182
2183
2184
2185
2186
2187
2188
2189
2190
2191
2192
2193
2194
2195
2196
2197
2198
2199
2200
2201
2202
2203
2204
2205
2206
2207
2208
2209
2210
2211
2212
2213
2214
2215
2216
2217
2218
2219
2220
2221
2222
2223
2224
2225
2226
2227
2228
2229
2230
2231
2232
2233
2234
2235
2236
2237
2238
2239
2240
2241
2242
2243
2244
2245
2246
2247
2248
2249
2250
2251
2252
2253
2254
2255
2256
2257
2258
2259
2260
2261
2262
2263
2264
2265
2266
2267
2268
2269
2270
2271
2272
2273
2274
2275
2276
2277
2278
2279
2280
2281
2282
2283
2284
2285
2286
2287
2288
2289
2290
2291
2292
2293
2294
2295
2296
2297
2298
2299
2300
2301
2302
2303
2304
2305
2306
2307
2308
2309
2310
2311
2312
2313
2314
2315
2316
2317
2318
2319
2320
2321
2322
2323
2324
2325
2326
2327
2328
2329
2330
2331
2332
2333
2334
2335
2336
2337
2338
2339
2340
2341
2342
2343
2344
2345
2346
2347
2348
2349
2350
2351
2352
2353
2354
2355
2356
2357
2358
2359
2360
2361
2362
2363
2364
2365
2366
2367
2368
2369
2370
2371
2372
2373
2374
2375
2376
2377
2378
2379
2380
2381
2382
2383
2384
2385
2386
2387
2388
2389
2390
2391
2392
2393
2394
2395
2396
2397
2398
2399
2400
2401
2402
2403
2404
2405
2406
2407
2408
2409
2410
2411
2412
2413
2414
2415
2416
2417
2418
2419
2420
2421
2422
2423
2424
2425
## DEEP LEARNING VIA MESSAGE PASSING ALGORITHMS
### BASED ON BELIEF PROPAGATION

**Anonymous authors**
Paper under double-blind review

ABSTRACT

Message-passing algorithms based on the Belief Propagation (BP) equations constitute a well-known distributed computational scheme. They yield exact marginals
on tree-like graphical models and have also proven to be effective in many problems
defined on loopy graphs, from inference to optimization, from signal processing
to clustering. The BP-based schemes are fundamentally different from stochastic
gradient descent (SGD), on which the current success of deep networks is based.
In this paper, we present and adapt to mini-batch training on GPUs a family of
BP-based message-passing algorithms with a reinforcement term that biases distributions towards locally entropic solutions. These algorithms are capable of
training multi-layer neural networks with performance comparable to SGD heuristics in a diverse set of experiments on natural datasets including multi-class image
classification and continual learning, while being capable of yielding improved
performances on sparse networks. Furthermore, they allow to make approximate
Bayesian predictions that have higher accuracy than point-wise ones.

1 INTRODUCTION

Belief Propagation is a method for computing marginals and entropies in probabilistic inference
problems (Bethe, 1935; Peierls, 1936; Gallager, 1962; Pearl, 1982). These include optimization
problems as well once they are written as zero temperature limit of a Gibbs distribution that uses the
cost function as energy. Learning is one particular case, in which one wants to minimize a cost which
is a data dependent loss function. These problems are generally intractable and message-passing
techniques have been particularly successful at providing principled approximations through efficient
distributed computations.

A particularly compact representation of inference/optimization problems that is used to build
massage-passing algorithms is provided by factor graphs. A factor graph is a bipartite graph composed
of variables nodes and factor nodes expressing the interactions among variables. Belief Propagation
is exact for tree-like factor graphs (Yedidia et al., 2003)), where the Gibbs distribution is naturally
factorized, whereas it is approximate for graphs with loops. Still, loopy BP is routinely used with
success in many real world applications ranging from error correcting codes, vision, clustering, just to
mention a few. In all these problems, loops are indeed present in the factor graph and yet the variables
are weakly correlated at long range and BP gives good results. A field in which BP has a long history
is the statistical physics of disordered systems where it is known as Cavity Method (Mรฉzard et al.,
1987). It has been used to study the typical properties of spin glass models which represent binary
variables interacting through random interactions over a given graph. It is very well known that in
spin glass models defined on complete graphs and in locally tree-like random graphs, which are
both loopy, the weak correlation conditions between variables may hold and BP give asymptotic
exact results (Mรฉzard & Montanari, 2009). Here we will mostly focus on neural networks ยฑ1 binary
weights and sign activation functions, for which the messages and the marginals can be described
simply by the difference between the probabilities associated with the +1 and -1 states, the so called
_magnetizations. The effectiveness of BP for deep learning has never been numerically tested in a_
systematic way, however there is clear evidence that the weak correlation decay condition does not
hold and thus BP convergence and approximation quality is unpredictable.

In this paper we explore the effectiveness of a variant of BP that has shown excellent convergence
properties in hard optimization problems and in non-convex shallow networks. It goes under the


-----

name of focusing BP (fBP) and is based on a probability distribution, a likelihood, that focuses on
highly entropic wide minima, neglecting the contribution to marginals from narrow minima even
when they are the majority (and hence dominate the Gibbs distribution). This version of BP is thus
expected to give good results only in models that have such wide entropic minima as part of their
energy landscape. As discussed in (Baldassi et al., 2016a), a simple way to define fBP is to add a
"reinforcement" term to the BP equations: an iteration-dependent local field is introduced for each
variable, with an intensity proportional to its marginal probability computed in the previous iteration
step. This field is gradually increased until the entire system becomes fully biased on a configuration.
The first version of reinforced BP was introduced in (Braunstein & Zecchina, 2006) as a heuristic
algorithm to solve the learning problem in shallow binary networks. Baldassi et al. (2016a) showed
that this version of BP is a limiting case of fBP, i.e., BP equations written for a likelihood that uses
the local entropy function instead of the error (energy) loss function. As discussed in depth in that
study, one way to introduce a likelihood that focuses on highly entropic regions is to create y coupled
replicas of the original system. fBP equations are obtained as BP equations for the replicated system.
It turns out that the fBP equations are identical to the BP equations for the original system with the
only addition of a self-reinforcing term in the message passing scheme. The fBP algorithm can be
used as a solver by gradually increasing the effect of the reinforcement: one can control the size of
the regions over which the fBP equations estimate the marginals by tuning the parameters that appear
in the expression of the reinforcement, until the high entropy regions reduce to a single configuration.
Interestingly, by keeping the size of the high entropy region fixed, the fBP fixed point allows one to
estimate the marginals and entropy relative to the region.

In this work, we present and adapt to GPU computation a family of fBP inspired message passing
algorithms that are capable of training multi-layer neural networks with generalization performance
and computational speed comparable to SGD. This is the first work that shows that learning by
message passing in deep neural networks 1) is possible and 2) is a viable alternative to SGD. Our
version of fBP adds the reinforcement term at each mini-batch step in what we call the Posterioras-Prior (PasP) rule. Furthermore, using the message-passing algorithm not as a solver but as an
estimator of marginals allows us to make locally Bayesian predictions, averaging the predictions
over the approximate posterior. The resulting generalization error is significantly better than those of
the solver, showing that, although approximate, the marginals of the weights estimated by messagepassing retain useful information. Consistently with the assumptions underlying fBP, we find that
the solutions provided by the message passing algorithms belong to flat entropic regions of the loss
landscape and have good performance in continual learning tasks and on sparse networks as well.

We also remark that our PasP update scheme is of independent interest and can be combined with
different posterior approximation techniques.

The paper is structured as follows: in Sec. 2 we give a brief review of some related works. In Sec. 3
we provide a detailed description of the message-passing equations and of the high level structure
of the algorithms. In Sec. 4 we compare the performance of the message passing algorithms versus
SGD based approaches in different learning settings.

2 RELATED WORKS

The literature on message passing algorithms is extensive, we refer to Mรฉzard & Montanari (2009)
and Zdeborovรก & Krzakala (2016) for a general overview. More related to our work, multilayer
message-passing algorithms have been developed in inference contexts (Manoel et al., 2017; Fletcher
et al., 2018), where they have been shown to produce exact marginals under certain statistical
assumptions on (unlearned) weight matrices.

The properties of message-passing for learning shallow neural networks have been extensively studied
(see Baldassi et al. (2020) and reference therein). Barbier et al. (2019) rigorously show that message
passing algorithms in generalized linear models perform asymptotically exact inference under some
statistical assumptions. Dictionary learning and matrix factorization are harder problems closely
related to deep network learning problems, in particular to the modelling of a single intermediate
layer. They have been approached using message passing in Kabashima et al. (2016) and Parker
et al. (2014), although the resulting predictions are found to be asymptotically inexact (Maillard
et al., 2021). The same problem is faced by the message passing algorithm recently proposed for a
multi-layer matrix factorization scenario (Zou et al., 2021). Unfortunately, our framework as well


-----

doesnโ€™t yield asymptotic exact predictions. Nonetheless, it gives a message passing heuristic that for
the first time is able to train deep neural networks on natural datasets, therefore sets a reference for
the algorithmic applications of this research line.

A few papers advocate the success of SGD to the geometrical structure (smoothness and flatness) of
the loss landscape in neural networks (Baldassi et al., 2015; Chaudhari et al., 2017; Garipov et al.,
2018; Li et al., 2018; Pittorino et al., 2021; Feng & Tu, 2021). These considerations do not depend on
the particular form of the SGD dynamics and should extend also to other types of algorithms, although
SGD is by far the most popular choice among NNs practitioners due to its simplicity, flexibility,
speed, and generalization performance.

While our work focuses on message passing schemes, some of the ideas presented here, such as
the PasP rule, can be combined with algorithms for Bayesian neural networksโ€™ training (HernรกndezLobato & Adams, 2015; Wu et al., 2018). Recent work extends BP by combining it with graph
neural networks (Kuck et al., 2020; Satorras & Welling, 2021). Finally, some work in computational
neuroscience shows similarities to our approach (Rao, 2007).

3 LEARNING BY MESSAGE PASSING

3.1 POSTERIOR-AS-PRIOR UPDATES

We consider a multi-layer perceptron with L hidden neuron layers, having weight and bias parameters
_W = {W_ _[โ„“], b[โ„“]}โ„“[L]=0[. We allow for stochastic activations][ P][ โ„“][(][x][โ„“][+1][|][z][โ„“][)][, where][ z][โ„“]_ [is the neuronโ€™s pre-]
activation vector for layer โ„“, and P _[โ„“]_ is assumed to be factorized over the neurons. If no stochasticity
is present, P _[โ„“]_ just encodes an element-wise activation function. The probability of output y given an
input x is then given by


_P_ _[โ„“][+1](x[โ„“][+1]_ _| W_ _[โ„“]x[โ„“]_ + b[โ„“]), (1)
_โ„“=0_

Y


_P_ (y | x, W) = _dx[1:][L]_
Z


where for convenience we defined x[0] = x and x[L][+1] = y. In a Bayesian framework, given a training
set D = {(xn, yn)}n and a prior distribution over the weights qฮธ(W) in some parametric family, the
posterior distribution is given by


_P_ (yn **_xn,_** ) qฮธ( ). (2)
_|_ _W_ _W_


_P_ (W | D, ฮธ) โˆ


Here โˆ denotes equality up to a normalization factor. Using the posterior one can compute the Bayesian prediction for a new data-point x through the average P (y | x, D, ฮธ) =
_dW P_ (y | x, W) P (W | D, ฮธ). Unfortunately, the posterior is generically intractable due to the
hard-to-compute normalization factor. On the other hand, we are mainly interested in training a
R
distribution that covers wide minima of the loss landscape that generalize well (Baldassi et al., 2016a)
and in recovering pointwise estimators within these regions. The Bayesian modeling becomes an
auxiliary tool to set the stage for the message passing algorithms seeking flat minima. We also need
a formalism that allows for mini-batch training to speed-up the computation and deal with large
datasets. Therefore, we devise an update scheme that we call Posterior-as-Prior (PasP), where we
evolve the parameters ฮธ[t] of a distribution qฮธt ( ) computed as an approximate mini-batch posterior,
_W_
in such a way that the outcome of the previous iteration becomes the prior in the following step. In
the PasP scheme, ฮธ[t] retains the memory of past observations. We also add an exponential factor ฯ,
that we typically set close to 1, tuning the forgetting rate and playing a role similar to the learning
rate in SGD. Given a mini-batch (X _[t], y[t]) sampled from the training set at time t and a scalar ฯ > 0,_
the PasP update reads

_ฯ_
_qฮธt+1_ ( ) _P_ ( **_y[t], X_** _[t], ฮธ[t])_ _,_ (3)
_W_ _โ‰ˆ_ _W |_

where denotes approximate equality and we do not account for the normalization factor. A first 
_โ‰ˆ_
approximation may be needed in the computation of the posterior, a second to project the approximate
posterior onto the distribution manifold spanned by ฮธ (Minka, 2001). In practice, we will consider
factorized approximate posterior in an exponential family and priors qฮธ in the same family, although
Eq. 3 generically allow for more refined approximations.


-----

Notice that setting ฯ = 1, the batch-size to 1, and taking a single pass over the dataset, we recover
the Assumed Density Filtering algorithm (Minka, 2001). For large enough ฯ (including ฯ = 1), the
iterations of qฮธt will concentrate on a pointwise estimator. This mechanism mimics the reinforcement heuristic commonly used to turn Belief Propagation into a solver for constrained satisfaction
problems (Braunstein & Zecchina, 2006) and related to flat-minima discovery (see focusing-BP in
Baldassi et al. (2016a)). A different prior updating mechanism which can be understood as empirical
Bayes has been used in Baldassi et al. (2016b).

3.2 INNER MESSAGE PASSING LOOP

While the PasP rule takes care of the reinforcement heuristic across mini-batches, we compute the
mini-batch posterior in Eq. 3 using message passing approaches derived from Belief Propagation.
BP is an iterative scheme for computing marginals and entropies of statistical models Mรฉzard &
Montanari (2009). It is most conveniently expressed on factor graphs, that is bipartite graphs where
the two sets of nodes are called variable nodes and factor nodes. They respectively represent the
variables involved in the statistical model and their interactions. Message from factor nodes to
variable nodes and viceversa are exchanged along the edges of the factor graph for a certain number
of BP iterations or until a fixed point is reached.

The factor graph for P (W | X _[t], y[t], ฮธ[t]) can be derived from Eq. 2, with the following additional_
specifications. For simplicity, we will ignore the bias term in each layer. We assume factorized
_qฮธt_ ( ), each factor parameterized by its first two moments. In what follows, we drop the PasP
_W_
iteration index t. For each example (xn, yn) in the mini-batch, we introduce the auxiliary variables
**_x[โ„“]n[, โ„“]_** [= 1][, . . ., L][, representing the layersโ€™ activations. For each example, each neuron in the network]
contributes a factor node to the factor graph. The scalar components of the weight matrices and
the activation vectors become variable nodes. This construction is presented in Appendix A, where
we also derive the message update rules on the factor graph. The factor graph thus defined is
extremely loopy and straightforward iteration of BP has convergence issues. Moreover, in presence
of a homogeneous prior over the weights, the neuron permutation symmetry in each hidden layer
induces a strongly attractive symmetric fixed point that hinders learning. We work around these
issues by breaking the symmetry at time t = 0 with an inhomogeneous prior. In our experiments
a little initial heterogeneity is sufficient to obtain specialized neurons at each following time step.
Additionally, we do not require message passing convergence in the inner loop (see Algorithm 1) but
perform one or a few iterations for each ฮธ update. We also include an inertia term commonly called
damping factor in the message updates (see B.2). As we shall discuss, these simple rules suffice to
train deep networks by message passing.

For the inner loop we adapt to deep neural networks four different message passing algorithms, all of
which are well known to the literature although derived in simpler settings: Belief Propagation (BP),
BP-Inspired (BPI) message passing, mean-field (MF), and approximate message passing (AMP). The
last three algorithms can be considered approximations of the first one. In the following paragraphs
we will discuss their common traits, present the BP updates as an example, and refer to Appendix A
for an in-depth exposition. For all algorithms, message updates can be divided in a forward pass
and backward pass, as also done in (Fletcher et al., 2018) in a multi-layer inference setting. The BP
algorithm is compactly reported in Algorithm 1.

**Meaning of messages.** All the messages involved in the message passing can be understood in
terms of cavity marginals or full marginals (as mentioned in the introduction BP is also known as
Cavity Method, see (Mรฉzard & Montanari, 2009)). Of particular relevance are m[โ„“]ki [and][ ฯƒ]ki[โ„“] [, denoting]
the mean and variance of the weights Wki[โ„“] [. The quantities][ ห†]x[โ„“]in [and][ โˆ†]in[โ„“] [instead denote the mean and]
variance of the i-th neuronโ€™s activation in layer โ„“ for a given input xn.

**Scalar free energies.** All message passing schemes are conveniently expressed in terms of two
functions that correspond to the effective free energy (Zdeborovรก & Krzakala, 2016) of a single


-----

neuron and of a single weight respectively :

_ฯ•[โ„“](B, A, ฯ‰, V ) = log_ dx dz e[โˆ’] 2[1] _[Ax][2][+][Bx]_ _P_ _[โ„“]_ (x|z) e[โˆ’] [(][ฯ‰]2[โˆ’]V[z][)2] _โ„“_ = 1, . . ., L (4)
Z


_ฯˆ(H, G, ฮธ) = log_ dw e[โˆ’] 2[1] _[G][2][w][2][+][Hw]_ _qฮธ(w)_ (5)
Z


Notice that for common deterministic activations such as ReLU and sign, the function ฯ• has analytic
and smooth expressions (see Appendix A.8). The same holds for the function ฯˆ when qฮธ(w) is
Gaussian (continuous weights) or a mixture of atoms (discrete weights). At the last layer we impose
_P_ _[L][+1](y|z) = I(y = sign(z)) in binary classification tasks and P_ _[L][+1](y|z) = I(y = arg max(z))_
in multi-class classification (see Appendix A.9). While in our experiments we use hard constraints
for the final output, therefore solving a constraint satisfaction problem, it would be interesting to also
consider soft constraints and introduce a temperature, but this is beyond the scope of our work.

**Start and end of message passing.** At the beginning of a new PasP iteration t, we reset the
messages (see Appendix A) and run message passing for ฯ„max iterations. We then compute the new
priorโ€™s parameters ฮธ[t][+1] from the posterior given by the message passing.

**BP Forward pass.** After initialization of the messages at time ฯ„ = 0, for each following time we
propagate a set of message from the first to the last layer and then another set from the last to the first.
For an intermediate layer โ„“ the forward pass reads

_xห†[โ„“,ฯ„]inโ†’k_ = _โˆ‚Bฯ•[โ„“]_ []Bin[โ„“,ฯ„]โ†’[โˆ’]k[1][, A]in[โ„“,ฯ„] _[โˆ’][1], ฯ‰in[โ„“][โˆ’][1][,ฯ„]_ _, Vin[โ„“][โˆ’][1][,ฯ„]_ (6)


โˆ†[โ„“,ฯ„]in = _โˆ‚B[2]_ _[ฯ•][โ„“]_ []Bin[โ„“,ฯ„] _[โˆ’][1], A[โ„“,ฯ„]in_ _[โˆ’][1], ฯ‰in[โ„“][โˆ’][1][,ฯ„]_ _, Vin[โ„“][โˆ’][1][,ฯ„]_ (7)


_m[โ„“,ฯ„]ki_ _n_ = _โˆ‚H_ _ฯˆ(Hki[โ„“,ฯ„]_ _[โˆ’]n[1][, G]ki[โ„“,ฯ„]_ _[โˆ’][1], ฮธki[โ„“]_ [)] (8)
_โ†’_ _โ†’_

_ฯƒki[โ„“,ฯ„]_ = _โˆ‚H[2]_ _[ฯˆ][(][H]ki[โ„“,ฯ„]_ _[โˆ’][1], G[โ„“,ฯ„]ki_ _[โˆ’][1], ฮธki[โ„“]_ [)] (9)

2

_Vkn[โ„“,ฯ„]_ = _m[โ„“,ฯ„]ki_ _n_ โˆ†[โ„“,ฯ„]in [+][ ฯƒ]ki[โ„“,ฯ„] [(ห†]x[โ„“,ฯ„]in _k[)][2][ +][ ฯƒ]ki[โ„“,ฯ„]_ [โˆ†]in[โ„“,ฯ„] (10)

_โ†’_ _โ†’_

Xi   

_ฯ‰kn[โ„“,ฯ„]โ†’i_ = _m[โ„“,ฯ„]ki[โ€ฒ]โ†’nx[ห†][โ„“,ฯ„]i[โ€ฒ]nโ†’k_ (11)

_iX[โ€ฒ]=ฬธ_ _i_

The equations for the first layer differ slightly and in an intuitive way from the ones above (see
Appendix A.3).

**BP Backward pass.** The backward pass updates a set of messages from the last to the first layer:

_gkn[โ„“,ฯ„]โ†’i_ = _โˆ‚ฯ‰ฯ•[โ„“][+1][ ]Bkn[โ„“][+1][,ฯ„]_ _, A[โ„“]kn[+1][,ฯ„]_ _, ฯ‰kn[โ„“,ฯ„]โ†’i[, V]kn[ โ„“,ฯ„]_ (12)


ฮ“[โ„“,ฯ„]kn = _โˆ’โˆ‚ฯ‰[2]_ _[ฯ•][โ„“][+1][ ]Bkn[โ„“][+1][,ฯ„]_ _, A[โ„“]kn[+1][,ฯ„]_ _, ฯ‰kn[โ„“,ฯ„]_ _[, V]kn[ โ„“,ฯ„]_ (13)
 2

_A[โ„“,ฯ„]in_ = _k_ (m[โ„“,ฯ„]kiโ†’n[)][2][ +][ ฯƒ]ki[โ„“,ฯ„] ฮ“[โ„“,ฯ„]kn _[โˆ’]_ _[ฯƒ]ki[โ„“,ฯ„]_ _gkn[โ„“,ฯ„]โ†’i_ (14)
X    

_Bin[โ„“,ฯ„]โ†’k_ = _m[โ„“,ฯ„]k[โ€ฒ]iโ†’n[g]k[โ„“,ฯ„][โ€ฒ]nโ†’i_ (15)

_kX[โ€ฒ]=ฬธ_ _k_

2

_G[โ„“,ฯ„]ki_ = _n_ (ห†x[โ„“,ฯ„]inโ†’k[)][2][ + โˆ†]in[โ„“,ฯ„] ฮ“[โ„“,ฯ„]kn _[โˆ’]_ [โˆ†]in[โ„“,ฯ„] _gkn[โ„“,ฯ„]โ†’i_ (16)
X    

_Hki[โ„“,ฯ„]โ†’n_ = _xห†[โ„“,ฯ„]in[โ€ฒ]โ†’k[g]kn[โ„“,ฯ„][โ€ฒ]โ†’i_ (17)

_nX[โ€ฒ]=ฬธ_ _n_


As with the forward pass, we add the caveat that for the last layer the equations are slightly different
from the ones above.


-----

**Computational complexity** The message passing equations boil down to element-wise operations
and tensor contractions that we easily implement using the GPU friendly julia library Tullio.jl (Abbott
et al., 2021). For a layer of input and output size N and considering a batch-size of B, the time
complexity of a forth-and-back iteration is O(N [2]B) for all message passing algorithms (BP, BPI, MF,
and AMP), the same as SGD. The prefactor varies and it is generally larger than SGD (see Appendix
B.9). Also, time complexity for message passing is proportional to ฯ„max (which we typically set to
1). We provide our implementation in the GitHub repo anonymous.

**Algorithm 1: BP for deep neural networks**
// Message passing used in the PasP Eq. 3 to approximate.

// the mini-batch posterior.
// Here we specifically refer to BP updates.
// BPI, MF, and AMP updates take the same form but using
// the rules in Appendix A.4, A.5, and A.7 respectively

**1 Initialize messages.**

**2 for ฯ„ = 1, . . . ฯ„max do**

// Forward Pass

**3** **for l = 0, . . ., L do**

**4** compute ห†x[โ„“], โˆ†[โ„“] using (6, 7)

**5** compute m[โ„“], ฯƒ[โ„“] using (8, 9)

**6** compute V[โ„“], ฯ‰[โ„“] using (10, 11)


// Backward Pass

**7** **for l = L, . . ., 0 do**

**8** compute g[โ„“], ฮ“[โ„“] using (12, 13)

**9** compute A[โ„“], B[โ„“] using (14, 15)

**10** compute G[โ„“], H _[โ„“]_ using (16, 17)

4 NUMERICAL RESULTS

We implement our message passing algorithms on neural networks with continuous and binary
weights and with binary activations. In our experiments we fix ฯ„max = 1. We typically do not observe
an increase in performance taking more steps, except for some specific cases and in particular for MF
layers. We remark that for ฯ„max = 1 the BP and the BPI equations are identical, so in most of the
subsequent numerical results we will only investigate BP.

We compare our algorithms with a SGD-based algorithm adapted to binary architectures (Hubara
et al., 2016) which we call BinaryNet along the paper (see Appendix B.6 for details). Comparison
of Bayesian predictions are with the gradient-based Expectation Backpropagation (EBP) algorithm
(Soudry et al., 2014a), also able to deal with discrete weights and activations. In all architectures we
avoid the use of bias terms and batch-normalization layers.

We find that message-passing algorithms are able to train generic MLP architectures with varying numbers and sizes of hidden layers. As for the datasets, we are able to perform both binary classification
and multi-class classification on standard computer vision datasets such as MNIST, Fashion-MNIST,
and CIFAR-10. Since these datasets consist of 10 classes, for the binary classification task we divide
each dataset in two classes (even vs odd).

We report that message passing algorithms are able to solve these optimization problems with
generalization performance comparable to or better than SGD-based algorithms. Some of the
message passing algorithms (BP and AMP in particular) need fewer epochs to achieve low error than
the ones required by SGD-based algorithms, even if adaptive methods like Adam are considered.
Timings of our GPU implementations of message passing algorithms are competitive with SGD (see
Appendix B.9).


-----

60


|Col1|BP train MF train BP test MF test|
|---|---|
||AMP train BinaryNet train AMP test BinaryNet test|
|||
|||
|||


20 40 60 80 100

4.1 EXPERIMENTS ACROSS ARCHITECTURES

We select a specific task, multi-class classification on Fashion-MNIST, and we compare the message
passing algorithms with BinaryNet for different choices of the architecture (i.e. we vary the number
and the size of the hidden layers). In Fig.1 (Left) we present the learning curves for a MLP with
3 hidden layers with 501 units with binary weights and activations. Similar results hold in our
experiments with 2 or 3 hidden layers of 101, 501 or 1001 units and with batch sizes from 1 to from
1024. The parameters used in our simulations are reported in Appendix B.3. Results on networks
with continuous weights can be found in Fig.2 (Right).

4.2 SPARSE LAYERS

Since the BP algorithm has notoriously been successful on sparse graphs, we perform a straightforward implementation of pruning at initialization, i.e. we impose a random boolean mask on the
weights that we keep fixed along the training. We call sparsity the fraction of zeroed weights. This
kind of non-adaptive pruning is known to largely hinder learning (Frankle et al., 2021; Sung et al.,
2021). In the right panel of Fig. 1, we report results on sparse binary networks in which we train
a MLP with 2 hidden layers of 101 units on the MNIST dataset. For reference, results on pruning
quantized/binary networks can be found in Refs. (Han et al., 2016; Ardakani et al., 2017; Tung &
Mori, 2018; Diffenderfer & Kailkhura, 2021). Experimenting with sparsity up to 90%, we observe
that BP and MF perform better than BinaryNet. AMP struggles behind BinaryNet instead.

25 100

BP train MF train

95

BP test MF test

20 AMP train BinaryNet train

90

AMP test BinaryNet test

85

15

80

BP test

error (%) 10 Bayes BP test

75

AMP test

test accuracy (%)

70 Bayes AMP test

5 MF test

65 Bayes MF test

BinaryNet test

0

epochs


10 20 30 40 50 60 70 80 90

|Col1|Col2|Col3|Col4|Col5|Col6|Col7|Col8|Col9|Col10|
|---|---|---|---|---|---|---|---|---|---|
|||||||||||
|||||||||||
|||||||||||
|||BP test Bayes|BP test|||||||
|||AMP te Bayes|st AMP te|st||||||
|||MF tes Bayes|t MF tes|t||||||
|||Binary|Net test|||||||


sparsity (%)


Figure 1: (Left) Training curves of message passing algorithms compared with BinaryNet on the
Fashion-MNIST dataset (multi-class classification) with a binary MLP with 3 hidden layers of 501
units. (Right) Final test accuracy when varying the layerโ€™s sparsity in a binary MLP with 2 hidden
layers of 101 units on the MNIST dataset (multi-class). In both panels the batch-size is 128 and
curves are averaged over 5 realizations of the initial conditions (and sparsity pattern in the right
panel).

4.3 EXPERIMENTS ACROSS DATASETS

We now fix the architecture, a MLP with 2 hidden layers of 501 neurons each with binary weights and
activations. We vary the dataset, i.e. we test the BP-based algorithms on standard computer vision
benchmark datasets such as MNIST, Fashion-MNIST and CIFAR-10, in both the multi-class and
binary classification tasks. In Tab. 1 we report the final test errors obtained by the message passing
algorithms compared to the BinaryNet baseline. See Appendix B.4 for the corresponding training
errors and the parameters used in the simulations. We mention that while the test performance is
mostly comparable, the train error tends to be lower for the message passing algorithms.


-----

|Col1|BinaryNet BP|Col3|AMP MF|
|---|---|---|---|
||sses) 1.3 ยฑ 0.1 1.4 ยฑ 0.2||1.4 ยฑ 0.1 1.3 ยฑ 0.|
||ST (2 classes) 2.4 ยฑ 0.1 2.3 ยฑ 0.1||2.4 ยฑ 0.1 2.3 ยฑ 0.|
||classes) 30.0 ยฑ 0.3 31.4 ยฑ 0.1||31.1 ยฑ 0.3 31.1 ยฑ 0.|
||2.2 ยฑ 0.1 2.6 ยฑ 0.1||2.6 ยฑ 0.1 2.3 ยฑ 0.|
||ST 12.0 ยฑ 0.6 11.8 ยฑ 0.3||11.9 ยฑ 0.2 12.1 ยฑ 0.|
||59.0 ยฑ 0.7 58.7 ยฑ 0.3||58.5 ยฑ 0.2 60.4 ยฑ 1.|
||on Fashion-MNIST of various algorith ts and activations. All algorithms are tra ard deviations are calculated over 5 ran IAN ERROR amework used as an estimator of the m ate Bayesian prediction, i.e. averagin We observe better generalization error f wing that the marginals retain useful ith the PasP mini-batch procedure (the e t this converges with difficulty in our te (as also confirmed by the local energy ompute can be considered as a local app cation on the MNIST dataset in Fig. 2, a tasets and architectures. We obtain the gle forward pass of the message passing osterior distribution does not concentra e to the prediction of a single configurat m a comparison of BP (point-wise an m Bayesian predictions, Expectation Ba mplementation details. y Weights 5 EBP Bayes BP BP BinaryNet 4 60 80 100 bayes EBP hs (%) 3 error 2 test 1||ms on a MLP with 2 hid ined with batch-size 12 dom initializations. ini-batch posterior mar g the pointwise predicti rom Bayesian predictio information. However, xact ones should be com sts). Since BP-based alg measure performed in A roximation of the full on nd we observe the same Bayesian prediction fro . To obtain good Bayes te too much, otherwise ion. d Bayesian) with SGD ckpropagation (Soudry Continuous Weights|
|(%) 50 error 40 t|||SGD EBP|
|tes 0 20 40 epoc|||bayes EBP|
|||||
|||||
|||||


20 40 60 80 100

epochs


SGD BP
EBP Bayes BP
bayes EBP

20 40 60 80 100

epochs


Figure 2: (Left) Test error curves for Bayesian and point-wise predictions for a MLP with 2 hidden
layers of 101 units on the 2-classes MNIST dataset. We report the results for (Left) binary and
(Right) continuous weights. In both cases, we compare SGD, BP (point-wise and Bayesian) and EBP
(point-wise and Bayesian). See Appendix B.3 for details.

4.5 CONTINUAL LEARNING


Given the high local entropy (i.e. the flatness) of the solutions found by the BP-based algorithms
(see Appendix B.5), we perform additional tests in a classic setting, continual learning, where the


-----

possibility of locally rearranging the solutions while keeping low training error can be an advantage.
When a deep network is trained sequentially on different tasks, it tends to forget exponentially fast
previously seen tasks while learning new ones (McCloskey & Cohen, 1989; Robins, 1995; Fusi et al.,
2005). Recent work (Feng & Tu, 2021) has shown that searching for a flat region in the loss landscape
can indeed help to prevent catastrophic forgetting. Several heuristics have been proposed to mitigate
the problem (Kirkpatrick et al., 2017; Aljundi et al., 2018; Zenke et al., 2017; Laborieux et al., 2021)
but all require specialized adjustments to the loss or the dynamics .

Here we show instead that our message passing schemes are naturally prone to learn multiple tasks
sequentially, mitigating the characteristic memory issues of the gradient-based schemes without the
need for explicit modifications. As a prototypical experiment, we sequentially trained a multi-layer
neural network on 6 different versions of the MNIST dataset, where the pixels of the images have
been randomly permuted (Goodfellow et al., 2013), giving a fixed budget of 40 epochs on each task.
We present the results for a two hidden layer neural network with 2001 units on each layer (see
Appendix B.3 for details). As can be seen in Fig. 3, at the end of the training the BP algorithm is able
to reach good generalization performances on all the tasks. We compared the BP performance with
BinaryNet, which already performs better than SGD with continuous weights (see the discussion
in Laborieux et al. (2021)). While our BP implementation is not competitive with ad-hoc techniques
specifically designed for this problem, it beats non-specialized heuristics. Moreover, we believe that
specialized approaches like the one of Laborieux et al. (2021) can be adapted to message passing as
well.

|BP Bi Bi Bi|naryNet lr= naryNet lr= naryNet lr=|0.1 1.0 10.0|Col4|Col5|Col6|
|---|---|---|---|---|---|


1 2 3 4 5 6 0 40 80 120 160 200 240
task # epochs

100

90

80

70

60

50

40 BP

30 BinaryNet lr=0.1

test accuracy (%) 20 BinaryNet lr=1.0

10 BinaryNet lr=10.0

0 1 2 3 4 5 6

task #


Figure 3: Performance of BP and BinaryNet on the permuted MNIST task (see text) for a two hidden
layer network with 2001 units on each layer and binary weights and activations. The model is trained
sequentially on 6 different versions of the MNIST dataset (the tasks), where the pixels have been
permuted. (Left) Test accuracy on each task after the network has been trained on all the tasks.
(Right) Test accuracy on the first task as a function of the number of epochs. Points are averages over
5 independent runs, shaded areas are errors on the mean.


5 DISCUSSION AND CONCLUSIONS

While successful in many fields, message passing algorithms, have notoriously struggled to scale
to deep neural networks training problems. Here we have developed a class of fBP-based message
passing algorithms and used them within an update scheme, Posterior-as-Prior (PasP), that makes it
possible to train deep and wide multilayer perceptrons by message passing.

We performed experiments binary activations and either binary or continuous weights. Future work
should try to include different activations, biases, batch-normalization, and convolutional layers as
well. Another interesting direction is the algorithmic computation of the (local) entropy of the model
from the messages.

Further theoretical work is needed for a more complete understanding of the robustness of our
methods. Recent developments in message passing algorithms (Rangan et al., 2019) and related
theoretical analysis (Goldt et al., 2020) could provide fruitful inspirations. While our algorithms
can be used for approximate Bayesian inference, exact posterior calculation is still out of reach for
message passing approaches and much technical work is needed in that direction.


-----

REFERENCES

Michael Abbott, Dilum Aluthge, N3N5, Simeon Schaub, Carlo Lucibello, Chris Elrod, and Johnny
[Chen. Tullio.jl julia package, 2021. URL https://github.com/mcabbott/Tullio.jl.](https://github.com/mcabbott/Tullio.jl)

Rahaf Aljundi, Francesca Babiloni, Mohamed Elhoseiny, Marcus Rohrbach, and Tinne Tuytelaars.
Memory aware synapses: Learning what (not) to forget. In Proceedings of the European Conference
_on Computer Vision (ECCV), pp. 139โ€“154, 2018._

Arash Ardakani, Carlo Condo, and Warren J. Gross. Sparsely-connected neural networks: Towards efficient VLSI implementation of deep neural networks. In 5th International Conference on Learning
_Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings._
[OpenReview.net, 2017. URL https://openreview.net/forum?id=r1fYuytex.](https://openreview.net/forum?id=r1fYuytex)

Carlo Baldassi, Alfredo Braunstein, Nicolas Brunel, and Riccardo Zecchina. Efficient supervised
learning in networks with binary synapses. Proceedings of the National Academy of Sciences,
[104(26):11079โ€“11084, 2007. ISSN 0027-8424. doi: 10.1073/pnas.0700324104. URL https:](https://www.pnas.org/content/104/26/11079)
[//www.pnas.org/content/104/26/11079.](https://www.pnas.org/content/104/26/11079)

Carlo Baldassi, Alessandro Ingrosso, Carlo Lucibello, Luca Saglietti, and Riccardo Zecchina.
Subdominant dense clusters allow for simple learning and high computational performance
in neural networks with discrete synapses. _Phys. Rev. Lett., 115:128101, Sep 2015._
[doi: 10.1103/PhysRevLett.115.128101. URL https://link.aps.org/doi/10.1103/](https://link.aps.org/doi/10.1103/PhysRevLett.115.128101)
[PhysRevLett.115.128101.](https://link.aps.org/doi/10.1103/PhysRevLett.115.128101)

Carlo Baldassi, Christian Borgs, Jennifer T. Chayes, Alessandro Ingrosso, Carlo Lucibello, Luca
Saglietti, and Riccardo Zecchina. Unreasonable effectiveness of learning neural networks: From
accessible states and robust ensembles to basic algorithmic schemes. Proceedings of the National
_Academy of Sciences, 113(48):E7655โ€“E7662, 2016a. ISSN 0027-8424. doi: 10.1073/pnas._
[1608103113. URL https://www.pnas.org/content/113/48/E7655.](https://www.pnas.org/content/113/48/E7655)

Carlo Baldassi, Federica Gerace, Carlo Lucibello, Luca Saglietti, and Riccardo Zecchina. Learning
may need only a few bits of synaptic precision. Phys. Rev. E, 93:052313, May 2016b. doi: 10.
[1103/PhysRevE.93.052313. URL https://link.aps.org/doi/10.1103/PhysRevE.](https://link.aps.org/doi/10.1103/PhysRevE.93.052313)
[93.052313.](https://link.aps.org/doi/10.1103/PhysRevE.93.052313)

Carlo Baldassi, Fabrizio Pittorino, and Riccardo Zecchina. Shaping the learning landscape in neural
networks around wide flat minima. Proceedings of the National Academy of Sciences, 117(1):
[161โ€“170, 2020. ISSN 0027-8424. doi: 10.1073/pnas.1908636117. URL https://www.pnas.](https://www.pnas.org/content/117/1/161)
[org/content/117/1/161.](https://www.pnas.org/content/117/1/161)

Jean Barbier, Florent Krzakala, Nicolas Macris, Lรฉo Miolane, and Lenka Zdeborovรก. Optimal errors
and phase transitions in high-dimensional generalized linear models. Proceedings of the National
_Academy of Sciences, 116(12):5451โ€“5460, 2019. ISSN 0027-8424. doi: 10.1073/pnas.1802705116._
[URL https://www.pnas.org/content/116/12/5451.](https://www.pnas.org/content/116/12/5451)

Hans Bethe. Statistical theory of superlattices. Proc. R. Soc. A, 150:552, 1935.

Alfredo Braunstein and Riccardo Zecchina. Learning by message passing in networks of discrete
synapses. Phys. Rev. Lett., 96:030201, Jan 2006. doi: 10.1103/PhysRevLett.96.030201. URL
[https://link.aps.org/doi/10.1103/PhysRevLett.96.030201.](https://link.aps.org/doi/10.1103/PhysRevLett.96.030201)

Pratik Chaudhari, Anna Choromanska, Stefano Soatto, Yann LeCun, Carlo Baldassi, Christian Borgs,
Jennifer T. Chayes, Levent Sagun, and Riccardo Zecchina. Entropy-sgd: Biasing gradient descent
into wide valleys. In 5th International Conference on Learning Representations, ICLR 2017,
_Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net, 2017. URL_
[https://openreview.net/forum?id=B1YfAfcgl.](https://openreview.net/forum?id=B1YfAfcgl)

James Diffenderfer and Bhavya Kailkhura. Multi-prize lottery ticket hypothesis: Finding accurate
binary neural networks by pruning a randomly weighted network. In International Confer_[ence on Learning Representations, 2021. URL https://openreview.net/forum?id=](https://openreview.net/forum?id=U_mat0b9iv)_
[U_mat0b9iv.](https://openreview.net/forum?id=U_mat0b9iv)


-----

Yu Feng and Yuhai Tu. The inverse varianceโ€“flatness relation in stochastic gradient descent is critical
for finding flat minima. Proceedings of the National Academy of Sciences, 118(9), 2021.

Alyson K Fletcher, Sundeep Rangan, and Philip Schniter. Inference in deep networks in high
dimensions. In 2018 IEEE International Symposium on Information Theory (ISIT), pp. 1884โ€“1888.
IEEE, 2018.

Jonathan Frankle, Gintare Karolina Dziugaite, Daniel Roy, and Michael Carbin. Pruning neural
networks at initialization: Why are we missing the mark? In International Conference on Learning
_[Representations, 2021. URL https://openreview.net/forum?id=Ig-VyQc-MLK.](https://openreview.net/forum?id=Ig-VyQc-MLK)_

Stefano Fusi, Patrick J Drew, and Larry F Abbott. Cascade models of synaptically stored memories.
_Neuron, 45(4):599โ€“611, 2005._

Marylou Gabriรฉ. Mean-field inference methods for neural networks. Journal of Physics A: Mathe_matical and Theoretical, 53(22):223002, 2020._

Robert Gallager. Low-density parity-check codes. IRE Transactions on information theory, 8(1):
21โ€“28, 1962.

Timur Garipov, Pavel Izmailov, Dmitrii Podoprikhin, Dmitry P Vetrov, and Andrew G Wilson. Loss
surfaces, mode connectivity, and fast ensembling of dnns. In S. Bengio, H. Wallach, H. Larochelle,
K. Grauman, N. Cesa-Bianchi, and R. Garnett (eds.), Advances in Neural Information Processing
_[Systems, volume 31. Curran Associates, Inc., 2018. URL https://proceedings.neurips.](https://proceedings.neurips.cc/paper/2018/file/be3087e74e9100d4bc4c6268cdbe8456-Paper.pdf)_
[cc/paper/2018/file/be3087e74e9100d4bc4c6268cdbe8456-Paper.pdf.](https://proceedings.neurips.cc/paper/2018/file/be3087e74e9100d4bc4c6268cdbe8456-Paper.pdf)

Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward
neural networks. In Yee Whye Teh and Mike Titterington (eds.), Proceedings of the Thirteenth
_International Conference on Artificial Intelligence and Statistics, volume 9 of Proceedings of_
_Machine Learning Research, pp. 249โ€“256, Chia Laguna Resort, Sardinia, Italy, 13โ€“15 May 2010._
[PMLR. URL https://proceedings.mlr.press/v9/glorot10a.html.](https://proceedings.mlr.press/v9/glorot10a.html)

Sebastian Goldt, Marc Mรฉzard, Florent Krzakala, and Lenka Zdeborovรก. Modeling the influence of
data structure on learning in neural networks: The hidden manifold model. Physical Review X, 10
(4):041044, 2020.

Ian J Goodfellow, Mehdi Mirza, Da Xiao, Aaron Courville, and Yoshua Bengio. An empirical investigation of catastrophic forgetting in gradient-based neural networks. arXiv preprint arXiv:1312.6211,
2013.

Song Han, Huizi Mao, and William J. Dally. Deep compression: Compressing deep neural network
with pruning, trained quantization and huffman coding. In Yoshua Bengio and Yann LeCun (eds.),
_4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico,_
_[May 2-4, 2016, Conference Track Proceedings, 2016. URL http://arxiv.org/abs/1510.](http://arxiv.org/abs/1510.00149)_
[00149.](http://arxiv.org/abs/1510.00149)

Josรฉ Miguel Hernรกndez-Lobato and Ryan P. Adams. Probabilistic backpropagation for scalable
learning of bayesian neural networks. In Proceedings of the 32nd International Conference on
_International Conference on Machine Learning - Volume 37, ICMLโ€™15, pp. 1861โ€“1869. JMLR.org,_
2015.

Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Binarized neural networks. In D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 29. Curran Asso[ciates, Inc., 2016. URL https://proceedings.neurips.cc/paper/2016/file/](https://proceedings.neurips.cc/paper/2016/file/d8330f857a17c53d217014ee776bfd50-Paper.pdf)
[d8330f857a17c53d217014ee776bfd50-Paper.pdf.](https://proceedings.neurips.cc/paper/2016/file/d8330f857a17c53d217014ee776bfd50-Paper.pdf)

Yiding Jiang, Behnam Neyshabur, Hossein Mobahi, Dilip Krishnan, and Samy Bengio. Fantastic generalization measures and where to find them. In International Conference on Learning
_[Representations, 2020. URL https://openreview.net/forum?id=SJgIPJBFvH.](https://openreview.net/forum?id=SJgIPJBFvH)_

Yoshiyuki Kabashima, Florent Krzakala, Marc Mรฉzard, Ayaka Sakata, and Lenka Zdeborovรก. Phase
transitions and sample complexity in bayes-optimal matrix factorization. IEEE Transactions on
_Information Theory, 62(7):4228โ€“4265, 2016. doi: 10.1109/TIT.2016.2556702._


-----

James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A
Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. Overcoming
catastrophic forgetting in neural networks. Proceedings of the national academy of sciences, 114
(13):3521โ€“3526, 2017.

Jonathan Kuck, Shuvam Chakraborty, Hao Tang, Rachel Luo, Jiaming Song, Ashish Sabharwal, and
Stefano Ermon. Belief propagation neural networks. In H. Larochelle, M. Ranzato, R. Hadsell,
M. F. Balcan, and H. Lin (eds.), Advances in Neural Information Processing Systems, volume 33,
[pp. 667โ€“678. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/](https://proceedings.neurips.cc/paper/2020/file/07217414eb3fbe24d4e5b6cafb91ca18-Paper.pdf)
[paper/2020/file/07217414eb3fbe24d4e5b6cafb91ca18-Paper.pdf.](https://proceedings.neurips.cc/paper/2020/file/07217414eb3fbe24d4e5b6cafb91ca18-Paper.pdf)

Axel Laborieux, Maxence Ernoult, Tifenn Hirtzlin, and Damien Querlioz. Synaptic metaplasticity in binarized neural networks. Nature Communications, 12(1):2549, May 2021. ISSN
2041-1723. doi: 10.1038/s41467-021-22768-y. [URL https://doi.org/10.1038/](https://doi.org/10.1038/s41467-021-22768-y)
[s41467-021-22768-y.](https://doi.org/10.1038/s41467-021-22768-y)

Hao Li, Zheng Xu, Gavin Taylor, Christoph Studer, and Tom Goldstein. Visualizing the loss landscape of neural nets. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and
R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 31. Curran As[sociates, Inc., 2018. URL https://proceedings.neurips.cc/paper/2018/file/](https://proceedings.neurips.cc/paper/2018/file/a41b3bb3e6b050b6c9067c67f663b915-Paper.pdf)
[a41b3bb3e6b050b6c9067c67f663b915-Paper.pdf.](https://proceedings.neurips.cc/paper/2018/file/a41b3bb3e6b050b6c9067c67f663b915-Paper.pdf)

Antoine Maillard, Florent Krzakala, Marc Mรฉzard, and Lenka Zdeborovรก. Perturbative construction
of mean-field equations in extensive-rank matrix factorization and denoising. arXiv preprint
_arXiv:2110.08775, 2021._

Andre Manoel, Florent Krzakala, Marc Mรฉzard, and Lenka Zdeborovรก. Multi-layer generalized linear
estimation. In 2017 IEEE International Symposium on Information Theory (ISIT), pp. 2098โ€“2102,
2017. doi: 10.1109/ISIT.2017.8006899.

Michael McCloskey and Neal J Cohen. Catastrophic interference in connectionist networks: The
sequential learning problem. In Psychology of learning and motivation, volume 24, pp. 109โ€“165.
Elsevier, 1989.

Marc Mรฉzard. Mean-field message-passing equations in the hopfield model and its generalizations.
_Physical Review E, 95(2):022117, 2017._

Marc Mรฉzard, Giorgio Parisi, and Miguel Angel Virasoro. Spin glass theory and beyond: An
_Introduction to the Replica Method and Its Applications, volume 9. World Scientific Publishing_
Company, 1987.

Thomas P. Minka. Expectation propagation for approximate bayesian inference. In Proceedings of
_the Seventeenth Conference on Uncertainty in Artificial Intelligence, UAIโ€™01, pp. 362โ€“369, San_
Francisco, CA, USA, 2001. Morgan Kaufmann Publishers Inc. ISBN 1558608001.

Marc Mรฉzard and Andrea Montanari. Information, Physics, and Computation. Oxford University
Press, Inc., USA, 2009. ISBN 019857083X.

Jason T Parker, Philip Schniter, and Volkan Cevher. Bilinear generalized approximate message
passingโ€”part i: Derivation. IEEE Transactions on Signal Processing, 62(22):5839โ€“5853, 2014.

Judea Pearl. Reverend Bayes on inference engines: A distributed hierarchical approach. Cognitive
Systems Laboratory, School of Engineering and Applied Science ..., 1982.

R. Peierls. On isingโ€™s model of ferromagnetism. Mathematical Proceedings of the Cambridge
_Philosophical Society, 32(3):477โ€“481, 1936. doi: 10.1017/S0305004100019174._

Fabrizio Pittorino, Carlo Lucibello, Christoph Feinauer, Gabriele Perugini, Carlo Baldassi, Elizaveta
Demyanenko, and Riccardo Zecchina. Entropic gradient descent algorithms and wide flat minima.
[In International Conference on Learning Representations, 2021. URL https://openreview.](https://openreview.net/forum?id=xjXg0bnoDmS)
[net/forum?id=xjXg0bnoDmS.](https://openreview.net/forum?id=xjXg0bnoDmS)

Sundeep Rangan, Philip Schniter, and Alyson K Fletcher. Vector approximate message passing. IEEE
_Transactions on Information Theory, 65(10):6664โ€“6684, 2019._


-----

Rajesh P. N. Rao. Neural models of Bayesian belief propagation., pp. 239โ€“267. Bayesian brain: Probabilistic approaches to neural coding. MIT Press, Cambridge, MA, US, 2007. ISBN 026204238X
(Hardcover); 978-0-262-04238-3 (Hardcover).

Anthony Robins. Catastrophic forgetting, rehearsal and pseudorehearsal. Connection Science, 7(2):
123โ€“146, 1995.

Victor Garcia Satorras and Max Welling. Neural enhanced belief propagation on factor graphs. In
_International Conference on Artificial Intelligence and Statistics, pp. 685โ€“693. PMLR, 2021._

Daniel Soudry, Itay Hubara, and Ron Meir. Expectation backpropagation: Parameterfree training of multilayer neural networks with continuous or discrete weights. In
Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K. Q. Weinberger (eds.),
_Advances in Neural Information Processing Systems,_ volume 27. Curran Associates,
Inc., 2014a. URL [https://proceedings.neurips.cc/paper/2014/file/](https://proceedings.neurips.cc/paper/2014/file/076a0c97d09cf1a0ec3e19c7f2529f2b-Paper.pdf)
[076a0c97d09cf1a0ec3e19c7f2529f2b-Paper.pdf.](https://proceedings.neurips.cc/paper/2014/file/076a0c97d09cf1a0ec3e19c7f2529f2b-Paper.pdf)

Daniel Soudry, Itay Hubara, and Ron Meir. Expectation backpropagation: Parameter-free training of
multilayer neural networks with continuous or discrete weights. In NIPS, volume 1, pp. 2, 2014b.

George Stamatescu, Federica Gerace, Carlo Lucibello, Ian Fuss, and Langford B. White. Critical
[initialisation in continuous approximations of binary neural networks. 2020. URL https:](https://openreview.net/forum?id=rylmoxrFDH)
[//openreview.net/forum?id=rylmoxrFDH.](https://openreview.net/forum?id=rylmoxrFDH)

Yi-Lin Sung, Varun Nair, and Colin Raffel. Training neural networks with fixed sparse masks, 2021.

Frederick Tung and Greg Mori. Clip-q: Deep network compression learning by in-parallel pruningquantization. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.
7873โ€“7882, 2018. doi: 10.1109/CVPR.2018.00821.

Anqi Wu, Sebastian Nowozin, Edward Meeds, Richard E Turner, Jose Miguel Hernandez-Lobato,
and Alexander L Gaunt. Deterministic variational inference for robust bayesian neural networks.
_arXiv preprint arXiv:1810.03958, 2018._

Jonathan S. Yedidia, William T. Freeman, and Yair Weiss. Understanding Belief Propagation and Its
_Generalizations, pp. 239โ€“269. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2003._
ISBN 1558608117.

Lenka Zdeborovรก and Florent Krzakala. Statistical physics of inference: Thresholds and algorithms.
_Advances in Physics, 65(5):453โ€“552, 2016._

Friedemann Zenke, Ben Poole, and Surya Ganguli. Continual learning through synaptic intelligence.
In International Conference on Machine Learning, pp. 3987โ€“3995. PMLR, 2017.

Qiuyun Zou, Haochuan Zhang, and Hongwen Yang. Multi-layer bilinear generalized approximate
message passing. IEEE Transactions on Signal Processing, 69:4529โ€“4543, 2021. doi: 10.1109/
TSP.2021.3100305.


-----

# Appendices

CONTENTS

**A BP-based message passing algorithms** **14**

A.1 Preliminary considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

A.2 Derivation of the BP equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

A.3 BP equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

A.4 BPI equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

A.5 MF equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

A.6 Derivation of the AMP equations . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

A.7 AMP equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

A.8 Activation Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

A.9 The ArgMax layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

**B** **Experimental details** **25**

B.1 Hyper-parameters of the BP-based scheme . . . . . . . . . . . . . . . . . . . . . . 25

B.2 Damping scheme for the message passing . . . . . . . . . . . . . . . . . . . . . . 25

B.3 Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

B.4 Varying the dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

B.5 Local energy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

B.6 SGD implementation (BinaryNet) . . . . . . . . . . . . . . . . . . . . . . . . . . 28

B.7 EBP implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

B.8 Unit polarization and overlaps . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

B.9 Computational performance: varying batch-size . . . . . . . . . . . . . . . . . . . 30

A BP-BASED MESSAGE PASSING ALGORITHMS

A.1 PRELIMINARY CONSIDERATIONS

Given a mini-batch = (xn, yn) _n, the factor graph defined by Eqs. (1, 2, 18) is explicitly written_
_B_ _{_ _}_
as:

_L_

_P_ ( _, x[1:][L]_ _, ฮธ)_ _P_ _[โ„“][+1]_ _x[โ„“]kn[+1]_ _Wki[โ„“]_ _[x][โ„“]in_ _qฮธ(Wki[โ„“]_ [)][,] (18)
_W_ _| B_ _โˆ_

_โ„“=0_ _k,n_ _i_ _k,i,โ„“_

Y Y X ! Y

where x[0]n [=][ x][n][,][ x]n[L][+1] = yn. The derivation of the BP equations for this model is straightforward
albeit lengthy and involved. It is obtained following the steps presented in multiple papers, books,
and reviews, see for instance (Mรฉzard & Montanari, 2009; Zdeborovรก & Krzakala, 2016; Mรฉzard,
2017), although it has not been attempted before in deep neural networks. It should be noted that a
(common) approximation that we take here with respect to the standard BP scheme, is that messages
are assumed to be Gaussian distributed and therefore parameterized by their mean and variance. This
goes by the name of relaxed belied propagation (rBP), just referred to as BP throughout the paper.

We derive the BP equations in A.2 and present them all together in A.3. From BP, we derive other 3
message passing algorithms useful for the deep network training setting, all of which are well known
to the literature: BP-Inspired (BPI) message passing A.4, mean-field (MF) A.5, and approximate


-----

message passing (AMP) A.7. The AMP derivation is the more involved and given in A.6. In all
these cases, message updates can be divided in a forward pass and a backward pass, as also done in
Fletcher et al. (2018) in a multi-layer inference setting. The BP algorithm is compactly reported in
Algorithm 1.

In our notation, โ„“ denotes the layer index, ฯ„ the BP iteration index, k an output neuron index, i an
input neuron index, and n a sample index.

We report below, for convenience, some of the considerations also present in the main text.

**Meaning of messages.** All the messages involved in the message passing equations can be understood in terms of cavity marginals or full marginals (as mentioned in the introduction BP is also
known as the Cavity Method, see Mรฉzard & Montanari (2009)). Of particular relevance are the
quantities m[โ„“]ki [and][ ฯƒ]ki[โ„“] [, denoting the mean and variance of the weights][ W][ โ„“]ki[. The quantities][ ห†]x[โ„“]in [and]
โˆ†[โ„“]in [instead denote mean and variance of the][ i][-th neuronโ€™s activation in layer][ โ„“] [in correspondence of]
an input xn.

**Scalar free energies.** All message passing schemes can be expressed using the following scalar
functions, corresponding to single neuron and single weight effective free-energies respectively:

_ฯ•[โ„“](B, A, ฯ‰, V ) = log_ dx dz e[โˆ’] [1]2 _[Ax][2][+][Bx]_ _P_ _[โ„“]_ (x | z) e[โˆ’] [(][ฯ‰]2[โˆ’]V[z][)2] _,_ (19)
Z


_ฯˆ(H, G, ฮธ) = log_ dw e[โˆ’] [1]2 _[G][2][w][2][+][Hw]_ _qฮธ(w)._ (20)
Z

These free energies will naturally arise in the derivation of the BP equations in Appendix A.2. For
the last layer, the neuron function has to be slightly modified:

_ฯ•[L][+1](y, ฯ‰, V ) = log_ dz P _[L][+1]_ (y | z) e[โˆ’] [(][ฯ‰]2[โˆ’]V[z][)2] _._ (21)
Z

Notice that for common deterministic activations such as ReLU and sign, the function ฯ• has
analytic and smooth expressions that we give in Appendix A.8. Same goes for ฯˆ when qฮธ(w)
is Gaussian (continuous weights) or a mixture of atoms (discrete weights). At the last layer
we impose P _[L][+1](y|z) = I(y = sign(z)) in binary classification tasks. For multi-class clas-_
sification instead, we have to adapt the formalism to vectorial pre-activations z and assume
_P_ _[L][+1](y|z) = I(y = arg max(z)) (see Appendix A.9). While in our experiments we use hard_
constraints for the final output, therefore solving a constraint satisfaction problem, it would be interesting to also consider generic loss functions. That would require minimal changes to our formalism,
but this is beyond the scope of our work.

**Binary weights.** In our experiments we use ยฑ1 weights in each layer. Therefore each marginal can
be parameterized by a single number and our prior/posterior takes the form

_qฮธ(Wki[โ„“]_ [)][ โˆ] _[e][ฮธ]ki[โ„“]_ _[W][ โ„“]ki_ (22)

The effective free energy function Eq. 20 becomes

_ฯˆ(H, G, ฮธki[โ„“]_ [) = log 2 cosh(][H][ +][ ฮธ]ki[โ„“] [)] (23)

and the messages G can be dropped from the message passing.

**Start and end of message passing.** At the beginning of a new PasP iteration t, we reset the
messages to zero and run message passing for ฯ„max iterations. We then compute the new prior
_qฮธt+1_ ( ) from the posterior given by the message passing iterations.
_W_

A.2 DERIVATION OF THE BP EQUATIONS

In order to derive the BP equations, we start with the following portion of the factor graph reported in
Eq. 18 in the main text, describing the contribution of a single data example in the inner loop of the
PasP updates:


-----

_P_ _[โ„“][+1]_


_x[โ„“]kn[+1]_


_Wki[โ„“]_ _[x][โ„“]in_


where x[0]n [=][ x][n][,][ x]n[L][+1] = yn. (24)


_โ„“=0_ _k_ _i_

where we recall that the quantity x[โ„“]kn [corresponds to the activation of neuron][ k][ in layer][ โ„“] [in corre-]
spondence of the input example n.

Let us start by analyzing the single factor:


_P_ _[โ„“][+1]_


_x[โ„“]kn[+1]_


_Wki[โ„“]_ _[x][โ„“]in_


(25)


We refer to messages that travel from input to output in the factor graph as upgoing or upwards
messages, while to the ones that travel from output to input as downgoing or backwards messages.

**Factor-to-variable-W messages** The factor-to-variable-W messages read:


_ฮฝห†kn[โ„“][+1]_ _ki[(][W][ โ„“]ki[)][ โˆ]_ _dฮฝki[โ„“]_ _[โ€ฒ]_ _n[(][W][ โ„“]ki[โ€ฒ]_ [)]
_โ†’_ _โ†’_
Z Yi[โ€ฒ]=ฬธ _i_


_dฮฝi[โ„“][โ€ฒ]n_ _k[(][x]i[โ„“][โ€ฒ]n[)][ dฮฝ][โ†“][(][x][โ„“]kn[+1][)][ P][ โ„“][+1]_
_โ†’_
_i[โ€ฒ]_

Y


_x[โ„“]kn[+1]_


_Wki[โ„“]_ _[โ€ฒ]_ _[x]i[โ„“][โ€ฒ]n_
_i[โ€ฒ]_ !

X

(26)


where ฮฝ denotes the messages travelling downwards (from output to input) in the factor graph.
_โ†“_

We denote the means and variances of the incoming messages respectively with m[โ„“]ki _n[,][ ห†]x[โ„“]in_ _k_ [and]
_โ†’_ _โ†’_
_ฯƒki[โ„“]_ _n[,][ โˆ†][โ„“]in_ _k[:]_
_โ†’_ _โ†’_

_m[โ„“]ki_ _n_ [=] _dฮฝki[โ„“]_ _n[(][W][ โ„“]ki[)][ W][ โ„“]ki_ (27)
_โ†’_ _โ†’_
Z

_ฯƒki[โ„“]_ _n_ [=] _dฮฝki[โ„“]_ _n[(][W][ โ„“]ki[)]_ _Wki[โ„“]_ _ki_ _n_ 2 (28)
_โ†’_ _โ†’_ _[โˆ’]_ _[m][โ„“]_ _โ†’_
Z
 

_xห†[โ„“]in_ _k_ [=] _dฮฝin[โ„“]_ _k[(][x]in[โ„“]_ [)][ x]in[โ„“] (29)
_โ†’_ _โ†’_
Z


2
โˆ†[โ„“]in _k_ [=] _dฮฝin[โ„“]_ _k[(][x]in[โ„“]_ [)] _x[โ„“]in_ _x[โ„“]in_ _k_ (30)
_โ†’_ _โ†’_ _[โˆ’]_ [ห†] _โ†’_
Z
 

We now use the central limit theorem to observe that with respect to the incoming messages distributions - assuming independence of these messages - in the large input limit the preactivation is a
Gaussian random variable:


_Wki[โ„“]_ _[โ€ฒ]_ _[x][โ„“]i[โ€ฒ]n_ _kn_ _i[, V]kn[ โ„“]_ _i[)]_ (31)
_iX[โ€ฒ]=ฬธ_ _i_ _[โˆผN]_ [(][ฯ‰][โ„“] _โ†’_ _โ†’_


where:


_ฯ‰kn[โ„“]_ _โ†’i_ [=][ E][ฮฝ] ๏ฃฎ _Wki[โ„“]_ _[โ€ฒ]_ _[x][โ„“]i[โ€ฒ]n๏ฃน_ = _m[โ„“]ki[โ€ฒ]โ†’n_ _x[ห†][โ„“]i[โ€ฒ]nโ†’k_ (32)

_i[โ€ฒ]=ฬธ_ _i_ _iX[โ€ฒ]=ฬธ_ _i_

๏ฃฐ[X] ๏ฃป

_Vkn[โ„“]_ _โ†’i_ [=][ V ar][ฮฝ] ๏ฃฎ _Wki[โ„“]_ _[โ€ฒ]_ _[x][โ„“]i[โ€ฒ]n๏ฃน_

_i[โ€ฒ]=ฬธ_ _i_

= _ฯƒ๏ฃฐki[โ„“][X][โ€ฒ]_ _n_ [โˆ†]i[โ„“][โ€ฒ]n _k๏ฃป[+]_ _m[โ„“]ki[โ€ฒ]_ _n_ 2 โˆ†โ„“i[โ€ฒ]n _k_ [+][ ฯƒ]ki[โ„“] _[โ€ฒ]_ _n_ _xห†[โ„“]i[โ€ฒ]n_ _k_ 2[] (33)

_โ†’_ _โ†’_ _โ†’_ _โ†’_ _โ†’_ _โ†’_

_i[โ€ฒ]=i_

Xฬธ     

Therefore we can rewrite the outgoing messages as:


-----

2
_ki_ _[xโ„“]in[)]_
_ฮฝห†kn[โ„“][+1]_ _i[(][W][ โ„“]ki[)][ โˆ]_ _dz dฮฝin[โ„“]_ _k[(][x][โ„“]in[)][ dฮฝ][โ†“][(][x][โ„“]kn[+1][)][ e][โˆ’]_ [(][z][โˆ’][ฯ‰kn]2[โ†’]Vkn[i] _[โˆ’]โ†’[W โ„“]i_ _P_ _[โ„“][+1]_ _x[โ„“]kn[+1]_
_โ†’_ _โ†’_
Z 


(34)


We now assume Wki[โ„“] _[x]in[โ„“]_ [to be small compared to the other terms. With a second order Taylor]
expansion we obtain:

_ฮฝห†kn[โ„“]_ _โ†’i[(][W][ โ„“]ki[)][ โˆ]_ _dz dฮฝโ†“(x[โ„“]kn[+1][)][ e][โˆ’]_ [(][z][โˆ’]2Vkn[ฯ‰kn]โ†’[โ†’]i[i][)][2] _P_ _[โ„“][+1]_ _x[โ„“]kn[+1]_ _[z]_
Z  


1 + _[z][ โˆ’]_ _[ฯ‰][kn][โ†’][i]_ _xห†[โ„“]in_ _k[W][ โ„“]ki_ [+ (][z][ โˆ’] _[ฯ‰][kn][โ†’][i][)][2][ โˆ’]_ _[V][kn][โ†’][i]_

_ร—_ _Vknโ†’i_ _โ†’_ 2Vknโ†’i

Introducing now the function:


โˆ†+ _xห†[โ„“]inโ†’k_ 2[ ] _Wki[โ„“]_ 2
   (35)


_ฯ•[โ„“](B, A, ฯ‰, V ) = log_ dx dz e[โˆ’] 2[1] _[Ax][2][+][Bx]_ _P_ _[โ„“]_ (x|z) e[โˆ’] [(][ฯ‰]2[โˆ’]V[z][)2] (36)
Z


and defining:

_gkn[โ„“]_ _i_ [=][ โˆ‚][ฯ‰][ฯ•][โ„“][+1][(][B][โ„“][+1][, A][โ„“][+1][, ฯ‰]kn[โ„“] _i[, V]kn[ โ„“]_ _i[)]_ (37)
_โ†’_ _โ†’_ _โ†’_

ฮ“[โ„“]kn _i_ [=][ โˆ’][โˆ‚]ฯ‰[2] _[ฯ•][โ„“][+1][(][B][โ„“][+1][, A][โ„“][+1][, ฯ‰]kn[โ„“]_ _i[, V]kn[ โ„“]_ _i[)]_ (38)
_โ†’_ _โ†’_ _โ†’_

the expansion for the log-message reads:


log ห†ฮฝkn[โ„“] _i[(][W][ โ„“]ki[)][ โ‰ˆ]_ _[const][ + ห†]x[โ„“]in_ _k_ _[g]kn[โ„“]_ _i[W][ โ„“]ki_
_โ†’_ _โ†’_ _โ†’_

2[] 2[ ] 2
โˆ†[โ„“]in _k_ [+] _xห†[โ„“]in_ _k_ ฮ“[โ„“]kn _i_ _in_ _k_ _gkn[โ„“]_ _i_ _Wki[โ„“]_ (39)

_โˆ’_ [1]2 _โ†’_ _โ†’_ _โ†’_ _[โˆ’]_ [โˆ†][โ„“] _โ†’_ _โ†’_

      

**Factor-to-variable-x messages** The derivation of these messages is analogous to the factor-tovariable-W ones in Eq. 26 just reported. The final result for the log-message is:

log ห†ฮฝkn[โ„“] _i[(][x]in[โ„“]_ [)][ โ‰ˆ][const][ +][ m]ki[โ„“] _n_ _[g]kn[โ„“]_ _i[x]in[โ„“]_
_โ†’_ _โ†’_ _โ†’_

2[] 2[ ] 2
_ฯƒki[โ„“]_ _n_ [+] _m[โ„“]ki_ _n_ ฮ“[โ„“]kn _i_ _ki_ _n_ _gkn[โ„“]_ _i_ _x[โ„“]in_ (40)

_โˆ’_ [1]2 _โ†’_ _โ†’_ _โ†’_ _[โˆ’]_ _[ฯƒ][โ„“]_ _โ†’_ _โ†’_

      

**Variable-W-to-output-factor messages** The message from variable Wki[โ„“] [to the output factor][ kn]
reads:


_ฮฝki[โ„“]_ _n[(][W][ โ„“]ki[)][ โˆ]_ _[P][ โ„“]ฮธki_ [(][W][ โ„“]ki[)][e] _n[โ€ฒ]_ =ฬธ _n_ [log ห†]ฮฝkn[โ„“] _[โ€ฒ]_ _โ†’i[(][W][ โ„“]ki[)]_
_โ†’_ P 2

_โ‰ˆ_ _Pฮธ[โ„“]ki_ [(][W][ โ„“]ki[)][e][H]ki[โ„“] _โ†’n[W][ โ„“]ki[โˆ’]_ [1]2 _[G]ki[โ„“]_ _โ†’n[(][W][ โ„“]ki[)]_ (41)

where we have defined:

_Hki[โ„“]_ _n_ [=] _xห†[โ„“]in[โ€ฒ]_ _k_ _[g]kn[โ„“]_ _[โ€ฒ]_ _i_ (42)
_โ†’_ _โ†’_ _โ†’_

_nX[โ€ฒ]=ฬธ_ _n_

2[] 2[]

_G[โ„“]ki_ _n_ [=] โˆ†[โ„“]in[โ€ฒ] _k_ [+] _xห†[โ„“]in[โ€ฒ]_ _k_ ฮ“[โ„“]kn[โ€ฒ] _i_ _in[โ€ฒ]_ _k_ _gkn[โ„“]_ _[โ€ฒ]_ _i_ (43)
_โ†’_ _n[โ€ฒ]=n_ _โ†’_ _โ†’_ _โ†’_ _[โˆ’]_ [โˆ†][โ„“] _โ†’_ _โ†’_
Xฬธ     

Introducing now the effective free energy:


-----

_ฯˆ(H, G, ฮธ) = log_ dW Pฮธ[โ„“] [(][W] [)][ e][HW][ โˆ’] 2[1] _[GW][ 2]_ (44)
Z

we can express the first two cumulants of the message ฮฝki[โ„“] _n[(][W][ โ„“]ki[)][ as:]_
_โ†’_

_m[โ„“]ki_ _n_ [=][ โˆ‚][H] _[ฯˆ][(][H]ki[โ„“]_ _n[, G][โ„“]ki_ _n[, ฮธ][ki][)]_ (45)
_โ†’_ _โ†’_ _โ†’_

_ฯƒki[โ„“]_ _n_ [=][ โˆ‚]H[2] _[ฯˆ][(][H]ki[โ„“]_ _n[, G][โ„“]ki_ _n[, ฮธ][ki][)]_ (46)
_โ†’_ _โ†’_ _โ†’_

**Variable-x-to-input-factor messages** We can write the downgoing message as:


_ฮฝ_ (x[โ„“]in[)][ โˆ] _[e]_ _k_ [log ห†]ฮฝkn[โ„“] _โ†’i[(][x][โ„“]in[)]_
_โ†“_
P

_โ‰ˆ_ _e[B]in[โ„“]_ _[x][โˆ’]_ [1]2 _[A]in[โ„“]_ _[x][2]_ (47)


where:


_Bin[โ„“]_ [=] _m[โ„“]ki_ _n_ _[g]kn[โ„“]_ _i_ (48)

_โ†’_ _โ†’_
_n_

X 2[] 2[]

_A[โ„“]in_ [=] _ฯƒki[โ„“]_ _n_ [+] _m[โ„“]ki_ _n_ ฮ“[โ„“]kn _i_ _ki_ _n_ _gkn[โ„“][+1]_ _i_ (49)

_n_ _โ†’_ _โ†’_ _โ†’_ _[โˆ’]_ _[ฯƒ][โ„“]_ _โ†’_ _โ†’_

X     

**Variable-x-to-output-factor messages** By defining the following cavity quantities:

_Bin[โ„“]_ _k_ [=][ B]in[โ„“] _k_ _ki_ _n_ _[g]kn[โ„“]_ _i_ (50)
_โ†’_ _โ†’_ _[โˆ’]_ _[m][โ„“]_ _โ†’_ _โ†’_ 2[] 2[]

_A[โ„“]in_ _k_ [=][ A]in[โ„“] _k_ _ฯƒki[โ„“]_ _n_ [+] _m[โ„“]ki_ _n_ ฮ“[โ„“]kn _i_ _ki_ _n_ _gkn[โ„“]_ _i_ (51)
_โ†’_ _โ†’_ _[โˆ’]_ _โ†’_ _โ†’_ _โ†’_ _[โˆ’]_ _[ฯƒ][โ„“]_ _โ†’_ _โ†’_

and the following non-cavity ones:    


_ฯ‰kn[โ„“]_ [=]

_Vkn[โ„“]_ [=]


_m[โ„“]ki_ _n_ _x[ห†][โ„“]in_ _k_ (52)
_โ†’_ _โ†’_


_i_

_Vkn[โ„“]_ [=] _ฯƒki[โ„“]_ _n_ [โˆ†]in[โ„“] _k_ [+] _m[โ„“]ki_ _n_ 2 โˆ†โ„“in _k_ [+][ ฯƒ]ki[โ„“] _n_ _xห†[โ„“]i[โ€ฒ]n_ _k_ 2[] (53)

_โ†’_ _โ†’_ _โ†’_ _โ†’_ _โ†’_ _โ†’_

_i_

X     

we can express the first 2 cumulants of the upgoing messages as:


_xห†[โ„“]in_ _k_ [=][ โˆ‚][B][ฯ•][โ„“][(][B]in[โ„“] _k[, A][โ„“]in_ _k[, ฯ‰]in[โ„“][โˆ’][1][, V]in[ โ„“][โˆ’][1])_ (54)
_โ†’_ _โ†’_ _โ†’_

โˆ†[โ„“]in _k_ [=][ โˆ‚]B[2] _[ฯ•][โ„“][(][B]in[โ„“]_ _k[, A][โ„“]in_ _k[, ฯ‰]in[โ„“][โˆ’][1][, V]in[ โ„“][โˆ’][1])_ (55)
_โ†’_ _โ†’_ _โ†’_

**Wrapping it up** Additional but straightforward considerations are required for the final input and
output layers (โ„“ = 0 and โ„“ = L respectively), since they do not receive messages from below and
above respectively. In the end, thanks to independence assumptions and the central limit theorem that
we used throughout the derivations, we arrive to a closed set of equations involving the means and
the variances (or otherwise the corresponding natural parameters) of the messages. Within the same
approximation assumption, we also replace the cavity quantities corresponding to variances with the
non-cavity counterparts. Dividing the update equations in a forward and backward pass, and ordering
them using time indexes in such a way that we have an efficient flow of information, we obtain the
set of BP equations presented in the main text Eqs. (6-17) and in the Appendix Eqs. (60-71).

A.3 BP EQUATIONS

We report here the end result of the derivation in last section, the complete set of BP equations also
presented in the main text as Eqs. (6-17).


-----

**Initialization** At ฯ„ = 0:


_Bin[โ„“,][0]_ _k_ [= 0] (56)
_โ†’_

_A[โ„“,]in[0]_ [= 0] (57)

_Hki[โ„“,][0]_ _n_ [= 0] (58)
_โ†’_

_G[โ„“,]ki[0]_ [= 0] (59)

**Forward Pass** At each ฯ„ = 1, . . ., ฯ„max, for โ„“ = 0, . . ., L:


_xห†[โ„“,ฯ„]in_ _k_ [=][ โˆ‚][B][ฯ•][โ„“][(][B]in[โ„“,ฯ„] _[โˆ’]k[1][, A]in[โ„“,ฯ„]_ _[โˆ’][1], ฯ‰in[โ„“][โˆ’][1][,ฯ„]_ _, Vin[โ„“][โˆ’][1][,ฯ„]_ ) (60)
_โ†’_ _โ†’_

โˆ†[โ„“,ฯ„]in [=][ โˆ‚]B[2] _[ฯ•][โ„“][(][B]in[โ„“,ฯ„]_ _[โˆ’][1], A[โ„“,ฯ„]in_ _[โˆ’][1], ฯ‰in[โ„“][โˆ’][1][,ฯ„]_ _, Vin[โ„“][โˆ’][1][,ฯ„]_ ) (61)

_m[โ„“,ฯ„]ki_ _n_ [=][ โˆ‚][H] _[ฯˆ][(][H]ki[โ„“,ฯ„]_ _[โˆ’]n[1][, G]ki[โ„“,ฯ„]_ _[โˆ’][1], ฮธki[โ„“]_ [)] (62)
_โ†’_ _โ†’_

_ฯƒki[โ„“,ฯ„]_ [=][ โˆ‚]H[2] _[ฯˆ][(][H]ki[โ„“,ฯ„]_ _[โˆ’][1], G[โ„“,ฯ„]ki_ _[โˆ’][1], ฮธki[โ„“]_ [)] (63)

2 2

_Vkn[โ„“,ฯ„]_ [=] _m[โ„“,ฯ„]ki_ โˆ†[โ„“,ฯ„]in [+][ ฯƒ]ki[โ„“,ฯ„] _[โˆ’][1]_ _xห†[โ„“,ฯ„]i[โ€ฒ]n_ + ฯƒki[โ„“,ฯ„] _[โˆ’][1]โˆ†[โ„“,ฯ„]in_ (64)
Xi     

_ฯ‰kn[โ„“,ฯ„]_ _i_ [=] _m[โ„“,ฯ„]ki[โ€ฒ]_ _n_ _x[ห†][โ„“,ฯ„]i[โ€ฒ]n_ _k_ (65)
_โ†’_ _โ†’_ _โ†’_

_iX[โ€ฒ]=ฬธ_ _i_


In these equations for simplicity we abused the notation, in fact for the first layer ห†x[โ„“]n[=0][,ฯ„] is fixed and
given by the input xn while โˆ†[โ„“]n[=0][,ฯ„] = 0 instead.

**Backward Pass** For ฯ„ = 1, . . ., ฯ„max, for โ„“ = L, . . ., 0 :

_gkn[โ„“,ฯ„]_ _i_ [=][ โˆ‚][ฯ‰][ฯ•][โ„“][+1][(][B]kn[โ„“][+1][,ฯ„] _, A[โ„“]kn[+1][,ฯ„]_ _, ฯ‰kn[โ„“,ฯ„]_ _i[, V]kn[ โ„“,ฯ„]_ [)] (66)
_โ†’_ _โ†’_

ฮ“[โ„“,ฯ„]kn [=][ โˆ’][โˆ‚]ฯ‰[2] _[ฯ•][โ„“][+1][(][B]kn[โ„“][+1][,ฯ„]_ _, A[โ„“]kn[+1][,ฯ„]_ _, ฯ‰kn[โ„“,ฯ„]_ _[, V]kn[ โ„“,ฯ„]_ [)] (67)

2 2[]

_A[โ„“,ฯ„]in_ [=] Xk m[โ„“,ฯ„]ki  + ฯƒki[โ„“,ฯ„]  ฮ“[โ„“,ฯ„]kn _[โˆ’]_ _[ฯƒ]ki[โ„“,ฯ„]_ gkn[โ„“,ฯ„]  (68)

_Bin[โ„“,ฯ„]_ _k_ [=] _m[โ„“,ฯ„]k[โ€ฒ]i_ _n_ _[g]k[โ„“,ฯ„][โ€ฒ]n_ _i_ (69)
_โ†’_ _โ†’_ _โ†’_

_kX[โ€ฒ]=ฬธ_ _k_

2 2[]

_G[โ„“,ฯ„]ki_ [=] Xn xห†[โ„“,ฯ„]in  + โˆ†[โ„“,ฯ„]in  ฮ“[โ„“,ฯ„]kn _[โˆ’]_ [โˆ†]in[โ„“,ฯ„] gkn[โ„“,ฯ„]  (70)

_Hki[โ„“,ฯ„]_ _n_ [=] _xห†[โ„“,ฯ„]in[โ€ฒ]_ _k_ _[g]kn[โ„“,ฯ„][โ€ฒ]_ _i_ (71)
_โ†’_ _โ†’_ _โ†’_

_nX[โ€ฒ]=ฬธ_ _n_


In these equations as well we abused the notation: calling L the number of hidden neuron layers,
when โ„“ = L one should use ฯ•[L][+1](y, ฯ‰, V ) from Eq. 21 instead of ฯ•[L][+1](B, A, ฯ‰, V ).

A.4 BPI EQUATIONS

The BP-Inspired algorithm (BPI) is obtained as an approximation of BP replacing some cavity
quantities with their non-cavity counterparts. What we obtain is a generalization of the single layer
algorithm of Baldassi et al. (2007).


-----

**Forward pass.**

**Backward pass.**


_xห†[โ„“,ฯ„]in_ = _โˆ‚Bฯ•[โ„“]_ []Bin[โ„“,ฯ„] _[โˆ’][1], A[โ„“,ฯ„]in_ _[โˆ’][1], ฯ‰in[โ„“][โˆ’][1][,ฯ„]_ _, Vin[โ„“][โˆ’][1][,ฯ„]_ (72)


โˆ†[โ„“,ฯ„]in = _โˆ‚B[2]_ _[ฯ•][โ„“]_ []Bin[โ„“,ฯ„] _[โˆ’][1], A[โ„“,ฯ„]in_ _[โˆ’][1], ฯ‰in[โ„“][โˆ’][1][,ฯ„]_ _, Vin[โ„“][โˆ’][1][,ฯ„]_ (73)


_m[โ„“,ฯ„]ki_ = _โˆ‚H_ _ฯˆ(Hki[โ„“,ฯ„]_ _[โˆ’][1], G[โ„“,ฯ„]ki_ _[โˆ’][1], ฮธki[โ„“]_ [)] (74)

_ฯƒki[โ„“,ฯ„]_ = _โˆ‚H[2]_ _[ฯˆ][(][H]ki[โ„“,ฯ„]_ _[โˆ’][1], G[โ„“,ฯ„]ki_ _[โˆ’][1], ฮธki[โ„“]_ [)] (75)

2

_Vkn[โ„“,ฯ„]_ = _m[โ„“,ฯ„]ki_ โˆ†[โ„“,ฯ„]in [+][ ฯƒ]ki[โ„“,ฯ„] [(ห†]x[โ„“,ฯ„]in [)][2][ +][ ฯƒ]ki[โ„“,ฯ„] [โˆ†]in[โ„“,ฯ„] (76)
Xi   

_ฯ‰kn[โ„“,ฯ„]_ = _m[โ„“,ฯ„]ki_ _x[ห†][โ„“,ฯ„]in_ (77)
X


_gkn[โ„“,ฯ„]_ _i_ = _โˆ‚ฯ‰ฯ•[โ„“][+1][ ]Bkn[โ„“][+1][,ฯ„]_ _, A[โ„“]kn[+1][,ฯ„]_ _, ฯ‰kn[โ„“,ฯ„]_ _ki_ _x[ห†][โ„“,ฯ„]ai_ _[, V]kn[ โ„“,ฯ„]_ (78)
_โ†’_ _[โˆ’]_ _[m][โ„“,ฯ„]_


ฮ“[โ„“,ฯ„]kn = _โˆ’โˆ‚ฯ‰[2]_ _[ฯ•][โ„“][+1][ ]Bkn[โ„“][+1][,ฯ„]_ _, A[โ„“]kn[+1][,ฯ„]_ _, ฯ‰kn[โ„“,ฯ„]_ _[, V]kn[ โ„“,ฯ„]_ (79)
 2

_A[โ„“,ฯ„]in_ = _k_ (m[โ„“,ฯ„]ki [)][2][ +][ ฯƒ]ki[โ„“,ฯ„] ฮ“[โ„“,ฯ„]kn _[โˆ’]_ _[ฯƒ]ki[โ„“,ฯ„]_ _gkn[โ„“,ฯ„]_ (80)
X    

_Bin[โ„“,ฯ„]_ = _m[โ„“,ฯ„]ki_ _[g]kn[โ„“,ฯ„]_ _i_ (81)

_โ†’_
_k_

X

2

_G[โ„“,ฯ„]ki_ = _n_ (ห†x[โ„“,ฯ„]in [)][2][ + โˆ†]in[โ„“,ฯ„] ฮ“[โ„“,ฯ„]kn _[โˆ’]_ [โˆ†]in[โ„“,ฯ„] _gkn[โ„“,ฯ„]_ (82)
X    

_Hki[โ„“,ฯ„]_ = _xห†[โ„“,ฯ„]in_ _[g]kn[โ„“,ฯ„]_ _i_ (83)

_โ†’_
_n_

X

A.5 MF EQUATIONS

The mean-field (MF) equations are obtained as a further simplification of BPI, using only non-cavity
quantities. Although the simplification appears minimal at this point, we empirically observe a
non-negligible discrepancy between the two algorithms in terms of generalization performance and
computational time.

**Forward pass.**


_xห†[โ„“,ฯ„]in_ = _โˆ‚Bฯ•[โ„“]_ []Bin[โ„“,ฯ„] _[โˆ’][1], A[โ„“,ฯ„]in_ _[โˆ’][1], ฯ‰in[โ„“][โˆ’][1][,ฯ„]_ _, Vin[โ„“][โˆ’][1][,ฯ„]_ (84)


โˆ†[โ„“,ฯ„]in = _โˆ‚B[2]_ _[ฯ•][โ„“]_ []Bin[โ„“,ฯ„] _[โˆ’][1], A[โ„“,ฯ„]in_ _[โˆ’][1], ฯ‰in[โ„“][โˆ’][1][,ฯ„]_ _, Vin[โ„“][โˆ’][1][,ฯ„]_ (85)


_m[โ„“,ฯ„]ki_ = _โˆ‚H_ _ฯˆ(Hki[โ„“,ฯ„]_ _[โˆ’][1], G[โ„“,ฯ„]ki_ _[โˆ’][1], ฮธki[โ„“]_ [)] (86)

_ฯƒki[โ„“,ฯ„]_ = _โˆ‚H[2]_ _[ฯˆ][(][H]ki[โ„“,ฯ„]_ _[โˆ’][1], G[โ„“,ฯ„]ki_ _[โˆ’][1], ฮธki[โ„“]_ [)] (87)

2

_Vkn[โ„“,ฯ„]_ = _m[โ„“,ฯ„]ki_ โˆ†[โ„“,ฯ„]in [+][ ฯƒ]ki[โ„“,ฯ„] [(ห†]x[โ„“,ฯ„]in [)][2][ +][ ฯƒ]ki[โ„“,ฯ„] [โˆ†]in[โ„“,ฯ„] (88)
Xi   

_ฯ‰kn[โ„“,ฯ„]_ = _m[โ„“,ฯ„]ki_ _x[ห†][โ„“,ฯ„]in_ (89)
X


-----

**Backward pass.**


_gkn[โ„“,ฯ„]_ = _โˆ‚ฯ‰ฯ•[โ„“][+1][ ]Bkn[โ„“][+1][,ฯ„]_ _, A[โ„“]kn[+1][,ฯ„]_ _, ฯ‰kn[โ„“,ฯ„]_ _[, V]kn[ โ„“,ฯ„]_ (90)


ฮ“[โ„“,ฯ„]kn = _โˆ’โˆ‚ฯ‰[2]_ _[ฯ•][โ„“][+1][ ]Bkn[โ„“][+1][,ฯ„]_ _, A[โ„“]kn[+1][,ฯ„]_ _, ฯ‰kn[โ„“,ฯ„]_ _[, V]kn[ โ„“,ฯ„]_ (91)
 2

_A[โ„“,ฯ„]in_ = _k_ (m[โ„“,ฯ„]ki [)][2][ +][ ฯƒ]ki[โ„“,ฯ„] ฮ“[โ„“,ฯ„]kn _[โˆ’]_ _[ฯƒ]ki[โ„“,ฯ„]_ _gkn[โ„“,ฯ„]_ (92)
X    

_Bin[โ„“,ฯ„]_ = _m[โ„“,ฯ„]ki_ _[g]kn[โ„“,ฯ„]_ (93)

_k_

X

2

_G[โ„“,ฯ„]ki_ = _n_ (ห†x[โ„“,ฯ„]in [)][2][ + โˆ†]in[โ„“,ฯ„] ฮ“[โ„“,ฯ„]kn _[โˆ’]_ [โˆ†]in[โ„“,ฯ„] _gkn[โ„“,ฯ„]_ (94)
X    

_Hki[โ„“,ฯ„]_ = _xห†[โ„“,ฯ„]in_ _[g]kn[โ„“,ฯ„]_ (95)

_n_

X

A.6 DERIVATION OF THE AMP EQUATIONS

In order to obtain the AMP equations, we approximate cavity quantities with non-cavity ones in the
BP equations Eqs. (60-71) using a first order expansion. We start with the mean activation:


_xห†[โ„“,ฯ„]inโ†’k_ [=][โˆ‚][B][ฯ•][โ„“][(][B]in[โ„“,ฯ„] _[โˆ’][1]_ _โˆ’_ _m[โ„“,ฯ„]kiโ†’[โˆ’]n[1]_ _[g]kn[โ„“,ฯ„]โ†’[โˆ’][1]i_ _[, A]in[โ„“,ฯ„]_ _[โˆ’][1], ฯ‰in[โ„“][โˆ’][1][,ฯ„]_ _, Vin[โ„“][โˆ’][1][,ฯ„]_ )

_โˆ‚Bฯ•[โ„“](Bin[โ„“,ฯ„]_ _[โˆ’][1], A[โ„“,ฯ„]in_ _[โˆ’][1], ฯ‰in[โ„“][โˆ’][1][,ฯ„]_ _, Vin[โ„“][โˆ’][1][,ฯ„]_ )
_โ‰ˆ_

_โˆ’_ _m[โ„“,ฯ„]kiโ†’[โˆ’]n[1]_ _[g]kn[โ„“,ฯ„]โ†’[โˆ’][1]i_ _[โˆ‚]B[2]_ _[ฯ•][โ„“][(][B]in[โ„“,ฯ„]_ _[โˆ’][1], A[โ„“,ฯ„]in_ _[โˆ’][1], ฯ‰in[โ„“][โˆ’][1][,ฯ„]_ _, Vin[โ„“][โˆ’][1][,ฯ„]_ )

_xห†[โ„“,ฯ„]in_ _ki_ _gkn[โ„“,ฯ„]_ _[โˆ’][1]โˆ†[โ„“,ฯ„]in_ (96)
_โ‰ˆ_ _[โˆ’]_ _[m][โ„“,ฯ„]_ _[โˆ’][1]_

Analogously, for the weightโ€™s mean we have:

_m[โ„“,ฯ„]kiโ†’n_ [=][ โˆ‚][H] _[ฯˆ][(][H]ki[โ„“,ฯ„]_ _[โˆ’][1]_ _โˆ’_ _xห†[โ„“,ฯ„]inโ†’[โˆ’]k[1]_ _[g]kn[โ„“,ฯ„]โ†’[โˆ’][1]i_ _[, G][โ„“,ฯ„]ki_ _[โˆ’][1], ฮธki[โ„“]_ [)]

_โ‰ˆ_ _โˆ‚H_ _ฯˆ(Hki[โ„“,ฯ„]_ _[โˆ’][1], Gki[โ„“,ฯ„]_ _[โˆ’][1], ฮธki[โ„“]_ [)][ โˆ’] _x[ห†][โ„“,ฯ„]inโ†’[โˆ’]k[1]_ _[g]kn[โ„“,ฯ„]โ†’[โˆ’][1]i_ _[โˆ‚]H[2]_ _[ฯˆ][(][H]ki[โ„“,ฯ„]_ _[โˆ’][1], G[โ„“,ฯ„]ki_ _[โˆ’][1], ฮธki[โ„“]_ [)]

_โ‰ˆ_ _m[โ„“,ฯ„]ki_ _[โˆ’]_ _x[ห†][โ„“,ฯ„]in_ _[โˆ’][1]_ _gkn[โ„“,ฯ„]_ _[โˆ’][1]_ _ฯƒki[โ„“,ฯ„]_ _[.]_ (97)

This brings us to:


_ฯ‰kn[โ„“,ฯ„]_ [=]


_m[โ„“,ฯ„]ki_ _n_ _x[ห†][โ„“,ฯ„]in_ _k_
_โ†’_ _โ†’_


_โ‰ˆ_ _i_ _m[โ„“,ฯ„]ki_ _x[ห†][โ„“,ฯ„]in_ _[โˆ’]_ _[g]kn[โ„“,ฯ„]_ _[โˆ’][1]_ _i_ _ฯƒki[โ„“,ฯ„]_ _x[ห†][โ„“,ฯ„]in_ _x[ห†][โ„“,ฯ„]in_ _[โˆ’][1]_ _โˆ’_ _gkn[โ„“,ฯ„]_ _[โˆ’][1]_ _i_ _m[โ„“,ฯ„]ki_ _[m]ki[โ„“,ฯ„]_ _[โˆ’][1]โˆ†[โ„“,ฯ„]in_
X X X

+ (gkn[โ„“,ฯ„] _[โˆ’][1])[2][ X]_ _ฯƒki[โ„“,ฯ„]_ _[m]ki[โ„“,ฯ„]_ _[โˆ’][1]xห†[โ„“,ฯ„]in_ _[โˆ’][1]โˆ†[โ„“,ฯ„]in_ (98)

_i_

Let us now apply the same procedure to the other set of cavity messages:

_gkn[โ„“,ฯ„]_ _i_ [=][โˆ‚][ฯ‰][ฯ•][โ„“][+1][(][B]kn[โ„“][+1][,ฯ„] _, A[โ„“]kn[+1][,ฯ„]_ _, ฯ‰kn[โ„“,ฯ„]_ _ki_ _n_ _x[ห†][โ„“,ฯ„]in_ _k[, V]kn[ โ„“,ฯ„]_ [)]
_โ†’_ _[โˆ’]_ _[m][โ„“,ฯ„]โ†’_ _โ†’_

_โ‰ˆโˆ‚ฯ‰ฯ•[โ„“][+1](Bkn[โ„“][+1][,ฯ„]_ _, A[โ„“]kn[+1][,ฯ„]_ _, ฯ‰kn[โ„“,ฯ„]_ _[, V]kn[ โ„“,ฯ„]_ [)]

_โˆ’_ _m[โ„“,ฯ„]kiโ†’n_ _x[ห†][โ„“,ฯ„]inโ†’k[โˆ‚]ฯ‰[2]_ _[ฯ•][โ„“][+1][(][B]kn[โ„“][+1][,ฯ„]_ _, A[โ„“]kn[+1][,ฯ„]_ _, ฯ‰kn[โ„“,ฯ„]_ _[, V]kn[ โ„“,ฯ„]_ [)]

_โ‰ˆgkn[โ„“,ฯ„]_ [+][ m]ki[โ„“,ฯ„] _x[ห†][โ„“,ฯ„]in_ [ฮ“]kn[โ„“,ฯ„] (99)


-----

_Bin[โ„“,ฯ„]_ [=]

_โ‰ˆ_

_Hki[โ„“,ฯ„]_ [=]


_m[โ„“,ฯ„]ki_ _n_ _[g]kn[โ„“,ฯ„]_ _i_
_โ†’_ _โ†’_
_k_

X

2

_k_ _m[โ„“,ฯ„]ki_ _[g]kn[โ„“,ฯ„]_ _[โˆ’]_ _x[ห†]in_ _k_ _gkn[โ„“,ฯ„]_ _ฯƒki[โ„“,ฯ„]_ [+ ห†]x[โ„“,ฯ„]in _k_ (m[โ„“,ฯ„]ki [)][2][ฮ“]kn[โ„“,ฯ„]

X X   X

_โˆ’_ (ห†x[โ„“,ฯ„]in [)][2][ X] _ฯƒki[โ„“,ฯ„]_ _[m]ki[โ„“,ฯ„]_ _[g]kn[โ„“,ฯ„]_ [ฮ“]kn[โ„“,ฯ„] (100)

_k_


_xห†[โ„“,ฯ„]in_ _k_ _[g]kn[โ„“,ฯ„]_ _i_
_โ†’_ _โ†’_

_xห†[โ„“,ฯ„]in_ _[g]kn[โ„“,ฯ„]_ [+][ m]ki[โ„“,ฯ„]


2

_[ m][โ„“,ฯ„]ki_ _n_ _xห†[โ„“,ฯ„]in_ ฮ“[โ„“,ฯ„]kn _[โˆ’]_ _[m]ki[โ„“,ฯ„]_ _n_ (gkn[โ„“,ฯ„] [)][2][โˆ†]in[โ„“,ฯ„]
X   X

_gkn[โ„“,ฯ„]_ [ฮ“]kn[โ„“,ฯ„] [โˆ†]in[โ„“,ฯ„] _x[ห†][โ„“,ฯ„]in_ (101)


_n_ _n_ _n_

_โˆ’_ (m[โ„“,ฯ„]ki [)][2][ X] _gkn[โ„“,ฯ„]_ [ฮ“]kn[โ„“,ฯ„] [โˆ†]in[โ„“,ฯ„] _x[ห†][โ„“,ฯ„]in_

_n_

We are now able to write down the full AMP equations, that we present in the next section.


A.7 AMP EQUATIONS

In summary, in the last section we derived the AMP algorithm as a closure of the BP messages passing
over non-cavity quantities, relying on some statistical assumptions on messages and interactions.
With respect to the MF message passing, we find some additional terms that go under the name of
Onsager corrections. In-depth overviews of the AMP (also known as Thouless-Anderson-Palmer
(TAP)) approach can be found in Refs. (Zdeborovรก & Krzakala, 2016; Mรฉzard, 2017; Gabriรฉ, 2020).
The final form of the AMP equations for the multi-layer perceptron is given below.

**Initialization** At ฯ„ = 0:

_Bin[โ„“,][0]_ [= 0] (102)

_Ain[โ„“,][0]_ [= 0] (103)

_Hki[โ„“,][0]_ [= 0][ or some values] (104)

_G[โ„“,]ki[0]_ [= 0][ or some values] (105)

_gkn[โ„“,][0]_ [= 0] (106)

**Forward Pass** At each ฯ„ = 1, . . ., ฯ„max, for โ„“ = 0, . . ., L:

_xห†[โ„“,ฯ„]in_ [=][โˆ‚][B][ฯ•][โ„“][(][B]in[โ„“,ฯ„] _[โˆ’][1], A[โ„“,ฯ„]in_ _[โˆ’][1], ฯ‰in[โ„“][โˆ’][1][,ฯ„]_ _, Vin[โ„“][โˆ’][1][,ฯ„]_ ) (107)

โˆ†[โ„“,ฯ„]in [=][โˆ‚]B[2] _[ฯ•][โ„“][(][B]in[โ„“,ฯ„]_ _[โˆ’][1], A[โ„“,ฯ„]in_ _[โˆ’][1], ฯ‰in[โ„“][โˆ’][1][,ฯ„]_ _, Vin[โ„“][โˆ’][1][,ฯ„]_ ) (108)

_m[โ„“,ฯ„]ki_ [=][โˆ‚][H] _[ฯˆ][(][H]ki[โ„“,ฯ„]_ _[โˆ’][1], G[โ„“,ฯ„]ki_ _[โˆ’][1], ฮธki[โ„“]_ [)] (109)

_ฯƒki[โ„“,ฯ„]_ [=][โˆ‚]H[2] _[ฯˆ][(][H]ki[โ„“,ฯ„]_ _[โˆ’][1], G[โ„“,ฯ„]ki_ _[โˆ’][1], ฮธki[โ„“]_ [)] (110)

2 2

_Vkn[โ„“,ฯ„]_ [=] _m[โ„“,ฯ„]ki_ โˆ†[โ„“,ฯ„]in [+][ ฯƒ]ki[โ„“,ฯ„] _xห†[โ„“,ฯ„]i[โ€ฒ]n_ + ฯƒki[โ„“,ฯ„] [โˆ†]in[โ„“,ฯ„] (111)
Xi     

_ฯ‰kn[โ„“,ฯ„]_ [=] _i_ _m[โ„“,ฯ„]ki_ _x[ห†][โ„“,ฯ„]in_ _[โˆ’]_ _[g]kn[โ„“,ฯ„]_ _[โˆ’][1]_ _i_ _ฯƒki[โ„“,ฯ„]_ _x[ห†][โ„“,ฯ„]in_ _x[ห†][โ„“,ฯ„]in_ _[โˆ’][1]_ _โˆ’_ _gkn[โ„“,ฯ„]_ _[โˆ’][1]_ _i_ _m[โ„“,ฯ„]ki_ _[m]ki[โ„“,ฯ„]_ _[โˆ’][1]โˆ†[โ„“,ฯ„]in_
X X X

+ (gkn[โ„“,ฯ„] _[โˆ’][1])[2][ X]_ _ฯƒki[โ„“,ฯ„]_ _[m]ki[โ„“,ฯ„]_ _[โˆ’][1]xห†[โ„“,ฯ„]in_ _[โˆ’][1]โˆ†[โ„“,ฯ„]in_ (112)

_i_


-----

**Backward Pass**

_gkn[โ„“,ฯ„]_ [=][โˆ‚][ฯ‰][ฯ•][โ„“][+1][(][B]kn[โ„“][+1][,ฯ„] _, A[โ„“]kn[+1][,ฯ„]_ _, ฯ‰kn[โ„“,ฯ„]_ _i[, V]kn[ โ„“,ฯ„]_ [)] (113)
_โ†’_

ฮ“[โ„“,ฯ„]kn [=][ โˆ’] _[โˆ‚]ฯ‰[2]_ _[ฯ•][โ„“][+1][(][B]kn[โ„“][+1][,ฯ„]_ _, A[โ„“]kn[+1][,ฯ„]_ _, ฯ‰kn[โ„“,ฯ„]_ _[, V]kn[ โ„“,ฯ„]_ [)] (114)

2 2[]

_A[โ„“,ฯ„]in_ [=] Xk m[โ„“,ฯ„]ki  + ฯƒki[โ„“,ฯ„]  ฮ“[โ„“,ฯ„]kn _[โˆ’]_ _[ฯƒ]ki[โ„“,ฯ„]_ gkn[โ„“,ฯ„]  (115)

2

_Bin[โ„“,ฯ„]_ [=] _k_ _m[โ„“,ฯ„]ki_ _[g]kn[โ„“,ฯ„]_ _[โˆ’]_ _x[ห†]in_ _k_ _gkn[โ„“,ฯ„]_ _ฯƒki[โ„“,ฯ„]_ [+ ห†]x[โ„“,ฯ„]in _k_ (m[โ„“,ฯ„]ki [)][2][ฮ“]kn[โ„“,ฯ„]
X X   X

_โˆ’_ (ห†x[โ„“,ฯ„]in [)][2][ X] _ฯƒki[โ„“,ฯ„]_ _[m]ki[โ„“,ฯ„]_ _[g]kn[โ„“,ฯ„]_ [ฮ“]kn[โ„“,ฯ„] (116)

_k_

2 2[]

_G[โ„“,ฯ„]ki_ [=] Xn xห†[โ„“,ฯ„]in  + โˆ†[โ„“,ฯ„]in  ฮ“[โ„“,ฯ„]kn _[โˆ’]2_ [โˆ†]in[โ„“,ฯ„] gkn[โ„“,ฯ„]  (117)

_Hki[โ„“,ฯ„]_ [=] _n_ _xห†[โ„“,ฯ„]in_ _[g]kn[โ„“,ฯ„]_ [+][ m]ki[โ„“,ฯ„] _n_ _xห†[โ„“,ฯ„]in_ ฮ“[โ„“,ฯ„]kn _[โˆ’]_ _[m]ki[โ„“,ฯ„]_ _n_ (gkn[โ„“,ฯ„] [)][2][โˆ†]in[โ„“,ฯ„]
X X   X

(m[โ„“,ฯ„]ki [)][2][ X] _gkn[โ„“,ฯ„]_ [ฮ“]kn[โ„“,ฯ„] [โˆ†]in[โ„“,ฯ„] _x[ห†][โ„“,ฯ„]in_ (118)
_โˆ’_

_n_

A.8 ACTIVATION FUNCTIONS

A.8.1 SIGN

In most of our experiments we use sign activations in each layer. With this choice, the neuronโ€™s free
energy 19 takes the form


+ [1]2 [log(2][ฯ€V][ )][,] (119)

[๏ฃถ]

๏ฃธ

(120)


๏ฃซ 2

๏ฃญ [1]


_e[Bx]_ _H_ _โˆ’_ _โˆš[xฯ‰]V_
_x_ 1,+1 
_โˆˆ{โˆ’X_ _}_


_ฯ•(B, A, ฯ‰, V ) = log_


where


where

_x_

= [1] _._
_H_ 2 [erfc] _โˆš2_

 

Notice that for sign activations the messages A can be dropped.


A.8.2 RELU

For ReLU (x) = max(0, x) activations the free energy 19 becomes

_ฯ•(B, A, ฯ‰, V ) =_ dxdz e[โˆ’] 2[1] _[Ax][2][+][Bx]_ _ฮด(x โˆ’_ max(0, z)) e[โˆ’] [(][ฯ‰]2[โˆ’]V[z][)2] (121)
Z

_ฯ‰_ _A_ [)] _BV + ฯ‰_
= log _H_ + _H_ + [1]
_โˆšV_ _[N]_ [(]A[ฯ‰][;][ B/A, V](B; 0, A[ +])[ 1] _โˆ’_ _โˆšV + AV_ [2] 2 [log(2][ฯ€V][ )][,]
   _N_  

(122)


where

A.9 THE ARGMAX LAYER


1

_e[โˆ’]_ [(][x][โˆ’]2ฮฃ[ยต][)2] _._ (123)
2ฯ€ฮฃ


_N_ (x; ยต, ฮฃ) =


In order to perform multi-class classification, we have to perform an argmax operation on the last
layer of the neural network. Call zk, for k = 1, . . ., K, the Gaussian random variables output of the
last layer of the network in correspondence of some input x. Assuming the correct label is class k[โˆ—],
the effective partition function Zkโˆ— corresponding to the output constraint reads:


-----

25

20


25

20


15

10


15

10

|Col1|Col2|BP|train|BP tes|t|
|---|---|---|---|---|---|
|||||||
|||||||
|||||||
|||||||

|Col1|Col2|Col3|BP|train|BP te|st|
|---|---|---|---|---|---|---|
||||||||
||||||||
||||||||
||||||||


BP train BP test


BP train BP test


20 40 60 80 100
epochs


20 40 60 80 100
epochs


Figure 4: MLP with 2 hidden layers with 101 hidden units each, batch-size 128 on the FashionMNIST dataset. In the first two layers we use the BP equations, while in the last layer the ArgMax
ones. (Left) ArgMax layer first version; (Right) ArgMax layer second version. Even if it is possible
to reach similar accuracies with the two versions, we decide to use the first one as it is simpler to use.


_Zkโˆ—_ = _dzk_ (zk; ฯ‰k, Vk)

_k_ _N_

Z Y


ฮ˜(zkโˆ— _zk)_ (124)
_โˆ’_
_kY=ฬธ_ _k[โˆ—]_

_H_ _โˆ’_ _[z][k]โˆš[โˆ—]_ _[โˆ’]Vk[ฯ‰][k]_ (125)
_k=k[โˆ—]_  

Yฬธ


_dzkโˆ—_ (zkโˆ— ; ฯ‰kโˆ— _, Vkโˆ—_ )
_N_


Here ฮ˜(x) is the Heaviside indicator function and we used the definition of H from Eq. 120. The
integral on the last line cannot be expressed analytically, therefore we have to resort to approximations.

A.9.1 APPROACH 1: JENSEN INEQUALITY


Using the Jensen inequality we obtain:


_ฯ†kโˆ—_ = log Zkโˆ— = log EzโˆผN (ฯ‰kโˆ— _,Vkโˆ—_ ) _H_ _โˆ’_ _[z][ โˆ’]โˆšV[ฯ‰]k[k]_

_k=k[โˆ—]_ 

Yฬธ

_โ‰ฅ_ EzโˆผN (ฯ‰kโˆ— _,Vkโˆ—_ ) log H _โˆ’_ _[z][ โˆ’]โˆšV[ฯ‰]k[k]_

_k=k[โˆ—]_  

Xฬธ


(126)

(127)


Reparameterizing the expectation we have:

_ฯ†หœkโˆ—_ = Eฯต (0,1) log

_โˆผN_ _H_ _โˆ’_ _[ฯ‰][k][โˆ—]_ [+][ ฯต]โˆš[โˆš]V[V]k[k][โˆ—] _[โˆ’]_ _[ฯ‰][k]_
_k=k[โˆ—]_  

Xฬธ

The derivative โˆ‚ฯ‰k _ฯ†[หœ]kโˆ—_ and โˆ‚ฯ‰[2]k _ฯ†[หœ]kโˆ—_ that we need can then be estimated by sampling (once) ฯต:


(128)


1
_โˆšVk EฯตโˆผN (0,1) K_ _โˆ’_ _[ฯ‰][k][โˆ—]_ [+][ฯต]โˆš[โˆš]V[V]k[k][โˆ—] _[โˆ’][ฯ‰][k]_

1  
_k[โ€ฒ]=ฬธ_ _k[โˆ—]_ _โˆšVkโ€ฒ_ [E][ฯต][โˆผN][ (0][,][1)][ K] _โˆ’_ _[ฯ‰][k][โˆ—]_ [+][ฯต]โˆš[โˆš]V[V]k[k]โ€ฒ[โˆ—] _[โˆ’][ฯ‰][k][โ€ฒ]_




_k ฬธ= k[โˆ—]_

(129)
_k = k[โˆ—]_


_โˆ‚ฯ‰k_ _ฯ†[หœ]kโˆ—_ =

where we have defined:


(x) =
_K_ _[N]([(]x[x]) [)]_ [=]

_H_


2/ฯ€

(130)
erfcx(x/2)

p


-----

A.9.2 APPROACH 2: JENSEN AGAIN

A further simplification is obtained by applying Jensen inequality again to 128 but in the opposite
direction, therefore we renounce to having a bound and look only for an approximation. We have the
new effective free energy:

_ฯ†หœkโˆ—_ = log EฯตโˆผN (0,1)H _โˆ’_ _[ฯ‰][k][โˆ—]_ [+][ ฯต]โˆš[โˆš]V[V]k[k][โˆ—] _[โˆ’]_ _[ฯ‰][k]_ (131)

_k=k[โˆ—]_  

Xฬธ

= _k=k[โˆ—]_ log H โˆ’ _โˆš[ฯ‰]V[k]k[โˆ—] +[โˆ’] V[ฯ‰]k[k]โˆ—_  (132)
Xฬธ

This gives, for k ฬธ= k[โˆ—]:

_โˆ’_ _โˆšVk1+Vkโˆ—_ _K_ _โˆ’_ _โˆš[ฯ‰]V[k]k[โˆ—]+[โˆ’]V[ฯ‰]k[k]โˆ—_ _k ฬธ= k[โˆ—]_

_โˆ‚ฯ‰k_ _ฯ†[หœ]kโˆ—_ = ๏ฃฑ๏ฃด๏ฃฒ _k[โ€ฒ]=ฬธ_ _k[โˆ—]_ _โˆšVkโ€ฒ1+Vkโˆ—_ _[K]_ _โˆ’_ _โˆš[ฯ‰]V[k][โˆ—]kโ€ฒ[โˆ’]+[ฯ‰]V[k]k[โ€ฒ]โˆ—_ _k = k[โˆ—]_ (133)

 

P

๏ฃด๏ฃณ

Notice that โˆ‚ฯ‰kโˆ— _ฯ†[หœ]k[โˆ—]_ = โˆ’ [P]k=ฬธ _k[โˆ—]_ _[โˆ‚][ฯ‰]k_ _ฯ†[หœ]k[โˆ—]_ . In last formulas we used the definition of K in Eq. 130.

We show in Fig. 4 the negligible difference between the two ArgMax versions when using BP on the
layers before the last one (which performs only the ArgMax).


B EXPERIMENTAL DETAILS

B.1 HYPER-PARAMETERS OF THE BP-BASED SCHEME

We include here a complete list of the hyper-parameters present in the BP-based algorithms. Notice
that, like in the SGD type of algorithms, many of them can be fixed or it is possible to find a
prescription for their value that works in most cases. However, we expect future research to find even
more effective values of the hyper-parameters, in the same way it has been done for SGD. These
hyper-parameters are: the mini-batch size bs; the parameter ฯ (that has to be tuned similarly to the
learning rate in SGD); the damping parameter ฮฑ (that performs a running smoothing on the BP fields
along the dynamics by adding a fraction of the field at the previous iteration, see Eqs. (134, 135)); the
initialization coefficient ฯต that we use to to sample the parameters of our prior distribution qฮธ( )
_W_
according to ฮธki[โ„“,t][=0] _ฯต_ (0, 1). Different choices of ฯต correspond to different initial distribution of
_โˆผ_ _N_
the weightsโ€™ magnetization m[โ„“]ki [= tanh(][ฮธ]ki[โ„“] [)][, as is shown in Fig. 5); the number of internal steps of]
reinforcement ฯ„max and the associated intensity of the internal reinforcement r. The performances
of the BP-based algorithms are robust in a reasonable range of these hyper-parameters. A more
principled choice of a good initialization condition could be made by adapting the technique from
Stamatescu et al. (2020).

Notice that among these parameters, the BP dynamics at each layer is mostly sensitive to ฯ and ฮฑ,
so that in general we consider them layer-dependent. See Sec. B.8 for details on the effect of these
parameters on the learning dynamics and on layer polarization (i.e. how the BP dynamics tends to
bias the weights towards a single point-wise configuration with high probability). Unless otherwise
stated we fix some of the hyper-parameters, in particular: bs = 128 (results are consistent with other
values of the batch-size, from bs = 1 up to bs = 1024 in our experiments), ฯต = 1.0, ฯ„max = 1, r = 0.

B.2 DAMPING SCHEME FOR THE MESSAGE PASSING

We use a damping parameter ฮฑ โˆˆ (0, 1) to stabilize the training, changing the updated rule for the
weightsโ€™ means as follows

_mหœ_ _[โ„“,ฯ„]ki_ [=][โˆ‚][H] _[ฯˆ][(][H]ki[โ„“,ฯ„]_ _[โˆ’][1], G[โ„“,ฯ„]ki_ _[โˆ’][1], ฮธki[โ„“]_ [)] (134)

_m[โ„“,ฯ„]ki_ [=][ฮฑ m]ki[โ„“,ฯ„] _[โˆ’][1]_ + (1 โˆ’ _ฮฑ) หœm[โ„“,ฯ„]ki_ (135)


-----

**P(m)**

ฯต = 0.1

**m**

|Col1|5 4 3 2 1|ฯต = 0.1 ฯต = 0.5 ฯต = 1.0 ฯต = 1.5|Col4|
|---|---|---|---|


-1.0 -0.5 **0.5** **1.0**

Figure 5: Initial distribution of the magnetizations varying the parameter ฯต. The initial distribution is
more concentrated around ยฑ1 as ฯต increases (i.e. it is more bimodal and the initial configuration is
more polarized).

B.3 ARCHITECTURES

In the experiments in which we vary the architecture (see Sec. 4.1), all simulations of the BP-based
algorithms use a number of internal reinforcement iterations ฯ„max = 1. Learning is performed on the
totality of the training dataset, the batch-size is bs = 128, the initialization coefficient is ฯต = 1.0.

For all architectures and all BP approximations, we use ฮฑ = 0.8 for each layer, apart for the
501-501-501 MLP in which we use ฮฑ = (0.1, 0.1, 0.1, 0.9). Concerning the parameter ฯ, we use
_ฯ = 0.9 on the last layer for all architectures and BP approximations. On the other layers we_
use: for the 101-101 and the 501-501 MLPs, ฯ = 1.0001 for all BP approximations; for the 101101-101 MLP, ฯ = 1.0 for BP and AMP while ฯ = 1.001 for MF; for the 501-501-501 MLP
_ฯ = 1.0001 for all BP approximations. For the BinaryNet simulations, the learning rate is lr = 10.0_
for all MLP architectures, giving the better performance among the learning rates we have tested,
_lr = 100, 10, 1, 0.1, 0.001._

We notice that while we need some tuning of the hyper-parameters to reach the performances of
BinaryNet, it is possible to fix them across datasets and architectures (e.g. ฯ = 1 and ฮฑ = 0.8 on
each layer) without in general losing more than 20% (relative) of the generalization performances,
demonstrating that the BP-based algorithms are effective for learning also with minimal hyperparameter tuning.

The experiments on the Bayesian error are performed on a MLP with 2 hidden layers of 101 units
on the MNIST dataset (binary classification). Learning is performed on the totality of the training
dataset, the batch-size is bs = 128, the initialization coefficient is ฯต = 1.0. In order to find the
pointwise configurations we use ฮฑ = 0.8 on each layer and ฯ = (1.0001, 1.0001, 0.9), while to find
the Bayesian ones we use ฮฑ = 0.8 on each layer and ฯ = (0.9999, 0.9999, 0.9) (these value prevent
an excessive polarization of the network towards a particular pointwise configurations).

For the continual learning task (see Sec. 4.5) we fixed ฯ = 1 and ฮฑ = 0.8 on each layer as we
empirically observed that polarizing the last layer helps mitigating the forgetting while leaving the
single-task performances almost unchanged.

In Fig. 6 we report training curves on architectures different from the ones reported in the main paper.


-----

25

20


25

20


15

10


15

10

|BP BP|train MF train test MF test|
|---|---|
|A A|MP train BinaryNet train MP test BinaryNet test|
|||
|||
|||

|Col1|BP train MF train BP test MF test|
|---|---|
||AMP train BinaryNet train AMP test BinaryNet test|
|||
|||
|||


BP train MF train
BP test MF test
AMP train BinaryNet train
AMP test BinaryNet test


BP train MF train
BP test MF test
AMP train BinaryNet train
AMP test BinaryNet test


20 40 60 80 100
epochs


20 40 60 80 100
epochs


Figure 6: Training curves of message passing algorithms compared with BinaryNet on the FashionMNIST dataset (multi-class classification) with a binary MLP with 3 hidden layers of 501 units.
(Right) The batch-size is 128 and curves are averaged over 5 realizations of the initial conditions

B.4 VARYING THE DATASET


When varying the dataset (see Sec. 4.3), all simulation of the BP-based algorithms use a number
of internal reinforcement iterations ฯ„max = 1. Learning is performed on the totality of the training
dataset, the batch-size is bs = 128, the initialization coefficient is ฯต = 1.0. For all datasets (MNIST
(2 classes), FashionMNIST (2 classes), CIFAR-10 (2 classes), MNIST, FashionMNIST, CIFAR-10)
and all algorithms (BP, AMP, MF) we use ฯ = (1.0001, 1.0001, 0.9) and ฮฑ = 0.8 for each layer.
Using in the first layers values of ฯ = 1 + ฯต with ฯต โ‰ฅ 0 and sufficiently small typically leads to good
results.

For the BinaryNet simulations, the learning rate is lr = 10.0 (both for binary classification and
multi-class classification), giving the better performance among the learning rates we have tested,
_lr = 100, 10, 1, 0.1, 0.001. In Tab. 2 we report the final train errors obtained on the different datasets._

Dataset BinaryNet BP AMP MF

**MNIST (2 classes)** 0.05 ยฑ 0.05 0.0 ยฑ 0.0 0.0 ยฑ 0.0 0.0 ยฑ 0.0

**FashionMNIST (2 classes)** 0.3 ยฑ 0.1 0.06 ยฑ 0.01 0.06 ยฑ 0.01 0.09 ยฑ 0.01

**CIFAR10 (2 classes)** 1.2 ยฑ 0.5 0.37 ยฑ 0.01 0.4 ยฑ 0.1 0.9 ยฑ 0.2

**MNIST** 0.09 ยฑ 0.01 0.12 ยฑ 0.01 0.12 ยฑ 0.01 0.03 ยฑ 0.01

**FashionMNIST** 4.0 ยฑ 0.5 3.4 ยฑ 0.1 3.7 ยฑ 0.1 2.5 ยฑ 0.2

**CIFAR10** 13.0 ยฑ 0.9 4.7 ยฑ 0.1 4.7 ยฑ 0.2 9.2 ยฑ 0.5


Table 2: Train error (%) on Fashion-MNIST of a multilayer perceptron with 2 hidden layers of 501
units each for BinaryNet (baseline), BP, AMP and MF. All algorithms are trained with batch-size 128
and for 100 epochs. Mean and standard deviations are calculated over 5 random initializations.

B.5 LOCAL ENERGY


We adapt the notion of flatness used in (Jiang et al., 2020; Pittorino et al., 2021), that we call local
energy, to configurations with binary weights. Given a weight configuration w โˆˆ{ยฑ1}[N], we define
the local energy ฮดEtrain(w, p) as the average difference in training error Etrain(w) when perturbing w
by flipping a random fraction p of its elements:

_ฮดEtrain(w, p) = Ez Etrain(w โŠ™_ **_z) โˆ’_** _Etrain(w),_ (136)

where โŠ™ denotes the Hadamard (element-wise) product and the expectation is over i.i.d. entries for z
equal to โˆ’1 with probability p and to +1 with probability 1 โˆ’ _p. We report the resulting local energy_
profiles (in a range [0, pmax]) in Fig. 7 right panel for BP and BinaryNet. The relative error grows


-----

slowly when perturbing the trained configurations (notice the convexity of the curves). This shows
that both BP-based and SGD-based algorithms find configurations that lie in relatively flat minima in
the energy landscape. The same qualitative phenomenon holds for different architectures and datasets.


0.30

0.25


0.20

0.15


0.10

0.05


0.00

|Col1|Col2|Col3|Col4|Col5|Col6|Col7|Col8|Col9|
|---|---|---|---|---|---|---|---|---|
||Bi BP|naryNet|||||||
||||||||||
||||||||||
||||||||||
||||||||||
||||||||||
||||||||||
||||||||||


0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35
flip probability p

Figure 7: Local energy curve of the point-wise configuration found by the BP algorithm compared
with BinaryNet on a MLP with 2 hidden layers of 101 units on the 2-class MNIST dataset.


B.6 SGD IMPLEMENTATION (BINARYNET)

We compare the BP-based algorithms with SGD training for neural networks with binary weights
and activations as introduced in BinaryNet (Hubara et al., 2016). This procedure consists in keeping
a continuous version of the parameters w which is updated with the SGD rule, with the gradient
calculated on the binarized configuration wb = sign(w). At inference time the forward pass is
calculated with the parameters wb. The backward pass with binary activations is performed with the
so called straight-through estimator.

Our implementation presents some differences with respect to the original proposal of the algorithm
in (Hubara et al., 2016), in order to keep the comparison as fair as possible with the BP-based
algorithms, in particular for what concerns the number of parameters. We do not use biases nor batch
normalization layers, therefore in order to keep the pre-activations of each hidden layer normalized
we rescale them by _โˆš1N_ [where][ N][ is the size of the previous layer (or the input size in the case of the]

pre-activations afferent to the first hidden layer). The standard SGD update rule is applied (instead
of Adam), and we use the binary cross-entropy loss. Clipping of the continuous configuration w in

[โˆ’1, 1] is applied. We use Xavier initialization (Glorot & Bengio, 2010) for the continuous weights.
In Fig.2 of the main paper, we apply the Adam optimization rule, noticing that it performs slightly
better in train and test generalization performance compared to the pure SGD one.


B.7 EBP IMPLEMENTATION

Expectation Back Propagation (EBP) Soudry et al. (2014b) is parameter-free Bayesian algorithm
that uses a mean-field (MF) approximation (fully factorized form for the posterior) in an online
environment to estimate the Bayesian posterior distribution after the arrival of a new data point.
The main differences between EBP and our approach relies in the approximation for the posterior
distribution. Moreover we explicitly base the estimation of the marginals on the local high entropy
structure. The fact that EBP works has no clear explanation: certainly it cannot be that the MF
assumption holds for multi-layer neural networks. Still, itโ€™s certainly very interesting that it works.
We argue that it might work precisely by virtue of the existence of high local entropy minima and


-----

|Col1|BP layer1|
|---|---|
||AMP layer1|
||MF layer1|

|Col1|Col2|
|---|---|
||BP layer1 AMP layer1 MF layer1|
|||

|Col1|Col2|
|---|---|
||BP layer2 AMP layer2|
||MF layer2|

|Col1|BP layer2|
|---|---|
||AMP layer2|
||MF layer2|

|BP tr BP te AMP t|ain MF train st MF test rain BinaryNet train|
|---|---|
|AMP t|est BinaryNet test|
|||
|||
|||

|Col1|BP layer3 AMP layer3|
|---|---|
||MF layer3|

|Col1|BP layer3 AMP layer3|
|---|---|
||MF layer3|


25 1.00

BP train MF train 0.010 BP layer1

20 BP testAMP train MF testBinaryNet train 0q 0.750.50 BP layer1AMP layer1MF layer1 _qab_ 0.005 AMP layer1MF layer1

AMP test BinaryNet test 0 50 100 0 50 100

epochs epochs

15 1.00

0.0050 BP layer2

0q 0.75 BP layer2AMP layer2 _qab_ 0.0025 AMP layer2MF layer2

error (%) 10 0.50 MF layer2

0 50 100 0 50 100
epochs epochs

5 0.4 BP layer3 0.04 BP layer3

AMP layer3 AMP layer3

0q 0.2 MF layer3 _qab_ 0.02 MF layer3

0 0 20 40 60 80 100 0 50 100 0.00 0 50 100

epochs epochs epochs


Figure 8: (Right panels) Polarizations _q0_ and overlaps _qab_ on each layer of a MLP with 2 hidden
_โŸจ_ _โŸฉ_ _โŸจ_ _โŸฉ_
layers of 501 units on the Fashion-MNIST dataset (multi-class), the batch-size is bs = 128. (Right)
Corresponding train and test error curves.

expect it to give similar performance to the MF case of our algorithm. The online iteration could in
fact be seen as way of implementing a reinforcement.

We implemented the EBP code along the lines of the original matlab implementation
(https://github.com/ExpectationBackpropagation/EBP_Matlab_Code). In order to perform a fair
comparison we removed the biases both in the binary and continuous weights versions. It is worth
noticing that we faced numerical issues in training with a moderate to big batchsize All the experiments were consequently limited to a batchsize of 10 patterns

B.8 UNIT POLARIZATION AND OVERLAPS

We define the self-overlap or polarization of a given unit a as q0[a] [=] _N1_ _i[(][w]i[a][)][2][, where][ N][ is the]_

number of parameters of the unit and _wi[a]_ _i=1_ [its weights. It quantifies how much the unit is polarized]
towards a unique point-wise binary configuration ( { _[}][N]_ _q0[a]_ [= 1][ corresponding to high confidence in a]P
given configurations while q0[a] [= 0][ to low). The overlap between two units][ a][ and][ b][ (considered in the]
same layer) is qab = _N[1]_ _wia[w]i[b][. The number of parameters][ N][ is the same for units belonging to the]_

same fully connected layer. We denote byP _โŸจq0โŸฉ_ = _Nout1_ _Na=1out_ _[q]0[a]_ [and][ โŸจ][q][ab][โŸฉ] [=] _Nout1_ _Na<bout_ _[q][ab][ the]_

mean polarization and mean overlap in a given layer (where Nout is the number of units in the layer).

P P

The parameters ฯ and ฮฑ govern the dynamical evolution of the polarization of each layer during
training. A value ฯ โช† 1 has the effect to progressively increase the units polarization during training,
while ฯ < 1 disfavours it. The damping ฮฑ which takes values in [0, 1] has the effect to slow the
dynamics by a smoothing process (the intensity of which depends on the value of ฮฑ), generically
favoring convergence. Given the nature of the updates in Algorithm 1, each layer presents its own
dynamics given the values of ฯโ„“ and ฮฑโ„“ at layer โ„“, that in general can differ from each other.

We find that it is is beneficial to control the polarization layer-per-layer, see Fig. 8 for the corresponding typical behavior of the mean polarization and the mean overlaps during training. Empirically, we
have found that (as we could expect) when training is successful the layers polarize progressively
towards q0 = 1, i.e. towards a precise point-wise solution, while the overlaps between units in each
hidden layer are such that qab 1 (indicating low symmetry between intra-layer units, as expected
for a non-redundant solution). To this aim, in most cases โ‰ช _ฮฑโ„“_ can be the same for each layer, while
tuning ฯโ„“ for each layer allows to find better generalization performances in some cases (but is not
strictly necessary for learning).

In particular, it is possible to use the same value ฯโ„“ for each layer before the last one (โ„“< L
where L is the number of layers in the network), while we have found that the last layer tends to


-----

polarize immediately during the dynamics (probably due to its proximity to the output constraints).
Empirically, it is usually beneficial for learning that this layer does not or only slightly polarize, i.e.
_โŸจq0โŸฉโ‰ช_ 1 (this can be achieved by imposing ฯL < 1). Learning is anyway possible even when the
last layer polarizes towards โŸจq0โŸฉ = 1 along the dynamics, i.e. by choosing ฯL sufficiently large.

As a simple general prescription in most experiments we can fix ฮฑ = 0.8 and ฯL = 0.9, therefore
leaving ฯโ„“<L as the only hyper-parameter to be tuned, akin to the learning rate in SGD. Its value has
to be very close to 1.0 (a value smaller than 1.0 tends to depolarize the layers, without focusing on a
particular point-wise binary configuration, while a value greater than 1.0 tends to lead to numerical
instabilities and parametersโ€™ divergence).

B.9 COMPUTATIONAL PERFORMANCE: VARYING BATCH-SIZE


In order to compare the time performances of the BP-based algorithms with our implementation
of BinaryNet, we report in Fig. 9 the time in seconds taken by a single epoch of each algorithm
in function of the batch-size, on a MLP of 2 layers of 501 units on Fashion-MNIST. We test both
algorithms on a NVIDIA GeForce RTX 2080 Ti GPU. Multi-class and binary classification present
a very similar time scaling with the batch-size, in both cases comparable with BinaryNet. Let us
also notice that BP-based algorithms are able to reach generalization performances comparable to
BinaryNet for all the values of the batch-size reported in this section.


10[1]

10[0]


10[0] 10[1] 10[2] 10[3]

|Col1|Col2|Col3|Col4|Col5|
|---|---|---|---|---|
||||BP BPI AMP MF||
||||BinaryNet||
||||||


BP
BPI
AMP
MF
BinaryNet

### batch size

Figure 9: Algorithms time scaling with the batch-size on a MLP with 2 hidden layers of 501 hidden
units each on the Fashion-MNIST dataset (multi-class classification). The reported time (in seconds)
refers to one epoch for each algorithm.


-----