File size: 100,083 Bytes
f71c233 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 1372 1373 1374 1375 1376 1377 1378 1379 1380 1381 1382 1383 1384 1385 1386 1387 1388 1389 1390 1391 1392 1393 1394 1395 1396 1397 1398 1399 1400 1401 1402 1403 1404 1405 1406 1407 1408 1409 1410 1411 1412 1413 1414 1415 1416 1417 1418 1419 1420 1421 1422 1423 1424 1425 1426 1427 1428 1429 1430 1431 1432 1433 1434 1435 1436 1437 1438 1439 1440 1441 1442 1443 1444 1445 1446 1447 1448 1449 1450 1451 1452 1453 1454 1455 1456 1457 1458 1459 1460 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475 1476 1477 1478 1479 1480 1481 1482 1483 1484 1485 1486 1487 1488 1489 1490 1491 1492 1493 1494 1495 1496 1497 1498 1499 1500 1501 1502 1503 1504 1505 1506 1507 1508 1509 1510 1511 1512 1513 1514 1515 1516 1517 1518 1519 1520 1521 1522 1523 1524 1525 1526 1527 1528 1529 1530 1531 1532 1533 1534 1535 1536 1537 1538 1539 1540 1541 1542 1543 1544 1545 1546 1547 1548 1549 1550 1551 1552 1553 1554 1555 1556 1557 1558 1559 1560 1561 1562 1563 1564 1565 1566 1567 1568 1569 1570 1571 1572 1573 1574 1575 1576 1577 1578 1579 1580 1581 1582 1583 1584 1585 1586 1587 1588 1589 1590 1591 1592 1593 1594 1595 1596 1597 1598 1599 1600 1601 1602 1603 1604 1605 1606 1607 1608 1609 1610 1611 1612 1613 1614 1615 1616 1617 1618 1619 1620 1621 1622 1623 1624 1625 1626 1627 1628 1629 1630 1631 1632 1633 1634 1635 1636 1637 1638 1639 1640 1641 1642 1643 1644 1645 1646 1647 1648 1649 1650 1651 1652 1653 1654 1655 1656 1657 1658 1659 1660 1661 1662 1663 1664 1665 1666 1667 1668 1669 1670 1671 1672 1673 1674 1675 1676 1677 1678 1679 1680 1681 1682 1683 1684 1685 1686 1687 1688 1689 1690 1691 1692 1693 1694 1695 1696 1697 1698 1699 1700 1701 1702 1703 1704 1705 1706 1707 1708 1709 1710 1711 1712 1713 1714 1715 1716 1717 1718 1719 1720 1721 1722 1723 1724 1725 1726 1727 1728 1729 1730 1731 1732 1733 1734 1735 1736 1737 1738 1739 1740 1741 1742 1743 1744 1745 1746 1747 1748 1749 1750 1751 1752 1753 1754 1755 1756 1757 1758 1759 1760 1761 1762 1763 1764 1765 1766 1767 1768 1769 1770 1771 1772 1773 1774 1775 1776 1777 1778 1779 1780 1781 1782 1783 1784 1785 1786 1787 1788 1789 1790 1791 1792 1793 1794 1795 1796 1797 1798 1799 1800 1801 1802 1803 1804 1805 1806 1807 1808 1809 1810 1811 1812 1813 1814 1815 1816 1817 1818 1819 1820 1821 1822 1823 1824 1825 1826 1827 1828 1829 1830 1831 1832 1833 1834 1835 1836 1837 1838 1839 1840 1841 1842 1843 1844 1845 1846 1847 1848 1849 1850 1851 1852 1853 1854 1855 1856 1857 1858 1859 1860 1861 1862 1863 1864 1865 1866 1867 1868 1869 1870 1871 1872 1873 1874 1875 1876 1877 1878 1879 1880 1881 1882 1883 1884 1885 1886 1887 1888 1889 1890 1891 1892 1893 1894 1895 1896 1897 1898 1899 1900 1901 1902 1903 1904 1905 1906 1907 1908 1909 1910 1911 1912 1913 1914 1915 1916 1917 1918 1919 1920 1921 1922 1923 1924 1925 1926 1927 1928 1929 1930 1931 1932 1933 1934 1935 1936 1937 1938 1939 1940 1941 1942 1943 1944 1945 1946 1947 1948 1949 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 2026 2027 2028 2029 2030 2031 2032 2033 2034 2035 2036 2037 2038 2039 2040 2041 2042 2043 2044 2045 2046 2047 2048 2049 2050 2051 2052 2053 2054 2055 2056 2057 2058 2059 2060 2061 2062 2063 2064 2065 2066 2067 2068 2069 2070 2071 2072 2073 2074 2075 2076 2077 2078 2079 2080 2081 2082 2083 2084 2085 2086 2087 2088 2089 2090 2091 2092 2093 2094 2095 2096 2097 2098 2099 2100 2101 2102 2103 2104 2105 2106 2107 2108 2109 2110 2111 2112 2113 2114 2115 2116 2117 2118 2119 2120 2121 2122 2123 2124 2125 2126 2127 2128 2129 2130 2131 2132 2133 2134 2135 2136 2137 2138 2139 2140 2141 2142 2143 2144 2145 2146 2147 2148 2149 2150 2151 2152 2153 2154 2155 2156 2157 2158 2159 2160 2161 2162 2163 2164 2165 2166 2167 2168 2169 2170 2171 2172 2173 2174 2175 2176 2177 2178 2179 2180 2181 2182 2183 2184 2185 2186 2187 2188 2189 2190 2191 2192 2193 2194 2195 2196 2197 2198 2199 2200 2201 2202 2203 2204 2205 2206 2207 2208 2209 2210 2211 2212 2213 2214 2215 2216 2217 2218 2219 2220 2221 2222 2223 2224 2225 2226 2227 2228 2229 2230 2231 2232 2233 2234 2235 2236 2237 2238 2239 2240 2241 2242 2243 2244 2245 2246 2247 2248 2249 2250 2251 2252 2253 2254 2255 2256 2257 2258 2259 2260 2261 2262 2263 2264 2265 2266 2267 2268 2269 2270 2271 2272 2273 2274 2275 2276 2277 2278 2279 2280 2281 2282 2283 2284 2285 2286 2287 2288 2289 2290 2291 2292 2293 2294 2295 2296 2297 2298 2299 2300 2301 2302 2303 2304 2305 2306 2307 2308 2309 2310 2311 2312 2313 2314 2315 2316 2317 2318 2319 2320 2321 2322 2323 2324 2325 2326 2327 2328 2329 2330 2331 2332 2333 2334 2335 2336 2337 2338 2339 2340 2341 2342 2343 2344 2345 2346 2347 2348 2349 2350 2351 2352 2353 2354 2355 2356 2357 2358 2359 2360 2361 2362 2363 2364 2365 2366 2367 2368 2369 2370 2371 2372 2373 2374 2375 2376 2377 2378 2379 2380 2381 2382 2383 2384 2385 2386 2387 2388 2389 2390 2391 2392 2393 2394 2395 2396 2397 2398 2399 2400 2401 2402 2403 2404 2405 2406 2407 2408 2409 2410 2411 2412 2413 2414 2415 2416 2417 2418 2419 2420 2421 2422 2423 2424 2425 |
## DEEP LEARNING VIA MESSAGE PASSING ALGORITHMS ### BASED ON BELIEF PROPAGATION **Anonymous authors** Paper under double-blind review ABSTRACT Message-passing algorithms based on the Belief Propagation (BP) equations constitute a well-known distributed computational scheme. They yield exact marginals on tree-like graphical models and have also proven to be effective in many problems defined on loopy graphs, from inference to optimization, from signal processing to clustering. The BP-based schemes are fundamentally different from stochastic gradient descent (SGD), on which the current success of deep networks is based. In this paper, we present and adapt to mini-batch training on GPUs a family of BP-based message-passing algorithms with a reinforcement term that biases distributions towards locally entropic solutions. These algorithms are capable of training multi-layer neural networks with performance comparable to SGD heuristics in a diverse set of experiments on natural datasets including multi-class image classification and continual learning, while being capable of yielding improved performances on sparse networks. Furthermore, they allow to make approximate Bayesian predictions that have higher accuracy than point-wise ones. 1 INTRODUCTION Belief Propagation is a method for computing marginals and entropies in probabilistic inference problems (Bethe, 1935; Peierls, 1936; Gallager, 1962; Pearl, 1982). These include optimization problems as well once they are written as zero temperature limit of a Gibbs distribution that uses the cost function as energy. Learning is one particular case, in which one wants to minimize a cost which is a data dependent loss function. These problems are generally intractable and message-passing techniques have been particularly successful at providing principled approximations through efficient distributed computations. A particularly compact representation of inference/optimization problems that is used to build massage-passing algorithms is provided by factor graphs. A factor graph is a bipartite graph composed of variables nodes and factor nodes expressing the interactions among variables. Belief Propagation is exact for tree-like factor graphs (Yedidia et al., 2003)), where the Gibbs distribution is naturally factorized, whereas it is approximate for graphs with loops. Still, loopy BP is routinely used with success in many real world applications ranging from error correcting codes, vision, clustering, just to mention a few. In all these problems, loops are indeed present in the factor graph and yet the variables are weakly correlated at long range and BP gives good results. A field in which BP has a long history is the statistical physics of disordered systems where it is known as Cavity Method (Mรฉzard et al., 1987). It has been used to study the typical properties of spin glass models which represent binary variables interacting through random interactions over a given graph. It is very well known that in spin glass models defined on complete graphs and in locally tree-like random graphs, which are both loopy, the weak correlation conditions between variables may hold and BP give asymptotic exact results (Mรฉzard & Montanari, 2009). Here we will mostly focus on neural networks ยฑ1 binary weights and sign activation functions, for which the messages and the marginals can be described simply by the difference between the probabilities associated with the +1 and -1 states, the so called _magnetizations. The effectiveness of BP for deep learning has never been numerically tested in a_ systematic way, however there is clear evidence that the weak correlation decay condition does not hold and thus BP convergence and approximation quality is unpredictable. In this paper we explore the effectiveness of a variant of BP that has shown excellent convergence properties in hard optimization problems and in non-convex shallow networks. It goes under the ----- name of focusing BP (fBP) and is based on a probability distribution, a likelihood, that focuses on highly entropic wide minima, neglecting the contribution to marginals from narrow minima even when they are the majority (and hence dominate the Gibbs distribution). This version of BP is thus expected to give good results only in models that have such wide entropic minima as part of their energy landscape. As discussed in (Baldassi et al., 2016a), a simple way to define fBP is to add a "reinforcement" term to the BP equations: an iteration-dependent local field is introduced for each variable, with an intensity proportional to its marginal probability computed in the previous iteration step. This field is gradually increased until the entire system becomes fully biased on a configuration. The first version of reinforced BP was introduced in (Braunstein & Zecchina, 2006) as a heuristic algorithm to solve the learning problem in shallow binary networks. Baldassi et al. (2016a) showed that this version of BP is a limiting case of fBP, i.e., BP equations written for a likelihood that uses the local entropy function instead of the error (energy) loss function. As discussed in depth in that study, one way to introduce a likelihood that focuses on highly entropic regions is to create y coupled replicas of the original system. fBP equations are obtained as BP equations for the replicated system. It turns out that the fBP equations are identical to the BP equations for the original system with the only addition of a self-reinforcing term in the message passing scheme. The fBP algorithm can be used as a solver by gradually increasing the effect of the reinforcement: one can control the size of the regions over which the fBP equations estimate the marginals by tuning the parameters that appear in the expression of the reinforcement, until the high entropy regions reduce to a single configuration. Interestingly, by keeping the size of the high entropy region fixed, the fBP fixed point allows one to estimate the marginals and entropy relative to the region. In this work, we present and adapt to GPU computation a family of fBP inspired message passing algorithms that are capable of training multi-layer neural networks with generalization performance and computational speed comparable to SGD. This is the first work that shows that learning by message passing in deep neural networks 1) is possible and 2) is a viable alternative to SGD. Our version of fBP adds the reinforcement term at each mini-batch step in what we call the Posterioras-Prior (PasP) rule. Furthermore, using the message-passing algorithm not as a solver but as an estimator of marginals allows us to make locally Bayesian predictions, averaging the predictions over the approximate posterior. The resulting generalization error is significantly better than those of the solver, showing that, although approximate, the marginals of the weights estimated by messagepassing retain useful information. Consistently with the assumptions underlying fBP, we find that the solutions provided by the message passing algorithms belong to flat entropic regions of the loss landscape and have good performance in continual learning tasks and on sparse networks as well. We also remark that our PasP update scheme is of independent interest and can be combined with different posterior approximation techniques. The paper is structured as follows: in Sec. 2 we give a brief review of some related works. In Sec. 3 we provide a detailed description of the message-passing equations and of the high level structure of the algorithms. In Sec. 4 we compare the performance of the message passing algorithms versus SGD based approaches in different learning settings. 2 RELATED WORKS The literature on message passing algorithms is extensive, we refer to Mรฉzard & Montanari (2009) and Zdeborovรก & Krzakala (2016) for a general overview. More related to our work, multilayer message-passing algorithms have been developed in inference contexts (Manoel et al., 2017; Fletcher et al., 2018), where they have been shown to produce exact marginals under certain statistical assumptions on (unlearned) weight matrices. The properties of message-passing for learning shallow neural networks have been extensively studied (see Baldassi et al. (2020) and reference therein). Barbier et al. (2019) rigorously show that message passing algorithms in generalized linear models perform asymptotically exact inference under some statistical assumptions. Dictionary learning and matrix factorization are harder problems closely related to deep network learning problems, in particular to the modelling of a single intermediate layer. They have been approached using message passing in Kabashima et al. (2016) and Parker et al. (2014), although the resulting predictions are found to be asymptotically inexact (Maillard et al., 2021). The same problem is faced by the message passing algorithm recently proposed for a multi-layer matrix factorization scenario (Zou et al., 2021). Unfortunately, our framework as well ----- doesnโt yield asymptotic exact predictions. Nonetheless, it gives a message passing heuristic that for the first time is able to train deep neural networks on natural datasets, therefore sets a reference for the algorithmic applications of this research line. A few papers advocate the success of SGD to the geometrical structure (smoothness and flatness) of the loss landscape in neural networks (Baldassi et al., 2015; Chaudhari et al., 2017; Garipov et al., 2018; Li et al., 2018; Pittorino et al., 2021; Feng & Tu, 2021). These considerations do not depend on the particular form of the SGD dynamics and should extend also to other types of algorithms, although SGD is by far the most popular choice among NNs practitioners due to its simplicity, flexibility, speed, and generalization performance. While our work focuses on message passing schemes, some of the ideas presented here, such as the PasP rule, can be combined with algorithms for Bayesian neural networksโ training (HernรกndezLobato & Adams, 2015; Wu et al., 2018). Recent work extends BP by combining it with graph neural networks (Kuck et al., 2020; Satorras & Welling, 2021). Finally, some work in computational neuroscience shows similarities to our approach (Rao, 2007). 3 LEARNING BY MESSAGE PASSING 3.1 POSTERIOR-AS-PRIOR UPDATES We consider a multi-layer perceptron with L hidden neuron layers, having weight and bias parameters _W = {W_ _[โ], b[โ]}โ[L]=0[. We allow for stochastic activations][ P][ โ][(][x][โ][+1][|][z][โ][)][, where][ z][โ]_ [is the neuronโs pre-] activation vector for layer โ, and P _[โ]_ is assumed to be factorized over the neurons. If no stochasticity is present, P _[โ]_ just encodes an element-wise activation function. The probability of output y given an input x is then given by _P_ _[โ][+1](x[โ][+1]_ _| W_ _[โ]x[โ]_ + b[โ]), (1) _โ=0_ Y _P_ (y | x, W) = _dx[1:][L]_ Z where for convenience we defined x[0] = x and x[L][+1] = y. In a Bayesian framework, given a training set D = {(xn, yn)}n and a prior distribution over the weights qฮธ(W) in some parametric family, the posterior distribution is given by _P_ (yn **_xn,_** ) qฮธ( ). (2) _|_ _W_ _W_ _P_ (W | D, ฮธ) โ Here โ denotes equality up to a normalization factor. Using the posterior one can compute the Bayesian prediction for a new data-point x through the average P (y | x, D, ฮธ) = _dW P_ (y | x, W) P (W | D, ฮธ). Unfortunately, the posterior is generically intractable due to the hard-to-compute normalization factor. On the other hand, we are mainly interested in training a R distribution that covers wide minima of the loss landscape that generalize well (Baldassi et al., 2016a) and in recovering pointwise estimators within these regions. The Bayesian modeling becomes an auxiliary tool to set the stage for the message passing algorithms seeking flat minima. We also need a formalism that allows for mini-batch training to speed-up the computation and deal with large datasets. Therefore, we devise an update scheme that we call Posterior-as-Prior (PasP), where we evolve the parameters ฮธ[t] of a distribution qฮธt ( ) computed as an approximate mini-batch posterior, _W_ in such a way that the outcome of the previous iteration becomes the prior in the following step. In the PasP scheme, ฮธ[t] retains the memory of past observations. We also add an exponential factor ฯ, that we typically set close to 1, tuning the forgetting rate and playing a role similar to the learning rate in SGD. Given a mini-batch (X _[t], y[t]) sampled from the training set at time t and a scalar ฯ > 0,_ the PasP update reads _ฯ_ _qฮธt+1_ ( ) _P_ ( **_y[t], X_** _[t], ฮธ[t])_ _,_ (3) _W_ _โ_ _W |_ where denotes approximate equality and we do not account for the normalization factor. A first _โ_ approximation may be needed in the computation of the posterior, a second to project the approximate posterior onto the distribution manifold spanned by ฮธ (Minka, 2001). In practice, we will consider factorized approximate posterior in an exponential family and priors qฮธ in the same family, although Eq. 3 generically allow for more refined approximations. ----- Notice that setting ฯ = 1, the batch-size to 1, and taking a single pass over the dataset, we recover the Assumed Density Filtering algorithm (Minka, 2001). For large enough ฯ (including ฯ = 1), the iterations of qฮธt will concentrate on a pointwise estimator. This mechanism mimics the reinforcement heuristic commonly used to turn Belief Propagation into a solver for constrained satisfaction problems (Braunstein & Zecchina, 2006) and related to flat-minima discovery (see focusing-BP in Baldassi et al. (2016a)). A different prior updating mechanism which can be understood as empirical Bayes has been used in Baldassi et al. (2016b). 3.2 INNER MESSAGE PASSING LOOP While the PasP rule takes care of the reinforcement heuristic across mini-batches, we compute the mini-batch posterior in Eq. 3 using message passing approaches derived from Belief Propagation. BP is an iterative scheme for computing marginals and entropies of statistical models Mรฉzard & Montanari (2009). It is most conveniently expressed on factor graphs, that is bipartite graphs where the two sets of nodes are called variable nodes and factor nodes. They respectively represent the variables involved in the statistical model and their interactions. Message from factor nodes to variable nodes and viceversa are exchanged along the edges of the factor graph for a certain number of BP iterations or until a fixed point is reached. The factor graph for P (W | X _[t], y[t], ฮธ[t]) can be derived from Eq. 2, with the following additional_ specifications. For simplicity, we will ignore the bias term in each layer. We assume factorized _qฮธt_ ( ), each factor parameterized by its first two moments. In what follows, we drop the PasP _W_ iteration index t. For each example (xn, yn) in the mini-batch, we introduce the auxiliary variables **_x[โ]n[, โ]_** [= 1][, . . ., L][, representing the layersโ activations. For each example, each neuron in the network] contributes a factor node to the factor graph. The scalar components of the weight matrices and the activation vectors become variable nodes. This construction is presented in Appendix A, where we also derive the message update rules on the factor graph. The factor graph thus defined is extremely loopy and straightforward iteration of BP has convergence issues. Moreover, in presence of a homogeneous prior over the weights, the neuron permutation symmetry in each hidden layer induces a strongly attractive symmetric fixed point that hinders learning. We work around these issues by breaking the symmetry at time t = 0 with an inhomogeneous prior. In our experiments a little initial heterogeneity is sufficient to obtain specialized neurons at each following time step. Additionally, we do not require message passing convergence in the inner loop (see Algorithm 1) but perform one or a few iterations for each ฮธ update. We also include an inertia term commonly called damping factor in the message updates (see B.2). As we shall discuss, these simple rules suffice to train deep networks by message passing. For the inner loop we adapt to deep neural networks four different message passing algorithms, all of which are well known to the literature although derived in simpler settings: Belief Propagation (BP), BP-Inspired (BPI) message passing, mean-field (MF), and approximate message passing (AMP). The last three algorithms can be considered approximations of the first one. In the following paragraphs we will discuss their common traits, present the BP updates as an example, and refer to Appendix A for an in-depth exposition. For all algorithms, message updates can be divided in a forward pass and backward pass, as also done in (Fletcher et al., 2018) in a multi-layer inference setting. The BP algorithm is compactly reported in Algorithm 1. **Meaning of messages.** All the messages involved in the message passing can be understood in terms of cavity marginals or full marginals (as mentioned in the introduction BP is also known as Cavity Method, see (Mรฉzard & Montanari, 2009)). Of particular relevance are m[โ]ki [and][ ฯ]ki[โ] [, denoting] the mean and variance of the weights Wki[โ] [. The quantities][ ห]x[โ]in [and][ โ]in[โ] [instead denote the mean and] variance of the i-th neuronโs activation in layer โ for a given input xn. **Scalar free energies.** All message passing schemes are conveniently expressed in terms of two functions that correspond to the effective free energy (Zdeborovรก & Krzakala, 2016) of a single ----- neuron and of a single weight respectively : _ฯ[โ](B, A, ฯ, V ) = log_ dx dz e[โ] 2[1] _[Ax][2][+][Bx]_ _P_ _[โ]_ (x|z) e[โ] [(][ฯ]2[โ]V[z][)2] _โ_ = 1, . . ., L (4) Z _ฯ(H, G, ฮธ) = log_ dw e[โ] 2[1] _[G][2][w][2][+][Hw]_ _qฮธ(w)_ (5) Z Notice that for common deterministic activations such as ReLU and sign, the function ฯ has analytic and smooth expressions (see Appendix A.8). The same holds for the function ฯ when qฮธ(w) is Gaussian (continuous weights) or a mixture of atoms (discrete weights). At the last layer we impose _P_ _[L][+1](y|z) = I(y = sign(z)) in binary classification tasks and P_ _[L][+1](y|z) = I(y = arg max(z))_ in multi-class classification (see Appendix A.9). While in our experiments we use hard constraints for the final output, therefore solving a constraint satisfaction problem, it would be interesting to also consider soft constraints and introduce a temperature, but this is beyond the scope of our work. **Start and end of message passing.** At the beginning of a new PasP iteration t, we reset the messages (see Appendix A) and run message passing for ฯmax iterations. We then compute the new priorโs parameters ฮธ[t][+1] from the posterior given by the message passing. **BP Forward pass.** After initialization of the messages at time ฯ = 0, for each following time we propagate a set of message from the first to the last layer and then another set from the last to the first. For an intermediate layer โ the forward pass reads _xห[โ,ฯ]inโk_ = _โBฯ[โ]_ []Bin[โ,ฯ]โ[โ]k[1][, A]in[โ,ฯ] _[โ][1], ฯin[โ][โ][1][,ฯ]_ _, Vin[โ][โ][1][,ฯ]_ (6) โ[โ,ฯ]in = _โB[2]_ _[ฯ][โ]_ []Bin[โ,ฯ] _[โ][1], A[โ,ฯ]in_ _[โ][1], ฯin[โ][โ][1][,ฯ]_ _, Vin[โ][โ][1][,ฯ]_ (7) _m[โ,ฯ]ki_ _n_ = _โH_ _ฯ(Hki[โ,ฯ]_ _[โ]n[1][, G]ki[โ,ฯ]_ _[โ][1], ฮธki[โ]_ [)] (8) _โ_ _โ_ _ฯki[โ,ฯ]_ = _โH[2]_ _[ฯ][(][H]ki[โ,ฯ]_ _[โ][1], G[โ,ฯ]ki_ _[โ][1], ฮธki[โ]_ [)] (9) 2 _Vkn[โ,ฯ]_ = _m[โ,ฯ]ki_ _n_ โ[โ,ฯ]in [+][ ฯ]ki[โ,ฯ] [(ห]x[โ,ฯ]in _k[)][2][ +][ ฯ]ki[โ,ฯ]_ [โ]in[โ,ฯ] (10) _โ_ _โ_ Xi _ฯkn[โ,ฯ]โi_ = _m[โ,ฯ]ki[โฒ]โnx[ห][โ,ฯ]i[โฒ]nโk_ (11) _iX[โฒ]=ฬธ_ _i_ The equations for the first layer differ slightly and in an intuitive way from the ones above (see Appendix A.3). **BP Backward pass.** The backward pass updates a set of messages from the last to the first layer: _gkn[โ,ฯ]โi_ = _โฯฯ[โ][+1][ ]Bkn[โ][+1][,ฯ]_ _, A[โ]kn[+1][,ฯ]_ _, ฯkn[โ,ฯ]โi[, V]kn[ โ,ฯ]_ (12) ฮ[โ,ฯ]kn = _โโฯ[2]_ _[ฯ][โ][+1][ ]Bkn[โ][+1][,ฯ]_ _, A[โ]kn[+1][,ฯ]_ _, ฯkn[โ,ฯ]_ _[, V]kn[ โ,ฯ]_ (13) 2 _A[โ,ฯ]in_ = _k_ (m[โ,ฯ]kiโn[)][2][ +][ ฯ]ki[โ,ฯ] ฮ[โ,ฯ]kn _[โ]_ _[ฯ]ki[โ,ฯ]_ _gkn[โ,ฯ]โi_ (14) X _Bin[โ,ฯ]โk_ = _m[โ,ฯ]k[โฒ]iโn[g]k[โ,ฯ][โฒ]nโi_ (15) _kX[โฒ]=ฬธ_ _k_ 2 _G[โ,ฯ]ki_ = _n_ (หx[โ,ฯ]inโk[)][2][ + โ]in[โ,ฯ] ฮ[โ,ฯ]kn _[โ]_ [โ]in[โ,ฯ] _gkn[โ,ฯ]โi_ (16) X _Hki[โ,ฯ]โn_ = _xห[โ,ฯ]in[โฒ]โk[g]kn[โ,ฯ][โฒ]โi_ (17) _nX[โฒ]=ฬธ_ _n_ As with the forward pass, we add the caveat that for the last layer the equations are slightly different from the ones above. ----- **Computational complexity** The message passing equations boil down to element-wise operations and tensor contractions that we easily implement using the GPU friendly julia library Tullio.jl (Abbott et al., 2021). For a layer of input and output size N and considering a batch-size of B, the time complexity of a forth-and-back iteration is O(N [2]B) for all message passing algorithms (BP, BPI, MF, and AMP), the same as SGD. The prefactor varies and it is generally larger than SGD (see Appendix B.9). Also, time complexity for message passing is proportional to ฯmax (which we typically set to 1). We provide our implementation in the GitHub repo anonymous. **Algorithm 1: BP for deep neural networks** // Message passing used in the PasP Eq. 3 to approximate. // the mini-batch posterior. // Here we specifically refer to BP updates. // BPI, MF, and AMP updates take the same form but using // the rules in Appendix A.4, A.5, and A.7 respectively **1 Initialize messages.** **2 for ฯ = 1, . . . ฯmax do** // Forward Pass **3** **for l = 0, . . ., L do** **4** compute หx[โ], โ[โ] using (6, 7) **5** compute m[โ], ฯ[โ] using (8, 9) **6** compute V[โ], ฯ[โ] using (10, 11) // Backward Pass **7** **for l = L, . . ., 0 do** **8** compute g[โ], ฮ[โ] using (12, 13) **9** compute A[โ], B[โ] using (14, 15) **10** compute G[โ], H _[โ]_ using (16, 17) 4 NUMERICAL RESULTS We implement our message passing algorithms on neural networks with continuous and binary weights and with binary activations. In our experiments we fix ฯmax = 1. We typically do not observe an increase in performance taking more steps, except for some specific cases and in particular for MF layers. We remark that for ฯmax = 1 the BP and the BPI equations are identical, so in most of the subsequent numerical results we will only investigate BP. We compare our algorithms with a SGD-based algorithm adapted to binary architectures (Hubara et al., 2016) which we call BinaryNet along the paper (see Appendix B.6 for details). Comparison of Bayesian predictions are with the gradient-based Expectation Backpropagation (EBP) algorithm (Soudry et al., 2014a), also able to deal with discrete weights and activations. In all architectures we avoid the use of bias terms and batch-normalization layers. We find that message-passing algorithms are able to train generic MLP architectures with varying numbers and sizes of hidden layers. As for the datasets, we are able to perform both binary classification and multi-class classification on standard computer vision datasets such as MNIST, Fashion-MNIST, and CIFAR-10. Since these datasets consist of 10 classes, for the binary classification task we divide each dataset in two classes (even vs odd). We report that message passing algorithms are able to solve these optimization problems with generalization performance comparable to or better than SGD-based algorithms. Some of the message passing algorithms (BP and AMP in particular) need fewer epochs to achieve low error than the ones required by SGD-based algorithms, even if adaptive methods like Adam are considered. Timings of our GPU implementations of message passing algorithms are competitive with SGD (see Appendix B.9). ----- 60 |Col1|BP train MF train BP test MF test| |---|---| ||AMP train BinaryNet train AMP test BinaryNet test| ||| ||| ||| 20 40 60 80 100 4.1 EXPERIMENTS ACROSS ARCHITECTURES We select a specific task, multi-class classification on Fashion-MNIST, and we compare the message passing algorithms with BinaryNet for different choices of the architecture (i.e. we vary the number and the size of the hidden layers). In Fig.1 (Left) we present the learning curves for a MLP with 3 hidden layers with 501 units with binary weights and activations. Similar results hold in our experiments with 2 or 3 hidden layers of 101, 501 or 1001 units and with batch sizes from 1 to from 1024. The parameters used in our simulations are reported in Appendix B.3. Results on networks with continuous weights can be found in Fig.2 (Right). 4.2 SPARSE LAYERS Since the BP algorithm has notoriously been successful on sparse graphs, we perform a straightforward implementation of pruning at initialization, i.e. we impose a random boolean mask on the weights that we keep fixed along the training. We call sparsity the fraction of zeroed weights. This kind of non-adaptive pruning is known to largely hinder learning (Frankle et al., 2021; Sung et al., 2021). In the right panel of Fig. 1, we report results on sparse binary networks in which we train a MLP with 2 hidden layers of 101 units on the MNIST dataset. For reference, results on pruning quantized/binary networks can be found in Refs. (Han et al., 2016; Ardakani et al., 2017; Tung & Mori, 2018; Diffenderfer & Kailkhura, 2021). Experimenting with sparsity up to 90%, we observe that BP and MF perform better than BinaryNet. AMP struggles behind BinaryNet instead. 25 100 BP train MF train 95 BP test MF test 20 AMP train BinaryNet train 90 AMP test BinaryNet test 85 15 80 BP test error (%) 10 Bayes BP test 75 AMP test test accuracy (%) 70 Bayes AMP test 5 MF test 65 Bayes MF test BinaryNet test 0 epochs 10 20 30 40 50 60 70 80 90 |Col1|Col2|Col3|Col4|Col5|Col6|Col7|Col8|Col9|Col10| |---|---|---|---|---|---|---|---|---|---| ||||||||||| ||||||||||| ||||||||||| |||BP test Bayes|BP test||||||| |||AMP te Bayes|st AMP te|st|||||| |||MF tes Bayes|t MF tes|t|||||| |||Binary|Net test||||||| sparsity (%) Figure 1: (Left) Training curves of message passing algorithms compared with BinaryNet on the Fashion-MNIST dataset (multi-class classification) with a binary MLP with 3 hidden layers of 501 units. (Right) Final test accuracy when varying the layerโs sparsity in a binary MLP with 2 hidden layers of 101 units on the MNIST dataset (multi-class). In both panels the batch-size is 128 and curves are averaged over 5 realizations of the initial conditions (and sparsity pattern in the right panel). 4.3 EXPERIMENTS ACROSS DATASETS We now fix the architecture, a MLP with 2 hidden layers of 501 neurons each with binary weights and activations. We vary the dataset, i.e. we test the BP-based algorithms on standard computer vision benchmark datasets such as MNIST, Fashion-MNIST and CIFAR-10, in both the multi-class and binary classification tasks. In Tab. 1 we report the final test errors obtained by the message passing algorithms compared to the BinaryNet baseline. See Appendix B.4 for the corresponding training errors and the parameters used in the simulations. We mention that while the test performance is mostly comparable, the train error tends to be lower for the message passing algorithms. ----- |Col1|BinaryNet BP|Col3|AMP MF| |---|---|---|---| ||sses) 1.3 ยฑ 0.1 1.4 ยฑ 0.2||1.4 ยฑ 0.1 1.3 ยฑ 0.| ||ST (2 classes) 2.4 ยฑ 0.1 2.3 ยฑ 0.1||2.4 ยฑ 0.1 2.3 ยฑ 0.| ||classes) 30.0 ยฑ 0.3 31.4 ยฑ 0.1||31.1 ยฑ 0.3 31.1 ยฑ 0.| ||2.2 ยฑ 0.1 2.6 ยฑ 0.1||2.6 ยฑ 0.1 2.3 ยฑ 0.| ||ST 12.0 ยฑ 0.6 11.8 ยฑ 0.3||11.9 ยฑ 0.2 12.1 ยฑ 0.| ||59.0 ยฑ 0.7 58.7 ยฑ 0.3||58.5 ยฑ 0.2 60.4 ยฑ 1.| ||on Fashion-MNIST of various algorith ts and activations. All algorithms are tra ard deviations are calculated over 5 ran IAN ERROR amework used as an estimator of the m ate Bayesian prediction, i.e. averagin We observe better generalization error f wing that the marginals retain useful ith the PasP mini-batch procedure (the e t this converges with difficulty in our te (as also confirmed by the local energy ompute can be considered as a local app cation on the MNIST dataset in Fig. 2, a tasets and architectures. We obtain the gle forward pass of the message passing osterior distribution does not concentra e to the prediction of a single configurat m a comparison of BP (point-wise an m Bayesian predictions, Expectation Ba mplementation details. y Weights 5 EBP Bayes BP BP BinaryNet 4 60 80 100 bayes EBP hs (%) 3 error 2 test 1||ms on a MLP with 2 hid ined with batch-size 12 dom initializations. ini-batch posterior mar g the pointwise predicti rom Bayesian predictio information. However, xact ones should be com sts). Since BP-based alg measure performed in A roximation of the full on nd we observe the same Bayesian prediction fro . To obtain good Bayes te too much, otherwise ion. d Bayesian) with SGD ckpropagation (Soudry Continuous Weights| |(%) 50 error 40 t|||SGD EBP| |tes 0 20 40 epoc|||bayes EBP| ||||| ||||| ||||| 20 40 60 80 100 epochs SGD BP EBP Bayes BP bayes EBP 20 40 60 80 100 epochs Figure 2: (Left) Test error curves for Bayesian and point-wise predictions for a MLP with 2 hidden layers of 101 units on the 2-classes MNIST dataset. We report the results for (Left) binary and (Right) continuous weights. In both cases, we compare SGD, BP (point-wise and Bayesian) and EBP (point-wise and Bayesian). See Appendix B.3 for details. 4.5 CONTINUAL LEARNING Given the high local entropy (i.e. the flatness) of the solutions found by the BP-based algorithms (see Appendix B.5), we perform additional tests in a classic setting, continual learning, where the ----- possibility of locally rearranging the solutions while keeping low training error can be an advantage. When a deep network is trained sequentially on different tasks, it tends to forget exponentially fast previously seen tasks while learning new ones (McCloskey & Cohen, 1989; Robins, 1995; Fusi et al., 2005). Recent work (Feng & Tu, 2021) has shown that searching for a flat region in the loss landscape can indeed help to prevent catastrophic forgetting. Several heuristics have been proposed to mitigate the problem (Kirkpatrick et al., 2017; Aljundi et al., 2018; Zenke et al., 2017; Laborieux et al., 2021) but all require specialized adjustments to the loss or the dynamics . Here we show instead that our message passing schemes are naturally prone to learn multiple tasks sequentially, mitigating the characteristic memory issues of the gradient-based schemes without the need for explicit modifications. As a prototypical experiment, we sequentially trained a multi-layer neural network on 6 different versions of the MNIST dataset, where the pixels of the images have been randomly permuted (Goodfellow et al., 2013), giving a fixed budget of 40 epochs on each task. We present the results for a two hidden layer neural network with 2001 units on each layer (see Appendix B.3 for details). As can be seen in Fig. 3, at the end of the training the BP algorithm is able to reach good generalization performances on all the tasks. We compared the BP performance with BinaryNet, which already performs better than SGD with continuous weights (see the discussion in Laborieux et al. (2021)). While our BP implementation is not competitive with ad-hoc techniques specifically designed for this problem, it beats non-specialized heuristics. Moreover, we believe that specialized approaches like the one of Laborieux et al. (2021) can be adapted to message passing as well. |BP Bi Bi Bi|naryNet lr= naryNet lr= naryNet lr=|0.1 1.0 10.0|Col4|Col5|Col6| |---|---|---|---|---|---| 1 2 3 4 5 6 0 40 80 120 160 200 240 task # epochs 100 90 80 70 60 50 40 BP 30 BinaryNet lr=0.1 test accuracy (%) 20 BinaryNet lr=1.0 10 BinaryNet lr=10.0 0 1 2 3 4 5 6 task # Figure 3: Performance of BP and BinaryNet on the permuted MNIST task (see text) for a two hidden layer network with 2001 units on each layer and binary weights and activations. The model is trained sequentially on 6 different versions of the MNIST dataset (the tasks), where the pixels have been permuted. (Left) Test accuracy on each task after the network has been trained on all the tasks. (Right) Test accuracy on the first task as a function of the number of epochs. Points are averages over 5 independent runs, shaded areas are errors on the mean. 5 DISCUSSION AND CONCLUSIONS While successful in many fields, message passing algorithms, have notoriously struggled to scale to deep neural networks training problems. Here we have developed a class of fBP-based message passing algorithms and used them within an update scheme, Posterior-as-Prior (PasP), that makes it possible to train deep and wide multilayer perceptrons by message passing. We performed experiments binary activations and either binary or continuous weights. Future work should try to include different activations, biases, batch-normalization, and convolutional layers as well. Another interesting direction is the algorithmic computation of the (local) entropy of the model from the messages. Further theoretical work is needed for a more complete understanding of the robustness of our methods. Recent developments in message passing algorithms (Rangan et al., 2019) and related theoretical analysis (Goldt et al., 2020) could provide fruitful inspirations. While our algorithms can be used for approximate Bayesian inference, exact posterior calculation is still out of reach for message passing approaches and much technical work is needed in that direction. ----- REFERENCES Michael Abbott, Dilum Aluthge, N3N5, Simeon Schaub, Carlo Lucibello, Chris Elrod, and Johnny [Chen. Tullio.jl julia package, 2021. URL https://github.com/mcabbott/Tullio.jl.](https://github.com/mcabbott/Tullio.jl) Rahaf Aljundi, Francesca Babiloni, Mohamed Elhoseiny, Marcus Rohrbach, and Tinne Tuytelaars. Memory aware synapses: Learning what (not) to forget. In Proceedings of the European Conference _on Computer Vision (ECCV), pp. 139โ154, 2018._ Arash Ardakani, Carlo Condo, and Warren J. Gross. Sparsely-connected neural networks: Towards efficient VLSI implementation of deep neural networks. In 5th International Conference on Learning _Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings._ [OpenReview.net, 2017. URL https://openreview.net/forum?id=r1fYuytex.](https://openreview.net/forum?id=r1fYuytex) Carlo Baldassi, Alfredo Braunstein, Nicolas Brunel, and Riccardo Zecchina. Efficient supervised learning in networks with binary synapses. Proceedings of the National Academy of Sciences, [104(26):11079โ11084, 2007. ISSN 0027-8424. doi: 10.1073/pnas.0700324104. URL https:](https://www.pnas.org/content/104/26/11079) [//www.pnas.org/content/104/26/11079.](https://www.pnas.org/content/104/26/11079) Carlo Baldassi, Alessandro Ingrosso, Carlo Lucibello, Luca Saglietti, and Riccardo Zecchina. Subdominant dense clusters allow for simple learning and high computational performance in neural networks with discrete synapses. _Phys. Rev. Lett., 115:128101, Sep 2015._ [doi: 10.1103/PhysRevLett.115.128101. URL https://link.aps.org/doi/10.1103/](https://link.aps.org/doi/10.1103/PhysRevLett.115.128101) [PhysRevLett.115.128101.](https://link.aps.org/doi/10.1103/PhysRevLett.115.128101) Carlo Baldassi, Christian Borgs, Jennifer T. Chayes, Alessandro Ingrosso, Carlo Lucibello, Luca Saglietti, and Riccardo Zecchina. Unreasonable effectiveness of learning neural networks: From accessible states and robust ensembles to basic algorithmic schemes. Proceedings of the National _Academy of Sciences, 113(48):E7655โE7662, 2016a. ISSN 0027-8424. doi: 10.1073/pnas._ [1608103113. URL https://www.pnas.org/content/113/48/E7655.](https://www.pnas.org/content/113/48/E7655) Carlo Baldassi, Federica Gerace, Carlo Lucibello, Luca Saglietti, and Riccardo Zecchina. Learning may need only a few bits of synaptic precision. Phys. Rev. E, 93:052313, May 2016b. doi: 10. [1103/PhysRevE.93.052313. URL https://link.aps.org/doi/10.1103/PhysRevE.](https://link.aps.org/doi/10.1103/PhysRevE.93.052313) [93.052313.](https://link.aps.org/doi/10.1103/PhysRevE.93.052313) Carlo Baldassi, Fabrizio Pittorino, and Riccardo Zecchina. Shaping the learning landscape in neural networks around wide flat minima. Proceedings of the National Academy of Sciences, 117(1): [161โ170, 2020. ISSN 0027-8424. doi: 10.1073/pnas.1908636117. URL https://www.pnas.](https://www.pnas.org/content/117/1/161) [org/content/117/1/161.](https://www.pnas.org/content/117/1/161) Jean Barbier, Florent Krzakala, Nicolas Macris, Lรฉo Miolane, and Lenka Zdeborovรก. Optimal errors and phase transitions in high-dimensional generalized linear models. Proceedings of the National _Academy of Sciences, 116(12):5451โ5460, 2019. ISSN 0027-8424. doi: 10.1073/pnas.1802705116._ [URL https://www.pnas.org/content/116/12/5451.](https://www.pnas.org/content/116/12/5451) Hans Bethe. Statistical theory of superlattices. Proc. R. Soc. A, 150:552, 1935. Alfredo Braunstein and Riccardo Zecchina. Learning by message passing in networks of discrete synapses. Phys. Rev. Lett., 96:030201, Jan 2006. doi: 10.1103/PhysRevLett.96.030201. URL [https://link.aps.org/doi/10.1103/PhysRevLett.96.030201.](https://link.aps.org/doi/10.1103/PhysRevLett.96.030201) Pratik Chaudhari, Anna Choromanska, Stefano Soatto, Yann LeCun, Carlo Baldassi, Christian Borgs, Jennifer T. Chayes, Levent Sagun, and Riccardo Zecchina. Entropy-sgd: Biasing gradient descent into wide valleys. In 5th International Conference on Learning Representations, ICLR 2017, _Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net, 2017. URL_ [https://openreview.net/forum?id=B1YfAfcgl.](https://openreview.net/forum?id=B1YfAfcgl) James Diffenderfer and Bhavya Kailkhura. Multi-prize lottery ticket hypothesis: Finding accurate binary neural networks by pruning a randomly weighted network. In International Confer_[ence on Learning Representations, 2021. URL https://openreview.net/forum?id=](https://openreview.net/forum?id=U_mat0b9iv)_ [U_mat0b9iv.](https://openreview.net/forum?id=U_mat0b9iv) ----- Yu Feng and Yuhai Tu. The inverse varianceโflatness relation in stochastic gradient descent is critical for finding flat minima. Proceedings of the National Academy of Sciences, 118(9), 2021. Alyson K Fletcher, Sundeep Rangan, and Philip Schniter. Inference in deep networks in high dimensions. In 2018 IEEE International Symposium on Information Theory (ISIT), pp. 1884โ1888. IEEE, 2018. Jonathan Frankle, Gintare Karolina Dziugaite, Daniel Roy, and Michael Carbin. Pruning neural networks at initialization: Why are we missing the mark? In International Conference on Learning _[Representations, 2021. URL https://openreview.net/forum?id=Ig-VyQc-MLK.](https://openreview.net/forum?id=Ig-VyQc-MLK)_ Stefano Fusi, Patrick J Drew, and Larry F Abbott. Cascade models of synaptically stored memories. _Neuron, 45(4):599โ611, 2005._ Marylou Gabriรฉ. Mean-field inference methods for neural networks. Journal of Physics A: Mathe_matical and Theoretical, 53(22):223002, 2020._ Robert Gallager. Low-density parity-check codes. IRE Transactions on information theory, 8(1): 21โ28, 1962. Timur Garipov, Pavel Izmailov, Dmitrii Podoprikhin, Dmitry P Vetrov, and Andrew G Wilson. Loss surfaces, mode connectivity, and fast ensembling of dnns. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (eds.), Advances in Neural Information Processing _[Systems, volume 31. Curran Associates, Inc., 2018. URL https://proceedings.neurips.](https://proceedings.neurips.cc/paper/2018/file/be3087e74e9100d4bc4c6268cdbe8456-Paper.pdf)_ [cc/paper/2018/file/be3087e74e9100d4bc4c6268cdbe8456-Paper.pdf.](https://proceedings.neurips.cc/paper/2018/file/be3087e74e9100d4bc4c6268cdbe8456-Paper.pdf) Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In Yee Whye Teh and Mike Titterington (eds.), Proceedings of the Thirteenth _International Conference on Artificial Intelligence and Statistics, volume 9 of Proceedings of_ _Machine Learning Research, pp. 249โ256, Chia Laguna Resort, Sardinia, Italy, 13โ15 May 2010._ [PMLR. URL https://proceedings.mlr.press/v9/glorot10a.html.](https://proceedings.mlr.press/v9/glorot10a.html) Sebastian Goldt, Marc Mรฉzard, Florent Krzakala, and Lenka Zdeborovรก. Modeling the influence of data structure on learning in neural networks: The hidden manifold model. Physical Review X, 10 (4):041044, 2020. Ian J Goodfellow, Mehdi Mirza, Da Xiao, Aaron Courville, and Yoshua Bengio. An empirical investigation of catastrophic forgetting in gradient-based neural networks. arXiv preprint arXiv:1312.6211, 2013. Song Han, Huizi Mao, and William J. Dally. Deep compression: Compressing deep neural network with pruning, trained quantization and huffman coding. In Yoshua Bengio and Yann LeCun (eds.), _4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico,_ _[May 2-4, 2016, Conference Track Proceedings, 2016. URL http://arxiv.org/abs/1510.](http://arxiv.org/abs/1510.00149)_ [00149.](http://arxiv.org/abs/1510.00149) Josรฉ Miguel Hernรกndez-Lobato and Ryan P. Adams. Probabilistic backpropagation for scalable learning of bayesian neural networks. In Proceedings of the 32nd International Conference on _International Conference on Machine Learning - Volume 37, ICMLโ15, pp. 1861โ1869. JMLR.org,_ 2015. Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Binarized neural networks. In D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 29. Curran Asso[ciates, Inc., 2016. URL https://proceedings.neurips.cc/paper/2016/file/](https://proceedings.neurips.cc/paper/2016/file/d8330f857a17c53d217014ee776bfd50-Paper.pdf) [d8330f857a17c53d217014ee776bfd50-Paper.pdf.](https://proceedings.neurips.cc/paper/2016/file/d8330f857a17c53d217014ee776bfd50-Paper.pdf) Yiding Jiang, Behnam Neyshabur, Hossein Mobahi, Dilip Krishnan, and Samy Bengio. Fantastic generalization measures and where to find them. In International Conference on Learning _[Representations, 2020. URL https://openreview.net/forum?id=SJgIPJBFvH.](https://openreview.net/forum?id=SJgIPJBFvH)_ Yoshiyuki Kabashima, Florent Krzakala, Marc Mรฉzard, Ayaka Sakata, and Lenka Zdeborovรก. Phase transitions and sample complexity in bayes-optimal matrix factorization. IEEE Transactions on _Information Theory, 62(7):4228โ4265, 2016. doi: 10.1109/TIT.2016.2556702._ ----- James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences, 114 (13):3521โ3526, 2017. Jonathan Kuck, Shuvam Chakraborty, Hao Tang, Rachel Luo, Jiaming Song, Ashish Sabharwal, and Stefano Ermon. Belief propagation neural networks. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin (eds.), Advances in Neural Information Processing Systems, volume 33, [pp. 667โ678. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/](https://proceedings.neurips.cc/paper/2020/file/07217414eb3fbe24d4e5b6cafb91ca18-Paper.pdf) [paper/2020/file/07217414eb3fbe24d4e5b6cafb91ca18-Paper.pdf.](https://proceedings.neurips.cc/paper/2020/file/07217414eb3fbe24d4e5b6cafb91ca18-Paper.pdf) Axel Laborieux, Maxence Ernoult, Tifenn Hirtzlin, and Damien Querlioz. Synaptic metaplasticity in binarized neural networks. Nature Communications, 12(1):2549, May 2021. ISSN 2041-1723. doi: 10.1038/s41467-021-22768-y. [URL https://doi.org/10.1038/](https://doi.org/10.1038/s41467-021-22768-y) [s41467-021-22768-y.](https://doi.org/10.1038/s41467-021-22768-y) Hao Li, Zheng Xu, Gavin Taylor, Christoph Studer, and Tom Goldstein. Visualizing the loss landscape of neural nets. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 31. Curran As[sociates, Inc., 2018. URL https://proceedings.neurips.cc/paper/2018/file/](https://proceedings.neurips.cc/paper/2018/file/a41b3bb3e6b050b6c9067c67f663b915-Paper.pdf) [a41b3bb3e6b050b6c9067c67f663b915-Paper.pdf.](https://proceedings.neurips.cc/paper/2018/file/a41b3bb3e6b050b6c9067c67f663b915-Paper.pdf) Antoine Maillard, Florent Krzakala, Marc Mรฉzard, and Lenka Zdeborovรก. Perturbative construction of mean-field equations in extensive-rank matrix factorization and denoising. arXiv preprint _arXiv:2110.08775, 2021._ Andre Manoel, Florent Krzakala, Marc Mรฉzard, and Lenka Zdeborovรก. Multi-layer generalized linear estimation. In 2017 IEEE International Symposium on Information Theory (ISIT), pp. 2098โ2102, 2017. doi: 10.1109/ISIT.2017.8006899. Michael McCloskey and Neal J Cohen. Catastrophic interference in connectionist networks: The sequential learning problem. In Psychology of learning and motivation, volume 24, pp. 109โ165. Elsevier, 1989. Marc Mรฉzard. Mean-field message-passing equations in the hopfield model and its generalizations. _Physical Review E, 95(2):022117, 2017._ Marc Mรฉzard, Giorgio Parisi, and Miguel Angel Virasoro. Spin glass theory and beyond: An _Introduction to the Replica Method and Its Applications, volume 9. World Scientific Publishing_ Company, 1987. Thomas P. Minka. Expectation propagation for approximate bayesian inference. In Proceedings of _the Seventeenth Conference on Uncertainty in Artificial Intelligence, UAIโ01, pp. 362โ369, San_ Francisco, CA, USA, 2001. Morgan Kaufmann Publishers Inc. ISBN 1558608001. Marc Mรฉzard and Andrea Montanari. Information, Physics, and Computation. Oxford University Press, Inc., USA, 2009. ISBN 019857083X. Jason T Parker, Philip Schniter, and Volkan Cevher. Bilinear generalized approximate message passingโpart i: Derivation. IEEE Transactions on Signal Processing, 62(22):5839โ5853, 2014. Judea Pearl. Reverend Bayes on inference engines: A distributed hierarchical approach. Cognitive Systems Laboratory, School of Engineering and Applied Science ..., 1982. R. Peierls. On isingโs model of ferromagnetism. Mathematical Proceedings of the Cambridge _Philosophical Society, 32(3):477โ481, 1936. doi: 10.1017/S0305004100019174._ Fabrizio Pittorino, Carlo Lucibello, Christoph Feinauer, Gabriele Perugini, Carlo Baldassi, Elizaveta Demyanenko, and Riccardo Zecchina. Entropic gradient descent algorithms and wide flat minima. [In International Conference on Learning Representations, 2021. URL https://openreview.](https://openreview.net/forum?id=xjXg0bnoDmS) [net/forum?id=xjXg0bnoDmS.](https://openreview.net/forum?id=xjXg0bnoDmS) Sundeep Rangan, Philip Schniter, and Alyson K Fletcher. Vector approximate message passing. IEEE _Transactions on Information Theory, 65(10):6664โ6684, 2019._ ----- Rajesh P. N. Rao. Neural models of Bayesian belief propagation., pp. 239โ267. Bayesian brain: Probabilistic approaches to neural coding. MIT Press, Cambridge, MA, US, 2007. ISBN 026204238X (Hardcover); 978-0-262-04238-3 (Hardcover). Anthony Robins. Catastrophic forgetting, rehearsal and pseudorehearsal. Connection Science, 7(2): 123โ146, 1995. Victor Garcia Satorras and Max Welling. Neural enhanced belief propagation on factor graphs. In _International Conference on Artificial Intelligence and Statistics, pp. 685โ693. PMLR, 2021._ Daniel Soudry, Itay Hubara, and Ron Meir. Expectation backpropagation: Parameterfree training of multilayer neural networks with continuous or discrete weights. In Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K. Q. Weinberger (eds.), _Advances in Neural Information Processing Systems,_ volume 27. Curran Associates, Inc., 2014a. URL [https://proceedings.neurips.cc/paper/2014/file/](https://proceedings.neurips.cc/paper/2014/file/076a0c97d09cf1a0ec3e19c7f2529f2b-Paper.pdf) [076a0c97d09cf1a0ec3e19c7f2529f2b-Paper.pdf.](https://proceedings.neurips.cc/paper/2014/file/076a0c97d09cf1a0ec3e19c7f2529f2b-Paper.pdf) Daniel Soudry, Itay Hubara, and Ron Meir. Expectation backpropagation: Parameter-free training of multilayer neural networks with continuous or discrete weights. In NIPS, volume 1, pp. 2, 2014b. George Stamatescu, Federica Gerace, Carlo Lucibello, Ian Fuss, and Langford B. White. Critical [initialisation in continuous approximations of binary neural networks. 2020. URL https:](https://openreview.net/forum?id=rylmoxrFDH) [//openreview.net/forum?id=rylmoxrFDH.](https://openreview.net/forum?id=rylmoxrFDH) Yi-Lin Sung, Varun Nair, and Colin Raffel. Training neural networks with fixed sparse masks, 2021. Frederick Tung and Greg Mori. Clip-q: Deep network compression learning by in-parallel pruningquantization. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7873โ7882, 2018. doi: 10.1109/CVPR.2018.00821. Anqi Wu, Sebastian Nowozin, Edward Meeds, Richard E Turner, Jose Miguel Hernandez-Lobato, and Alexander L Gaunt. Deterministic variational inference for robust bayesian neural networks. _arXiv preprint arXiv:1810.03958, 2018._ Jonathan S. Yedidia, William T. Freeman, and Yair Weiss. Understanding Belief Propagation and Its _Generalizations, pp. 239โ269. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2003._ ISBN 1558608117. Lenka Zdeborovรก and Florent Krzakala. Statistical physics of inference: Thresholds and algorithms. _Advances in Physics, 65(5):453โ552, 2016._ Friedemann Zenke, Ben Poole, and Surya Ganguli. Continual learning through synaptic intelligence. In International Conference on Machine Learning, pp. 3987โ3995. PMLR, 2017. Qiuyun Zou, Haochuan Zhang, and Hongwen Yang. Multi-layer bilinear generalized approximate message passing. IEEE Transactions on Signal Processing, 69:4529โ4543, 2021. doi: 10.1109/ TSP.2021.3100305. ----- # Appendices CONTENTS **A BP-based message passing algorithms** **14** A.1 Preliminary considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 A.2 Derivation of the BP equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 A.3 BP equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 A.4 BPI equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 A.5 MF equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 A.6 Derivation of the AMP equations . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 A.7 AMP equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 A.8 Activation Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 A.9 The ArgMax layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 **B** **Experimental details** **25** B.1 Hyper-parameters of the BP-based scheme . . . . . . . . . . . . . . . . . . . . . . 25 B.2 Damping scheme for the message passing . . . . . . . . . . . . . . . . . . . . . . 25 B.3 Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 B.4 Varying the dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 B.5 Local energy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 B.6 SGD implementation (BinaryNet) . . . . . . . . . . . . . . . . . . . . . . . . . . 28 B.7 EBP implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 B.8 Unit polarization and overlaps . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 B.9 Computational performance: varying batch-size . . . . . . . . . . . . . . . . . . . 30 A BP-BASED MESSAGE PASSING ALGORITHMS A.1 PRELIMINARY CONSIDERATIONS Given a mini-batch = (xn, yn) _n, the factor graph defined by Eqs. (1, 2, 18) is explicitly written_ _B_ _{_ _}_ as: _L_ _P_ ( _, x[1:][L]_ _, ฮธ)_ _P_ _[โ][+1]_ _x[โ]kn[+1]_ _Wki[โ]_ _[x][โ]in_ _qฮธ(Wki[โ]_ [)][,] (18) _W_ _| B_ _โ_ _โ=0_ _k,n_ _i_ _k,i,โ_ Y Y X ! Y where x[0]n [=][ x][n][,][ x]n[L][+1] = yn. The derivation of the BP equations for this model is straightforward albeit lengthy and involved. It is obtained following the steps presented in multiple papers, books, and reviews, see for instance (Mรฉzard & Montanari, 2009; Zdeborovรก & Krzakala, 2016; Mรฉzard, 2017), although it has not been attempted before in deep neural networks. It should be noted that a (common) approximation that we take here with respect to the standard BP scheme, is that messages are assumed to be Gaussian distributed and therefore parameterized by their mean and variance. This goes by the name of relaxed belied propagation (rBP), just referred to as BP throughout the paper. We derive the BP equations in A.2 and present them all together in A.3. From BP, we derive other 3 message passing algorithms useful for the deep network training setting, all of which are well known to the literature: BP-Inspired (BPI) message passing A.4, mean-field (MF) A.5, and approximate ----- message passing (AMP) A.7. The AMP derivation is the more involved and given in A.6. In all these cases, message updates can be divided in a forward pass and a backward pass, as also done in Fletcher et al. (2018) in a multi-layer inference setting. The BP algorithm is compactly reported in Algorithm 1. In our notation, โ denotes the layer index, ฯ the BP iteration index, k an output neuron index, i an input neuron index, and n a sample index. We report below, for convenience, some of the considerations also present in the main text. **Meaning of messages.** All the messages involved in the message passing equations can be understood in terms of cavity marginals or full marginals (as mentioned in the introduction BP is also known as the Cavity Method, see Mรฉzard & Montanari (2009)). Of particular relevance are the quantities m[โ]ki [and][ ฯ]ki[โ] [, denoting the mean and variance of the weights][ W][ โ]ki[. The quantities][ ห]x[โ]in [and] โ[โ]in [instead denote mean and variance of the][ i][-th neuronโs activation in layer][ โ] [in correspondence of] an input xn. **Scalar free energies.** All message passing schemes can be expressed using the following scalar functions, corresponding to single neuron and single weight effective free-energies respectively: _ฯ[โ](B, A, ฯ, V ) = log_ dx dz e[โ] [1]2 _[Ax][2][+][Bx]_ _P_ _[โ]_ (x | z) e[โ] [(][ฯ]2[โ]V[z][)2] _,_ (19) Z _ฯ(H, G, ฮธ) = log_ dw e[โ] [1]2 _[G][2][w][2][+][Hw]_ _qฮธ(w)._ (20) Z These free energies will naturally arise in the derivation of the BP equations in Appendix A.2. For the last layer, the neuron function has to be slightly modified: _ฯ[L][+1](y, ฯ, V ) = log_ dz P _[L][+1]_ (y | z) e[โ] [(][ฯ]2[โ]V[z][)2] _._ (21) Z Notice that for common deterministic activations such as ReLU and sign, the function ฯ has analytic and smooth expressions that we give in Appendix A.8. Same goes for ฯ when qฮธ(w) is Gaussian (continuous weights) or a mixture of atoms (discrete weights). At the last layer we impose P _[L][+1](y|z) = I(y = sign(z)) in binary classification tasks. For multi-class clas-_ sification instead, we have to adapt the formalism to vectorial pre-activations z and assume _P_ _[L][+1](y|z) = I(y = arg max(z)) (see Appendix A.9). While in our experiments we use hard_ constraints for the final output, therefore solving a constraint satisfaction problem, it would be interesting to also consider generic loss functions. That would require minimal changes to our formalism, but this is beyond the scope of our work. **Binary weights.** In our experiments we use ยฑ1 weights in each layer. Therefore each marginal can be parameterized by a single number and our prior/posterior takes the form _qฮธ(Wki[โ]_ [)][ โ] _[e][ฮธ]ki[โ]_ _[W][ โ]ki_ (22) The effective free energy function Eq. 20 becomes _ฯ(H, G, ฮธki[โ]_ [) = log 2 cosh(][H][ +][ ฮธ]ki[โ] [)] (23) and the messages G can be dropped from the message passing. **Start and end of message passing.** At the beginning of a new PasP iteration t, we reset the messages to zero and run message passing for ฯmax iterations. We then compute the new prior _qฮธt+1_ ( ) from the posterior given by the message passing iterations. _W_ A.2 DERIVATION OF THE BP EQUATIONS In order to derive the BP equations, we start with the following portion of the factor graph reported in Eq. 18 in the main text, describing the contribution of a single data example in the inner loop of the PasP updates: ----- _P_ _[โ][+1]_ _x[โ]kn[+1]_ _Wki[โ]_ _[x][โ]in_ where x[0]n [=][ x][n][,][ x]n[L][+1] = yn. (24) _โ=0_ _k_ _i_ where we recall that the quantity x[โ]kn [corresponds to the activation of neuron][ k][ in layer][ โ] [in corre-] spondence of the input example n. Let us start by analyzing the single factor: _P_ _[โ][+1]_ _x[โ]kn[+1]_ _Wki[โ]_ _[x][โ]in_ (25) We refer to messages that travel from input to output in the factor graph as upgoing or upwards messages, while to the ones that travel from output to input as downgoing or backwards messages. **Factor-to-variable-W messages** The factor-to-variable-W messages read: _ฮฝหkn[โ][+1]_ _ki[(][W][ โ]ki[)][ โ]_ _dฮฝki[โ]_ _[โฒ]_ _n[(][W][ โ]ki[โฒ]_ [)] _โ_ _โ_ Z Yi[โฒ]=ฬธ _i_ _dฮฝi[โ][โฒ]n_ _k[(][x]i[โ][โฒ]n[)][ dฮฝ][โ][(][x][โ]kn[+1][)][ P][ โ][+1]_ _โ_ _i[โฒ]_ Y _x[โ]kn[+1]_ _Wki[โ]_ _[โฒ]_ _[x]i[โ][โฒ]n_ _i[โฒ]_ ! X (26) where ฮฝ denotes the messages travelling downwards (from output to input) in the factor graph. _โ_ We denote the means and variances of the incoming messages respectively with m[โ]ki _n[,][ ห]x[โ]in_ _k_ [and] _โ_ _โ_ _ฯki[โ]_ _n[,][ โ][โ]in_ _k[:]_ _โ_ _โ_ _m[โ]ki_ _n_ [=] _dฮฝki[โ]_ _n[(][W][ โ]ki[)][ W][ โ]ki_ (27) _โ_ _โ_ Z _ฯki[โ]_ _n_ [=] _dฮฝki[โ]_ _n[(][W][ โ]ki[)]_ _Wki[โ]_ _ki_ _n_ 2 (28) _โ_ _โ_ _[โ]_ _[m][โ]_ _โ_ Z _xห[โ]in_ _k_ [=] _dฮฝin[โ]_ _k[(][x]in[โ]_ [)][ x]in[โ] (29) _โ_ _โ_ Z 2 โ[โ]in _k_ [=] _dฮฝin[โ]_ _k[(][x]in[โ]_ [)] _x[โ]in_ _x[โ]in_ _k_ (30) _โ_ _โ_ _[โ]_ [ห] _โ_ Z We now use the central limit theorem to observe that with respect to the incoming messages distributions - assuming independence of these messages - in the large input limit the preactivation is a Gaussian random variable: _Wki[โ]_ _[โฒ]_ _[x][โ]i[โฒ]n_ _kn_ _i[, V]kn[ โ]_ _i[)]_ (31) _iX[โฒ]=ฬธ_ _i_ _[โผN]_ [(][ฯ][โ] _โ_ _โ_ where: _ฯkn[โ]_ _โi_ [=][ E][ฮฝ] ๏ฃฎ _Wki[โ]_ _[โฒ]_ _[x][โ]i[โฒ]n๏ฃน_ = _m[โ]ki[โฒ]โn_ _x[ห][โ]i[โฒ]nโk_ (32) _i[โฒ]=ฬธ_ _i_ _iX[โฒ]=ฬธ_ _i_ ๏ฃฐ[X] ๏ฃป _Vkn[โ]_ _โi_ [=][ V ar][ฮฝ] ๏ฃฎ _Wki[โ]_ _[โฒ]_ _[x][โ]i[โฒ]n๏ฃน_ _i[โฒ]=ฬธ_ _i_ = _ฯ๏ฃฐki[โ][X][โฒ]_ _n_ [โ]i[โ][โฒ]n _k๏ฃป[+]_ _m[โ]ki[โฒ]_ _n_ 2 โโi[โฒ]n _k_ [+][ ฯ]ki[โ] _[โฒ]_ _n_ _xห[โ]i[โฒ]n_ _k_ 2[] (33) _โ_ _โ_ _โ_ _โ_ _โ_ _โ_ _i[โฒ]=i_ Xฬธ Therefore we can rewrite the outgoing messages as: ----- 2 _ki_ _[xโ]in[)]_ _ฮฝหkn[โ][+1]_ _i[(][W][ โ]ki[)][ โ]_ _dz dฮฝin[โ]_ _k[(][x][โ]in[)][ dฮฝ][โ][(][x][โ]kn[+1][)][ e][โ]_ [(][z][โ][ฯkn]2[โ]Vkn[i] _[โ]โ[W โ]i_ _P_ _[โ][+1]_ _x[โ]kn[+1]_ _โ_ _โ_ Z (34) We now assume Wki[โ] _[x]in[โ]_ [to be small compared to the other terms. With a second order Taylor] expansion we obtain: _ฮฝหkn[โ]_ _โi[(][W][ โ]ki[)][ โ]_ _dz dฮฝโ(x[โ]kn[+1][)][ e][โ]_ [(][z][โ]2Vkn[ฯkn]โ[โ]i[i][)][2] _P_ _[โ][+1]_ _x[โ]kn[+1]_ _[z]_ Z 1 + _[z][ โ]_ _[ฯ][kn][โ][i]_ _xห[โ]in_ _k[W][ โ]ki_ [+ (][z][ โ] _[ฯ][kn][โ][i][)][2][ โ]_ _[V][kn][โ][i]_ _ร_ _Vknโi_ _โ_ 2Vknโi Introducing now the function: โ+ _xห[โ]inโk_ 2[ ] _Wki[โ]_ 2 (35) _ฯ[โ](B, A, ฯ, V ) = log_ dx dz e[โ] 2[1] _[Ax][2][+][Bx]_ _P_ _[โ]_ (x|z) e[โ] [(][ฯ]2[โ]V[z][)2] (36) Z and defining: _gkn[โ]_ _i_ [=][ โ][ฯ][ฯ][โ][+1][(][B][โ][+1][, A][โ][+1][, ฯ]kn[โ] _i[, V]kn[ โ]_ _i[)]_ (37) _โ_ _โ_ _โ_ ฮ[โ]kn _i_ [=][ โ][โ]ฯ[2] _[ฯ][โ][+1][(][B][โ][+1][, A][โ][+1][, ฯ]kn[โ]_ _i[, V]kn[ โ]_ _i[)]_ (38) _โ_ _โ_ _โ_ the expansion for the log-message reads: log หฮฝkn[โ] _i[(][W][ โ]ki[)][ โ]_ _[const][ + ห]x[โ]in_ _k_ _[g]kn[โ]_ _i[W][ โ]ki_ _โ_ _โ_ _โ_ 2[] 2[ ] 2 โ[โ]in _k_ [+] _xห[โ]in_ _k_ ฮ[โ]kn _i_ _in_ _k_ _gkn[โ]_ _i_ _Wki[โ]_ (39) _โ_ [1]2 _โ_ _โ_ _โ_ _[โ]_ [โ][โ] _โ_ _โ_ **Factor-to-variable-x messages** The derivation of these messages is analogous to the factor-tovariable-W ones in Eq. 26 just reported. The final result for the log-message is: log หฮฝkn[โ] _i[(][x]in[โ]_ [)][ โ][const][ +][ m]ki[โ] _n_ _[g]kn[โ]_ _i[x]in[โ]_ _โ_ _โ_ _โ_ 2[] 2[ ] 2 _ฯki[โ]_ _n_ [+] _m[โ]ki_ _n_ ฮ[โ]kn _i_ _ki_ _n_ _gkn[โ]_ _i_ _x[โ]in_ (40) _โ_ [1]2 _โ_ _โ_ _โ_ _[โ]_ _[ฯ][โ]_ _โ_ _โ_ **Variable-W-to-output-factor messages** The message from variable Wki[โ] [to the output factor][ kn] reads: _ฮฝki[โ]_ _n[(][W][ โ]ki[)][ โ]_ _[P][ โ]ฮธki_ [(][W][ โ]ki[)][e] _n[โฒ]_ =ฬธ _n_ [log ห]ฮฝkn[โ] _[โฒ]_ _โi[(][W][ โ]ki[)]_ _โ_ P 2 _โ_ _Pฮธ[โ]ki_ [(][W][ โ]ki[)][e][H]ki[โ] _โn[W][ โ]ki[โ]_ [1]2 _[G]ki[โ]_ _โn[(][W][ โ]ki[)]_ (41) where we have defined: _Hki[โ]_ _n_ [=] _xห[โ]in[โฒ]_ _k_ _[g]kn[โ]_ _[โฒ]_ _i_ (42) _โ_ _โ_ _โ_ _nX[โฒ]=ฬธ_ _n_ 2[] 2[] _G[โ]ki_ _n_ [=] โ[โ]in[โฒ] _k_ [+] _xห[โ]in[โฒ]_ _k_ ฮ[โ]kn[โฒ] _i_ _in[โฒ]_ _k_ _gkn[โ]_ _[โฒ]_ _i_ (43) _โ_ _n[โฒ]=n_ _โ_ _โ_ _โ_ _[โ]_ [โ][โ] _โ_ _โ_ Xฬธ Introducing now the effective free energy: ----- _ฯ(H, G, ฮธ) = log_ dW Pฮธ[โ] [(][W] [)][ e][HW][ โ] 2[1] _[GW][ 2]_ (44) Z we can express the first two cumulants of the message ฮฝki[โ] _n[(][W][ โ]ki[)][ as:]_ _โ_ _m[โ]ki_ _n_ [=][ โ][H] _[ฯ][(][H]ki[โ]_ _n[, G][โ]ki_ _n[, ฮธ][ki][)]_ (45) _โ_ _โ_ _โ_ _ฯki[โ]_ _n_ [=][ โ]H[2] _[ฯ][(][H]ki[โ]_ _n[, G][โ]ki_ _n[, ฮธ][ki][)]_ (46) _โ_ _โ_ _โ_ **Variable-x-to-input-factor messages** We can write the downgoing message as: _ฮฝ_ (x[โ]in[)][ โ] _[e]_ _k_ [log ห]ฮฝkn[โ] _โi[(][x][โ]in[)]_ _โ_ P _โ_ _e[B]in[โ]_ _[x][โ]_ [1]2 _[A]in[โ]_ _[x][2]_ (47) where: _Bin[โ]_ [=] _m[โ]ki_ _n_ _[g]kn[โ]_ _i_ (48) _โ_ _โ_ _n_ X 2[] 2[] _A[โ]in_ [=] _ฯki[โ]_ _n_ [+] _m[โ]ki_ _n_ ฮ[โ]kn _i_ _ki_ _n_ _gkn[โ][+1]_ _i_ (49) _n_ _โ_ _โ_ _โ_ _[โ]_ _[ฯ][โ]_ _โ_ _โ_ X **Variable-x-to-output-factor messages** By defining the following cavity quantities: _Bin[โ]_ _k_ [=][ B]in[โ] _k_ _ki_ _n_ _[g]kn[โ]_ _i_ (50) _โ_ _โ_ _[โ]_ _[m][โ]_ _โ_ _โ_ 2[] 2[] _A[โ]in_ _k_ [=][ A]in[โ] _k_ _ฯki[โ]_ _n_ [+] _m[โ]ki_ _n_ ฮ[โ]kn _i_ _ki_ _n_ _gkn[โ]_ _i_ (51) _โ_ _โ_ _[โ]_ _โ_ _โ_ _โ_ _[โ]_ _[ฯ][โ]_ _โ_ _โ_ and the following non-cavity ones: _ฯkn[โ]_ [=] _Vkn[โ]_ [=] _m[โ]ki_ _n_ _x[ห][โ]in_ _k_ (52) _โ_ _โ_ _i_ _Vkn[โ]_ [=] _ฯki[โ]_ _n_ [โ]in[โ] _k_ [+] _m[โ]ki_ _n_ 2 โโin _k_ [+][ ฯ]ki[โ] _n_ _xห[โ]i[โฒ]n_ _k_ 2[] (53) _โ_ _โ_ _โ_ _โ_ _โ_ _โ_ _i_ X we can express the first 2 cumulants of the upgoing messages as: _xห[โ]in_ _k_ [=][ โ][B][ฯ][โ][(][B]in[โ] _k[, A][โ]in_ _k[, ฯ]in[โ][โ][1][, V]in[ โ][โ][1])_ (54) _โ_ _โ_ _โ_ โ[โ]in _k_ [=][ โ]B[2] _[ฯ][โ][(][B]in[โ]_ _k[, A][โ]in_ _k[, ฯ]in[โ][โ][1][, V]in[ โ][โ][1])_ (55) _โ_ _โ_ _โ_ **Wrapping it up** Additional but straightforward considerations are required for the final input and output layers (โ = 0 and โ = L respectively), since they do not receive messages from below and above respectively. In the end, thanks to independence assumptions and the central limit theorem that we used throughout the derivations, we arrive to a closed set of equations involving the means and the variances (or otherwise the corresponding natural parameters) of the messages. Within the same approximation assumption, we also replace the cavity quantities corresponding to variances with the non-cavity counterparts. Dividing the update equations in a forward and backward pass, and ordering them using time indexes in such a way that we have an efficient flow of information, we obtain the set of BP equations presented in the main text Eqs. (6-17) and in the Appendix Eqs. (60-71). A.3 BP EQUATIONS We report here the end result of the derivation in last section, the complete set of BP equations also presented in the main text as Eqs. (6-17). ----- **Initialization** At ฯ = 0: _Bin[โ,][0]_ _k_ [= 0] (56) _โ_ _A[โ,]in[0]_ [= 0] (57) _Hki[โ,][0]_ _n_ [= 0] (58) _โ_ _G[โ,]ki[0]_ [= 0] (59) **Forward Pass** At each ฯ = 1, . . ., ฯmax, for โ = 0, . . ., L: _xห[โ,ฯ]in_ _k_ [=][ โ][B][ฯ][โ][(][B]in[โ,ฯ] _[โ]k[1][, A]in[โ,ฯ]_ _[โ][1], ฯin[โ][โ][1][,ฯ]_ _, Vin[โ][โ][1][,ฯ]_ ) (60) _โ_ _โ_ โ[โ,ฯ]in [=][ โ]B[2] _[ฯ][โ][(][B]in[โ,ฯ]_ _[โ][1], A[โ,ฯ]in_ _[โ][1], ฯin[โ][โ][1][,ฯ]_ _, Vin[โ][โ][1][,ฯ]_ ) (61) _m[โ,ฯ]ki_ _n_ [=][ โ][H] _[ฯ][(][H]ki[โ,ฯ]_ _[โ]n[1][, G]ki[โ,ฯ]_ _[โ][1], ฮธki[โ]_ [)] (62) _โ_ _โ_ _ฯki[โ,ฯ]_ [=][ โ]H[2] _[ฯ][(][H]ki[โ,ฯ]_ _[โ][1], G[โ,ฯ]ki_ _[โ][1], ฮธki[โ]_ [)] (63) 2 2 _Vkn[โ,ฯ]_ [=] _m[โ,ฯ]ki_ โ[โ,ฯ]in [+][ ฯ]ki[โ,ฯ] _[โ][1]_ _xห[โ,ฯ]i[โฒ]n_ + ฯki[โ,ฯ] _[โ][1]โ[โ,ฯ]in_ (64) Xi _ฯkn[โ,ฯ]_ _i_ [=] _m[โ,ฯ]ki[โฒ]_ _n_ _x[ห][โ,ฯ]i[โฒ]n_ _k_ (65) _โ_ _โ_ _โ_ _iX[โฒ]=ฬธ_ _i_ In these equations for simplicity we abused the notation, in fact for the first layer หx[โ]n[=0][,ฯ] is fixed and given by the input xn while โ[โ]n[=0][,ฯ] = 0 instead. **Backward Pass** For ฯ = 1, . . ., ฯmax, for โ = L, . . ., 0 : _gkn[โ,ฯ]_ _i_ [=][ โ][ฯ][ฯ][โ][+1][(][B]kn[โ][+1][,ฯ] _, A[โ]kn[+1][,ฯ]_ _, ฯkn[โ,ฯ]_ _i[, V]kn[ โ,ฯ]_ [)] (66) _โ_ _โ_ ฮ[โ,ฯ]kn [=][ โ][โ]ฯ[2] _[ฯ][โ][+1][(][B]kn[โ][+1][,ฯ]_ _, A[โ]kn[+1][,ฯ]_ _, ฯkn[โ,ฯ]_ _[, V]kn[ โ,ฯ]_ [)] (67) 2 2[] _A[โ,ฯ]in_ [=] Xk m[โ,ฯ]ki + ฯki[โ,ฯ] ฮ[โ,ฯ]kn _[โ]_ _[ฯ]ki[โ,ฯ]_ gkn[โ,ฯ] (68) _Bin[โ,ฯ]_ _k_ [=] _m[โ,ฯ]k[โฒ]i_ _n_ _[g]k[โ,ฯ][โฒ]n_ _i_ (69) _โ_ _โ_ _โ_ _kX[โฒ]=ฬธ_ _k_ 2 2[] _G[โ,ฯ]ki_ [=] Xn xห[โ,ฯ]in + โ[โ,ฯ]in ฮ[โ,ฯ]kn _[โ]_ [โ]in[โ,ฯ] gkn[โ,ฯ] (70) _Hki[โ,ฯ]_ _n_ [=] _xห[โ,ฯ]in[โฒ]_ _k_ _[g]kn[โ,ฯ][โฒ]_ _i_ (71) _โ_ _โ_ _โ_ _nX[โฒ]=ฬธ_ _n_ In these equations as well we abused the notation: calling L the number of hidden neuron layers, when โ = L one should use ฯ[L][+1](y, ฯ, V ) from Eq. 21 instead of ฯ[L][+1](B, A, ฯ, V ). A.4 BPI EQUATIONS The BP-Inspired algorithm (BPI) is obtained as an approximation of BP replacing some cavity quantities with their non-cavity counterparts. What we obtain is a generalization of the single layer algorithm of Baldassi et al. (2007). ----- **Forward pass.** **Backward pass.** _xห[โ,ฯ]in_ = _โBฯ[โ]_ []Bin[โ,ฯ] _[โ][1], A[โ,ฯ]in_ _[โ][1], ฯin[โ][โ][1][,ฯ]_ _, Vin[โ][โ][1][,ฯ]_ (72) โ[โ,ฯ]in = _โB[2]_ _[ฯ][โ]_ []Bin[โ,ฯ] _[โ][1], A[โ,ฯ]in_ _[โ][1], ฯin[โ][โ][1][,ฯ]_ _, Vin[โ][โ][1][,ฯ]_ (73) _m[โ,ฯ]ki_ = _โH_ _ฯ(Hki[โ,ฯ]_ _[โ][1], G[โ,ฯ]ki_ _[โ][1], ฮธki[โ]_ [)] (74) _ฯki[โ,ฯ]_ = _โH[2]_ _[ฯ][(][H]ki[โ,ฯ]_ _[โ][1], G[โ,ฯ]ki_ _[โ][1], ฮธki[โ]_ [)] (75) 2 _Vkn[โ,ฯ]_ = _m[โ,ฯ]ki_ โ[โ,ฯ]in [+][ ฯ]ki[โ,ฯ] [(ห]x[โ,ฯ]in [)][2][ +][ ฯ]ki[โ,ฯ] [โ]in[โ,ฯ] (76) Xi _ฯkn[โ,ฯ]_ = _m[โ,ฯ]ki_ _x[ห][โ,ฯ]in_ (77) X _gkn[โ,ฯ]_ _i_ = _โฯฯ[โ][+1][ ]Bkn[โ][+1][,ฯ]_ _, A[โ]kn[+1][,ฯ]_ _, ฯkn[โ,ฯ]_ _ki_ _x[ห][โ,ฯ]ai_ _[, V]kn[ โ,ฯ]_ (78) _โ_ _[โ]_ _[m][โ,ฯ]_ ฮ[โ,ฯ]kn = _โโฯ[2]_ _[ฯ][โ][+1][ ]Bkn[โ][+1][,ฯ]_ _, A[โ]kn[+1][,ฯ]_ _, ฯkn[โ,ฯ]_ _[, V]kn[ โ,ฯ]_ (79) 2 _A[โ,ฯ]in_ = _k_ (m[โ,ฯ]ki [)][2][ +][ ฯ]ki[โ,ฯ] ฮ[โ,ฯ]kn _[โ]_ _[ฯ]ki[โ,ฯ]_ _gkn[โ,ฯ]_ (80) X _Bin[โ,ฯ]_ = _m[โ,ฯ]ki_ _[g]kn[โ,ฯ]_ _i_ (81) _โ_ _k_ X 2 _G[โ,ฯ]ki_ = _n_ (หx[โ,ฯ]in [)][2][ + โ]in[โ,ฯ] ฮ[โ,ฯ]kn _[โ]_ [โ]in[โ,ฯ] _gkn[โ,ฯ]_ (82) X _Hki[โ,ฯ]_ = _xห[โ,ฯ]in_ _[g]kn[โ,ฯ]_ _i_ (83) _โ_ _n_ X A.5 MF EQUATIONS The mean-field (MF) equations are obtained as a further simplification of BPI, using only non-cavity quantities. Although the simplification appears minimal at this point, we empirically observe a non-negligible discrepancy between the two algorithms in terms of generalization performance and computational time. **Forward pass.** _xห[โ,ฯ]in_ = _โBฯ[โ]_ []Bin[โ,ฯ] _[โ][1], A[โ,ฯ]in_ _[โ][1], ฯin[โ][โ][1][,ฯ]_ _, Vin[โ][โ][1][,ฯ]_ (84) โ[โ,ฯ]in = _โB[2]_ _[ฯ][โ]_ []Bin[โ,ฯ] _[โ][1], A[โ,ฯ]in_ _[โ][1], ฯin[โ][โ][1][,ฯ]_ _, Vin[โ][โ][1][,ฯ]_ (85) _m[โ,ฯ]ki_ = _โH_ _ฯ(Hki[โ,ฯ]_ _[โ][1], G[โ,ฯ]ki_ _[โ][1], ฮธki[โ]_ [)] (86) _ฯki[โ,ฯ]_ = _โH[2]_ _[ฯ][(][H]ki[โ,ฯ]_ _[โ][1], G[โ,ฯ]ki_ _[โ][1], ฮธki[โ]_ [)] (87) 2 _Vkn[โ,ฯ]_ = _m[โ,ฯ]ki_ โ[โ,ฯ]in [+][ ฯ]ki[โ,ฯ] [(ห]x[โ,ฯ]in [)][2][ +][ ฯ]ki[โ,ฯ] [โ]in[โ,ฯ] (88) Xi _ฯkn[โ,ฯ]_ = _m[โ,ฯ]ki_ _x[ห][โ,ฯ]in_ (89) X ----- **Backward pass.** _gkn[โ,ฯ]_ = _โฯฯ[โ][+1][ ]Bkn[โ][+1][,ฯ]_ _, A[โ]kn[+1][,ฯ]_ _, ฯkn[โ,ฯ]_ _[, V]kn[ โ,ฯ]_ (90) ฮ[โ,ฯ]kn = _โโฯ[2]_ _[ฯ][โ][+1][ ]Bkn[โ][+1][,ฯ]_ _, A[โ]kn[+1][,ฯ]_ _, ฯkn[โ,ฯ]_ _[, V]kn[ โ,ฯ]_ (91) 2 _A[โ,ฯ]in_ = _k_ (m[โ,ฯ]ki [)][2][ +][ ฯ]ki[โ,ฯ] ฮ[โ,ฯ]kn _[โ]_ _[ฯ]ki[โ,ฯ]_ _gkn[โ,ฯ]_ (92) X _Bin[โ,ฯ]_ = _m[โ,ฯ]ki_ _[g]kn[โ,ฯ]_ (93) _k_ X 2 _G[โ,ฯ]ki_ = _n_ (หx[โ,ฯ]in [)][2][ + โ]in[โ,ฯ] ฮ[โ,ฯ]kn _[โ]_ [โ]in[โ,ฯ] _gkn[โ,ฯ]_ (94) X _Hki[โ,ฯ]_ = _xห[โ,ฯ]in_ _[g]kn[โ,ฯ]_ (95) _n_ X A.6 DERIVATION OF THE AMP EQUATIONS In order to obtain the AMP equations, we approximate cavity quantities with non-cavity ones in the BP equations Eqs. (60-71) using a first order expansion. We start with the mean activation: _xห[โ,ฯ]inโk_ [=][โ][B][ฯ][โ][(][B]in[โ,ฯ] _[โ][1]_ _โ_ _m[โ,ฯ]kiโ[โ]n[1]_ _[g]kn[โ,ฯ]โ[โ][1]i_ _[, A]in[โ,ฯ]_ _[โ][1], ฯin[โ][โ][1][,ฯ]_ _, Vin[โ][โ][1][,ฯ]_ ) _โBฯ[โ](Bin[โ,ฯ]_ _[โ][1], A[โ,ฯ]in_ _[โ][1], ฯin[โ][โ][1][,ฯ]_ _, Vin[โ][โ][1][,ฯ]_ ) _โ_ _โ_ _m[โ,ฯ]kiโ[โ]n[1]_ _[g]kn[โ,ฯ]โ[โ][1]i_ _[โ]B[2]_ _[ฯ][โ][(][B]in[โ,ฯ]_ _[โ][1], A[โ,ฯ]in_ _[โ][1], ฯin[โ][โ][1][,ฯ]_ _, Vin[โ][โ][1][,ฯ]_ ) _xห[โ,ฯ]in_ _ki_ _gkn[โ,ฯ]_ _[โ][1]โ[โ,ฯ]in_ (96) _โ_ _[โ]_ _[m][โ,ฯ]_ _[โ][1]_ Analogously, for the weightโs mean we have: _m[โ,ฯ]kiโn_ [=][ โ][H] _[ฯ][(][H]ki[โ,ฯ]_ _[โ][1]_ _โ_ _xห[โ,ฯ]inโ[โ]k[1]_ _[g]kn[โ,ฯ]โ[โ][1]i_ _[, G][โ,ฯ]ki_ _[โ][1], ฮธki[โ]_ [)] _โ_ _โH_ _ฯ(Hki[โ,ฯ]_ _[โ][1], Gki[โ,ฯ]_ _[โ][1], ฮธki[โ]_ [)][ โ] _x[ห][โ,ฯ]inโ[โ]k[1]_ _[g]kn[โ,ฯ]โ[โ][1]i_ _[โ]H[2]_ _[ฯ][(][H]ki[โ,ฯ]_ _[โ][1], G[โ,ฯ]ki_ _[โ][1], ฮธki[โ]_ [)] _โ_ _m[โ,ฯ]ki_ _[โ]_ _x[ห][โ,ฯ]in_ _[โ][1]_ _gkn[โ,ฯ]_ _[โ][1]_ _ฯki[โ,ฯ]_ _[.]_ (97) This brings us to: _ฯkn[โ,ฯ]_ [=] _m[โ,ฯ]ki_ _n_ _x[ห][โ,ฯ]in_ _k_ _โ_ _โ_ _โ_ _i_ _m[โ,ฯ]ki_ _x[ห][โ,ฯ]in_ _[โ]_ _[g]kn[โ,ฯ]_ _[โ][1]_ _i_ _ฯki[โ,ฯ]_ _x[ห][โ,ฯ]in_ _x[ห][โ,ฯ]in_ _[โ][1]_ _โ_ _gkn[โ,ฯ]_ _[โ][1]_ _i_ _m[โ,ฯ]ki_ _[m]ki[โ,ฯ]_ _[โ][1]โ[โ,ฯ]in_ X X X + (gkn[โ,ฯ] _[โ][1])[2][ X]_ _ฯki[โ,ฯ]_ _[m]ki[โ,ฯ]_ _[โ][1]xห[โ,ฯ]in_ _[โ][1]โ[โ,ฯ]in_ (98) _i_ Let us now apply the same procedure to the other set of cavity messages: _gkn[โ,ฯ]_ _i_ [=][โ][ฯ][ฯ][โ][+1][(][B]kn[โ][+1][,ฯ] _, A[โ]kn[+1][,ฯ]_ _, ฯkn[โ,ฯ]_ _ki_ _n_ _x[ห][โ,ฯ]in_ _k[, V]kn[ โ,ฯ]_ [)] _โ_ _[โ]_ _[m][โ,ฯ]โ_ _โ_ _โโฯฯ[โ][+1](Bkn[โ][+1][,ฯ]_ _, A[โ]kn[+1][,ฯ]_ _, ฯkn[โ,ฯ]_ _[, V]kn[ โ,ฯ]_ [)] _โ_ _m[โ,ฯ]kiโn_ _x[ห][โ,ฯ]inโk[โ]ฯ[2]_ _[ฯ][โ][+1][(][B]kn[โ][+1][,ฯ]_ _, A[โ]kn[+1][,ฯ]_ _, ฯkn[โ,ฯ]_ _[, V]kn[ โ,ฯ]_ [)] _โgkn[โ,ฯ]_ [+][ m]ki[โ,ฯ] _x[ห][โ,ฯ]in_ [ฮ]kn[โ,ฯ] (99) ----- _Bin[โ,ฯ]_ [=] _โ_ _Hki[โ,ฯ]_ [=] _m[โ,ฯ]ki_ _n_ _[g]kn[โ,ฯ]_ _i_ _โ_ _โ_ _k_ X 2 _k_ _m[โ,ฯ]ki_ _[g]kn[โ,ฯ]_ _[โ]_ _x[ห]in_ _k_ _gkn[โ,ฯ]_ _ฯki[โ,ฯ]_ [+ ห]x[โ,ฯ]in _k_ (m[โ,ฯ]ki [)][2][ฮ]kn[โ,ฯ] X X X _โ_ (หx[โ,ฯ]in [)][2][ X] _ฯki[โ,ฯ]_ _[m]ki[โ,ฯ]_ _[g]kn[โ,ฯ]_ [ฮ]kn[โ,ฯ] (100) _k_ _xห[โ,ฯ]in_ _k_ _[g]kn[โ,ฯ]_ _i_ _โ_ _โ_ _xห[โ,ฯ]in_ _[g]kn[โ,ฯ]_ [+][ m]ki[โ,ฯ] 2 _[ m][โ,ฯ]ki_ _n_ _xห[โ,ฯ]in_ ฮ[โ,ฯ]kn _[โ]_ _[m]ki[โ,ฯ]_ _n_ (gkn[โ,ฯ] [)][2][โ]in[โ,ฯ] X X _gkn[โ,ฯ]_ [ฮ]kn[โ,ฯ] [โ]in[โ,ฯ] _x[ห][โ,ฯ]in_ (101) _n_ _n_ _n_ _โ_ (m[โ,ฯ]ki [)][2][ X] _gkn[โ,ฯ]_ [ฮ]kn[โ,ฯ] [โ]in[โ,ฯ] _x[ห][โ,ฯ]in_ _n_ We are now able to write down the full AMP equations, that we present in the next section. A.7 AMP EQUATIONS In summary, in the last section we derived the AMP algorithm as a closure of the BP messages passing over non-cavity quantities, relying on some statistical assumptions on messages and interactions. With respect to the MF message passing, we find some additional terms that go under the name of Onsager corrections. In-depth overviews of the AMP (also known as Thouless-Anderson-Palmer (TAP)) approach can be found in Refs. (Zdeborovรก & Krzakala, 2016; Mรฉzard, 2017; Gabriรฉ, 2020). The final form of the AMP equations for the multi-layer perceptron is given below. **Initialization** At ฯ = 0: _Bin[โ,][0]_ [= 0] (102) _Ain[โ,][0]_ [= 0] (103) _Hki[โ,][0]_ [= 0][ or some values] (104) _G[โ,]ki[0]_ [= 0][ or some values] (105) _gkn[โ,][0]_ [= 0] (106) **Forward Pass** At each ฯ = 1, . . ., ฯmax, for โ = 0, . . ., L: _xห[โ,ฯ]in_ [=][โ][B][ฯ][โ][(][B]in[โ,ฯ] _[โ][1], A[โ,ฯ]in_ _[โ][1], ฯin[โ][โ][1][,ฯ]_ _, Vin[โ][โ][1][,ฯ]_ ) (107) โ[โ,ฯ]in [=][โ]B[2] _[ฯ][โ][(][B]in[โ,ฯ]_ _[โ][1], A[โ,ฯ]in_ _[โ][1], ฯin[โ][โ][1][,ฯ]_ _, Vin[โ][โ][1][,ฯ]_ ) (108) _m[โ,ฯ]ki_ [=][โ][H] _[ฯ][(][H]ki[โ,ฯ]_ _[โ][1], G[โ,ฯ]ki_ _[โ][1], ฮธki[โ]_ [)] (109) _ฯki[โ,ฯ]_ [=][โ]H[2] _[ฯ][(][H]ki[โ,ฯ]_ _[โ][1], G[โ,ฯ]ki_ _[โ][1], ฮธki[โ]_ [)] (110) 2 2 _Vkn[โ,ฯ]_ [=] _m[โ,ฯ]ki_ โ[โ,ฯ]in [+][ ฯ]ki[โ,ฯ] _xห[โ,ฯ]i[โฒ]n_ + ฯki[โ,ฯ] [โ]in[โ,ฯ] (111) Xi _ฯkn[โ,ฯ]_ [=] _i_ _m[โ,ฯ]ki_ _x[ห][โ,ฯ]in_ _[โ]_ _[g]kn[โ,ฯ]_ _[โ][1]_ _i_ _ฯki[โ,ฯ]_ _x[ห][โ,ฯ]in_ _x[ห][โ,ฯ]in_ _[โ][1]_ _โ_ _gkn[โ,ฯ]_ _[โ][1]_ _i_ _m[โ,ฯ]ki_ _[m]ki[โ,ฯ]_ _[โ][1]โ[โ,ฯ]in_ X X X + (gkn[โ,ฯ] _[โ][1])[2][ X]_ _ฯki[โ,ฯ]_ _[m]ki[โ,ฯ]_ _[โ][1]xห[โ,ฯ]in_ _[โ][1]โ[โ,ฯ]in_ (112) _i_ ----- **Backward Pass** _gkn[โ,ฯ]_ [=][โ][ฯ][ฯ][โ][+1][(][B]kn[โ][+1][,ฯ] _, A[โ]kn[+1][,ฯ]_ _, ฯkn[โ,ฯ]_ _i[, V]kn[ โ,ฯ]_ [)] (113) _โ_ ฮ[โ,ฯ]kn [=][ โ] _[โ]ฯ[2]_ _[ฯ][โ][+1][(][B]kn[โ][+1][,ฯ]_ _, A[โ]kn[+1][,ฯ]_ _, ฯkn[โ,ฯ]_ _[, V]kn[ โ,ฯ]_ [)] (114) 2 2[] _A[โ,ฯ]in_ [=] Xk m[โ,ฯ]ki + ฯki[โ,ฯ] ฮ[โ,ฯ]kn _[โ]_ _[ฯ]ki[โ,ฯ]_ gkn[โ,ฯ] (115) 2 _Bin[โ,ฯ]_ [=] _k_ _m[โ,ฯ]ki_ _[g]kn[โ,ฯ]_ _[โ]_ _x[ห]in_ _k_ _gkn[โ,ฯ]_ _ฯki[โ,ฯ]_ [+ ห]x[โ,ฯ]in _k_ (m[โ,ฯ]ki [)][2][ฮ]kn[โ,ฯ] X X X _โ_ (หx[โ,ฯ]in [)][2][ X] _ฯki[โ,ฯ]_ _[m]ki[โ,ฯ]_ _[g]kn[โ,ฯ]_ [ฮ]kn[โ,ฯ] (116) _k_ 2 2[] _G[โ,ฯ]ki_ [=] Xn xห[โ,ฯ]in + โ[โ,ฯ]in ฮ[โ,ฯ]kn _[โ]2_ [โ]in[โ,ฯ] gkn[โ,ฯ] (117) _Hki[โ,ฯ]_ [=] _n_ _xห[โ,ฯ]in_ _[g]kn[โ,ฯ]_ [+][ m]ki[โ,ฯ] _n_ _xห[โ,ฯ]in_ ฮ[โ,ฯ]kn _[โ]_ _[m]ki[โ,ฯ]_ _n_ (gkn[โ,ฯ] [)][2][โ]in[โ,ฯ] X X X (m[โ,ฯ]ki [)][2][ X] _gkn[โ,ฯ]_ [ฮ]kn[โ,ฯ] [โ]in[โ,ฯ] _x[ห][โ,ฯ]in_ (118) _โ_ _n_ A.8 ACTIVATION FUNCTIONS A.8.1 SIGN In most of our experiments we use sign activations in each layer. With this choice, the neuronโs free energy 19 takes the form + [1]2 [log(2][ฯV][ )][,] (119) [๏ฃถ] ๏ฃธ (120) ๏ฃซ 2 ๏ฃญ [1] _e[Bx]_ _H_ _โ_ _โ[xฯ]V_ _x_ 1,+1 _โ{โX_ _}_ _ฯ(B, A, ฯ, V ) = log_ where where _x_ = [1] _._ _H_ 2 [erfc] _โ2_ Notice that for sign activations the messages A can be dropped. A.8.2 RELU For ReLU (x) = max(0, x) activations the free energy 19 becomes _ฯ(B, A, ฯ, V ) =_ dxdz e[โ] 2[1] _[Ax][2][+][Bx]_ _ฮด(x โ_ max(0, z)) e[โ] [(][ฯ]2[โ]V[z][)2] (121) Z _ฯ_ _A_ [)] _BV + ฯ_ = log _H_ + _H_ + [1] _โV_ _[N]_ [(]A[ฯ][;][ B/A, V](B; 0, A[ +])[ 1] _โ_ _โV + AV_ [2] 2 [log(2][ฯV][ )][,] _N_ (122) where A.9 THE ARGMAX LAYER 1 _e[โ]_ [(][x][โ]2ฮฃ[ยต][)2] _._ (123) 2ฯฮฃ _N_ (x; ยต, ฮฃ) = In order to perform multi-class classification, we have to perform an argmax operation on the last layer of the neural network. Call zk, for k = 1, . . ., K, the Gaussian random variables output of the last layer of the network in correspondence of some input x. Assuming the correct label is class k[โ], the effective partition function Zkโ corresponding to the output constraint reads: ----- 25 20 25 20 15 10 15 10 |Col1|Col2|BP|train|BP tes|t| |---|---|---|---|---|---| ||||||| ||||||| ||||||| ||||||| |Col1|Col2|Col3|BP|train|BP te|st| |---|---|---|---|---|---|---| |||||||| |||||||| |||||||| |||||||| BP train BP test BP train BP test 20 40 60 80 100 epochs 20 40 60 80 100 epochs Figure 4: MLP with 2 hidden layers with 101 hidden units each, batch-size 128 on the FashionMNIST dataset. In the first two layers we use the BP equations, while in the last layer the ArgMax ones. (Left) ArgMax layer first version; (Right) ArgMax layer second version. Even if it is possible to reach similar accuracies with the two versions, we decide to use the first one as it is simpler to use. _Zkโ_ = _dzk_ (zk; ฯk, Vk) _k_ _N_ Z Y ฮ(zkโ _zk)_ (124) _โ_ _kY=ฬธ_ _k[โ]_ _H_ _โ_ _[z][k]โ[โ]_ _[โ]Vk[ฯ][k]_ (125) _k=k[โ]_ Yฬธ _dzkโ_ (zkโ ; ฯkโ _, Vkโ_ ) _N_ Here ฮ(x) is the Heaviside indicator function and we used the definition of H from Eq. 120. The integral on the last line cannot be expressed analytically, therefore we have to resort to approximations. A.9.1 APPROACH 1: JENSEN INEQUALITY Using the Jensen inequality we obtain: _ฯkโ_ = log Zkโ = log EzโผN (ฯkโ _,Vkโ_ ) _H_ _โ_ _[z][ โ]โV[ฯ]k[k]_ _k=k[โ]_ Yฬธ _โฅ_ EzโผN (ฯkโ _,Vkโ_ ) log H _โ_ _[z][ โ]โV[ฯ]k[k]_ _k=k[โ]_ Xฬธ (126) (127) Reparameterizing the expectation we have: _ฯหkโ_ = Eฯต (0,1) log _โผN_ _H_ _โ_ _[ฯ][k][โ]_ [+][ ฯต]โ[โ]V[V]k[k][โ] _[โ]_ _[ฯ][k]_ _k=k[โ]_ Xฬธ The derivative โฯk _ฯ[ห]kโ_ and โฯ[2]k _ฯ[ห]kโ_ that we need can then be estimated by sampling (once) ฯต: (128) 1 _โVk EฯตโผN (0,1) K_ _โ_ _[ฯ][k][โ]_ [+][ฯต]โ[โ]V[V]k[k][โ] _[โ][ฯ][k]_ 1 _k[โฒ]=ฬธ_ _k[โ]_ _โVkโฒ_ [E][ฯต][โผN][ (0][,][1)][ K] _โ_ _[ฯ][k][โ]_ [+][ฯต]โ[โ]V[V]k[k]โฒ[โ] _[โ][ฯ][k][โฒ]_ _k ฬธ= k[โ]_ (129) _k = k[โ]_ _โฯk_ _ฯ[ห]kโ_ = where we have defined: (x) = _K_ _[N]([(]x[x]) [)]_ [=] _H_ 2/ฯ (130) erfcx(x/2) p ----- A.9.2 APPROACH 2: JENSEN AGAIN A further simplification is obtained by applying Jensen inequality again to 128 but in the opposite direction, therefore we renounce to having a bound and look only for an approximation. We have the new effective free energy: _ฯหkโ_ = log EฯตโผN (0,1)H _โ_ _[ฯ][k][โ]_ [+][ ฯต]โ[โ]V[V]k[k][โ] _[โ]_ _[ฯ][k]_ (131) _k=k[โ]_ Xฬธ = _k=k[โ]_ log H โ _โ[ฯ]V[k]k[โ] +[โ] V[ฯ]k[k]โ_ (132) Xฬธ This gives, for k ฬธ= k[โ]: _โ_ _โVk1+Vkโ_ _K_ _โ_ _โ[ฯ]V[k]k[โ]+[โ]V[ฯ]k[k]โ_ _k ฬธ= k[โ]_ _โฯk_ _ฯ[ห]kโ_ = ๏ฃฑ๏ฃด๏ฃฒ _k[โฒ]=ฬธ_ _k[โ]_ _โVkโฒ1+Vkโ_ _[K]_ _โ_ _โ[ฯ]V[k][โ]kโฒ[โ]+[ฯ]V[k]k[โฒ]โ_ _k = k[โ]_ (133) P ๏ฃด๏ฃณ Notice that โฯkโ _ฯ[ห]k[โ]_ = โ [P]k=ฬธ _k[โ]_ _[โ][ฯ]k_ _ฯ[ห]k[โ]_ . In last formulas we used the definition of K in Eq. 130. We show in Fig. 4 the negligible difference between the two ArgMax versions when using BP on the layers before the last one (which performs only the ArgMax). B EXPERIMENTAL DETAILS B.1 HYPER-PARAMETERS OF THE BP-BASED SCHEME We include here a complete list of the hyper-parameters present in the BP-based algorithms. Notice that, like in the SGD type of algorithms, many of them can be fixed or it is possible to find a prescription for their value that works in most cases. However, we expect future research to find even more effective values of the hyper-parameters, in the same way it has been done for SGD. These hyper-parameters are: the mini-batch size bs; the parameter ฯ (that has to be tuned similarly to the learning rate in SGD); the damping parameter ฮฑ (that performs a running smoothing on the BP fields along the dynamics by adding a fraction of the field at the previous iteration, see Eqs. (134, 135)); the initialization coefficient ฯต that we use to to sample the parameters of our prior distribution qฮธ( ) _W_ according to ฮธki[โ,t][=0] _ฯต_ (0, 1). Different choices of ฯต correspond to different initial distribution of _โผ_ _N_ the weightsโ magnetization m[โ]ki [= tanh(][ฮธ]ki[โ] [)][, as is shown in Fig. 5); the number of internal steps of] reinforcement ฯmax and the associated intensity of the internal reinforcement r. The performances of the BP-based algorithms are robust in a reasonable range of these hyper-parameters. A more principled choice of a good initialization condition could be made by adapting the technique from Stamatescu et al. (2020). Notice that among these parameters, the BP dynamics at each layer is mostly sensitive to ฯ and ฮฑ, so that in general we consider them layer-dependent. See Sec. B.8 for details on the effect of these parameters on the learning dynamics and on layer polarization (i.e. how the BP dynamics tends to bias the weights towards a single point-wise configuration with high probability). Unless otherwise stated we fix some of the hyper-parameters, in particular: bs = 128 (results are consistent with other values of the batch-size, from bs = 1 up to bs = 1024 in our experiments), ฯต = 1.0, ฯmax = 1, r = 0. B.2 DAMPING SCHEME FOR THE MESSAGE PASSING We use a damping parameter ฮฑ โ (0, 1) to stabilize the training, changing the updated rule for the weightsโ means as follows _mห_ _[โ,ฯ]ki_ [=][โ][H] _[ฯ][(][H]ki[โ,ฯ]_ _[โ][1], G[โ,ฯ]ki_ _[โ][1], ฮธki[โ]_ [)] (134) _m[โ,ฯ]ki_ [=][ฮฑ m]ki[โ,ฯ] _[โ][1]_ + (1 โ _ฮฑ) หm[โ,ฯ]ki_ (135) ----- **P(m)** ฯต = 0.1 **m** |Col1|5 4 3 2 1|ฯต = 0.1 ฯต = 0.5 ฯต = 1.0 ฯต = 1.5|Col4| |---|---|---|---| -1.0 -0.5 **0.5** **1.0** Figure 5: Initial distribution of the magnetizations varying the parameter ฯต. The initial distribution is more concentrated around ยฑ1 as ฯต increases (i.e. it is more bimodal and the initial configuration is more polarized). B.3 ARCHITECTURES In the experiments in which we vary the architecture (see Sec. 4.1), all simulations of the BP-based algorithms use a number of internal reinforcement iterations ฯmax = 1. Learning is performed on the totality of the training dataset, the batch-size is bs = 128, the initialization coefficient is ฯต = 1.0. For all architectures and all BP approximations, we use ฮฑ = 0.8 for each layer, apart for the 501-501-501 MLP in which we use ฮฑ = (0.1, 0.1, 0.1, 0.9). Concerning the parameter ฯ, we use _ฯ = 0.9 on the last layer for all architectures and BP approximations. On the other layers we_ use: for the 101-101 and the 501-501 MLPs, ฯ = 1.0001 for all BP approximations; for the 101101-101 MLP, ฯ = 1.0 for BP and AMP while ฯ = 1.001 for MF; for the 501-501-501 MLP _ฯ = 1.0001 for all BP approximations. For the BinaryNet simulations, the learning rate is lr = 10.0_ for all MLP architectures, giving the better performance among the learning rates we have tested, _lr = 100, 10, 1, 0.1, 0.001._ We notice that while we need some tuning of the hyper-parameters to reach the performances of BinaryNet, it is possible to fix them across datasets and architectures (e.g. ฯ = 1 and ฮฑ = 0.8 on each layer) without in general losing more than 20% (relative) of the generalization performances, demonstrating that the BP-based algorithms are effective for learning also with minimal hyperparameter tuning. The experiments on the Bayesian error are performed on a MLP with 2 hidden layers of 101 units on the MNIST dataset (binary classification). Learning is performed on the totality of the training dataset, the batch-size is bs = 128, the initialization coefficient is ฯต = 1.0. In order to find the pointwise configurations we use ฮฑ = 0.8 on each layer and ฯ = (1.0001, 1.0001, 0.9), while to find the Bayesian ones we use ฮฑ = 0.8 on each layer and ฯ = (0.9999, 0.9999, 0.9) (these value prevent an excessive polarization of the network towards a particular pointwise configurations). For the continual learning task (see Sec. 4.5) we fixed ฯ = 1 and ฮฑ = 0.8 on each layer as we empirically observed that polarizing the last layer helps mitigating the forgetting while leaving the single-task performances almost unchanged. In Fig. 6 we report training curves on architectures different from the ones reported in the main paper. ----- 25 20 25 20 15 10 15 10 |BP BP|train MF train test MF test| |---|---| |A A|MP train BinaryNet train MP test BinaryNet test| ||| ||| ||| |Col1|BP train MF train BP test MF test| |---|---| ||AMP train BinaryNet train AMP test BinaryNet test| ||| ||| ||| BP train MF train BP test MF test AMP train BinaryNet train AMP test BinaryNet test BP train MF train BP test MF test AMP train BinaryNet train AMP test BinaryNet test 20 40 60 80 100 epochs 20 40 60 80 100 epochs Figure 6: Training curves of message passing algorithms compared with BinaryNet on the FashionMNIST dataset (multi-class classification) with a binary MLP with 3 hidden layers of 501 units. (Right) The batch-size is 128 and curves are averaged over 5 realizations of the initial conditions B.4 VARYING THE DATASET When varying the dataset (see Sec. 4.3), all simulation of the BP-based algorithms use a number of internal reinforcement iterations ฯmax = 1. Learning is performed on the totality of the training dataset, the batch-size is bs = 128, the initialization coefficient is ฯต = 1.0. For all datasets (MNIST (2 classes), FashionMNIST (2 classes), CIFAR-10 (2 classes), MNIST, FashionMNIST, CIFAR-10) and all algorithms (BP, AMP, MF) we use ฯ = (1.0001, 1.0001, 0.9) and ฮฑ = 0.8 for each layer. Using in the first layers values of ฯ = 1 + ฯต with ฯต โฅ 0 and sufficiently small typically leads to good results. For the BinaryNet simulations, the learning rate is lr = 10.0 (both for binary classification and multi-class classification), giving the better performance among the learning rates we have tested, _lr = 100, 10, 1, 0.1, 0.001. In Tab. 2 we report the final train errors obtained on the different datasets._ Dataset BinaryNet BP AMP MF **MNIST (2 classes)** 0.05 ยฑ 0.05 0.0 ยฑ 0.0 0.0 ยฑ 0.0 0.0 ยฑ 0.0 **FashionMNIST (2 classes)** 0.3 ยฑ 0.1 0.06 ยฑ 0.01 0.06 ยฑ 0.01 0.09 ยฑ 0.01 **CIFAR10 (2 classes)** 1.2 ยฑ 0.5 0.37 ยฑ 0.01 0.4 ยฑ 0.1 0.9 ยฑ 0.2 **MNIST** 0.09 ยฑ 0.01 0.12 ยฑ 0.01 0.12 ยฑ 0.01 0.03 ยฑ 0.01 **FashionMNIST** 4.0 ยฑ 0.5 3.4 ยฑ 0.1 3.7 ยฑ 0.1 2.5 ยฑ 0.2 **CIFAR10** 13.0 ยฑ 0.9 4.7 ยฑ 0.1 4.7 ยฑ 0.2 9.2 ยฑ 0.5 Table 2: Train error (%) on Fashion-MNIST of a multilayer perceptron with 2 hidden layers of 501 units each for BinaryNet (baseline), BP, AMP and MF. All algorithms are trained with batch-size 128 and for 100 epochs. Mean and standard deviations are calculated over 5 random initializations. B.5 LOCAL ENERGY We adapt the notion of flatness used in (Jiang et al., 2020; Pittorino et al., 2021), that we call local energy, to configurations with binary weights. Given a weight configuration w โ{ยฑ1}[N], we define the local energy ฮดEtrain(w, p) as the average difference in training error Etrain(w) when perturbing w by flipping a random fraction p of its elements: _ฮดEtrain(w, p) = Ez Etrain(w โ_ **_z) โ_** _Etrain(w),_ (136) where โ denotes the Hadamard (element-wise) product and the expectation is over i.i.d. entries for z equal to โ1 with probability p and to +1 with probability 1 โ _p. We report the resulting local energy_ profiles (in a range [0, pmax]) in Fig. 7 right panel for BP and BinaryNet. The relative error grows ----- slowly when perturbing the trained configurations (notice the convexity of the curves). This shows that both BP-based and SGD-based algorithms find configurations that lie in relatively flat minima in the energy landscape. The same qualitative phenomenon holds for different architectures and datasets. 0.30 0.25 0.20 0.15 0.10 0.05 0.00 |Col1|Col2|Col3|Col4|Col5|Col6|Col7|Col8|Col9| |---|---|---|---|---|---|---|---|---| ||Bi BP|naryNet||||||| |||||||||| |||||||||| |||||||||| |||||||||| |||||||||| |||||||||| |||||||||| 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 flip probability p Figure 7: Local energy curve of the point-wise configuration found by the BP algorithm compared with BinaryNet on a MLP with 2 hidden layers of 101 units on the 2-class MNIST dataset. B.6 SGD IMPLEMENTATION (BINARYNET) We compare the BP-based algorithms with SGD training for neural networks with binary weights and activations as introduced in BinaryNet (Hubara et al., 2016). This procedure consists in keeping a continuous version of the parameters w which is updated with the SGD rule, with the gradient calculated on the binarized configuration wb = sign(w). At inference time the forward pass is calculated with the parameters wb. The backward pass with binary activations is performed with the so called straight-through estimator. Our implementation presents some differences with respect to the original proposal of the algorithm in (Hubara et al., 2016), in order to keep the comparison as fair as possible with the BP-based algorithms, in particular for what concerns the number of parameters. We do not use biases nor batch normalization layers, therefore in order to keep the pre-activations of each hidden layer normalized we rescale them by _โ1N_ [where][ N][ is the size of the previous layer (or the input size in the case of the] pre-activations afferent to the first hidden layer). The standard SGD update rule is applied (instead of Adam), and we use the binary cross-entropy loss. Clipping of the continuous configuration w in [โ1, 1] is applied. We use Xavier initialization (Glorot & Bengio, 2010) for the continuous weights. In Fig.2 of the main paper, we apply the Adam optimization rule, noticing that it performs slightly better in train and test generalization performance compared to the pure SGD one. B.7 EBP IMPLEMENTATION Expectation Back Propagation (EBP) Soudry et al. (2014b) is parameter-free Bayesian algorithm that uses a mean-field (MF) approximation (fully factorized form for the posterior) in an online environment to estimate the Bayesian posterior distribution after the arrival of a new data point. The main differences between EBP and our approach relies in the approximation for the posterior distribution. Moreover we explicitly base the estimation of the marginals on the local high entropy structure. The fact that EBP works has no clear explanation: certainly it cannot be that the MF assumption holds for multi-layer neural networks. Still, itโs certainly very interesting that it works. We argue that it might work precisely by virtue of the existence of high local entropy minima and ----- |Col1|BP layer1| |---|---| ||AMP layer1| ||MF layer1| |Col1|Col2| |---|---| ||BP layer1 AMP layer1 MF layer1| ||| |Col1|Col2| |---|---| ||BP layer2 AMP layer2| ||MF layer2| |Col1|BP layer2| |---|---| ||AMP layer2| ||MF layer2| |BP tr BP te AMP t|ain MF train st MF test rain BinaryNet train| |---|---| |AMP t|est BinaryNet test| ||| ||| ||| |Col1|BP layer3 AMP layer3| |---|---| ||MF layer3| |Col1|BP layer3 AMP layer3| |---|---| ||MF layer3| 25 1.00 BP train MF train 0.010 BP layer1 20 BP testAMP train MF testBinaryNet train 0q 0.750.50 BP layer1AMP layer1MF layer1 _qab_ 0.005 AMP layer1MF layer1 AMP test BinaryNet test 0 50 100 0 50 100 epochs epochs 15 1.00 0.0050 BP layer2 0q 0.75 BP layer2AMP layer2 _qab_ 0.0025 AMP layer2MF layer2 error (%) 10 0.50 MF layer2 0 50 100 0 50 100 epochs epochs 5 0.4 BP layer3 0.04 BP layer3 AMP layer3 AMP layer3 0q 0.2 MF layer3 _qab_ 0.02 MF layer3 0 0 20 40 60 80 100 0 50 100 0.00 0 50 100 epochs epochs epochs Figure 8: (Right panels) Polarizations _q0_ and overlaps _qab_ on each layer of a MLP with 2 hidden _โจ_ _โฉ_ _โจ_ _โฉ_ layers of 501 units on the Fashion-MNIST dataset (multi-class), the batch-size is bs = 128. (Right) Corresponding train and test error curves. expect it to give similar performance to the MF case of our algorithm. The online iteration could in fact be seen as way of implementing a reinforcement. We implemented the EBP code along the lines of the original matlab implementation (https://github.com/ExpectationBackpropagation/EBP_Matlab_Code). In order to perform a fair comparison we removed the biases both in the binary and continuous weights versions. It is worth noticing that we faced numerical issues in training with a moderate to big batchsize All the experiments were consequently limited to a batchsize of 10 patterns B.8 UNIT POLARIZATION AND OVERLAPS We define the self-overlap or polarization of a given unit a as q0[a] [=] _N1_ _i[(][w]i[a][)][2][, where][ N][ is the]_ number of parameters of the unit and _wi[a]_ _i=1_ [its weights. It quantifies how much the unit is polarized] towards a unique point-wise binary configuration ( { _[}][N]_ _q0[a]_ [= 1][ corresponding to high confidence in a]P given configurations while q0[a] [= 0][ to low). The overlap between two units][ a][ and][ b][ (considered in the] same layer) is qab = _N[1]_ _wia[w]i[b][. The number of parameters][ N][ is the same for units belonging to the]_ same fully connected layer. We denote byP _โจq0โฉ_ = _Nout1_ _Na=1out_ _[q]0[a]_ [and][ โจ][q][ab][โฉ] [=] _Nout1_ _Na<bout_ _[q][ab][ the]_ mean polarization and mean overlap in a given layer (where Nout is the number of units in the layer). P P The parameters ฯ and ฮฑ govern the dynamical evolution of the polarization of each layer during training. A value ฯ โช 1 has the effect to progressively increase the units polarization during training, while ฯ < 1 disfavours it. The damping ฮฑ which takes values in [0, 1] has the effect to slow the dynamics by a smoothing process (the intensity of which depends on the value of ฮฑ), generically favoring convergence. Given the nature of the updates in Algorithm 1, each layer presents its own dynamics given the values of ฯโ and ฮฑโ at layer โ, that in general can differ from each other. We find that it is is beneficial to control the polarization layer-per-layer, see Fig. 8 for the corresponding typical behavior of the mean polarization and the mean overlaps during training. Empirically, we have found that (as we could expect) when training is successful the layers polarize progressively towards q0 = 1, i.e. towards a precise point-wise solution, while the overlaps between units in each hidden layer are such that qab 1 (indicating low symmetry between intra-layer units, as expected for a non-redundant solution). To this aim, in most cases โช _ฮฑโ_ can be the same for each layer, while tuning ฯโ for each layer allows to find better generalization performances in some cases (but is not strictly necessary for learning). In particular, it is possible to use the same value ฯโ for each layer before the last one (โ< L where L is the number of layers in the network), while we have found that the last layer tends to ----- polarize immediately during the dynamics (probably due to its proximity to the output constraints). Empirically, it is usually beneficial for learning that this layer does not or only slightly polarize, i.e. _โจq0โฉโช_ 1 (this can be achieved by imposing ฯL < 1). Learning is anyway possible even when the last layer polarizes towards โจq0โฉ = 1 along the dynamics, i.e. by choosing ฯL sufficiently large. As a simple general prescription in most experiments we can fix ฮฑ = 0.8 and ฯL = 0.9, therefore leaving ฯโ<L as the only hyper-parameter to be tuned, akin to the learning rate in SGD. Its value has to be very close to 1.0 (a value smaller than 1.0 tends to depolarize the layers, without focusing on a particular point-wise binary configuration, while a value greater than 1.0 tends to lead to numerical instabilities and parametersโ divergence). B.9 COMPUTATIONAL PERFORMANCE: VARYING BATCH-SIZE In order to compare the time performances of the BP-based algorithms with our implementation of BinaryNet, we report in Fig. 9 the time in seconds taken by a single epoch of each algorithm in function of the batch-size, on a MLP of 2 layers of 501 units on Fashion-MNIST. We test both algorithms on a NVIDIA GeForce RTX 2080 Ti GPU. Multi-class and binary classification present a very similar time scaling with the batch-size, in both cases comparable with BinaryNet. Let us also notice that BP-based algorithms are able to reach generalization performances comparable to BinaryNet for all the values of the batch-size reported in this section. 10[1] 10[0] 10[0] 10[1] 10[2] 10[3] |Col1|Col2|Col3|Col4|Col5| |---|---|---|---|---| ||||BP BPI AMP MF|| ||||BinaryNet|| |||||| BP BPI AMP MF BinaryNet ### batch size Figure 9: Algorithms time scaling with the batch-size on a MLP with 2 hidden layers of 501 hidden units each on the Fashion-MNIST dataset (multi-class classification). The reported time (in seconds) refers to one epoch for each algorithm. ----- |