File size: 89,053 Bytes
f71c233 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 1372 1373 1374 1375 1376 1377 1378 1379 1380 1381 1382 1383 1384 1385 1386 1387 1388 1389 1390 1391 1392 1393 1394 1395 1396 1397 1398 1399 1400 1401 1402 1403 1404 1405 1406 1407 1408 1409 1410 1411 1412 1413 1414 1415 1416 1417 1418 1419 1420 1421 1422 1423 1424 1425 1426 1427 1428 1429 1430 1431 1432 1433 1434 1435 1436 1437 1438 1439 1440 1441 1442 1443 1444 1445 1446 1447 1448 1449 1450 1451 1452 1453 1454 1455 1456 1457 1458 1459 1460 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475 1476 1477 1478 1479 1480 1481 1482 1483 1484 1485 1486 1487 1488 1489 1490 1491 1492 1493 1494 1495 1496 1497 1498 1499 1500 1501 1502 1503 1504 1505 1506 1507 1508 1509 1510 1511 1512 1513 1514 1515 1516 1517 1518 1519 1520 1521 1522 1523 1524 1525 1526 1527 1528 1529 1530 1531 1532 1533 1534 1535 1536 1537 1538 1539 1540 1541 1542 1543 1544 1545 1546 1547 1548 1549 1550 1551 1552 1553 1554 1555 1556 1557 1558 1559 1560 1561 1562 1563 1564 1565 1566 1567 1568 1569 1570 1571 1572 1573 1574 1575 1576 1577 1578 1579 1580 1581 1582 1583 1584 1585 1586 1587 1588 1589 1590 1591 1592 1593 1594 1595 1596 1597 1598 1599 1600 1601 1602 1603 1604 1605 1606 1607 1608 1609 1610 1611 1612 1613 1614 1615 1616 1617 1618 1619 1620 1621 1622 1623 1624 1625 1626 1627 1628 1629 1630 1631 1632 1633 1634 1635 1636 1637 1638 1639 1640 1641 1642 1643 1644 1645 1646 1647 1648 1649 1650 1651 1652 1653 1654 1655 1656 1657 1658 1659 1660 1661 1662 1663 1664 1665 1666 1667 1668 1669 1670 1671 1672 1673 1674 1675 1676 1677 1678 1679 1680 1681 1682 1683 1684 1685 1686 1687 1688 1689 1690 1691 1692 1693 1694 1695 1696 1697 1698 1699 1700 1701 1702 1703 1704 1705 1706 1707 1708 1709 1710 1711 1712 1713 1714 1715 1716 1717 1718 1719 1720 1721 1722 1723 1724 1725 1726 1727 1728 1729 1730 1731 1732 1733 1734 1735 1736 1737 1738 1739 1740 1741 1742 1743 1744 1745 1746 1747 1748 1749 1750 1751 1752 1753 1754 1755 1756 1757 1758 1759 1760 1761 1762 1763 1764 1765 1766 1767 1768 1769 1770 1771 1772 1773 1774 1775 1776 1777 1778 1779 1780 1781 1782 1783 1784 1785 1786 1787 1788 1789 1790 1791 1792 1793 1794 1795 1796 1797 1798 1799 1800 1801 1802 1803 1804 1805 1806 1807 1808 1809 1810 1811 1812 1813 1814 1815 1816 1817 1818 1819 1820 1821 1822 1823 1824 1825 1826 1827 1828 1829 1830 1831 1832 1833 1834 1835 1836 1837 1838 1839 1840 1841 1842 1843 1844 1845 1846 1847 1848 1849 1850 1851 1852 1853 1854 1855 1856 1857 1858 1859 1860 1861 1862 1863 1864 1865 1866 1867 1868 1869 1870 1871 1872 1873 1874 1875 1876 1877 1878 1879 1880 1881 1882 1883 1884 1885 1886 1887 1888 1889 1890 1891 1892 1893 1894 1895 1896 1897 1898 1899 1900 1901 1902 1903 1904 1905 1906 1907 1908 1909 1910 1911 1912 1913 1914 1915 1916 1917 1918 1919 1920 1921 1922 1923 1924 1925 1926 1927 1928 1929 1930 1931 1932 1933 1934 1935 1936 1937 1938 1939 1940 1941 1942 1943 1944 1945 1946 1947 1948 1949 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 2026 2027 2028 2029 2030 2031 2032 2033 2034 2035 2036 2037 2038 2039 2040 2041 2042 2043 2044 2045 2046 2047 2048 2049 2050 2051 2052 2053 2054 2055 2056 2057 2058 2059 2060 2061 2062 2063 2064 2065 2066 2067 2068 2069 2070 2071 2072 2073 2074 2075 2076 2077 2078 2079 2080 2081 2082 2083 2084 2085 2086 2087 2088 2089 2090 2091 2092 2093 2094 2095 2096 2097 2098 2099 2100 2101 2102 2103 2104 2105 2106 2107 2108 2109 2110 2111 2112 2113 2114 2115 2116 2117 2118 2119 2120 2121 2122 2123 2124 2125 2126 2127 2128 2129 2130 2131 2132 2133 2134 2135 2136 2137 2138 2139 2140 2141 2142 2143 2144 2145 2146 2147 2148 2149 2150 2151 2152 2153 2154 2155 2156 2157 2158 2159 2160 2161 2162 2163 2164 2165 2166 2167 2168 2169 2170 2171 2172 2173 2174 2175 2176 2177 2178 2179 |
# WHAT TO EXPECT OF HARDWARE METRIC PREDIC## TORS IN NEURAL ARCHITECTURE SEARCH **Anonymous authors** Paper under double-blind review ABSTRACT Modern Neural Architecture Search (NAS) focuses on finding the best performing architectures in hardware-aware settings; e.g., those with an optimal tradeoff of accuracy and latency. Due to many advantages of prediction models over live measurements, the search process is often guided by estimates of how well each considered network architecture performs on the desired metrics. Typical prediction models range from operation-wise lookup tables over gradient-boosted trees and neural networks, with little known information on how they compare. We evaluate 18 different performance predictors on ten combinations of metrics, devices, network types, and training tasks, and find that MLP models are the most promising. We then simulate and evaluate how the guidance of such prediction models affects the subsequent architecture selection. Due to inaccurate predictions, the selected architectures are generally suboptimal, which we quantify as an expected reduction in accuracy and hypervolume. We show that simply verifying the predictions of just the selected architectures can lead to substantially improved results. Under a time budget, we find it preferable to use a fast and inaccurate prediction model over accurate but slow live measurements. 1 INTRODUCTION Modern neural network architectures are designed not only considering their primary objective, such as accuracy. While existing architectures can be scaled down to work with the limited available memory and computational power of, e.g., mobile phones, they are significantly outperformed by specifically designed architectures (Howard et al., 2017; Sandler et al., 2018; Zhang et al., 2018; Ma et al., 2018). Standard hardware metrics include memory usage, number of model parameters, Multiply-Accumulate operations, energy consumption, latency, and more; each of which may be limited by the hardware platform or network task. As the range of tasks and target platforms grows, specialized architectures and the methods to find them efficiently are gaining importance. The automated design and discovery of specialized architectures is the main intent of Neural Architecture Search (NAS). This recent field of study repeatedly broke state of the art records (Zoph et al., 2018; Real et al., 2018; Cai et al., 2019; Tan & Le, 2019; Chu et al., 2019a; Hu et al., 2020) while aiming to reduce the researchers’ involvement with this tedious and time-consuming process to a minimum. As the performance of each considered architecture needs to be evaluated, the hardware metrics need to be either measured live or guessed by a trained prediction model. While measuring live has the advantage of not suffering from inaccurate predictions, the corresponding hardware needs to be available during the search process. Measuring on-demand may also significantly slow down the search process and necessitates further measurements for each new architecture search. On the other hand, a prediction model abstracts the hardware from the search code and simplifies changes to the optimization targets, such as metrics or devices. The data set to train the predictor also has to be collected only once so that a trained predictor then works in the absence of the hardware it is predicting for, e.g., in a cloud environment. Furthermore, a differentiable predictor can be used for gradient-based architecture optimization of typically non-differentiable metrics (Cai et al., 2019; Xu et al., 2020; Nayman et al., 2021). While the many advantages make predictors a popular choice of hardware-aware NAS (e.g. Xu et al. (2020); Wu et al. (2019); Wan et al. (2020); Dai et al. (2020); Nayman et al. (2021)), there are no guidelines on which predictors perform best, how many training samples are required, or ----- what happens when a predictor is inaccurate. This work investigates the above points. As a first contribution, we conduct large-scale experiments on ten hardware-metric datasets chosen from HWNAS-Bench (Li et al., 2021a) and TransNAS-Bench-101 (Duan et al., 2021). We explore how powerful the different predictors are when using different amounts of training data and whether these results generalize across different network architecture types. As a second contribution, we extensively simulate the subsequent architecture selection to investigate the impact of inaccurate predictors. Our results demonstrate the effectiveness of network-based prediction models; provide insights into predictor mistakes and what to expect from them. To facilitate reproducibility and further research, our experimental results and code are made available in Appendix A. 2 RELATED WORK **NAS Benchmarks:** As the search spaces of NAS methods often differ from one another and lack extensive studies, the difficulty of fair comparisons and reproducibility have become a major concern (Yang et al., 2019; Li & Talwalkar, 2020). To alleviate this problem, researchers have exhaustively evaluated search spaces of several thousand architectures to create benchmarks (Ying et al., 2019; Dong & Yang, 2020; Dong et al., 2020; Siems et al., 2020), containing detailed statistics for each architecture. TransNAS-Bench-101 (Duan et al., 2021) evaluates several thousand architectures across seven diverse tasks and finds that the best task-specific architectures may vary significantly. The popular NAS-BENCH 201 benchmark (Dong & Yang, 2020) has been further extended with ten different hardware metrics for all 15625 architectures on each of the three data sets CIFAR10, CIFAR100 (Krizhevsky et al., 2009) and ImageNet16-120 (Chrabaszcz et al., 2017). Major findings of this HW-NAS Bench (Li et al., 2021a) include that FLOPs and the number of parameters are a poor approximation for other metrics such as latency. Many existing NAS methods use such inadequate substitutes for their simplicity and would benefit from their replacement with better prediction models. Li et al. also find that hardware-specific costs do not correlate well across hardware platforms. While accounting for each device’s characteristics improves the NAS results, it is also expensive. Predictors can reduce costs by requiring fewer measurements and shorter query times. [1]. **Predictors in NAS:** Aside from real-time measurements (Tan et al., 2019; Yang et al., 2018), hardware metric estimation in NAS is commonly performed via Lookup Table (Wu et al., 2019), Analytical Estimation or a Prediction Model (Dai et al., 2020; Xu et al., 2020). While an operationand layer-wise Lookup Table can accurately estimate hardware-agnostic metrics, such as FLOPs or the number of parameters (Cai et al., 2019; Guo et al., 2020; Chu et al., 2019a), they may be suboptimal for device-dependent metrics. Latency and energy consumption have non-obvious factors that depend on hardware specifics such as memory, cache usage, the ability to parallelize each operation, and an interplay between different network operations. Such details can be captured with neural networks (Dai et al., 2020; Mendoza & Wang, 2020; Ponomarev et al., 2020; Xu et al., 2020) or other specialized models (Yao et al., 2018; Wess et al., 2021). Of particular interest is the correct prediction of the model loss or accuracy, possibly reducing the architecture search time by orders of magnitude (Mellor et al., 2020; Wang et al., 2021; Li et al., 2021b). In addition to common predictors such as Linear Regression, Random Forests (Liaw et al., 2002) or Gaussian Processes (Rasmussen, 2003); specialized techniques may exploit training curve extrapolation, network weight sharing or gradient information. Our experiments follow the recent large-scale study of White et al. (2021), who compare 31 diverse accuracy prediction methods based on initialization and query time, using three NAS benchmarks. 3 PREDICTING HARDWARE METRICS Our methods follow the large-scale study of White et al. (2021), who compared a total of 31 accuracy prediction methods. The differences between accuracy and hardware-metric prediction, our selection of predictors, and the general training pipeline are described in this section. In our experiments on HW-NAS-Bench and TransNAS-Bench-101, described in Section 4, we then compare these predictors across different training set sizes. 1For further reading, we recommend a recent survey on hardware-aware NAS (Benmeziane et al., 2021) ----- **Differences to accuracy predictors:** There are fundamental differences when predicting hardware metrics and the accuracy of network topologies. The most essential is the cost to obtain a helpful predictor, which may vary widely for accuracy prediction methods. While determining the test accuracy requires the costly and lengthy training of networks, measuring hardware metrics does not necessitate any network training. Consequentially, specialized accuracy-estimation methods that rely on trained networks, loss history, learning curve extrapolation, or early stopping do not apply to hardware metrics. Furthermore, so-called zero-cost proxies that predict metrics from the gradients of a single batch are dependant on the network topology but not on the hardware the network is placed on. Therefore, the dominant hardware-metric predictor family is model-based. Since all relevant predictors are model-based, they can be compared by their training set size. This simplifies the initialization time of a predictor as the number of prior measured architectures on which they are trained. In stark contrast, some accuracy predictors do not need any training data, while others require several partially or fully trained networks. Since an untrained network and a few batches suffice to measure a hardware-metric, the collection of such a training set is comparably inexpensive. Additionally, hardware predictors are generally used supplementary to a one-shot network optimized for loss or accuracy. Depending on the NAS method, a fully differentiable predictor is required in order to guide the gradient-based architecture selection. Typical choices are Lookup Tables (Cai et al., 2019; Nayman et al., 2021) and neural networks (Xu et al., 2020). **Model-based predictors:** The goal of a predictor fp(a) is to accurately approximate the function _f_ (a), which may be, e.g., the latency of an architecture a from the search space A. A model-based predictor is trained via supervised learning on a set Dtrain of datapoints (a, f (a)), after which it can be inexpensively queried for estimates on further architectures. The collection of the dataset and the duration of the training are referred to as initialization time and training time respectively. The quality of such a trained predictor is generally determined by the (ranking) correlation between measurements _f_ (a) _a_ _test_ and predictions _fp(a)_ _a_ _test_ on the unseen architectures _{_ _|_ _∈A_ _}_ _{_ _|_ _∈A_ _}_ _test_ . Common correlation metric choices are Pearson (PCC), Spearman (SCC) and Kendall’s _ATau (KT) (Chu et al., 2019b; Yu et al., 2020; Siems et al., 2020). ⊂A_ Our experiments include 18 model-based predictors from different families: Linear Regression, Ridge Regression (Saunders et al., 1998), Bayesian Linear Regression (Bishop, 2007), Support Vector Machines (Cortes & Vapnik, 1995), Gaussian Process (Rasmussen, 2003), Sparse Gaussian Process (Candela & Rasmussen, 2005), Random Forests (Liaw et al., 2002), XGBoost (Chen & Guestrin, 2016), NGBoost (Duan et al., 2020), LGBoost (Ke et al., 2017), BOHAMIANN (Springenberg et al., 2016), BANANAS (White et al., 2019), BONAS (Shi et al., 2020), GCN (Wen et al., 2020), small and large Multi-Layer-Perceptrons (MLP), NAO (Luo et al., 2018), and a layeroperation-wise Lookup Table model. We provide further descriptions and implementation details in Appendix B. **Hyper-parameter tuning:** The default hyperparameters of the used predictors vary significantly in their levels of hyper-parameter tuning, especially in the context of NAS. Additionally, some predictors may internally make use of cross-validation, while others do not. Following White et al. (2021), we attempt to level the playing field by running a cross-validation random-search over hyperparameters each time a predictor is fit to data. Each search is limited to 5000 iterations and a total run time of 15 minutes and naturally excludes any test data. The predictor-specific parameter details are given in Appendix C. **Training pipeline** To make a reliable comparison, we use the NASLib library (Ruchte et al. (2020), see Appendix A). We fit each predictor on each dataset and training size 50 times, using seeds {0, ..., 49}. Some predictors internally normalize the training values (subtract mean, divide by standard deviation). We choose to explicitly do this for all predictors and datasets, which reduces the dependency of hyper-parameters (e.g. learning rate) on the dataset and allows us to analyze and compare the prediction errors across datasets more effectively. ----- 4 PREDICTOR EXPERIMENTS We compare the different predictor models based on two NAS benchmarks, HW-NAS-Bench (Li et al., 2021a) and TransNAS-Bench-101 (Duan et al., 2021). They differ considerably by their network tasks, hardware devices, and architecture designs. **HW-NAS-Bench architecture design and datasets** In HW-NAS-Bench, each architecture is solely defined by the topology of a building block (”cell”), which is stacked multiple times to create a complete network. Each cell is completely defined by choosing six candidate operations. Since they select from five different candidates each time, there are 5[6] = 15625 unique cell topologies. These cells are not fully sequential but contain paths of different lengths, which is visualized in Appendix D. HW-NAS-Bench provides ten hardware statistics on CIFAR10, CIFAR100 Krizhevsky et al. (2009) and ImageNet16-120 Chrabaszcz et al. (2017), of which we exclude the incomplete EdgeTPU metric. Thus there are 27 data sets of varying difficulty. As detailed in Appendix E, 12 of them can be accurately fit with Linear Regression and only 25 training samples. Many are also very similar since their measured networks differ only by the number of image classes. We therefore select five datasets that (1.) are not trivial to learn as they are non-linear and (2.) not redundant: _• ImageNet16-120, raspi4, latency_ _• CIFAR100, pixel3, latency_ _• CIFAR10, edgegpu, latency_ _• CIFAR100, edgegpu, energy consumption_ _• ImageNet16-120, eyeriss, arithmetic intensity_ **TransNAS-Bench-101 architecture design and datasets** TransNAS-Bench-101 contains information for 7,352 different network architectures, used as backbones in seven diverse vision tasks. Since 4,096 are also a subset of HW-NAS-Bench, we focus on the remaining 3,256 architectures with a macro-level search space. Unlike a micro-level search space, where a cell is stacked multiple times to create a network, each network layer and block is considered individually. In particular, the TransNAS-Bench-101 networks consist of four to six pairs of ResNet blocks (He et al., 2016), which may modify the image size and channels in four ways: not at all, double the channel count, halve the spatial size, and both. Every network has to double the channel count 1 to 3 times, resulting in 3,256 unique architectures. The networks may consequentially differ in their number of layers (depth), the number of channels (width), and image size at any layer. As done for HW-NAS-Bench, we select five of the seven available datasets for their latency measurements. Aside from the self-supervised Jigsaw task, there is little difference between the cross-task latency measurements (see Appendix E). We evaluate the possibly redundant datasets nonetheless, since latency predictions in macro-level search spaces are an important domain for NAS on image classification and object detection tasks: _• Object classification_ _• Scene classification_ _• Room layout_ _• Jigsaw_ _• Semantic segmentation_ **Fitting results and comparison** The results, averaged over all selected HW-NAS-Bench and TransNAS-Bench-101 datasets, are presented in Figures 1a and 1b, respectively. The left plots present the absolute predictor performance, the right ones make relative comparisons easier. Unsurprisingly, more training samples (i.e., evaluated architectures) generally lead to better prediction results, even until the entire search space is known (aside from the test set). This is true for most of the predictors, although e.g. Gaussian Processes and BOHAMIANN saturate early. The simple Linear Regression and Ridge Regression models also fail to make proper use of hundreds of data points but perform decently when only a few training samples are available. Interestingly, the same is true for the graph-encoding network-based predictors BONAS and GCN. While knowing how the different paths within each cell connect (see Appendix B) is especially useful given less than fifty training samples, the advantage disappears afterward. In contrast, the graph-encoding encoder-decoder approach of NAO performs decently at all times. ----- (b) Results on TransNAS-Bench-101. Since all network architectures are purely sequential by design, we do not Average over HW-NAS datasets Average over HW-NAS datasets 0.2 Lin. Reg. 0.8 Bayes. Lin. Reg. Ridge Reg. XGBoost 0.1 NGBoost LGBoost 0.6 Random Forests Sparse GP GP 0.0 BOHAMIANN 0.4 SVM Reg. NAO Kendall's Tau (absolute) GCN 0.1 BONAS 0.2 Kendall's Tau (centered on average) BANANAS MLP (large) MLP (small) Lookup Table 0.2 0.0 10[1] 10[2] 10[3] 10[4] 10[1] 10[2] 10[3] 10[4] training set size training set size (a) Results on HW-NAS-Bench. NAO performs decently at all times, and none of the prediction models requires more than 60 training samples to improve over a Lookup Table model. Average over TransNAS datasets Average over TransNAS datasets 0.9 0.10 0.8 Lin. Reg. Bayes. Lin. Reg. 0.7 0.05 Ridge Reg. XGBoost NGBoost 0.6 LGBoost 0.00 Random Forests 0.5 Sparse GP GP BOHAMIANN 0.4 Kendall's Tau (absolute) 0.05 SVM Reg. MLP (large) 0.3 Kendall's Tau (centered on average) MLP (small) Lookup Table 0.10 0.2 0.1 10[1] 10[2] 10[3] 10[1] 10[2] 10[3] training set size training set size evaluate predictors that specifically encode the architecture connectivity (BANANAS, BONAS, GCN, NAO). After as few as 20 training samples, MLP models outclass all other predictors. Figure 1: How well the different predictors rank the test architectures, depending on the training set size and averaged over the five selected datasets. Left plots: absolute Kendall’s Tau ranking correlation, higher is better. Right plots: same as left, but centered on the predictor-average. Due to their powerful rule-based approach, tree-based models perform much better given many training samples. Under such circumstances, LGBoost is a candidate for the best predictor model. Similarly, the predictions of Support Vector Machines also benefit strongly from more samples. The model we find to perform best for most training set sizes are MLPs. They are among the top predictors at almost all times in the HW-NAS-Bench, although tree-based models are competitive given enough data. After around 3,000 training samples, thinner and deeper MLPs improve over the wider and smaller ones. The path-encoding BANANAS model behaves similarly to a regular large MLP but requires more samples to reach the same performance. This is interesting since, aside from the data encoding, BANANAS is an ensemble of three large MLP models. Even though only the first network layer is affected by the data encoding, the more complicated path-encoding proves harmful ----- HW-NAS-Bench TransNAS-Bench-101 Raspi4 FPGA Eyeriss Pixel3 EdgeGPU Tesla V100 latency 0.45 (0.75) 0.99 (0.97) 0.99 (0.96) 0.49 (0.78) 0.21 (0.79) 0.60 (0.70) energy 0.99 (0.97) 1.00 (0.99) 0.23 (0.79) arithmetic intensity 0.84 (0.81) Table 1: The Kendall’s Tau correlation of Lookup Tables and Linear Regression (in brackets, using only 124 training samples) across metrics and devices. Lookup Tables perform only marginally better on the FPGA and Eyeriss devices, but considerably worse in all other cases. More detailed statistics are available in Appendix E. when the connectivity of the architectures in the search space is fixed. On TransNAS-Bench-101, MLP perform exceptionally well. They are much better than any other tested predictor once more than just 20 training samples are available. The small MLP model can achieve a KT correlation of 80% with just 200 training samples, which takes the best non-network-based predictor (Support Vector Machine) four times as many. They are also the only models that achieve a KT correlation of over 90%, about 5% higher than the next best model (LGBoost). Finally, the Lookup Table models (black horizontal lines) perform poorly in comparison to any other predictor. Even though building such a model for HW-NAS-Bench datasets requires only 25 neighbored architectures, NAO and GCN perform better after just ten random samples. More than half of the predictor models require less than 25 random samples, while the worst need at most 60. On TransNAS-Bench-101, Lookup Tables perform comparably better. Building one requires only 21 neighbored architectures, and it takes most models between 50 and 100 random training samples to achieve better performance. When measured on a per dataset basis, we find that the Lookup Table models display a severe performance difference ranging from about 20% KT correlation (cifar10edgegpu latency and Jigsaw) to over 70% (ImageNet16-120-eyeriss arithmetic intensity and Semantic Segmentation, see Appendix E). Other models prove to be much more stable. **Devices and Metrics** The previously described results are based on a specific selection of HWNAS-Bench and TransNAS-Bench-101 datasets that are hard to fit for Lookup Table models. As shown in Table 1, that is not always the case. The FPGA and Eyeriss hardware devices are very suitable for Lookup Tables, achieving an almost perfect ranking correlation is possible. Nonetheless, Linear Regression requires only 124 training samples to compete even there and is significantly better in every other case. We finally observe that the difficulty of fitting predictors primarily depends on the hardware device, much more than the measured metric. 5 EVALUATING THE PREDICTOR-GUIDED ARCHITECTURE SELECTION Although the experiments in Section 4 greatly assist us in selecting a predictor, it is not clear what a specific Kendall’s Tau correlation implies for the subsequent architecture selection. Given a perfect hardware metric predictor (Kendall’s Tau = 1.0), we can expect that an ideal architecture search process will select the architectures with the best tradeoff of accuracy and the hardware metric, i.e., the true Pareto front. On the other hand, imperfect predictions result in the selection of supposedlybest architectures that are wrongly believed to be better. To study how hardware predictors affect NAS results, we extensively evaluate the selection of such supposedly-best architectures in simulation. This approach can evaluate any combination of predictor quality, test set size, and dataset, without the technical difficulties of obtaining actual predictor models that precisely match such requirements. Since the hardware and accuracy prediction models are usually independent and can be studied in isolation, we use ground-truth accuracy values in all cases. **Simulating predictors** The main challenge of the simulation is to quickly and accurately model predictor outputs. We base our simulation on how predictor-generated values deviate from their ground-truth targets on the test set, which is explained in Figure 2 and further detailed in Appendix G. Since the simulated deviations are similar to those of actual predictors, simulated predictions are obtained by drawing random values from this deviation distribution and adding them to the ground-truth hardware measurements. ----- Predictor deviations |n d|ormal fit, std=0.477 eviations| |---|---| ||| 1 0 1 normal fit, std=0.477 deviations deviation of the predictions Simulated predictor deviations |normal fit, st mixed dist. g|normal fit, st mixed dist. g|d=0.500 enerated with std=0.5| |---|---|---| |||| 2 1 0 1 2 normal fit, std=0.500 mixed dist. generated with std=0.5 deviation of the simulated predictions 1.4 1.2 1.0 0.8 0.6 0.4 0.2 0.0 1.4 1.2 1.0 0.8 0.6 0.4 0.2 0.0 Figure 2: A trained XGBoost prediction model on normalized ImageNet16-120 raspi4-latency test Predictions and targets KT=0.73, SCC=0.90, PCC=0.88 3 2 1 0 predicted values 1 2 2 1 0 1 2 3 true values data. Left: The latency prediction (y-axis) for every architecture (blue dot) is approximately correct (red line). Center: The same data as on the left, the distribution of deviations made by the predictor (blue) and a normal distribution fit to them (orange). Right: A mixed distribution can simulate 0.45 0.40 0.45 0.40 0.35 0.30 0.35 0.30 0.25 0.25 |al deviation distributions as that in the center plot.|Col2| |---|---| |true pareto front predicted pareto front all architectures selected architectures|| |true pareto front predicted pareto front all architectures selected architectures|| |4 no|2 0 2 4 rmalized ImageNet16-120-raspi4_latency| |Col1|Col2|Col3|Col4| |---|---|---|---| ||||| ||||| ||||| ||||| ||true par discover selected|eto front, HV=2.9 ed pareto front, H arch., MRAall = 1.|3 V=2.86 06%, MRApareto = 0.43%| ||||| ||||| |2.|0 1.5 1.0 0.5 0.0 0.5 normalized ImageNet16-120-raspi4_latency||| Figure 3: An example of predictor-guided architecture selection, std=0.5. Left: The simulated predictor makes an inaccurate latency prediction for each architecture (blue), resulting in the selection of the supposedly-best architectures (orange dots). Even though the predicted Pareto front (orange line) may differ significantly from the ground-truth Pareto front (red line), most selected architectures are close to optimal. Right: Same data as left. The true Pareto front (red) and that of the selected architectures (orange). Simply accepting all selected architectures results in a Mean Reduction of Accuracy (MRA) of 1.06%, while verifying the predictions and discarding inferior results improves that to 0.43%. The hypervolume (HV, area under the Pareto-fronts) is reduced by 0.07. A single example of a simulation can be seen in Figure 3. Although most selected architectures (orange) are close to the true optimum (red Pareto front), there almost always exists an architecture that has superior accuracy and, at most, the same latency. Simply accepting the 13 selected architectures in this particular example results in a mean reduction of accuracy (MRAall) of 1.06%. In other words, the average selected architecture has 1.06% lower accuracy than a comparable one on the true Paret front. However, simply verifying the hardware metric predictions through actual measurements reveals that some selected architectures are suboptimal. By choosing only the Pareto subset of the selection, the opportunity loss can be reduced to 0.43% (MRApareto). An important property of this approach is that it is independent of any particular optimization method. The supposedly-best architectures are always correctly identified, which is an upper bound on how well Bayesian Optimization, Evolutionary Algorithms, and other approaches can perform. The exemplary MRApareto opportunity loss of 0.43% is therefore unavoidable and depends solely on the hardware metric predictor, the dataset, and the number of considered architectures. **Results** We simulate 1,000 architecture selections for each of the five chosen HW-NAS-Bench datasets, six different test set sizes, and eleven distribution standard deviations between 0.0 and 1.0. As exemplarily shown in Figure 3, each such simulation allows us to compute the mean reduction in accuracy (MRA) and the hypervolume (HV) under the Pareto fronts. The most important insights are visualized in Figure 4 and summarized below. ----- mean over any number of architectures |Image cifar1 cifar1 cifar1|Net16-120 0-edgegpu_ 00-edgegpu 00-pixel3_l|-raspi4_late latency _energy atency|ncy|Col5|Col6|Col7| |---|---|---|---|---|---|---| |mean||||||| 0.0/1.00 0.2/0.88 0.4/0.78 0.6/0.69 0.8/0.62 1.0/0.57 ImageNet16-120-eyeriss_arithmetic_intensity ImageNet16-120-raspi4_latency cifar10-edgegpu_latency cifar100-edgegpu_energy cifar100-pixel3_latency mean Std. of prediction deviations / Kendall's Tau mean over all datasets 4.5 4.0 3.5 3.0 2.5 2.5 2.0 1.5 1.0 0.5 |Col1|Col2|Col3|Col4|Col5|Col6| |---|---|---|---|---|---| |||all selec pareto-s|ted archite et of the s|ctures elected ar|chitectures| 0.0/1.00 0.2/0.88 0.4/0.78 0.6/0.69 0.8/0.62 1.0/0.57 all selected architectures pareto-set of the selected architectures Std. of prediction deviations / Kendall's Tau |Col1|Col2|Col3|Col4|Col5|Col6| |---|---|---|---|---|---| |100|10|00|5000||| 2.0 0.0/1.00 0.2/0.88 0.4/0.78 0.6/0.69 0.8/0.62 1.0/0.57 100 1000 5000 500 2000 15625 Std. of prediction deviations / Kendall's Tau Figure 4: Simulation results, with the standard deviation of the predictor deviations and the resulting KT correlation on the x-axis. Left: Verifying the hardware predictions can significantly improve the results, even more so for better predictors. Center: The drops in average accuracy are dependant on the dataset and hardware metric. Right: Considering more candidate architectures and using better prediction models improves the results; larger values are better. Verifying the predicted results matters (Figure 4, left). The best prediction models achieve a KT correlation of almost 0.9, which translates to a mean reduction in accuracy of MRAall 1.5%. That means, for each selected architecture, there exists an architecture of equal or lower latency in the ≈ true Pareto set (if latency is the hardware metric) that improves the average accuracy by 1.5%. Even though all selected architectures are believed to form a Pareto set, that is not the case. Their optimal subset has a reduction of only MRApareto 0.5%, a significant improvement. However, finding this optimal subset requires actually measuring the hardware metrics of the architectures selected by ≈ the used NAS method. Furthermore, the left of Figure 4 aids in anticipating the MRA given a specific predictor. If one used e.g. BOHAMIANN (KT 0.8, see Figure1a) instead of MLPs or LGBoost (KT 0.9), MRApareto _≈_ _≈_ increases from around 0.5% to roughly 1.2%. The average accuracy of the selected architectures is thus reduced by another 0.7%, just by using an unsuitable hardware metric predictor. Lookup Tables (KT≈0.45) are not even visualized anymore, they have an MRApareto of over 2.5%. Another interesting observation is that the gap between MRAall and MRApareto is wider for better predictors. This is a shortcoming of the MRA metric that we elaborate on in Appendix H. The dataset and metric matter (Figure 4, center). While we generally present the results averaged over datasets, there exists some discrepancy among them. Most interestingly, predicting hardware metrics on harder classification problems (ImageNet16-120 is harder than CIFAR10) also results in a higher MRA. This is especially important since MRA is an absolute accuracy reduction. Even though the CIFAR10 networks achieve twice the accuracy of ImageNet16-120 networks, they lose less absolute accuracy to imperfect predictions. The order of MRA/dataset is primarily stable for any predictor KT correlation. Finally, as visualized by the shaded areas, the standard deviation of the MRA is generally huge. Consequentially, predictor-guided NAS is very likely to produce results of varying quality for each different predictor or search attempt, especially with less accurate predictors. The number of considered architectures matters (Figure 4, right). We measure the hypervolume of the discovered Pareto front (i.e., the area beneath it, see Appendix H), which, unlike MRA, also considers the hardware metric. Quite obviously, if the architectures from the true Pareto set are not considered, they can not be selected. To achieve the highest possible hypervolume of around 4.2 (i.e. find the true Pareto set), every architecture in the search space must be evaluated with a perfect predictor. This is impossible in most real-world cases, where only a tiny fraction of all possible architectures can ever be considered. For HW-NAS-Bench, considering 5000 architectures with perfect live measurements and predicting the metrics for all 15625 with ranking correlation KT≈0.73 results in selecting equivalent sets of architectures. As seen in Figure1a, Ridge Regression can achieve this performance with fewer than 100 training samples. Thus, a worse predictor leads to better results if it enables considering more architectures. This insight is especially crucial for live measurements, which are accurate but slow. Similarly, estimating the network accuracy with super-networks takes much more time than predicting their performance with a neural predictor (Wen et al., 2020). If the measurement of any metrics is the limiting factor, a guided selection of a cheap predictor is likely to do better. ----- 6 DISCUSSION **Chosen prediction methods** Given the nature of hardware-metric prediction, only the subset of model-based predictors evaluated by White et al. (2021) is suitable. We extended this subset with four models, including the popular Lookup Table. We abstained from evaluating layer-wise predictors (e.g. Wess et al. (2021)) since such data is not available, and meta-learning predictors (Lee et al., 2021) due to the vast possibilities to configure them. A separate and specialized comparison between classic and meta-learning predictors seems favorable to us. **Simulation limitations** In contrast to evaluating real predictors, the simulation allows us to quickly make statements for any test set sizes and predictor-inaccuracies. However, naturally, the results are only approximations. While they match actual values, they are generally slightly pessimistic (see Appendix I). We also limit the simulation to HW-NAS-Bench since the changes to classification results are more accessible to interpretation than changes to loss values across different problem types. Finally, the current simulation approach can not investigate methods that absolutely require a trained one-shot network, such as gradient-based approaches. Including such methods is an interesting direction for future research. **Transferability of the results** Our evaluation includes five challenging and diverse datasets based on the micro-level search space of HW-NAS-Bench and five latency-based datasets of various macro-level search space architectures in TransNAS-Bench-101. Nonetheless, we find shared trends: All tested prediction models improve over Lookup Tables with little amounts of training data. Furthermore, most predictors benefit from more training data, even until the entire search space (aside from the test set) is known. We also find that network-based predictors are generally best but may be challenged by tree-based predictors if enough training data is available. Given only a few samples, Ridge Regression performs better than most other models. **Recommendations** While Lookup Tables are a cheap, simple, and popular model in gradientbased architecture selection, we find a significant variance in performance across tasks and devices (see Table 1 and Appendix E). We recommend replacing such models with either MLPs or Ridge Regression, which are more stable, fully differentiable, and often take less than 100 training samples to achieve better results. For most realistic scenarios where more than 100 training samples are available, MLP models are the most promising. They are among the top predictors on HW-NAS-Bench and demonstrate outstanding performance on the TransNAS-Bench-101 datasets. We found that specialized architecture encodings are primarily beneficial for little training data but suspect that they enjoy an additional advantage when network topologies are more complex and diverse (White et al., 2021). While the query time for all predictors is less than 0.05s and thus negligible, there is a notable difference in training time (see Appendix F), primarily due to the hyper-parameter optimization. We recommend Ridge Regression for very little amounts of training data and LGBoost otherwise if that is an important factor. If a NAS method selects architectures based on hardware metric predictions, we strongly suggest verifying the results by measuring the true metric value afterward. Doing so may eliminate inferior candidates and improve the average result substantially. Finally, if the limiting factor to a NAS method is the slow measurement of hardware metrics, using a much faster predictor may lead to an improvement, even if the prediction model is less accurate. 7 CONCLUSIONS This work evaluated various hardware-metric prediction models on ten problems of different metrics, devices, and network architecture types. We then simulated the selection process for different test set sizes and predictor inaccuracies to improve our understanding of predictor-based architecture selection. We find that even imperfect predictors may improve NAS if their low query time enables considering more candidate architectures. Finally, verifying the predictions for the selected candidates can lead to a drastic improvement of their average performance. The code and results are made available, thus acting both for recommendation and as a baseline for future works. ----- REFERENCES Hadjer Benmeziane, Kaoutar El Maghraoui, Hamza Ouarnoughi, Sma¨ıl Niar, Martin Wistuba, and Naigang Wang. A Comprehensive Survey on Hardware-Aware Neural Architecture Search. _[CoRR, abs/2101.09336, 2021. URL https://arxiv.org/abs/2101.09336.](https://arxiv.org/abs/2101.09336)_ Christopher M. Bishop. Pattern recognition and machine learning, 5th Edition. Information science [and statistics. Springer, 2007. ISBN 9780387310732. URL https://www.worldcat.org/](https://www.worldcat.org/oclc/71008143) [oclc/71008143.](https://www.worldcat.org/oclc/71008143) Han Cai, Ligeng Zhu, and Song Han. ProxylessNAS: Direct Neural Architecture Search on Target Task and Hardware. In 7th International Conference on Learning Representations, _ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net, 2019._ [URL https:](https://openreview.net/forum?id=HylVB3AqYm) [//openreview.net/forum?id=HylVB3AqYm.](https://openreview.net/forum?id=HylVB3AqYm) Joaquin Qui˜nonero Candela and Carl Edward Rasmussen. A Unifying View of Sparse Approximate [Gaussian Process Regression. J. Mach. Learn. Res., 6:1939–1959, 2005. URL http://jmlr.](http://jmlr.org/papers/v6/quinonero-candela05a.html) [org/papers/v6/quinonero-candela05a.html.](http://jmlr.org/papers/v6/quinonero-candela05a.html) Tianqi Chen and Carlos Guestrin. XGBoost: A scalable tree boosting system. In Proceedings of the _22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD_ ’16, pp. 785–794, New York, NY, USA, 2016. ACM. ISBN 978-1-4503-4232-2. doi: 10.1145/ [2939672.2939785. URL http://doi.acm.org/10.1145/2939672.2939785.](http://doi.acm.org/10.1145/2939672.2939785) Patryk Chrabaszcz, Ilya Loshchilov, and Frank Hutter. A downsampled variant of imagenet as an alternative to the cifar datasets. arXiv preprint arXiv:1707.08819, 2017. Xiangxiang Chu, Bo Zhang, Jixiang Li, Qingyuan Li, and Ruijun Xu. ScarletNAS: Bridging the Gap Between Scalability and Fairness in Neural Architecture Search. CoRR, abs/1908.06022, [2019a. URL http://arxiv.org/abs/1908.06022.](http://arxiv.org/abs/1908.06022) Xiangxiang Chu, Bo Zhang, Ruijun Xu, and Jixiang Li. FairNAS: Rethinking Evaluation Fairness [of Weight Sharing Neural Architecture Search. CoRR, abs/1907.01845, 2019b. URL http:](http://arxiv.org/abs/1907.01845) [//arxiv.org/abs/1907.01845.](http://arxiv.org/abs/1907.01845) Corinna Cortes and Vladimir Vapnik. Support-vector networks. Machine learning, 20(3):273–297, 1995. Xiaoliang Dai, Alvin Wan, Peizhao Zhang, Bichen Wu, Zijian He, Zhen Wei, Kan Chen, Yuandong Tian, Matthew Yu, Peter Vajda, and Joseph E. Gonzalez. FBNetV3: Joint Architecture-Recipe Search using Neural Acquisition Function. _CoRR, abs/2006.02049, 2020._ [URL https://](https://arxiv.org/abs/2006.02049) [arxiv.org/abs/2006.02049.](https://arxiv.org/abs/2006.02049) Xuanyi Dong and Yi Yang. NAS-Bench-201: Extending the Scope of Reproducible Neural Architecture Search. In 8th International Conference on Learning Representations, ICLR 2020, Addis _[Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020. URL https://openreview.](https://openreview.net/forum?id=HJxyZkBKDr)_ [net/forum?id=HJxyZkBKDr.](https://openreview.net/forum?id=HJxyZkBKDr) Xuanyi Dong, Lu Liu, Katarzyna Musial, and Bogdan Gabrys. NATS-Bench: Benchmarking NAS Algorithms for Architecture Topology and Size. arXiv preprint arXiv:2009.00437, 2020. Tony Duan, Avati Anand, Daisy Yi Ding, Khanh K. Thai, Sanjay Basu, Andrew Y. Ng, and Alejandro Schuler. Ngboost: Natural gradient boosting for probabilistic prediction. In Proceedings of _the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual_ _Event, volume 119 of Proceedings of Machine Learning Research, pp. 2690–2700. PMLR, 2020._ [URL http://proceedings.mlr.press/v119/duan20a.html.](http://proceedings.mlr.press/v119/duan20a.html) Yawen Duan, Xin Chen, Hang Xu, Zewei Chen, Xiaodan Liang, Tong Zhang, and Zhenguo Li. TransNAS-Bench-101: Improving Transferability and Generalizability of Cross-Task Neural Ar[chitecture Search. CoRR, abs/2105.11871, 2021. URL https://arxiv.org/abs/2105.](https://arxiv.org/abs/2105.11871) [11871.](https://arxiv.org/abs/2105.11871) ----- Zichao Guo, Xiangyu Zhang, Haoyuan Mu, Wen Heng, Zechun Liu, Yichen Wei, and Jian Sun. Single Path One-Shot Neural Architecture Search with Uniform Sampling. In European Conference _[on Computer Vision, pp. 544–560. Springer, 2020. URL http://arxiv.org/abs/1904.](http://arxiv.org/abs/1904.00420)_ [00420.](http://arxiv.org/abs/1904.00420) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, [pp. 770–778, 2016. URL http://arxiv.org/abs/1512.03385.](http://arxiv.org/abs/1512.03385) Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. MobileNets: Efficient Convolutional Neural Networks for [Mobile Vision applications. CoRR, abs/1704.04861, 2017. URL http://arxiv.org/abs/](http://arxiv.org/abs/1704.04861) [1704.04861.](http://arxiv.org/abs/1704.04861) Shoukang Hu, Sirui Xie, Hehui Zheng, Chunxiao Liu, Jianping Shi, Xunying Liu, and Dahua Lin. DSNAS: Direct Neural Architecture Search without Parameter Retraining. In Proceedings of the _IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12084–12092, 2020._ [URL http://arxiv.org/abs/2002.09128.](http://arxiv.org/abs/2002.09128) Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and TieYan Liu. Lightgbm: A highly efficient gradient boosting decision tree. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett (eds.), Advances in Neural Information Processing Systems 30: Annual Conference on _Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pp._ [3146–3154, 2017. URL https://proceedings.neurips.cc/paper/2017/hash/](https://proceedings.neurips.cc/paper/2017/hash/6449f44a102fde848669bdd9eb6b76fa-Abstract.html) [6449f44a102fde848669bdd9eb6b76fa-Abstract.html.](https://proceedings.neurips.cc/paper/2017/hash/6449f44a102fde848669bdd9eb6b76fa-Abstract.html) Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. CIFAR-10 (Canadian Institute for Advanced [Research). 2009. URL http://www.cs.toronto.edu/˜kriz/cifar.html.](http://www.cs.toronto.edu/~kriz/cifar.html) Hayeon Lee, Sewoong Lee, Song Chong, and Sung Ju Hwang. HELP: Hardware-Adaptive Efficient [Latency Predictor for NAS via Meta-Learning. CoRR, abs/2106.08630, 2021. URL https:](https://arxiv.org/abs/2106.08630) [//arxiv.org/abs/2106.08630.](https://arxiv.org/abs/2106.08630) Chaojian Li, Zhongzhi Yu, Yonggan Fu, Yongan Zhang, Yang Zhao, Haoran You, Qixuan Yu, Yue Wang, and Yingyan Lin. HW-NAS-Bench: Hardware-Aware Neural Architecture Search Bench[mark. CoRR, abs/2103.10584, 2021a. URL https://arxiv.org/abs/2103.10584.](https://arxiv.org/abs/2103.10584) Guihong Li, Sumit K. Mandal, Umit Y. Ogras, and Radu Marculescu. FLASH: Fast Neural Ar-[¨] [chitecture Search with Hardware Optimization. CoRR, abs/2108.00568, 2021b. URL https:](https://arxiv.org/abs/2108.00568) [//arxiv.org/abs/2108.00568.](https://arxiv.org/abs/2108.00568) Liam Li and Ameet Talwalkar. Random Search and Reproducibility for Neural Architecture Search. In Uncertainty in Artificial Intelligence, pp. 367–377. PMLR, 2020. Andy Liaw, Matthew Wiener, et al. Classification and Regression by randomForest. R news, 2(3): 18–22, 2002. Marius Lindauer and Frank Hutter. Best Practices for Scientific Research on Neural Architecture Search. Journal of Machine Learning Research, 21(243):1–18, 2020. Renqian Luo, Fei Tian, Tao Qin, Enhong Chen, and Tie-Yan Liu. Neural Architecture Optimization. In Samy Bengio, Hanna M. Wallach, Hugo Larochelle, Kristen Grauman, Nicol`o Cesa-Bianchi, and Roman Garnett (eds.), Advances in Neural Information Processing Systems 31: Annual Con_ference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2018,_ _[Montr´eal, Canada, pp. 7827–7838, 2018. URL https://proceedings.neurips.cc/](https://proceedings.neurips.cc/paper/2018/hash/933670f1ac8ba969f32989c312faba75-Abstract.html)_ [paper/2018/hash/933670f1ac8ba969f32989c312faba75-Abstract.html.](https://proceedings.neurips.cc/paper/2018/hash/933670f1ac8ba969f32989c312faba75-Abstract.html) Ningning Ma, Xiangyu Zhang, Hai-Tao Zheng, and Jian Sun. ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture Design. In Vittorio Ferrari, Martial Hebert, Cristian Sminchisescu, and Yair Weiss (eds.), Computer Vision - ECCV 2018 - 15th European Conference, Mu_nich, Germany, September 8-14, 2018, Proceedings, Part XIV, volume 11218 of Lecture Notes_ _[in Computer Science, pp. 122–138. Springer, 2018. doi: 10.1007/978-3-030-01264-9\ 8. URL](https://doi.org/10.1007/978-3-030-01264-9_8)_ [https://doi.org/10.1007/978-3-030-01264-9_8.](https://doi.org/10.1007/978-3-030-01264-9_8) ----- Joseph Mellor, Jack Turner, Amos Storkey, and Elliot J. Crowley. Neural Architecture Search with[out Training, 2020. URL http://arxiv.org/abs/2006.04647.](http://arxiv.org/abs/2006.04647) Daniel M. Mendoza and Sijin Wang. Predicting Latency of Neural Network Inference, 2020. [URL http://cs230.stanford.edu/projects_fall_2020/reports/](http://cs230.stanford.edu/projects_fall_2020/reports/55793069.pdf) [55793069.pdf.](http://cs230.stanford.edu/projects_fall_2020/reports/55793069.pdf) Niv Nayman, Yonathan Aflalo, Asaf Noy, and Lihi Zelnik. HardCoRe-NAS: Hard Constrained diffeRentiable Neural Architecture Search. In Marina Meila and Tong Zhang (eds.), Proceedings _of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual_ _Event, volume 139 of Proceedings of Machine Learning Research, pp. 7979–7990. PMLR, 2021._ [URL http://proceedings.mlr.press/v139/nayman21a.html.](http://proceedings.mlr.press/v139/nayman21a.html) Evgeny Ponomarev, Sergey A. Matveev, and Ivan V. Oseledets. LETI: Latency Estimation Tool and Investigation of Neural Networks inference on Mobile GPU. CoRR, abs/2010.02871, 2020. URL [https://arxiv.org/abs/2010.02871.](https://arxiv.org/abs/2010.02871) Carl Edward Rasmussen. Gaussian Processes in Machine Learning. In Olivier Bousquet, Ulrike von Luxburg, and Gunnar R¨atsch (eds.), Advanced Lectures on Machine Learning, ML Sum_mer Schools 2003, Canberra, Australia, February 2-14, 2003, T¨ubingen, Germany, August 4-_ _16, 2003, Revised Lectures, volume 3176 of Lecture Notes in Computer Science, pp. 63–71._ [Springer, 2003. doi: 10.1007/978-3-540-28650-9\ 4. URL https://doi.org/10.1007/](https://doi.org/10.1007/978-3-540-28650-9_4) [978-3-540-28650-9_4.](https://doi.org/10.1007/978-3-540-28650-9_4) Esteban Real, Alok Aggarwal, Yanping Huang, and Quoc V Le. Regularized Evolution for Image [Classifier Architecture Search, 2018. URL http://arxiv.org/abs/1802.01548.](http://arxiv.org/abs/1802.01548) Michael Ruchte, Arber Zela, Julien Siems, Josif Grabocka, and Frank Hutter. Naslib: A modular [and flexible neural architecture search library. https://github.com/automl/NASLib,](https://github.com/automl/NASLib) 2020. Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on _computer vision and pattern recognition, pp. 4510–4520, 2018._ Craig Saunders, Alexander Gammerman, and Volodya Vovk. Ridge Regression Learning Algorithm in Dual Variables. In Proceedings of the Fifteenth International Conference on Machine Learning, ICML ’98, pp. 515521, San Francisco, CA, USA, 1998. Morgan Kaufmann Publishers Inc. ISBN 1558605568. Han Shi, Renjie Pi, Hang Xu, Zhenguo Li, James T. Kwok, and Tong Zhang. Bridging the Gap between Sample-based and One-shot Neural Architecture Search with BONAS. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and HsuanTien Lin (eds.), Advances in Neural Information Processing Systems 33: _Annual Con-_ _ference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12,_ _[2020, virtual, 2020. URL https://proceedings.neurips.cc/paper/2020/hash/](https://proceedings.neurips.cc/paper/2020/hash/13d4635deccc230c944e4ff6e03404b5-Abstract.html)_ [13d4635deccc230c944e4ff6e03404b5-Abstract.html.](https://proceedings.neurips.cc/paper/2020/hash/13d4635deccc230c944e4ff6e03404b5-Abstract.html) Julien Siems, Lucas Zimmer, Arber Zela, Jovita Lukasik, Margret Keuper, and Frank Hutter. NASBench-301 and the Case for Surrogate Benchmarks for Neural Architecture Search, 2020. Jost Tobias Springenberg, Aaron Klein, Stefan Falkner, and Frank Hutter. Bayesian Optimization with Robust Bayesian Neural Networks. In Daniel D. Lee, Masashi Sugiyama, Ulrike von Luxburg, Isabelle Guyon, and Roman Garnett (eds.), Advances _in Neural Information Processing Systems 29:_ _Annual Conference on Neural Infor-_ _mation Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, pp. 4134–_ 4142, 2016. URL [https://proceedings.neurips.cc/paper/2016/hash/](https://proceedings.neurips.cc/paper/2016/hash/a96d3afec184766bfeca7a9f989fc7e7-Abstract.html) [a96d3afec184766bfeca7a9f989fc7e7-Abstract.html.](https://proceedings.neurips.cc/paper/2016/hash/a96d3afec184766bfeca7a9f989fc7e7-Abstract.html) Mingxing Tan and Quoc V. Le. EfficientNet: Rethinking Model Scaling for Convolutional Neural [Networks. CoRR, abs/1905.11946, 2019. URL http://arxiv.org/abs/1905.11946.](http://arxiv.org/abs/1905.11946) ----- Mingxing Tan, Bo Chen, Ruoming Pang, Vijay Vasudevan, Mark Sandler, Andrew Howard, and Quoc V. Le. MnasNet: Platform-Aware Neural Architecture Search for Mobile. In IEEE _Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA,_ _USA, June 16-20, 2019, pp. 2820–2828. Computer Vision Foundation / IEEE, 2019._ doi: 10.1109/CVPR.2019.00293. URL [http://openaccess.thecvf.com/content_](http://openaccess.thecvf.com/content_CVPR_2019/html/Tan_MnasNet_Platform-Aware_Neural_Architecture_Search_for_Mobile_CVPR_2019_paper.html) [CVPR_2019/html/Tan_MnasNet_Platform-Aware_Neural_Architecture_](http://openaccess.thecvf.com/content_CVPR_2019/html/Tan_MnasNet_Platform-Aware_Neural_Architecture_Search_for_Mobile_CVPR_2019_paper.html) [Search_for_Mobile_CVPR_2019_paper.html.](http://openaccess.thecvf.com/content_CVPR_2019/html/Tan_MnasNet_Platform-Aware_Neural_Architecture_Search_for_Mobile_CVPR_2019_paper.html) Alvin Wan, Xiaoliang Dai, Peizhao Zhang, Zijian He, Yuandong Tian, Saining Xie, Bichen Wu, Matthew Yu, Tao Xu, Kan Chen, Peter Vajda, and Joseph E. Gonzalez. FBNetV2: Differentiable Neural Architecture Search for Spatial and Channel Dimensions. In 2020 IEEE/CVF Conference _on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020,_ [pp. 12962–12971. IEEE, 2020. doi: 10.1109/CVPR42600.2020.01298. URL https://doi.](https://doi.org/10.1109/CVPR42600.2020.01298) [org/10.1109/CVPR42600.2020.01298.](https://doi.org/10.1109/CVPR42600.2020.01298) Ruochen Wang, Xiangning Chen, Minhao Cheng, Xiaocheng Tang, and Cho-Jui Hsieh. RANKNOSH: Efficient Predictor-Based Architecture Search via Non-Uniform Successive Halving. _[CoRR, abs/2108.08019, 2021. URL https://arxiv.org/abs/2108.08019.](https://arxiv.org/abs/2108.08019)_ Wei Wen, Hanxiao Liu, Yiran Chen, Hai Helen Li, Gabriel Bender, and Pieter-Jan Kindermans. Neural Predictor for Neural Architecture Search. In Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm (eds.), Computer Vision - ECCV 2020 - 16th European Conference, _Glasgow, UK, August 23-28, 2020, Proceedings, Part XXIX, volume 12374 of Lecture Notes in_ _[Computer Science, pp. 660–676. Springer, 2020. doi: 10.1007/978-3-030-58526-6\ 39. URL](https://doi.org/10.1007/978-3-030-58526-6_39)_ [https://doi.org/10.1007/978-3-030-58526-6_39.](https://doi.org/10.1007/978-3-030-58526-6_39) Matthias Wess, Matvey Ivanov, Christoph Unger, Anvesh Nookala, Alexander Wendt, and Axel Jantsch. ANNETTE: Accurate Neural Network Execution Time Estimation With Stacked Models. _IEEE Access, 9:35453556, 2021. ISSN 2169-3536. doi: 10.1109/access.2020.3047259. URL_ [http://dx.doi.org/10.1109/ACCESS.2020.3047259.](http://dx.doi.org/10.1109/ACCESS.2020.3047259) Colin White, Willie Neiswanger, and Yash Savani. BANANAS: Bayesian Optimization with Neural Architectures for Neural Architecture Search. arXiv preprint arXiv:1910.11858, 2019. Colin White, Arber Zela, Binxin Ru, Yang Liu, and Frank Hutter. How Powerful are Performance Predictors in Neural Architecture Search? _CoRR, abs/2104.01177, 2021._ [URL https://](https://arxiv.org/abs/2104.01177) [arxiv.org/abs/2104.01177.](https://arxiv.org/abs/2104.01177) Bichen Wu, Xiaoliang Dai, Peizhao Zhang, Yanghan Wang, Fei Sun, Yiming Wu, Yuandong Tian, Peter Vajda, Yangqing Jia, and Kurt Keutzer. FBNet: Hardware-Aware Efficient ConvNet Design via Differentiable Neural Architecture Search. In IEEE Conference on Computer Vision _and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pp. 10734–_ 10742. Computer Vision Foundation / IEEE, 2019. doi: 10.1109/CVPR.2019.01099. URL [http://openaccess.thecvf.com/content_CVPR_2019/html/Wu_FBNet_](http://openaccess.thecvf.com/content_CVPR_2019/html/Wu_FBNet_Hardware-Aware_Efficient_ConvNet_Design_via_Differentiable_Neural_Architecture_Search_CVPR_2019_paper.html) [Hardware-Aware_Efficient_ConvNet_Design_via_Differentiable_](http://openaccess.thecvf.com/content_CVPR_2019/html/Wu_FBNet_Hardware-Aware_Efficient_ConvNet_Design_via_Differentiable_Neural_Architecture_Search_CVPR_2019_paper.html) [Neural_Architecture_Search_CVPR_2019_paper.html.](http://openaccess.thecvf.com/content_CVPR_2019/html/Wu_FBNet_Hardware-Aware_Efficient_ConvNet_Design_via_Differentiable_Neural_Architecture_Search_CVPR_2019_paper.html) Yuhui Xu, Lingxi Xie, Xiaopeng Zhang, Xin Chen, Bowen Shi, Qi Tian, and Hongkai Xiong. Latency-Aware Differentiable Neural Architecture Search. CoRR, abs/2001.06392, 2020. URL [https://arxiv.org/abs/2001.06392.](https://arxiv.org/abs/2001.06392) Antoine Yang, Pedro M Esperanc¸a, and Fabio M Carlucci. Nas evaluation is frustratingly hard. _arXiv preprint arXiv:1912.12522, 2019._ Tien-Ju Yang, Andrew G. Howard, Bo Chen, Xiao Zhang, Alec Go, Mark Sandler, Vivienne Sze, and Hartwig Adam. NetAdapt: Platform-Aware Neural Network Adaptation for Mobile Applications. In Vittorio Ferrari, Martial Hebert, Cristian Sminchisescu, and Yair Weiss (eds.), _Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8-14,_ _2018, Proceedings, Part X, volume 11214 of Lecture Notes in Computer Science, pp. 289–304._ [Springer, 2018. doi: 10.1007/978-3-030-01249-6\ 18. URL https://doi.org/10.1007/](https://doi.org/10.1007/978-3-030-01249-6_18) [978-3-030-01249-6_18.](https://doi.org/10.1007/978-3-030-01249-6_18) ----- Shuochao Yao, Yiran Zhao, Huajie Shao, ShengZhong Liu, Dongxin Liu, Lu Su, and Tarek Abdelzaher. FastDeepIoT: Towards Understanding and Optimizing Neural Network Execution Time on Mobile and Embedded Devices. In Proceedings of the 16th ACM Conference on Embedded _Networked Sensor Systems, pp. 278–291, 2018._ Chris Ying, Aaron Klein, Eric Christiansen, Esteban Real, Kevin Murphy, and Frank Hutter. NASBench-101: Towards Reproducible Neural Architecture Search. In Kamalika Chaudhuri and Ruslan Salakhutdinov (eds.), Proceedings of the 36th International Conference on Machine Learn_ing, ICML 2019, 9-15 June 2019, Long Beach, California, USA, volume 97 of Proceedings of_ _[Machine Learning Research, pp. 7105–7114. PMLR, 2019. URL http://proceedings.](http://proceedings.mlr.press/v97/ying19a.html)_ [mlr.press/v97/ying19a.html.](http://proceedings.mlr.press/v97/ying19a.html) Kaicheng Yu, Ren´e Ranftl, and Mathieu Salzmann. How to Train Your Super-Net: An Analysis [of Training Heuristics in Weight-Sharing NAS. CoRR, abs/2003.04276, 2020. URL https:](https://arxiv.org/abs/2003.04276) [//arxiv.org/abs/2003.04276.](https://arxiv.org/abs/2003.04276) Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun. ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices. In 2018 IEEE Conference _on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA,_ _June 18-22, 2018, pp. 6848–6856. IEEE Computer Society, 2018._ doi: 10.1109/CVPR. [2018.00716. URL http://openaccess.thecvf.com/content_cvpr_2018/html/](http://openaccess.thecvf.com/content_cvpr_2018/html/Zhang_ShuffleNet_An_Extremely_CVPR_2018_paper.html) [Zhang_ShuffleNet_An_Extremely_CVPR_2018_paper.html.](http://openaccess.thecvf.com/content_cvpr_2018/html/Zhang_ShuffleNet_An_Extremely_CVPR_2018_paper.html) Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V Le. Learning transferable architectures for scalable image recognition. In Proceedings of the IEEE conference on computer vision and _[pattern recognition, pp. 8697–8710, 2018. URL http://arxiv.org/abs/1707.07012.](http://arxiv.org/abs/1707.07012)_ ----- A BEST PRACTICES FOR NAS, CODE AND DATA To improve the reproducibility and facilitate fair experimental comparisons, we follow the bestpractices checklist (Lindauer & Hutter, 2020): _• Release Code for the Training Pipeline(s) you use. Our experiments are based on White_ et al. (2021), who use NASLib (Ruchte et al., 2020) to compare 31 methods for accuracy prediction. Our NASLib fork, extending the framework for HW-NAS-Bench, TransNASBench, some performance predictors and the hypervolume simulations, is provided in the supplementary materials. We intend to either make our fork available on GitHub or submit the changes upstream once this paper is accepted/published. _• Use the Same Evaluation Protocol for the Methods Being Compared. Aside from the_ implementation of each predictor, all experiments use the same pipeline. _• Validate The Results Several Times._ We ran each predictor 50 times, with seeds _{0, ..., 49}. The reductions in hypervolume are simulated 1000 times using different a_ different subset of the data set, for each combination of {iteration, HW-NAS data set, noise on HW metric}. _• Control Confounding Factors. While all experiments used the same software libraries_ and hardware resources, they were run on different machines to speed up the evaluation. We found hardly any benefit in using a GPU even for the network-based predictors, which is why every method only used two CPU cores. The OS is Ubuntu 18.04, notable software packages are PyTorch 1.9.0, numpy 1.19.5, scikit-learn 0.24.2, pybnn 0.0.5, ngboost 0.3.11, and xgboost 1.4.2 _• Report the Use of Hyperparameter Optimization. See Appendix C._ In addition to the code in the supplementary materials, we also provide the experimental results as csv files. Running the predictors and hypervolume simulations takes some time, but the easy access to the data of the finished experiments may prove useful for future research. Please see readme.md in the accompanying code zip file for instructions. B ENCODINGS AND PREDICTORS B.1 DATA ENCODINGS Every architecture a ∈A requires a unique representation, which depends on the used predictor. The common encoding types are: **Adjacency one-hot: Each architecture a is uniquely defined by the chosen candidate operation on** every path. For example, each architecture in NAS-BENCH-201 consists of a repeated cell structure, which has five candidate operations on each of the six paths. Therefore there are 5[6] = 15625 unique architectures, which can each be referenced by a sequence of operation-indices such as [0 1 2 3 4 0]. Many predictors perform better if the sequence is presented as a one-hot encoding, which is in this case [10000 01000 00100 00010 00001 10000]. Similarly, the path-encoding (used by BANANAS) is a one-hot representation over the used candidate operation all possible paths. Since the connectivity within cells for HW-NAS-Bench and TransNAS-Bench-101 is fixed, it provides no more information than the adjacency one-hot encoding. If the connectivity can be adjusted more freely, as in the NAS-Bench-101 search space, the additional information may improve the fit. The encodings for BONAS, GCN, and NAO each provide further information in addition to the Adjacency one-hot vector, most notably the adjacency-matrix. This {0, 1}[(][N] [+2)][×][(][N] [+2)] matrix lists describes which of the N architecture paths (rows) serves as inputs for each other path (column), and also includes input/output. ----- B.2 PREDICTORS We briefly describe the 18 predictor methods in our experiments. We adopt their implementations from the NASLib library (see Appendix A), which we extend with Linear Regression, Ridge Regression, and Support Vector Machines from the scikit-learn package; and a simple Lookup Table implementation. Unless specified otherwise, the methods use the adjacency one-hot encoding. _• BANANAS An ensemble of three MLP models with five to 20 layers, each using the path-_ encoding (White et al., 2019). _• Bayesian Linear Regression A bayesian model that assumes (1) a linear dependency be-_ tween inputs and outputs, and (2) that the samples are normally distributed (Bishop, 2007). _• BOHAMIANN A bayesian inference predictor using stochastic gradient Hamiltonian_ Monte Carlo (SGHMC) to sample from a bayesian neural network (Springenberg et al., 2016). _• BONAS Bayesian Optimization for NAS (Shi et al., 2020) uses a GCN predictor within an_ outer loop of bayesian optimization, as a meta-learning task. The GCN requires encoding the adjacency matrix of each architecture. _• Gaussian Process A simple model that assumes a joint Gaussian distribution underlying_ the training data (Rasmussen, 2003). _• GCN A Graph Convolutional Network that makes use of an adjacency-matrix encoding of_ each architecture (Wen et al., 2020). _• Linear Regression A simple model that assumes an independent value/cost for each oper-_ ation/layer, which only need to be summed up. Unlike the Lookup Table model, it uses a least-square fit on the training data. _• Lookup Table The most simple and perhaps widely used model for differentiable archi-_ tecture selection. It generally assumes a single baseline architecture (e.g. [001 001] in adjacency one-hot encoding), and a lookup matrix R[(num layers)][×][(num candidates)] that contains the increases/reductions in the metric for each layer and candidate operation. The metric value of a new architecture can be predicted with a simple sum over the respective matrix entries and the baseline value. The model is obtained from measuring either each candidate operation in isolation, or by computing the differences between the baseline architecture and specific variations (e.g. [010 001] or [100 001], to measure the first candidates). This model always requires 1+(num layers) · (num candidates−1) neighbored architectures to fit. We detail the resulting correlation values for each used dataset in Appendix E. _• LGBoost Light Gradient Boosting Machine (LightGBM or LGBoost, Ke et al. (2017)) is a_ lightweight gradient-boosted decision tree model. _• MLP We use fully-connected Multi Layer Perceptrons in two size-categories._ _• NAO NAO (Luo et al., 2018) uses an encoder-decoder topology,_ which encodes/compresses an architecture to a continuous representation, and decodes it again. This representation is further used to make architecture predictions. _• NGBoost Natural Gradient Boosting (NGBoost, Duan et al. (2020)) is a gradient-boosted_ decision tree model that uses natural gradients to estimate uncertainty. _• Ridge Regression Ridge Regression (Saunders et al., 1998) extends the Linear Regression_ least-squares fit with a regularization term that serves as bias-variance tradeoff. _• Random Forests An ensemble of decision trees (Liaw et al., 2002)._ _• Sparse Gaussian Process an approximation of Gaussian Processes that summarizes train-_ ing data (Candela & Rasmussen, 2005). _• Support Vector Machine A model that maps its inputs to a high-dimensional space, where_ training samples are used as support-vectors for decision-boundaries (Cortes & Vapnik, 1995). _• XGBoost eXtreme Gradient Boosting (XGBoost, Chen & Guestrin (2016)) is a gradient-_ boosted decision tree model. ----- C HYPERPARAMETERS We list our default and hyper-parameter sample ranges in Table 2. For comparability with White et al. (2021), we only change the values of newly introduced parameterized predictors: Ridge Regression, Support Vector Machines, and small MLPs. Model Hyper-parameter Range/Choice Log-transform Default Num. Layers [5, 25] false 20 BANANAS Layer width [5, 25] false 20 Learning rate [0.0001, 0.1] true 0.001 Num. Layers [16, 128] true 64 BONAS Batch size [32, 256] true 128 Learning rate [0.00001, 0.1] true 0.0001 Num. Layers [64, 200] true 144 Batch size [5, 32] true 7 Learning rate [0.00001, 0.1] true 0.0001 Weight decay [0.00001, 0.1] true 0.0003 GCN Num. leaves [10, 100] false 31 LGBoost Learning rate [0.001, 0.1] true 0.05 Feature fraction [0.1, 1] false 0.9 Num. layers [2, 5] false 3 Layer width [16, 128] true 32 MLP (small) Learning rate [0.0001, 0.1] true 0.001 Activation function _{relu, tanh, hardswish}_ relu Num. layers [5, 25] false 20 MLP (huge) Layer width [5, 25] false 20 Learning rate [0.0001, 0.1] true 0.001 Num. layers [16, 128] true 64 NAO Batch size [32, 256] true 100 Learning rate [0.00001, 0.1] true 0.001 Num. estimators [128, 512] true 64 Learning rate [0.001, 0.1] true 0.081 Max depth [1, 25] false 6 Max features [0.1, 1] false 0.79 NGBoost Ridge Regression Regularization α [0.25, 2.5] false 1.0 Num. estimators [16, 128] true 116 Max features [0.1, 0.9] true 0.17 Min samples (leaf) [1, 20] false 2 Min samples (split) [2, 20] true 2 Random Forests Regularization C [0.5, 1.5] false 1.0 Support Vector Machine Kernel _{linear, poly, rbf, sigmoid}_ rbf Max depth [1, 15] false 6 Min child weight [1, 10] false 1 XGBoost Col sample (tree) [0, 1] false 1 Learning rate [0.001, 0.5] true 0.3 Col sample (level) [0, 1] false 1 Table 2: Hyper-parameter ranges and default values of the configurable predictors ----- D NAS-BENCH-201 / HW-NAS-BENCH CELL DESIGN with exactly one out of five candidate operations {Zero, Skip, Convolution 1×1, Convolution 3×3, 1 3 5 2 6 4 Figure 5: Basic NAS-Bench-201 / HW-NAS cell design. Each of the six orange paths is finalizedshared cell topology Zero, Skip, Convolution 1 Average Pooling 3×3}. E SELECTION OF DATASETS Linear Regression XGBoost LUT 11 25 55 124 276 614 1366 3036 6748 15000 15000 - ImageNet16-120-raspi4 latency 0.324 0.205 0.606 0.676 0.705 0.716 0.715 0.723 0.728 0.729 0.757 0.443 cifar100-pixel3 latency 0.392 0.292 0.732 0.780 0.797 0.803 0.806 0.809 0.812 0.812 0.877 0.484 cifar10-edgegpu latency 0.370 0.258 0.724 0.790 0.806 0.819 0.820 0.822 0.830 0.829 0.926 0.175 cifar100-edgegpu energy 0.376 0.275 0.732 0.793 0.812 0.821 0.821 0.823 0.831 0.831 0.920 0.221 ImageNet16-120-eyeriss arith. int. 0.369 0.293 0.748 0.805 0.817 0.827 0.825 0.832 0.843 0.846 0.970 0.861 cifar10-pixel3 latency 0.388 0.300 0.733 0.780 0.797 0.805 0.805 0.810 0.813 0.813 0.878 0.475 cifar10-raspi4 latency 0.393 0.315 0.740 0.787 0.799 0.805 0.807 0.810 0.813 0.813 0.890 0.462 cifar100-raspi4 latency 0.393 0.308 0.744 0.786 0.801 0.807 0.810 0.810 0.814 0.814 0.888 0.445 ImageNet16-120-pixel3 latency 0.398 0.312 0.739 0.786 0.799 0.807 0.809 0.812 0.815 0.816 0.884 0.509 cifar100-edgegpu latency 0.375 0.268 0.728 0.793 0.810 0.821 0.820 0.822 0.831 0.831 0.924 0.191 cifar10-edgegpu energy 0.375 0.284 0.728 0.792 0.810 0.821 0.823 0.824 0.831 0.831 0.922 0.183 ImageNet16-120-edgegpu energy 0.377 0.281 0.733 0.797 0.814 0.825 0.825 0.826 0.834 0.833 0.926 0.280 ImageNet16-120-edgegpu latency 0.379 0.264 0.737 0.799 0.817 0.826 0.826 0.828 0.836 0.835 0.938 0.277 cifar10-eyeriss arith. int. 0.384 0.296 0.757 0.811 0.826 0.835 0.832 0.843 0.854 0.854 0.969 0.826 cifar100-eyeriss arith. int. 0.384 0.297 0.757 0.811 0.826 0.835 0.833 0.844 0.855 0.856 0.971 0.830 ImageNet16-120-fpga latency 0.443 0.494 0.904 0.936 0.947 0.951 0.948 0.951 0.952 0.952 0.983 0.965 ImageNet16-120-fpga energy 0.443 0.494 0.905 0.935 0.947 0.951 0.948 0.951 0.952 0.952 0.983 0.965 ImageNet16-120-eyeriss latency 0.457 0.937 0.953 0.954 0.954 0.954 0.953 0.953 0.954 0.954 0.952 0.989 cifar10-eyeriss latency 0.461 0.943 0.959 0.959 0.960 0.960 0.959 0.960 0.960 0.960 0.958 0.995 cifar100-eyeriss latency 0.462 0.946 0.963 0.963 0.963 0.963 0.963 0.963 0.964 0.963 0.962 0.998 cifar10-eyeriss energy 0.456 0.967 0.985 0.985 0.985 0.985 0.985 0.985 0.985 0.985 0.975 0.996 ImageNet16-120-eyeriss energy 0.458 0.967 0.985 0.985 0.986 0.985 0.986 0.985 0.985 0.986 0.972 0.998 cifar100-eyeriss energy 0.457 0.967 0.985 0.985 0.985 0.986 0.985 0.986 0.986 0.986 0.976 0.998 cifar10-fpga energy 0.458 0.973 0.987 0.987 0.987 0.987 0.987 0.987 0.987 0.987 0.986 0.999 cifar100-fpga energy 0.458 0.973 0.987 0.987 0.987 0.987 0.987 0.987 0.987 0.987 0.986 0.999 cifar100-fpga latency 0.457 0.973 0.987 0.987 0.987 0.987 0.987 0.987 0.987 0.987 0.986 0.999 cifar10-fpga latency 0.457 0.973 0.987 0.987 0.987 0.987 0.987 0.987 0.987 0.987 0.986 0.999 Table 3: Kendall’s Tau test correlation for Linear Regression, XGBoost, and Lookup Table (LUT) on all HW-NAS-Bench datasets (rows), for different amounts of available training data (columns), tested on the remaining 625 samples. The Lookup Table model is tested on all 15625 architectures. We selected the five data sets at the top. Linear Regression XGBoost LUT 9 18 34 65 123 234 442 837 1585 2999 2999 - jigsaw 0.201 0.227 0.410 0.535 0.586 0.605 0.616 0.624 0.631 0.632 0.661 0.201 class object 0.268 0.262 0.518 0.646 0.711 0.741 0.759 0.771 0.780 0.780 0.828 0.701 room layout 0.275 0.271 0.527 0.653 0.721 0.753 0.768 0.780 0.789 0.789 0.896 0.685 class scene 0.275 0.268 0.527 0.653 0.721 0.755 0.768 0.782 0.789 0.790 0.907 0.710 segmentsemantic 0.282 0.259 0.545 0.684 0.746 0.780 0.798 0.809 0.816 0.818 0.871 0.726 Table 4: Kendall’s Tau test correlation for Linear Regression and XGBoost on the five used TransNAS datasets (rows), for different amounts of available training data (columns), tested on the remaining 256 samples. The Lookup Table model (LUT) is tested on all 3256 architectures. **HW-NAS-Bench:** To select five datasets that are (1) non-linear and (2) different from one another, we first fit Linear Regression to every available dataset, with the results listed in Table 3. The bottom 12 datasets can be accurately fit with only 25 training samples, so they are not very interesting as a ----- challenge. On these datasets, the Lookup Table model achieves exceptional performance. Since the networks for CIFAR10, CIFAR100 and ImageNet16-120 only differ slightly, their measurements on the same device and metric (e.g. raspi4 latency) is very similar. To improve the generalizability of our results, we thus select datasets on different devices and metrics, which are listed at the top of Table 3. As displayed in Figure 6, their data distributions are generally different. **TransNAS-Bench-101:** Since the latency measurements of the architectures is generally very similarly distributed (see Figure 7), it is not necessary to train the predictors on all of them. We select all data sets that provide the test loss and inference time attributes for all architectures, resulting in exactly the five datasets listed in Section 4 (the other two datasets contain more specific test losses). ImageNet16-120-raspi4_latency cifar100-pixel3_latency cifar10-edgegpu_latency 1400 1000 1000 1200 800 800 1000 800 600 600 occurrences 600 occurrences 400 occurrences 400 400 200 200 200 0 0 0 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 0 10 20 30 40 0 2 4 6 8 10 12 measurements measurements measurements cifar100-edgegpu_energy ImageNet16-120-eyeriss_arithmetic_intensity 2000 800 1750 1500 600 1250 1000 occurrences 400 occurrences 750 200 500 250 0 0 0 10 20 30 40 0 1 2 3 4 5 6 7 8 measurements measurements Figure 6: How the data of each selected HW-NAS-Bench dataset is distributed (not normalized). class_object class_scene jigsaw 1200 1200 600 1000 1000 500 800 800 400 600 600 300 occurrences occurrences occurrences 400 400 200 200 200 100 0 0 0 0.00 0.05 0.10 0.15 0.20 0.25 0.00 0.05 0.10 0.15 0.20 0.25 0.05 0.10 0.15 0.20 0.25 0.30 measurements measurements measurements room_layout segmentsemantic 1000 1000 800 800 600 600 occurrences occurrences 400 400 200 200 0 0 0.00 0.05 0.10 0.15 0.20 0.25 0.00 0.05 0.10 0.15 0.20 0.25 measurements measurements Figure 7: How the data of each selected TransNAS-Bench-101 dataset is distributed (not normalized). Since all architectures are measured for latency on the same hardware, the resulting datasets are much less diverse than the HW-NAS-Bench ones. ----- PREDICTOR FIT TIME Lin. Reg. Bayes. Lin. R Ridge Reg. XGBoost NGBoost LGBoost Random Fore Sparse GP GP BOHAMIANN SVM Reg. NAO GCN BONAS BANANAS MLP (large) MLP (small) |Col1|Col2|Col3| |---|---|---| |. Reg. .||| |Reg.||| |orests||| |rests NN||| |N||| |)||| |) l)||| |||| Average over TransNAS datasetsAverage over HW-NAS datasets Average over TransNAS datasetsAverage over HW-NAS datasetsAverage over HW-NAS datasets Average over HW-NAS datasets 150002500 15000 200003000 20000 12500 12500Lin. Reg. 17500 175002000 Bayes. Lin. Reg. 2500 10000 10000Lin. Reg.Ridge Reg. 15000 15000 Bayes. Lin. Reg.XGBoost 1500 Ridge Reg.NGBoost 2000 7500 7500XGBoostLGBoost 12500 12500 NGBoostRandom Forests 10005000 5000LGBoostSparse GP 100001500 10000 Random ForestsGP Sparse GPBOHAMIANN 2500 2500GPSVM Reg. 500 7500 7500 BOHAMIANNNAO 1000 SVM Reg.GCN 0 0 MLP (large)BONAS Time to fit the predictor (absolute)Time to fit the predictor (absolute) 5000 Time to fit the predictor (absolute) 50000 MLP (small)BANANAS 500 2500 2500MLP (large) 2500 Time to fit the predictor (centered on average)Time to fit the predictor (centered on average) 2500 Time to fit the predictor (centered on average) MLP (small) 500 5000 5000 00 0 1010[1][1] 10[2]10[2] 10[3] 10[3] 10[4] 101010[1][1][1] 1010[2][2]10[2] 1010[3][3] 10[3] 1010[4][4] 10[1] 10[2] 10[3] 10[4] training set sizetraining set size training set sizetraining set sizetraining set size training set size Figure 8: Fit time (in seconds) of predictors to data, depending on the training set size. By far the most expensive methods are network-based. However, a significant portion of this time is spent on the hyper-parameter optimization prior to the actual fitting. G APPROXIMATING PREDICTOR MISTAKES Predictor deviations 2.00 1.75 1.50 1.25 1.00 0.75 0.50 0.25 0.00 |Col1|normal fit, std=0.309 deviations| |---|---| normal fit, std=0.309 deviations Predictor deviations Predictor deviations 1.75 normal fit, std=0.348 1.75 normal fit, std=0.456 deviations deviations 1.50 1.50 1.25 1.25 density 1.00 density 1.00 0.75 0.75 0.50 0.50 0.25 0.25 0.00 0.00 2 1 0 2 1 0 1 deviation of the predictions deviation of the predictions Figure 9: Further examples of predictor deviation distributions, as visualized in the center of Figure 2. Left: Linear Regression on CIFAR100, edgegpu, energy consumption. Center: Support Vector Machine on Jigsaw. Right: small MLP on ImageNet16-120, raspi4, latency. Intuitively, the predictor deviation distributions (see Figures 2 and 9) generally resemble a normal distribution. However, most predictors: (1) Have a notable peak, sometimes off-center (e.g. at x=0.2) (2) Have less density than a normal distribution almost everywhere else (3) Have some outliers (e.g. at x>1.5) that are extremely unlikely for a normal distribution We measured the p-value for different distributions on the first 100 test samples using a T-Test, every time we evaluated a predictor. The average statistics can be found in Table 5. Since a large number of empirical observations generally pushes the p-value to 0, this only serves to compare them to each other. We find that the outliers (3) appear often enough and are so unlikely to happen for a normal distribution, that even a uniform distribution has a higher statistical support. Consequentially, we approximate the common predictor deviations by sampling from a mixed distribution that adresses (1) to (3). ----- p-value normal 0.028 cauchy 0.030 lognorm 0.028 t 0.028 uniform 0.037 Table 5: P-values of different distributions, trying to fit the distribution of all predictor mistakes according to a t-test. Larger values are better, but comparing many empirically sampled points with a true density function tends to push the p-values to 0. This mixed distribution consists of two Normal distributions (N1, N2) and one Uniform distribution (U ), from which we sample with 72.5%, 26.5% and 1% respectively. For some constant v: _• We uniformly sample a shift c from [0, 2 · v], that is used to push the centers of N1 and N2_ to x > 0 and x < 0 respectively. We sample each value from N1(c, v), N2( _c, 3_ _v), and U1(_ 15 _v, 15_ _v) randomly, with_ _•_ _−_ _·_ _−_ _·_ _·_ the weighting given above. _• We normalize (subtract mean, divide by standard deviation) our sampled distribution and_ then scale it to the desired standard deviation. _• The predictors produce non-smooth distributions. We simulate that by sampling 15 times_ fewer values as needed, and repeat them as often. The code for the simulation is also provided (see Appendix A). As seen in Figure 10, the resulting simulated deviation distributions generally resemble a common predictor pattern. We do not account for differences in predictors, training set sizes or more, since that may become too specific and overengineered. Appendix I visualizes simulation sanity checks. We find that the simulation is slightly pessimistic and simplified, but resembles the results of actual predictors. Simulated predictor deviations |normal fit, std mixed dist. ge|normal fit, std mixed dist. ge|=0.500 nerated with std=0.5| |---|---|---| |||| 2 0 2 normal fit, std=0.500 mixed dist. generated with std=0.5 deviation of the simulated predictions Simulated predictor deviations |normal fit, std= mixed dist. gene|normal fit, std= mixed dist. gene|0.500 rated with std=0.5| |---|---|---| |||| 2 1 0 1 2 normal fit, std=0.500 mixed dist. generated with std=0.5 deviation of the simulated predictions Simulated predictor deviations |normal fit, std mixed dist. ge|normal fit, std mixed dist. ge|=0.500 nerated with std=0.5| |---|---|---| |||| 2 0 2 normal fit, std=0.500 mixed dist. generated with std=0.5 deviation of the simulated predictions 1.6 1.4 1.2 1.0 0.8 0.6 0.4 0.2 0.0 1.4 1.2 1.0 0.8 0.6 0.4 0.2 0.0 1.2 1.0 0.8 0.6 0.4 0.2 0.0 Figure 10: The sampled values of gaussian+uniform fit the measured predictor mistakes better than a single distribution, as they are roughly normally distributed, but include outliers. ----- MEASURING SIMULATED MISTAKES 0.45 0.40 0.45 0.40 0.35 0.30 0.35 0.30 0.25 0.25 |true pareto front predicted pareto front all architectures selected architectures|Col2|Col3| |---|---|---| ||true pareto front predicted pareto front all architectures selected architectures|| ||4 normali|2 0 2 4 zed ImageNet16-120-raspi4_latency| |Col1|Col2|Col3|Col4|Col5|Col6|Col7|Col8| |---|---|---|---|---|---|---|---| ||||||||| ||||||||| ||||||||| ||||||||| ||true pa discove selecte|re re d|to front, d paret arch., M|HV=2.93 o front, HV RAall = 3.22||=2.67 %, MRApa|reto = 3.77%| ||||||||| ||||||||| |2.|0 1.5 normali||1.0 0.5 0.0 0.5 zed ImageNet16-120-raspi4_latency||||| true pareto front predicted pareto front all architectures selected architectures Figure 11: Similar to Figure 3. When the discovered Pareto set is considerably worse than the true Pareto set, it is possible for the Mean Reduction of Accuracy of the Pareto subset (MRApareto) to be _worse than the average over all architectures (MRAall). This naturally happens more frequently for_ worse predictors with a high sampling std. and low KT correlation. Consequentially, the difference between MRAall and MRApareto is wider for better predictors (see Figure 4). Additionally, all of the selected non-Pareto-front members are clustered in a high-latency area and redundant with each other. This emphasizes the limitations of just considering drops in accuracy, as the hardware metric aspect is ignored. In this case, the predictor-guided selection failed to find a low-latency solution. In this regard, hypervolume is a better but less intuitive metric. hardware metric hardware metric Figure 12: Examples to explain measurement methods. 62 true pareto front A5 60 actually selected architecture 58 A4 56 54 A3 52 accuracy accuracy [%] difference A2 50 48 HW metric difference C1 46 A1 44 18 20 22 24 26 28 30 hardware metric 50 pareto front hypervolume +10% reference point 40 30 accuracy [%] 20 10 to 0 reference point 0 18 20 22 24 26 28 30 32 hardware metric **Left: The distance of each selected candidate architecture C1 to the true Pareto front is measured,** for accuracy and the hardware metric. C1 is dominated by A2, A3, and A4 of the true Pareto set. A2 has a slightly higher accuracy than C1 while being much better on the hardware metric, e.g. latency. A4 has a slightly better hardware metric value, but much higher accuracy. Given several candidate architectures, their differences are averaged. **Right: We compute the reference point for the hypervolume (for two objectives: area under a** Pareto front) by multiplying the highest hardware metric value from the true Pareto front with 1.1, and accuracy 0. While we are consistent throughout all experiments, this choice is arbitrary, as there is no obviously correct choice for the reference point. If the hypervolume of a supposed Pareto front is computed, the reference point of the true Pareto front is reused. Thus, choosing inferior architectures will always reduce the hypervolume. We arbitrarily chose the multiplier of m = 1.1 as a middle ground between making the rightmost point of the Pareto front irrelevant (m = 1.0) and overemphasizing it (m >> 1.0). ----- SIMULATION SANITY CHECK 1.0 mean over any number of architectures 0.7 0.6 0.5 ImageNet16-120-eyeriss_arithmetic_intensity ImageNet16-120-raspi4_latency cifar10-edgegpu_latency cifar100-edgegpu_energy cifar100-pixel3_latency mean 0.0/1.00 0.2/0.88 0.4/0.78 0.6/0.69 0.8/0.62 1.0/0.57 |Imag Imag|eNet16-120 eNet16-120|-eyeriss_a -raspi4_lat|Col4|rithmetic_i ency|ntensity|Col7|Col8| |---|---|---|---|---|---|---|---| |cifar1 cifar1 cifar1|0-edgegpu 00-edgegp 00-pixel3_l|_latency u_energy atency|||||| |mean|||||||| ||||||||| ||||||||| ||||||||| Std. of prediction deviations / Kendall's Tau 0.8 0.6 0.4 0.2 |Col1|Col2|Col3|Col4|Col5|Col6|Col7| |---|---|---|---|---|---|---| |||||||| |||||||| |||||||| |||||||| |||||||| ||KT=-0.7|5, SCC=-0.88|, PCC=0.77|||| 0.0 0.2 0.4 0.6 0.8 1.0 KT=-0.75, SCC=-0.88, PCC=0.77 Std. of prediction deviations Figure 13: Standard deviation over the predictor deviations (x axis) and Kendall’s Tau correlation (y axis), for the trained predictors on HW-NAS-Bench (left) and in simulation (right). The simulated predictor inaccuracies are slightly pessimistic (low KT), but still match the true values. ----- Predictor deviations 1.75 1.50 1.25 1.00 0.75 0.50 0.25 0.00 normal fit, std=0.445 deviations All candidates |Col1|normal fit, std=0.445 deviations| |---|---| ||| ||| deviation of the predictions candidate occurrences in the architecture candidate not at all exactly once exactly twice |Col1|normal fit, std=0.541 deviations| |---|---| ||| ||| |Col1|normal fit, std=0.532 deviations| |---|---| ||| ||| |Col1|normal fit, std=0.462 deviations| |---|---| ||| ||| |Predictor|deviations| |---|---| ||normal fit, std=0.146 deviations| ||| ||| |Col1|normal fit, std=0.446 deviations| |---|---| ||| ||| Predictor deviations Predictor deviations Predictor deviations 1.2 2.5 normal fit, std=0.541 2.00 normal fit, std=0.443 normal fit, std=0.356 1.0 deviations 1.75 deviations 2.0 deviations 0.8 1.50 1.25 1.5 density 0.6 density 1.00 density 1.0 0.4 0.75 0.2 0.50 0.5 0.25 0.0 0.00 0.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 1.5 1.0 0.5 0.0 0.5 1.0 1.5 1 0 1 2 Zero deviation of the predictions deviation of the predictions deviation of the predictions Predictor deviations Predictor deviations Predictor deviations 1.4 normal fit, std=0.532 1.75 normal fit, std=0.436 normal fit, std=0.412 1.2 deviations 1.50 deviations 2.0 deviations 1.0 1.25 1.5 density 0.80.6 density 1.000.75 density 1.0 0.4 0.50 0.5 0.2 0.25 0.0 0.00 0.0 1 0 1 2 1.5 1.0 0.5 0.0 0.5 1.0 1.0 0.5 0.0 0.5 1.0 1.5 Skip deviation of the predictions deviation of the predictions deviation of the predictions Predictor deviations Predictor deviations Predictor deviations 2.00 normal fit, std=0.462 normal fit, std=0.470 normal fit, std=0.393 2.0 deviations 1.75 deviations 2.0 deviations 1.50 1.5 1.25 1.5 1.00 density 1.0 density 0.75 density 1.0 0.5 0.50 0.5 0.25 0.0 0.00 0.0 1.0 0.5 0.0 0.5 1.0 1 0 1 2 1.0 0.5 0.0 0.5 1.0 1.5 Conv1x1 deviation of the predictions deviation of the predictions deviation of the predictions Predictor deviations Predictor deviations Predictor deviations 2.00 1.4 normal fit, std=0.146 normal fit, std=0.403 normal fit, std=0.565 4 deviations 1.75 deviations 1.2 deviations 1.50 1.0 3 1.25 0.8 1.00 density 2 density 0.75 density 0.6 1 0.50 0.4 0.25 0.2 0 0.00 0.0 0.4 0.2 0.0 0.2 0.4 1.0 0.5 0.0 0.5 1.0 1.5 1.5 1.0 0.5 0.0 0.5 1.0 1.5 Conv3x3 deviation of the predictions deviation of the predictions deviation of the predictions Predictor deviations Predictor deviations Predictor deviations 2.00 normal fit, std=0.446 normal fit, std=0.411 2.00 normal fit, std=0.477 1.75 deviations 2.0 deviations 1.75 deviations 1.50 1.50 1.25 1.5 1.25 density 1.00 density 1.0 density 1.00 0.75 0.75 0.50 0.5 0.50 0.25 0.25 0.00 0.0 0.00 1 0 1 2 1.0 0.5 0.0 0.5 1.0 1.5 1.5 1.0 0.5 0.0 0.5 1.0 1.5 Pool deviation of the predictions deviation of the predictions deviation of the predictions Table 6: How a trained XGB predictor deviates from the ground-truth values for different architecture subsets, akin to Figure 2. While they are not exactly the same, they still resemble the distribution over the entire test set (top plot, 625 samples). One noteworthy exception is when no Conv3x3 operations are used at all, in which case the standard deviation is considerably smaller. ----- |