File size: 89,053 Bytes
f71c233
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
1399
1400
1401
1402
1403
1404
1405
1406
1407
1408
1409
1410
1411
1412
1413
1414
1415
1416
1417
1418
1419
1420
1421
1422
1423
1424
1425
1426
1427
1428
1429
1430
1431
1432
1433
1434
1435
1436
1437
1438
1439
1440
1441
1442
1443
1444
1445
1446
1447
1448
1449
1450
1451
1452
1453
1454
1455
1456
1457
1458
1459
1460
1461
1462
1463
1464
1465
1466
1467
1468
1469
1470
1471
1472
1473
1474
1475
1476
1477
1478
1479
1480
1481
1482
1483
1484
1485
1486
1487
1488
1489
1490
1491
1492
1493
1494
1495
1496
1497
1498
1499
1500
1501
1502
1503
1504
1505
1506
1507
1508
1509
1510
1511
1512
1513
1514
1515
1516
1517
1518
1519
1520
1521
1522
1523
1524
1525
1526
1527
1528
1529
1530
1531
1532
1533
1534
1535
1536
1537
1538
1539
1540
1541
1542
1543
1544
1545
1546
1547
1548
1549
1550
1551
1552
1553
1554
1555
1556
1557
1558
1559
1560
1561
1562
1563
1564
1565
1566
1567
1568
1569
1570
1571
1572
1573
1574
1575
1576
1577
1578
1579
1580
1581
1582
1583
1584
1585
1586
1587
1588
1589
1590
1591
1592
1593
1594
1595
1596
1597
1598
1599
1600
1601
1602
1603
1604
1605
1606
1607
1608
1609
1610
1611
1612
1613
1614
1615
1616
1617
1618
1619
1620
1621
1622
1623
1624
1625
1626
1627
1628
1629
1630
1631
1632
1633
1634
1635
1636
1637
1638
1639
1640
1641
1642
1643
1644
1645
1646
1647
1648
1649
1650
1651
1652
1653
1654
1655
1656
1657
1658
1659
1660
1661
1662
1663
1664
1665
1666
1667
1668
1669
1670
1671
1672
1673
1674
1675
1676
1677
1678
1679
1680
1681
1682
1683
1684
1685
1686
1687
1688
1689
1690
1691
1692
1693
1694
1695
1696
1697
1698
1699
1700
1701
1702
1703
1704
1705
1706
1707
1708
1709
1710
1711
1712
1713
1714
1715
1716
1717
1718
1719
1720
1721
1722
1723
1724
1725
1726
1727
1728
1729
1730
1731
1732
1733
1734
1735
1736
1737
1738
1739
1740
1741
1742
1743
1744
1745
1746
1747
1748
1749
1750
1751
1752
1753
1754
1755
1756
1757
1758
1759
1760
1761
1762
1763
1764
1765
1766
1767
1768
1769
1770
1771
1772
1773
1774
1775
1776
1777
1778
1779
1780
1781
1782
1783
1784
1785
1786
1787
1788
1789
1790
1791
1792
1793
1794
1795
1796
1797
1798
1799
1800
1801
1802
1803
1804
1805
1806
1807
1808
1809
1810
1811
1812
1813
1814
1815
1816
1817
1818
1819
1820
1821
1822
1823
1824
1825
1826
1827
1828
1829
1830
1831
1832
1833
1834
1835
1836
1837
1838
1839
1840
1841
1842
1843
1844
1845
1846
1847
1848
1849
1850
1851
1852
1853
1854
1855
1856
1857
1858
1859
1860
1861
1862
1863
1864
1865
1866
1867
1868
1869
1870
1871
1872
1873
1874
1875
1876
1877
1878
1879
1880
1881
1882
1883
1884
1885
1886
1887
1888
1889
1890
1891
1892
1893
1894
1895
1896
1897
1898
1899
1900
1901
1902
1903
1904
1905
1906
1907
1908
1909
1910
1911
1912
1913
1914
1915
1916
1917
1918
1919
1920
1921
1922
1923
1924
1925
1926
1927
1928
1929
1930
1931
1932
1933
1934
1935
1936
1937
1938
1939
1940
1941
1942
1943
1944
1945
1946
1947
1948
1949
1950
1951
1952
1953
1954
1955
1956
1957
1958
1959
1960
1961
1962
1963
1964
1965
1966
1967
1968
1969
1970
1971
1972
1973
1974
1975
1976
1977
1978
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
2025
2026
2027
2028
2029
2030
2031
2032
2033
2034
2035
2036
2037
2038
2039
2040
2041
2042
2043
2044
2045
2046
2047
2048
2049
2050
2051
2052
2053
2054
2055
2056
2057
2058
2059
2060
2061
2062
2063
2064
2065
2066
2067
2068
2069
2070
2071
2072
2073
2074
2075
2076
2077
2078
2079
2080
2081
2082
2083
2084
2085
2086
2087
2088
2089
2090
2091
2092
2093
2094
2095
2096
2097
2098
2099
2100
2101
2102
2103
2104
2105
2106
2107
2108
2109
2110
2111
2112
2113
2114
2115
2116
2117
2118
2119
2120
2121
2122
2123
2124
2125
2126
2127
2128
2129
2130
2131
2132
2133
2134
2135
2136
2137
2138
2139
2140
2141
2142
2143
2144
2145
2146
2147
2148
2149
2150
2151
2152
2153
2154
2155
2156
2157
2158
2159
2160
2161
2162
2163
2164
2165
2166
2167
2168
2169
2170
2171
2172
2173
2174
2175
2176
2177
2178
2179
# WHAT TO EXPECT OF HARDWARE METRIC PREDIC## TORS IN NEURAL ARCHITECTURE SEARCH

**Anonymous authors**
Paper under double-blind review

ABSTRACT

Modern Neural Architecture Search (NAS) focuses on finding the best performing architectures in hardware-aware settings; e.g., those with an optimal tradeoff
of accuracy and latency. Due to many advantages of prediction models over live
measurements, the search process is often guided by estimates of how well each
considered network architecture performs on the desired metrics. Typical prediction models range from operation-wise lookup tables over gradient-boosted trees
and neural networks, with little known information on how they compare. We
evaluate 18 different performance predictors on ten combinations of metrics, devices, network types, and training tasks, and find that MLP models are the most
promising. We then simulate and evaluate how the guidance of such prediction
models affects the subsequent architecture selection. Due to inaccurate predictions, the selected architectures are generally suboptimal, which we quantify as
an expected reduction in accuracy and hypervolume. We show that simply verifying the predictions of just the selected architectures can lead to substantially
improved results. Under a time budget, we find it preferable to use a fast and
inaccurate prediction model over accurate but slow live measurements.

1 INTRODUCTION

Modern neural network architectures are designed not only considering their primary objective,
such as accuracy. While existing architectures can be scaled down to work with the limited available
memory and computational power of, e.g., mobile phones, they are significantly outperformed by
specifically designed architectures (Howard et al., 2017; Sandler et al., 2018; Zhang et al., 2018;
Ma et al., 2018). Standard hardware metrics include memory usage, number of model parameters,
Multiply-Accumulate operations, energy consumption, latency, and more; each of which may be
limited by the hardware platform or network task. As the range of tasks and target platforms grows,
specialized architectures and the methods to find them efficiently are gaining importance.

The automated design and discovery of specialized architectures is the main intent of Neural Architecture Search (NAS). This recent field of study repeatedly broke state of the art records (Zoph et al.,
2018; Real et al., 2018; Cai et al., 2019; Tan & Le, 2019; Chu et al., 2019a; Hu et al., 2020) while
aiming to reduce the researchers’ involvement with this tedious and time-consuming process to a
minimum. As the performance of each considered architecture needs to be evaluated, the hardware
metrics need to be either measured live or guessed by a trained prediction model. While measuring live has the advantage of not suffering from inaccurate predictions, the corresponding hardware
needs to be available during the search process. Measuring on-demand may also significantly slow
down the search process and necessitates further measurements for each new architecture search.
On the other hand, a prediction model abstracts the hardware from the search code and simplifies
changes to the optimization targets, such as metrics or devices. The data set to train the predictor
also has to be collected only once so that a trained predictor then works in the absence of the hardware it is predicting for, e.g., in a cloud environment. Furthermore, a differentiable predictor can be
used for gradient-based architecture optimization of typically non-differentiable metrics (Cai et al.,
2019; Xu et al., 2020; Nayman et al., 2021).

While the many advantages make predictors a popular choice of hardware-aware NAS (e.g. Xu
et al. (2020); Wu et al. (2019); Wan et al. (2020); Dai et al. (2020); Nayman et al. (2021)), there
are no guidelines on which predictors perform best, how many training samples are required, or


-----

what happens when a predictor is inaccurate. This work investigates the above points. As a first
contribution, we conduct large-scale experiments on ten hardware-metric datasets chosen from HWNAS-Bench (Li et al., 2021a) and TransNAS-Bench-101 (Duan et al., 2021). We explore how
powerful the different predictors are when using different amounts of training data and whether
these results generalize across different network architecture types. As a second contribution, we
extensively simulate the subsequent architecture selection to investigate the impact of inaccurate
predictors. Our results demonstrate the effectiveness of network-based prediction models; provide
insights into predictor mistakes and what to expect from them. To facilitate reproducibility and
further research, our experimental results and code are made available in Appendix A.

2 RELATED WORK

**NAS Benchmarks:** As the search spaces of NAS methods often differ from one another and lack
extensive studies, the difficulty of fair comparisons and reproducibility have become a major concern
(Yang et al., 2019; Li & Talwalkar, 2020). To alleviate this problem, researchers have exhaustively
evaluated search spaces of several thousand architectures to create benchmarks (Ying et al., 2019;
Dong & Yang, 2020; Dong et al., 2020; Siems et al., 2020), containing detailed statistics for each
architecture. TransNAS-Bench-101 (Duan et al., 2021) evaluates several thousand architectures
across seven diverse tasks and finds that the best task-specific architectures may vary significantly.

The popular NAS-BENCH 201 benchmark (Dong & Yang, 2020) has been further extended with ten
different hardware metrics for all 15625 architectures on each of the three data sets CIFAR10, CIFAR100 (Krizhevsky et al., 2009) and ImageNet16-120 (Chrabaszcz et al., 2017). Major findings of
this HW-NAS Bench (Li et al., 2021a) include that FLOPs and the number of parameters are a poor
approximation for other metrics such as latency. Many existing NAS methods use such inadequate
substitutes for their simplicity and would benefit from their replacement with better prediction models. Li et al. also find that hardware-specific costs do not correlate well across hardware platforms.
While accounting for each device’s characteristics improves the NAS results, it is also expensive.
Predictors can reduce costs by requiring fewer measurements and shorter query times. [1].

**Predictors in NAS:** Aside from real-time measurements (Tan et al., 2019; Yang et al., 2018),
hardware metric estimation in NAS is commonly performed via Lookup Table (Wu et al., 2019),
Analytical Estimation or a Prediction Model (Dai et al., 2020; Xu et al., 2020). While an operationand layer-wise Lookup Table can accurately estimate hardware-agnostic metrics, such as FLOPs or
the number of parameters (Cai et al., 2019; Guo et al., 2020; Chu et al., 2019a), they may be suboptimal for device-dependent metrics. Latency and energy consumption have non-obvious factors that
depend on hardware specifics such as memory, cache usage, the ability to parallelize each operation,
and an interplay between different network operations. Such details can be captured with neural
networks (Dai et al., 2020; Mendoza & Wang, 2020; Ponomarev et al., 2020; Xu et al., 2020) or
other specialized models (Yao et al., 2018; Wess et al., 2021).

Of particular interest is the correct prediction of the model loss or accuracy, possibly reducing the
architecture search time by orders of magnitude (Mellor et al., 2020; Wang et al., 2021; Li et al.,
2021b). In addition to common predictors such as Linear Regression, Random Forests (Liaw et al.,
2002) or Gaussian Processes (Rasmussen, 2003); specialized techniques may exploit training curve
extrapolation, network weight sharing or gradient information. Our experiments follow the recent
large-scale study of White et al. (2021), who compare 31 diverse accuracy prediction methods based
on initialization and query time, using three NAS benchmarks.

3 PREDICTING HARDWARE METRICS

Our methods follow the large-scale study of White et al. (2021), who compared a total of 31 accuracy prediction methods. The differences between accuracy and hardware-metric prediction, our
selection of predictors, and the general training pipeline are described in this section. In our experiments on HW-NAS-Bench and TransNAS-Bench-101, described in Section 4, we then compare
these predictors across different training set sizes.

1For further reading, we recommend a recent survey on hardware-aware NAS (Benmeziane et al., 2021)


-----

**Differences to accuracy predictors:** There are fundamental differences when predicting hardware metrics and the accuracy of network topologies. The most essential is the cost to obtain a
helpful predictor, which may vary widely for accuracy prediction methods. While determining the
test accuracy requires the costly and lengthy training of networks, measuring hardware metrics does
not necessitate any network training. Consequentially, specialized accuracy-estimation methods that
rely on trained networks, loss history, learning curve extrapolation, or early stopping do not apply to
hardware metrics. Furthermore, so-called zero-cost proxies that predict metrics from the gradients
of a single batch are dependant on the network topology but not on the hardware the network is
placed on. Therefore, the dominant hardware-metric predictor family is model-based.

Since all relevant predictors are model-based, they can be compared by their training set size. This
simplifies the initialization time of a predictor as the number of prior measured architectures on
which they are trained. In stark contrast, some accuracy predictors do not need any training data,
while others require several partially or fully trained networks. Since an untrained network and a
few batches suffice to measure a hardware-metric, the collection of such a training set is comparably
inexpensive.

Additionally, hardware predictors are generally used supplementary to a one-shot network optimized
for loss or accuracy. Depending on the NAS method, a fully differentiable predictor is required in
order to guide the gradient-based architecture selection. Typical choices are Lookup Tables (Cai
et al., 2019; Nayman et al., 2021) and neural networks (Xu et al., 2020).

**Model-based predictors:** The goal of a predictor fp(a) is to accurately approximate the function
_f_ (a), which may be, e.g., the latency of an architecture a from the search space A. A model-based
predictor is trained via supervised learning on a set Dtrain of datapoints (a, f (a)), after which it can
be inexpensively queried for estimates on further architectures. The collection of the dataset and the
duration of the training are referred to as initialization time and training time respectively.

The quality of such a trained predictor is generally determined by the (ranking) correlation between
measurements _f_ (a) _a_ _test_ and predictions _fp(a)_ _a_ _test_ on the unseen architectures
_{_ _|_ _∈A_ _}_ _{_ _|_ _∈A_ _}_
_test_ . Common correlation metric choices are Pearson (PCC), Spearman (SCC) and Kendall’s
_ATau (KT) (Chu et al., 2019b; Yu et al., 2020; Siems et al., 2020). ⊂A_

Our experiments include 18 model-based predictors from different families: Linear Regression,
Ridge Regression (Saunders et al., 1998), Bayesian Linear Regression (Bishop, 2007), Support
Vector Machines (Cortes & Vapnik, 1995), Gaussian Process (Rasmussen, 2003), Sparse Gaussian Process (Candela & Rasmussen, 2005), Random Forests (Liaw et al., 2002), XGBoost (Chen &
Guestrin, 2016), NGBoost (Duan et al., 2020), LGBoost (Ke et al., 2017), BOHAMIANN (Springenberg et al., 2016), BANANAS (White et al., 2019), BONAS (Shi et al., 2020), GCN (Wen
et al., 2020), small and large Multi-Layer-Perceptrons (MLP), NAO (Luo et al., 2018), and a layeroperation-wise Lookup Table model. We provide further descriptions and implementation details in
Appendix B.

**Hyper-parameter tuning:** The default hyperparameters of the used predictors vary significantly
in their levels of hyper-parameter tuning, especially in the context of NAS. Additionally, some predictors may internally make use of cross-validation, while others do not. Following White et al.
(2021), we attempt to level the playing field by running a cross-validation random-search over hyperparameters each time a predictor is fit to data. Each search is limited to 5000 iterations and a total
run time of 15 minutes and naturally excludes any test data. The predictor-specific parameter details
are given in Appendix C.

**Training pipeline** To make a reliable comparison, we use the NASLib library (Ruchte et al.
(2020), see Appendix A). We fit each predictor on each dataset and training size 50 times, using
seeds {0, ..., 49}.

Some predictors internally normalize the training values (subtract mean, divide by standard deviation). We choose to explicitly do this for all predictors and datasets, which reduces the dependency
of hyper-parameters (e.g. learning rate) on the dataset and allows us to analyze and compare the
prediction errors across datasets more effectively.


-----

4 PREDICTOR EXPERIMENTS

We compare the different predictor models based on two NAS benchmarks, HW-NAS-Bench (Li
et al., 2021a) and TransNAS-Bench-101 (Duan et al., 2021). They differ considerably by their
network tasks, hardware devices, and architecture designs.

**HW-NAS-Bench architecture design and datasets** In HW-NAS-Bench, each architecture is
solely defined by the topology of a building block (”cell”), which is stacked multiple times to create
a complete network. Each cell is completely defined by choosing six candidate operations. Since
they select from five different candidates each time, there are 5[6] = 15625 unique cell topologies.
These cells are not fully sequential but contain paths of different lengths, which is visualized in
Appendix D.

HW-NAS-Bench provides ten hardware statistics on CIFAR10, CIFAR100 Krizhevsky et al. (2009)
and ImageNet16-120 Chrabaszcz et al. (2017), of which we exclude the incomplete EdgeTPU metric. Thus there are 27 data sets of varying difficulty. As detailed in Appendix E, 12 of them can
be accurately fit with Linear Regression and only 25 training samples. Many are also very similar
since their measured networks differ only by the number of image classes. We therefore select five
datasets that (1.) are not trivial to learn as they are non-linear and (2.) not redundant:



_• ImageNet16-120, raspi4, latency_

_• CIFAR100, pixel3, latency_

_• CIFAR10, edgegpu, latency_



_• CIFAR100, edgegpu, energy consumption_

_• ImageNet16-120, eyeriss, arithmetic intensity_


**TransNAS-Bench-101 architecture design and datasets** TransNAS-Bench-101 contains information for 7,352 different network architectures, used as backbones in seven diverse vision tasks.
Since 4,096 are also a subset of HW-NAS-Bench, we focus on the remaining 3,256 architectures
with a macro-level search space. Unlike a micro-level search space, where a cell is stacked multiple
times to create a network, each network layer and block is considered individually. In particular, the
TransNAS-Bench-101 networks consist of four to six pairs of ResNet blocks (He et al., 2016), which
may modify the image size and channels in four ways: not at all, double the channel count, halve the
spatial size, and both. Every network has to double the channel count 1 to 3 times, resulting in 3,256
unique architectures. The networks may consequentially differ in their number of layers (depth), the
number of channels (width), and image size at any layer.

As done for HW-NAS-Bench, we select five of the seven available datasets for their latency measurements. Aside from the self-supervised Jigsaw task, there is little difference between the cross-task
latency measurements (see Appendix E). We evaluate the possibly redundant datasets nonetheless,
since latency predictions in macro-level search spaces are an important domain for NAS on image
classification and object detection tasks:



_• Object classification_

_• Scene classification_

_• Room layout_



_• Jigsaw_

_• Semantic segmentation_


**Fitting results and comparison** The results, averaged over all selected HW-NAS-Bench and
TransNAS-Bench-101 datasets, are presented in Figures 1a and 1b, respectively. The left plots
present the absolute predictor performance, the right ones make relative comparisons easier.

Unsurprisingly, more training samples (i.e., evaluated architectures) generally lead to better prediction results, even until the entire search space is known (aside from the test set). This is true for
most of the predictors, although e.g. Gaussian Processes and BOHAMIANN saturate early. The
simple Linear Regression and Ridge Regression models also fail to make proper use of hundreds
of data points but perform decently when only a few training samples are available. Interestingly,
the same is true for the graph-encoding network-based predictors BONAS and GCN. While knowing how the different paths within each cell connect (see Appendix B) is especially useful given
less than fifty training samples, the advantage disappears afterward. In contrast, the graph-encoding
encoder-decoder approach of NAO performs decently at all times.


-----

(b) Results on TransNAS-Bench-101. Since all network architectures are purely sequential by design, we do not

Average over HW-NAS datasets Average over HW-NAS datasets

0.2

Lin. Reg.

0.8 Bayes. Lin. Reg.

Ridge Reg.
XGBoost

0.1 NGBoost

LGBoost

0.6

Random Forests
Sparse GP
GP

0.0

BOHAMIANN

0.4 SVM Reg.

NAO

Kendall's Tau (absolute) GCN

0.1 BONAS

0.2 Kendall's Tau (centered on average) BANANAS

MLP (large)
MLP (small)
Lookup Table

0.2

0.0

10[1] 10[2] 10[3] 10[4] 10[1] 10[2] 10[3] 10[4]

training set size training set size

(a) Results on HW-NAS-Bench. NAO performs decently at all times, and none of the prediction models requires
more than 60 training samples to improve over a Lookup Table model.

Average over TransNAS datasets Average over TransNAS datasets

0.9

0.10

0.8

Lin. Reg.
Bayes. Lin. Reg.

0.7 0.05 Ridge Reg.

XGBoost
NGBoost

0.6

LGBoost

0.00 Random Forests

0.5 Sparse GP

GP
BOHAMIANN

0.4

Kendall's Tau (absolute) 0.05 SVM Reg.

MLP (large)

0.3 Kendall's Tau (centered on average) MLP (small)

Lookup Table

0.10

0.2

0.1

10[1] 10[2] 10[3] 10[1] 10[2] 10[3]

training set size training set size

evaluate predictors that specifically encode the architecture connectivity (BANANAS, BONAS, GCN, NAO).
After as few as 20 training samples, MLP models outclass all other predictors.

Figure 1: How well the different predictors rank the test architectures, depending on the training
set size and averaged over the five selected datasets. Left plots: absolute Kendall’s Tau ranking
correlation, higher is better. Right plots: same as left, but centered on the predictor-average.

Due to their powerful rule-based approach, tree-based models perform much better given many
training samples. Under such circumstances, LGBoost is a candidate for the best predictor model.
Similarly, the predictions of Support Vector Machines also benefit strongly from more samples.

The model we find to perform best for most training set sizes are MLPs. They are among the top
predictors at almost all times in the HW-NAS-Bench, although tree-based models are competitive
given enough data. After around 3,000 training samples, thinner and deeper MLPs improve over the
wider and smaller ones. The path-encoding BANANAS model behaves similarly to a regular large
MLP but requires more samples to reach the same performance. This is interesting since, aside from
the data encoding, BANANAS is an ensemble of three large MLP models. Even though only the first
network layer is affected by the data encoding, the more complicated path-encoding proves harmful


-----

HW-NAS-Bench TransNAS-Bench-101

Raspi4 FPGA Eyeriss Pixel3 EdgeGPU Tesla V100

latency 0.45 (0.75) 0.99 (0.97) 0.99 (0.96) 0.49 (0.78) 0.21 (0.79) 0.60 (0.70)
energy 0.99 (0.97) 1.00 (0.99) 0.23 (0.79)
arithmetic intensity 0.84 (0.81)

Table 1: The Kendall’s Tau correlation of Lookup Tables and Linear Regression (in brackets, using
only 124 training samples) across metrics and devices. Lookup Tables perform only marginally
better on the FPGA and Eyeriss devices, but considerably worse in all other cases. More detailed
statistics are available in Appendix E.

when the connectivity of the architectures in the search space is fixed. On TransNAS-Bench-101,
MLP perform exceptionally well. They are much better than any other tested predictor once more
than just 20 training samples are available. The small MLP model can achieve a KT correlation
of 80% with just 200 training samples, which takes the best non-network-based predictor (Support
Vector Machine) four times as many. They are also the only models that achieve a KT correlation of
over 90%, about 5% higher than the next best model (LGBoost).

Finally, the Lookup Table models (black horizontal lines) perform poorly in comparison to any other
predictor. Even though building such a model for HW-NAS-Bench datasets requires only 25 neighbored architectures, NAO and GCN perform better after just ten random samples. More than half
of the predictor models require less than 25 random samples, while the worst need at most 60. On
TransNAS-Bench-101, Lookup Tables perform comparably better. Building one requires only 21
neighbored architectures, and it takes most models between 50 and 100 random training samples to
achieve better performance. When measured on a per dataset basis, we find that the Lookup Table
models display a severe performance difference ranging from about 20% KT correlation (cifar10edgegpu latency and Jigsaw) to over 70% (ImageNet16-120-eyeriss arithmetic intensity and Semantic Segmentation, see Appendix E). Other models prove to be much more stable.

**Devices and Metrics** The previously described results are based on a specific selection of HWNAS-Bench and TransNAS-Bench-101 datasets that are hard to fit for Lookup Table models. As
shown in Table 1, that is not always the case. The FPGA and Eyeriss hardware devices are very suitable for Lookup Tables, achieving an almost perfect ranking correlation is possible. Nonetheless,
Linear Regression requires only 124 training samples to compete even there and is significantly better in every other case. We finally observe that the difficulty of fitting predictors primarily depends
on the hardware device, much more than the measured metric.

5 EVALUATING THE PREDICTOR-GUIDED ARCHITECTURE SELECTION

Although the experiments in Section 4 greatly assist us in selecting a predictor, it is not clear what a
specific Kendall’s Tau correlation implies for the subsequent architecture selection. Given a perfect
hardware metric predictor (Kendall’s Tau = 1.0), we can expect that an ideal architecture search
process will select the architectures with the best tradeoff of accuracy and the hardware metric, i.e.,
the true Pareto front. On the other hand, imperfect predictions result in the selection of supposedlybest architectures that are wrongly believed to be better.

To study how hardware predictors affect NAS results, we extensively evaluate the selection of such
supposedly-best architectures in simulation. This approach can evaluate any combination of predictor quality, test set size, and dataset, without the technical difficulties of obtaining actual predictor
models that precisely match such requirements. Since the hardware and accuracy prediction models
are usually independent and can be studied in isolation, we use ground-truth accuracy values in all
cases.

**Simulating predictors** The main challenge of the simulation is to quickly and accurately model
predictor outputs. We base our simulation on how predictor-generated values deviate from their
ground-truth targets on the test set, which is explained in Figure 2 and further detailed in Appendix G. Since the simulated deviations are similar to those of actual predictors, simulated predictions are obtained by drawing random values from this deviation distribution and adding them to
the ground-truth hardware measurements.


-----

Predictor deviations

|n d|ormal fit, std=0.477 eviations|
|---|---|
|||



1 0 1

normal fit, std=0.477
deviations

deviation of the predictions


Simulated predictor deviations

|normal fit, st mixed dist. g|normal fit, st mixed dist. g|d=0.500 enerated with std=0.5|
|---|---|---|
||||



2 1 0 1 2

normal fit, std=0.500
mixed dist. generated with std=0.5

deviation of the simulated predictions


1.4

1.2

1.0

0.8

0.6

0.4

0.2

0.0


1.4

1.2

1.0

0.8

0.6

0.4

0.2

0.0


Figure 2: A trained XGBoost prediction model on normalized ImageNet16-120 raspi4-latency test

Predictions and targets

KT=0.73, SCC=0.90, PCC=0.88

3

2

1

0

predicted values

1

2

2 1 0 1 2 3

true values

data. Left: The latency prediction (y-axis) for every architecture (blue dot) is approximately correct
(red line). Center: The same data as on the left, the distribution of deviations made by the predictor
(blue) and a normal distribution fit to them (orange). Right: A mixed distribution can simulate


0.45

0.40


0.45

0.40


0.35

0.30


0.35

0.30


0.25


0.25



|al deviation distributions as that in the center plot.|Col2|
|---|---|
|true pareto front predicted pareto front all architectures selected architectures||
|true pareto front predicted pareto front all architectures selected architectures||
|4 no|2 0 2 4 rmalized ImageNet16-120-raspi4_latency|


|Col1|Col2|Col3|Col4|
|---|---|---|---|
|||||
|||||
|||||
|||||
||true par discover selected|eto front, HV=2.9 ed pareto front, H arch., MRAall = 1.|3 V=2.86 06%, MRApareto = 0.43%|
|||||
|||||
|2.|0 1.5 1.0 0.5 0.0 0.5 normalized ImageNet16-120-raspi4_latency|||


Figure 3: An example of predictor-guided architecture selection, std=0.5. Left: The simulated predictor makes an inaccurate latency prediction for each architecture (blue), resulting in the selection
of the supposedly-best architectures (orange dots). Even though the predicted Pareto front (orange
line) may differ significantly from the ground-truth Pareto front (red line), most selected architectures are close to optimal. Right: Same data as left. The true Pareto front (red) and that of the
selected architectures (orange). Simply accepting all selected architectures results in a Mean Reduction of Accuracy (MRA) of 1.06%, while verifying the predictions and discarding inferior results
improves that to 0.43%. The hypervolume (HV, area under the Pareto-fronts) is reduced by 0.07.

A single example of a simulation can be seen in Figure 3. Although most selected architectures
(orange) are close to the true optimum (red Pareto front), there almost always exists an architecture that has superior accuracy and, at most, the same latency. Simply accepting the 13 selected
architectures in this particular example results in a mean reduction of accuracy (MRAall) of 1.06%.
In other words, the average selected architecture has 1.06% lower accuracy than a comparable one
on the true Paret front. However, simply verifying the hardware metric predictions through actual
measurements reveals that some selected architectures are suboptimal. By choosing only the Pareto
subset of the selection, the opportunity loss can be reduced to 0.43% (MRApareto).

An important property of this approach is that it is independent of any particular optimization
method. The supposedly-best architectures are always correctly identified, which is an upper bound
on how well Bayesian Optimization, Evolutionary Algorithms, and other approaches can perform.
The exemplary MRApareto opportunity loss of 0.43% is therefore unavoidable and depends solely
on the hardware metric predictor, the dataset, and the number of considered architectures.

**Results** We simulate 1,000 architecture selections for each of the five chosen HW-NAS-Bench
datasets, six different test set sizes, and eleven distribution standard deviations between 0.0 and 1.0.
As exemplarily shown in Figure 3, each such simulation allows us to compute the mean reduction
in accuracy (MRA) and the hypervolume (HV) under the Pareto fronts. The most important insights
are visualized in Figure 4 and summarized below.


-----

mean over any number of architectures

|Image cifar1 cifar1 cifar1|Net16-120 0-edgegpu_ 00-edgegpu 00-pixel3_l|-raspi4_late latency _energy atency|ncy|Col5|Col6|Col7|
|---|---|---|---|---|---|---|
|mean|||||||



0.0/1.00 0.2/0.88 0.4/0.78 0.6/0.69 0.8/0.62 1.0/0.57

ImageNet16-120-eyeriss_arithmetic_intensity
ImageNet16-120-raspi4_latency
cifar10-edgegpu_latency
cifar100-edgegpu_energy
cifar100-pixel3_latency
mean

Std. of prediction deviations / Kendall's Tau


mean over all datasets


4.5

4.0

3.5

3.0

2.5


2.5

2.0

1.5

1.0

0.5

|Col1|Col2|Col3|Col4|Col5|Col6|
|---|---|---|---|---|---|
|||all selec pareto-s|ted archite et of the s|ctures elected ar|chitectures|



0.0/1.00 0.2/0.88 0.4/0.78 0.6/0.69 0.8/0.62 1.0/0.57

all selected architectures
pareto-set of the selected architectures

Std. of prediction deviations / Kendall's Tau

|Col1|Col2|Col3|Col4|Col5|Col6|
|---|---|---|---|---|---|
|100|10|00|5000|||


2.0

0.0/1.00 0.2/0.88 0.4/0.78 0.6/0.69 0.8/0.62 1.0/0.57

100 1000 5000
500 2000 15625

Std. of prediction deviations / Kendall's Tau


Figure 4: Simulation results, with the standard deviation of the predictor deviations and the resulting
KT correlation on the x-axis. Left: Verifying the hardware predictions can significantly improve the
results, even more so for better predictors. Center: The drops in average accuracy are dependant on
the dataset and hardware metric. Right: Considering more candidate architectures and using better
prediction models improves the results; larger values are better.

Verifying the predicted results matters (Figure 4, left). The best prediction models achieve a KT
correlation of almost 0.9, which translates to a mean reduction in accuracy of MRAall 1.5%. That
means, for each selected architecture, there exists an architecture of equal or lower latency in the ≈
true Pareto set (if latency is the hardware metric) that improves the average accuracy by 1.5%. Even
though all selected architectures are believed to form a Pareto set, that is not the case. Their optimal
subset has a reduction of only MRApareto 0.5%, a significant improvement. However, finding
this optimal subset requires actually measuring the hardware metrics of the architectures selected by ≈
the used NAS method.

Furthermore, the left of Figure 4 aids in anticipating the MRA given a specific predictor. If one used
e.g. BOHAMIANN (KT 0.8, see Figure1a) instead of MLPs or LGBoost (KT 0.9), MRApareto
_≈_ _≈_
increases from around 0.5% to roughly 1.2%. The average accuracy of the selected architectures is
thus reduced by another 0.7%, just by using an unsuitable hardware metric predictor. Lookup Tables
(KT≈0.45) are not even visualized anymore, they have an MRApareto of over 2.5%.

Another interesting observation is that the gap between MRAall and MRApareto is wider for better
predictors. This is a shortcoming of the MRA metric that we elaborate on in Appendix H.

The dataset and metric matter (Figure 4, center). While we generally present the results averaged
over datasets, there exists some discrepancy among them. Most interestingly, predicting hardware
metrics on harder classification problems (ImageNet16-120 is harder than CIFAR10) also results in
a higher MRA. This is especially important since MRA is an absolute accuracy reduction. Even
though the CIFAR10 networks achieve twice the accuracy of ImageNet16-120 networks, they lose
less absolute accuracy to imperfect predictions. The order of MRA/dataset is primarily stable for
any predictor KT correlation. Finally, as visualized by the shaded areas, the standard deviation
of the MRA is generally huge. Consequentially, predictor-guided NAS is very likely to produce
results of varying quality for each different predictor or search attempt, especially with less accurate
predictors.

The number of considered architectures matters (Figure 4, right). We measure the hypervolume of
the discovered Pareto front (i.e., the area beneath it, see Appendix H), which, unlike MRA, also
considers the hardware metric. Quite obviously, if the architectures from the true Pareto set are not
considered, they can not be selected. To achieve the highest possible hypervolume of around 4.2
(i.e. find the true Pareto set), every architecture in the search space must be evaluated with a perfect
predictor. This is impossible in most real-world cases, where only a tiny fraction of all possible
architectures can ever be considered.

For HW-NAS-Bench, considering 5000 architectures with perfect live measurements and predicting
the metrics for all 15625 with ranking correlation KT≈0.73 results in selecting equivalent sets of
architectures. As seen in Figure1a, Ridge Regression can achieve this performance with fewer
than 100 training samples. Thus, a worse predictor leads to better results if it enables considering
more architectures. This insight is especially crucial for live measurements, which are accurate but
slow. Similarly, estimating the network accuracy with super-networks takes much more time than
predicting their performance with a neural predictor (Wen et al., 2020). If the measurement of any
metrics is the limiting factor, a guided selection of a cheap predictor is likely to do better.


-----

6 DISCUSSION

**Chosen prediction methods** Given the nature of hardware-metric prediction, only the subset of
model-based predictors evaluated by White et al. (2021) is suitable. We extended this subset with
four models, including the popular Lookup Table. We abstained from evaluating layer-wise predictors (e.g. Wess et al. (2021)) since such data is not available, and meta-learning predictors (Lee
et al., 2021) due to the vast possibilities to configure them. A separate and specialized comparison
between classic and meta-learning predictors seems favorable to us.

**Simulation limitations** In contrast to evaluating real predictors, the simulation allows us to
quickly make statements for any test set sizes and predictor-inaccuracies. However, naturally, the
results are only approximations. While they match actual values, they are generally slightly pessimistic (see Appendix I). We also limit the simulation to HW-NAS-Bench since the changes to
classification results are more accessible to interpretation than changes to loss values across different problem types. Finally, the current simulation approach can not investigate methods that
absolutely require a trained one-shot network, such as gradient-based approaches. Including such
methods is an interesting direction for future research.

**Transferability of the results** Our evaluation includes five challenging and diverse datasets
based on the micro-level search space of HW-NAS-Bench and five latency-based datasets of various macro-level search space architectures in TransNAS-Bench-101. Nonetheless, we find shared
trends: All tested prediction models improve over Lookup Tables with little amounts of training
data. Furthermore, most predictors benefit from more training data, even until the entire search
space (aside from the test set) is known. We also find that network-based predictors are generally
best but may be challenged by tree-based predictors if enough training data is available. Given only
a few samples, Ridge Regression performs better than most other models.

**Recommendations** While Lookup Tables are a cheap, simple, and popular model in gradientbased architecture selection, we find a significant variance in performance across tasks and devices
(see Table 1 and Appendix E). We recommend replacing such models with either MLPs or Ridge
Regression, which are more stable, fully differentiable, and often take less than 100 training samples
to achieve better results.

For most realistic scenarios where more than 100 training samples are available, MLP models are
the most promising. They are among the top predictors on HW-NAS-Bench and demonstrate outstanding performance on the TransNAS-Bench-101 datasets. We found that specialized architecture
encodings are primarily beneficial for little training data but suspect that they enjoy an additional
advantage when network topologies are more complex and diverse (White et al., 2021).

While the query time for all predictors is less than 0.05s and thus negligible, there is a notable
difference in training time (see Appendix F), primarily due to the hyper-parameter optimization. We
recommend Ridge Regression for very little amounts of training data and LGBoost otherwise if that
is an important factor.

If a NAS method selects architectures based on hardware metric predictions, we strongly suggest
verifying the results by measuring the true metric value afterward. Doing so may eliminate inferior
candidates and improve the average result substantially. Finally, if the limiting factor to a NAS
method is the slow measurement of hardware metrics, using a much faster predictor may lead to an
improvement, even if the prediction model is less accurate.

7 CONCLUSIONS

This work evaluated various hardware-metric prediction models on ten problems of different metrics, devices, and network architecture types. We then simulated the selection process for different
test set sizes and predictor inaccuracies to improve our understanding of predictor-based architecture selection. We find that even imperfect predictors may improve NAS if their low query time
enables considering more candidate architectures. Finally, verifying the predictions for the selected
candidates can lead to a drastic improvement of their average performance. The code and results are
made available, thus acting both for recommendation and as a baseline for future works.


-----

REFERENCES

Hadjer Benmeziane, Kaoutar El Maghraoui, Hamza Ouarnoughi, Sma¨ıl Niar, Martin Wistuba, and
Naigang Wang. A Comprehensive Survey on Hardware-Aware Neural Architecture Search.
_[CoRR, abs/2101.09336, 2021. URL https://arxiv.org/abs/2101.09336.](https://arxiv.org/abs/2101.09336)_

Christopher M. Bishop. Pattern recognition and machine learning, 5th Edition. Information science
[and statistics. Springer, 2007. ISBN 9780387310732. URL https://www.worldcat.org/](https://www.worldcat.org/oclc/71008143)
[oclc/71008143.](https://www.worldcat.org/oclc/71008143)

Han Cai, Ligeng Zhu, and Song Han. ProxylessNAS: Direct Neural Architecture Search on
Target Task and Hardware. In 7th International Conference on Learning Representations,
_ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net, 2019._ [URL https:](https://openreview.net/forum?id=HylVB3AqYm)
[//openreview.net/forum?id=HylVB3AqYm.](https://openreview.net/forum?id=HylVB3AqYm)

Joaquin Qui˜nonero Candela and Carl Edward Rasmussen. A Unifying View of Sparse Approximate
[Gaussian Process Regression. J. Mach. Learn. Res., 6:1939–1959, 2005. URL http://jmlr.](http://jmlr.org/papers/v6/quinonero-candela05a.html)
[org/papers/v6/quinonero-candela05a.html.](http://jmlr.org/papers/v6/quinonero-candela05a.html)

Tianqi Chen and Carlos Guestrin. XGBoost: A scalable tree boosting system. In Proceedings of the
_22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD_
’16, pp. 785–794, New York, NY, USA, 2016. ACM. ISBN 978-1-4503-4232-2. doi: 10.1145/
[2939672.2939785. URL http://doi.acm.org/10.1145/2939672.2939785.](http://doi.acm.org/10.1145/2939672.2939785)

Patryk Chrabaszcz, Ilya Loshchilov, and Frank Hutter. A downsampled variant of imagenet as an
alternative to the cifar datasets. arXiv preprint arXiv:1707.08819, 2017.

Xiangxiang Chu, Bo Zhang, Jixiang Li, Qingyuan Li, and Ruijun Xu. ScarletNAS: Bridging the
Gap Between Scalability and Fairness in Neural Architecture Search. CoRR, abs/1908.06022,
[2019a. URL http://arxiv.org/abs/1908.06022.](http://arxiv.org/abs/1908.06022)

Xiangxiang Chu, Bo Zhang, Ruijun Xu, and Jixiang Li. FairNAS: Rethinking Evaluation Fairness
[of Weight Sharing Neural Architecture Search. CoRR, abs/1907.01845, 2019b. URL http:](http://arxiv.org/abs/1907.01845)
[//arxiv.org/abs/1907.01845.](http://arxiv.org/abs/1907.01845)

Corinna Cortes and Vladimir Vapnik. Support-vector networks. Machine learning, 20(3):273–297,
1995.

Xiaoliang Dai, Alvin Wan, Peizhao Zhang, Bichen Wu, Zijian He, Zhen Wei, Kan Chen, Yuandong
Tian, Matthew Yu, Peter Vajda, and Joseph E. Gonzalez. FBNetV3: Joint Architecture-Recipe
Search using Neural Acquisition Function. _CoRR, abs/2006.02049, 2020._ [URL https://](https://arxiv.org/abs/2006.02049)
[arxiv.org/abs/2006.02049.](https://arxiv.org/abs/2006.02049)

Xuanyi Dong and Yi Yang. NAS-Bench-201: Extending the Scope of Reproducible Neural Architecture Search. In 8th International Conference on Learning Representations, ICLR 2020, Addis
_[Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020. URL https://openreview.](https://openreview.net/forum?id=HJxyZkBKDr)_
[net/forum?id=HJxyZkBKDr.](https://openreview.net/forum?id=HJxyZkBKDr)

Xuanyi Dong, Lu Liu, Katarzyna Musial, and Bogdan Gabrys. NATS-Bench: Benchmarking NAS
Algorithms for Architecture Topology and Size. arXiv preprint arXiv:2009.00437, 2020.

Tony Duan, Avati Anand, Daisy Yi Ding, Khanh K. Thai, Sanjay Basu, Andrew Y. Ng, and Alejandro Schuler. Ngboost: Natural gradient boosting for probabilistic prediction. In Proceedings of
_the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual_
_Event, volume 119 of Proceedings of Machine Learning Research, pp. 2690–2700. PMLR, 2020._
[URL http://proceedings.mlr.press/v119/duan20a.html.](http://proceedings.mlr.press/v119/duan20a.html)

Yawen Duan, Xin Chen, Hang Xu, Zewei Chen, Xiaodan Liang, Tong Zhang, and Zhenguo Li.
TransNAS-Bench-101: Improving Transferability and Generalizability of Cross-Task Neural Ar[chitecture Search. CoRR, abs/2105.11871, 2021. URL https://arxiv.org/abs/2105.](https://arxiv.org/abs/2105.11871)
[11871.](https://arxiv.org/abs/2105.11871)


-----

Zichao Guo, Xiangyu Zhang, Haoyuan Mu, Wen Heng, Zechun Liu, Yichen Wei, and Jian Sun. Single Path One-Shot Neural Architecture Search with Uniform Sampling. In European Conference
_[on Computer Vision, pp. 544–560. Springer, 2020. URL http://arxiv.org/abs/1904.](http://arxiv.org/abs/1904.00420)_
[00420.](http://arxiv.org/abs/1904.00420)

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual Learning for Image
Recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition,
[pp. 770–778, 2016. URL http://arxiv.org/abs/1512.03385.](http://arxiv.org/abs/1512.03385)

Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand,
Marco Andreetto, and Hartwig Adam. MobileNets: Efficient Convolutional Neural Networks for
[Mobile Vision applications. CoRR, abs/1704.04861, 2017. URL http://arxiv.org/abs/](http://arxiv.org/abs/1704.04861)
[1704.04861.](http://arxiv.org/abs/1704.04861)

Shoukang Hu, Sirui Xie, Hehui Zheng, Chunxiao Liu, Jianping Shi, Xunying Liu, and Dahua Lin.
DSNAS: Direct Neural Architecture Search without Parameter Retraining. In Proceedings of the
_IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12084–12092, 2020._
[URL http://arxiv.org/abs/2002.09128.](http://arxiv.org/abs/2002.09128)

Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and TieYan Liu. Lightgbm: A highly efficient gradient boosting decision tree. In Isabelle Guyon, Ulrike
von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman
Garnett (eds.), Advances in Neural Information Processing Systems 30: Annual Conference on
_Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pp._
[3146–3154, 2017. URL https://proceedings.neurips.cc/paper/2017/hash/](https://proceedings.neurips.cc/paper/2017/hash/6449f44a102fde848669bdd9eb6b76fa-Abstract.html)
[6449f44a102fde848669bdd9eb6b76fa-Abstract.html.](https://proceedings.neurips.cc/paper/2017/hash/6449f44a102fde848669bdd9eb6b76fa-Abstract.html)

Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. CIFAR-10 (Canadian Institute for Advanced
[Research). 2009. URL http://www.cs.toronto.edu/˜kriz/cifar.html.](http://www.cs.toronto.edu/~kriz/cifar.html)

Hayeon Lee, Sewoong Lee, Song Chong, and Sung Ju Hwang. HELP: Hardware-Adaptive Efficient
[Latency Predictor for NAS via Meta-Learning. CoRR, abs/2106.08630, 2021. URL https:](https://arxiv.org/abs/2106.08630)
[//arxiv.org/abs/2106.08630.](https://arxiv.org/abs/2106.08630)

Chaojian Li, Zhongzhi Yu, Yonggan Fu, Yongan Zhang, Yang Zhao, Haoran You, Qixuan Yu, Yue
Wang, and Yingyan Lin. HW-NAS-Bench: Hardware-Aware Neural Architecture Search Bench[mark. CoRR, abs/2103.10584, 2021a. URL https://arxiv.org/abs/2103.10584.](https://arxiv.org/abs/2103.10584)

Guihong Li, Sumit K. Mandal, Umit Y. Ogras, and Radu Marculescu. FLASH: Fast Neural Ar-[¨]
[chitecture Search with Hardware Optimization. CoRR, abs/2108.00568, 2021b. URL https:](https://arxiv.org/abs/2108.00568)
[//arxiv.org/abs/2108.00568.](https://arxiv.org/abs/2108.00568)

Liam Li and Ameet Talwalkar. Random Search and Reproducibility for Neural Architecture Search.
In Uncertainty in Artificial Intelligence, pp. 367–377. PMLR, 2020.

Andy Liaw, Matthew Wiener, et al. Classification and Regression by randomForest. R news, 2(3):
18–22, 2002.

Marius Lindauer and Frank Hutter. Best Practices for Scientific Research on Neural Architecture
Search. Journal of Machine Learning Research, 21(243):1–18, 2020.

Renqian Luo, Fei Tian, Tao Qin, Enhong Chen, and Tie-Yan Liu. Neural Architecture Optimization.
In Samy Bengio, Hanna M. Wallach, Hugo Larochelle, Kristen Grauman, Nicol`o Cesa-Bianchi,
and Roman Garnett (eds.), Advances in Neural Information Processing Systems 31: Annual Con_ference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2018,_
_[Montr´eal, Canada, pp. 7827–7838, 2018. URL https://proceedings.neurips.cc/](https://proceedings.neurips.cc/paper/2018/hash/933670f1ac8ba969f32989c312faba75-Abstract.html)_
[paper/2018/hash/933670f1ac8ba969f32989c312faba75-Abstract.html.](https://proceedings.neurips.cc/paper/2018/hash/933670f1ac8ba969f32989c312faba75-Abstract.html)

Ningning Ma, Xiangyu Zhang, Hai-Tao Zheng, and Jian Sun. ShuffleNet V2: Practical Guidelines
for Efficient CNN Architecture Design. In Vittorio Ferrari, Martial Hebert, Cristian Sminchisescu, and Yair Weiss (eds.), Computer Vision - ECCV 2018 - 15th European Conference, Mu_nich, Germany, September 8-14, 2018, Proceedings, Part XIV, volume 11218 of Lecture Notes_
_[in Computer Science, pp. 122–138. Springer, 2018. doi: 10.1007/978-3-030-01264-9\ 8. URL](https://doi.org/10.1007/978-3-030-01264-9_8)_
[https://doi.org/10.1007/978-3-030-01264-9_8.](https://doi.org/10.1007/978-3-030-01264-9_8)


-----

Joseph Mellor, Jack Turner, Amos Storkey, and Elliot J. Crowley. Neural Architecture Search with[out Training, 2020. URL http://arxiv.org/abs/2006.04647.](http://arxiv.org/abs/2006.04647)

Daniel M. Mendoza and Sijin Wang. Predicting Latency of Neural Network Inference,
2020. [URL http://cs230.stanford.edu/projects_fall_2020/reports/](http://cs230.stanford.edu/projects_fall_2020/reports/55793069.pdf)
[55793069.pdf.](http://cs230.stanford.edu/projects_fall_2020/reports/55793069.pdf)

Niv Nayman, Yonathan Aflalo, Asaf Noy, and Lihi Zelnik. HardCoRe-NAS: Hard Constrained
diffeRentiable Neural Architecture Search. In Marina Meila and Tong Zhang (eds.), Proceedings
_of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual_
_Event, volume 139 of Proceedings of Machine Learning Research, pp. 7979–7990. PMLR, 2021._
[URL http://proceedings.mlr.press/v139/nayman21a.html.](http://proceedings.mlr.press/v139/nayman21a.html)

Evgeny Ponomarev, Sergey A. Matveev, and Ivan V. Oseledets. LETI: Latency Estimation Tool and
Investigation of Neural Networks inference on Mobile GPU. CoRR, abs/2010.02871, 2020. URL
[https://arxiv.org/abs/2010.02871.](https://arxiv.org/abs/2010.02871)

Carl Edward Rasmussen. Gaussian Processes in Machine Learning. In Olivier Bousquet, Ulrike
von Luxburg, and Gunnar R¨atsch (eds.), Advanced Lectures on Machine Learning, ML Sum_mer Schools 2003, Canberra, Australia, February 2-14, 2003, T¨ubingen, Germany, August 4-_
_16, 2003, Revised Lectures, volume 3176 of Lecture Notes in Computer Science, pp. 63–71._
[Springer, 2003. doi: 10.1007/978-3-540-28650-9\ 4. URL https://doi.org/10.1007/](https://doi.org/10.1007/978-3-540-28650-9_4)
[978-3-540-28650-9_4.](https://doi.org/10.1007/978-3-540-28650-9_4)

Esteban Real, Alok Aggarwal, Yanping Huang, and Quoc V Le. Regularized Evolution for Image
[Classifier Architecture Search, 2018. URL http://arxiv.org/abs/1802.01548.](http://arxiv.org/abs/1802.01548)

Michael Ruchte, Arber Zela, Julien Siems, Josif Grabocka, and Frank Hutter. Naslib: A modular
[and flexible neural architecture search library. https://github.com/automl/NASLib,](https://github.com/automl/NASLib)
2020.

Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on
_computer vision and pattern recognition, pp. 4510–4520, 2018._

Craig Saunders, Alexander Gammerman, and Volodya Vovk. Ridge Regression Learning Algorithm
in Dual Variables. In Proceedings of the Fifteenth International Conference on Machine Learning,
ICML ’98, pp. 515521, San Francisco, CA, USA, 1998. Morgan Kaufmann Publishers Inc. ISBN
1558605568.

Han Shi, Renjie Pi, Hang Xu, Zhenguo Li, James T. Kwok, and Tong Zhang. Bridging
the Gap between Sample-based and One-shot Neural Architecture Search with BONAS. In
Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and HsuanTien Lin (eds.), Advances in Neural Information Processing Systems 33: _Annual Con-_
_ference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12,_
_[2020, virtual, 2020. URL https://proceedings.neurips.cc/paper/2020/hash/](https://proceedings.neurips.cc/paper/2020/hash/13d4635deccc230c944e4ff6e03404b5-Abstract.html)_
[13d4635deccc230c944e4ff6e03404b5-Abstract.html.](https://proceedings.neurips.cc/paper/2020/hash/13d4635deccc230c944e4ff6e03404b5-Abstract.html)

Julien Siems, Lucas Zimmer, Arber Zela, Jovita Lukasik, Margret Keuper, and Frank Hutter. NASBench-301 and the Case for Surrogate Benchmarks for Neural Architecture Search, 2020.

Jost Tobias Springenberg, Aaron Klein, Stefan Falkner, and Frank Hutter. Bayesian
Optimization with Robust Bayesian Neural Networks. In Daniel D. Lee, Masashi
Sugiyama, Ulrike von Luxburg, Isabelle Guyon, and Roman Garnett (eds.), Advances
_in Neural Information Processing Systems 29:_ _Annual Conference on Neural Infor-_
_mation Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, pp. 4134–_
4142, 2016. URL [https://proceedings.neurips.cc/paper/2016/hash/](https://proceedings.neurips.cc/paper/2016/hash/a96d3afec184766bfeca7a9f989fc7e7-Abstract.html)
[a96d3afec184766bfeca7a9f989fc7e7-Abstract.html.](https://proceedings.neurips.cc/paper/2016/hash/a96d3afec184766bfeca7a9f989fc7e7-Abstract.html)

Mingxing Tan and Quoc V. Le. EfficientNet: Rethinking Model Scaling for Convolutional Neural
[Networks. CoRR, abs/1905.11946, 2019. URL http://arxiv.org/abs/1905.11946.](http://arxiv.org/abs/1905.11946)


-----

Mingxing Tan, Bo Chen, Ruoming Pang, Vijay Vasudevan, Mark Sandler, Andrew Howard, and
Quoc V. Le. MnasNet: Platform-Aware Neural Architecture Search for Mobile. In IEEE
_Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA,_
_USA, June 16-20, 2019, pp. 2820–2828. Computer Vision Foundation / IEEE, 2019._ doi:
10.1109/CVPR.2019.00293. URL [http://openaccess.thecvf.com/content_](http://openaccess.thecvf.com/content_CVPR_2019/html/Tan_MnasNet_Platform-Aware_Neural_Architecture_Search_for_Mobile_CVPR_2019_paper.html)
[CVPR_2019/html/Tan_MnasNet_Platform-Aware_Neural_Architecture_](http://openaccess.thecvf.com/content_CVPR_2019/html/Tan_MnasNet_Platform-Aware_Neural_Architecture_Search_for_Mobile_CVPR_2019_paper.html)
[Search_for_Mobile_CVPR_2019_paper.html.](http://openaccess.thecvf.com/content_CVPR_2019/html/Tan_MnasNet_Platform-Aware_Neural_Architecture_Search_for_Mobile_CVPR_2019_paper.html)

Alvin Wan, Xiaoliang Dai, Peizhao Zhang, Zijian He, Yuandong Tian, Saining Xie, Bichen Wu,
Matthew Yu, Tao Xu, Kan Chen, Peter Vajda, and Joseph E. Gonzalez. FBNetV2: Differentiable
Neural Architecture Search for Spatial and Channel Dimensions. In 2020 IEEE/CVF Conference
_on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020,_
[pp. 12962–12971. IEEE, 2020. doi: 10.1109/CVPR42600.2020.01298. URL https://doi.](https://doi.org/10.1109/CVPR42600.2020.01298)
[org/10.1109/CVPR42600.2020.01298.](https://doi.org/10.1109/CVPR42600.2020.01298)

Ruochen Wang, Xiangning Chen, Minhao Cheng, Xiaocheng Tang, and Cho-Jui Hsieh. RANKNOSH: Efficient Predictor-Based Architecture Search via Non-Uniform Successive Halving.
_[CoRR, abs/2108.08019, 2021. URL https://arxiv.org/abs/2108.08019.](https://arxiv.org/abs/2108.08019)_

Wei Wen, Hanxiao Liu, Yiran Chen, Hai Helen Li, Gabriel Bender, and Pieter-Jan Kindermans.
Neural Predictor for Neural Architecture Search. In Andrea Vedaldi, Horst Bischof, Thomas
Brox, and Jan-Michael Frahm (eds.), Computer Vision - ECCV 2020 - 16th European Conference,
_Glasgow, UK, August 23-28, 2020, Proceedings, Part XXIX, volume 12374 of Lecture Notes in_
_[Computer Science, pp. 660–676. Springer, 2020. doi: 10.1007/978-3-030-58526-6\ 39. URL](https://doi.org/10.1007/978-3-030-58526-6_39)_
[https://doi.org/10.1007/978-3-030-58526-6_39.](https://doi.org/10.1007/978-3-030-58526-6_39)

Matthias Wess, Matvey Ivanov, Christoph Unger, Anvesh Nookala, Alexander Wendt, and Axel
Jantsch. ANNETTE: Accurate Neural Network Execution Time Estimation With Stacked Models.
_IEEE Access, 9:35453556, 2021. ISSN 2169-3536. doi: 10.1109/access.2020.3047259. URL_
[http://dx.doi.org/10.1109/ACCESS.2020.3047259.](http://dx.doi.org/10.1109/ACCESS.2020.3047259)

Colin White, Willie Neiswanger, and Yash Savani. BANANAS: Bayesian Optimization with Neural
Architectures for Neural Architecture Search. arXiv preprint arXiv:1910.11858, 2019.

Colin White, Arber Zela, Binxin Ru, Yang Liu, and Frank Hutter. How Powerful are Performance
Predictors in Neural Architecture Search? _CoRR, abs/2104.01177, 2021._ [URL https://](https://arxiv.org/abs/2104.01177)
[arxiv.org/abs/2104.01177.](https://arxiv.org/abs/2104.01177)

Bichen Wu, Xiaoliang Dai, Peizhao Zhang, Yanghan Wang, Fei Sun, Yiming Wu, Yuandong
Tian, Peter Vajda, Yangqing Jia, and Kurt Keutzer. FBNet: Hardware-Aware Efficient ConvNet
Design via Differentiable Neural Architecture Search. In IEEE Conference on Computer Vision
_and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pp. 10734–_
10742. Computer Vision Foundation / IEEE, 2019. doi: 10.1109/CVPR.2019.01099. URL
[http://openaccess.thecvf.com/content_CVPR_2019/html/Wu_FBNet_](http://openaccess.thecvf.com/content_CVPR_2019/html/Wu_FBNet_Hardware-Aware_Efficient_ConvNet_Design_via_Differentiable_Neural_Architecture_Search_CVPR_2019_paper.html)
[Hardware-Aware_Efficient_ConvNet_Design_via_Differentiable_](http://openaccess.thecvf.com/content_CVPR_2019/html/Wu_FBNet_Hardware-Aware_Efficient_ConvNet_Design_via_Differentiable_Neural_Architecture_Search_CVPR_2019_paper.html)
[Neural_Architecture_Search_CVPR_2019_paper.html.](http://openaccess.thecvf.com/content_CVPR_2019/html/Wu_FBNet_Hardware-Aware_Efficient_ConvNet_Design_via_Differentiable_Neural_Architecture_Search_CVPR_2019_paper.html)

Yuhui Xu, Lingxi Xie, Xiaopeng Zhang, Xin Chen, Bowen Shi, Qi Tian, and Hongkai Xiong.
Latency-Aware Differentiable Neural Architecture Search. CoRR, abs/2001.06392, 2020. URL
[https://arxiv.org/abs/2001.06392.](https://arxiv.org/abs/2001.06392)

Antoine Yang, Pedro M Esperanc¸a, and Fabio M Carlucci. Nas evaluation is frustratingly hard.
_arXiv preprint arXiv:1912.12522, 2019._

Tien-Ju Yang, Andrew G. Howard, Bo Chen, Xiao Zhang, Alec Go, Mark Sandler, Vivienne Sze,
and Hartwig Adam. NetAdapt: Platform-Aware Neural Network Adaptation for Mobile Applications. In Vittorio Ferrari, Martial Hebert, Cristian Sminchisescu, and Yair Weiss (eds.),
_Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8-14,_
_2018, Proceedings, Part X, volume 11214 of Lecture Notes in Computer Science, pp. 289–304._
[Springer, 2018. doi: 10.1007/978-3-030-01249-6\ 18. URL https://doi.org/10.1007/](https://doi.org/10.1007/978-3-030-01249-6_18)
[978-3-030-01249-6_18.](https://doi.org/10.1007/978-3-030-01249-6_18)


-----

Shuochao Yao, Yiran Zhao, Huajie Shao, ShengZhong Liu, Dongxin Liu, Lu Su, and Tarek Abdelzaher. FastDeepIoT: Towards Understanding and Optimizing Neural Network Execution Time
on Mobile and Embedded Devices. In Proceedings of the 16th ACM Conference on Embedded
_Networked Sensor Systems, pp. 278–291, 2018._

Chris Ying, Aaron Klein, Eric Christiansen, Esteban Real, Kevin Murphy, and Frank Hutter. NASBench-101: Towards Reproducible Neural Architecture Search. In Kamalika Chaudhuri and Ruslan Salakhutdinov (eds.), Proceedings of the 36th International Conference on Machine Learn_ing, ICML 2019, 9-15 June 2019, Long Beach, California, USA, volume 97 of Proceedings of_
_[Machine Learning Research, pp. 7105–7114. PMLR, 2019. URL http://proceedings.](http://proceedings.mlr.press/v97/ying19a.html)_
[mlr.press/v97/ying19a.html.](http://proceedings.mlr.press/v97/ying19a.html)

Kaicheng Yu, Ren´e Ranftl, and Mathieu Salzmann. How to Train Your Super-Net: An Analysis
[of Training Heuristics in Weight-Sharing NAS. CoRR, abs/2003.04276, 2020. URL https:](https://arxiv.org/abs/2003.04276)
[//arxiv.org/abs/2003.04276.](https://arxiv.org/abs/2003.04276)

Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun. ShuffleNet: An Extremely
Efficient Convolutional Neural Network for Mobile Devices. In 2018 IEEE Conference
_on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA,_
_June 18-22, 2018, pp. 6848–6856. IEEE Computer Society, 2018._ doi: 10.1109/CVPR.
[2018.00716. URL http://openaccess.thecvf.com/content_cvpr_2018/html/](http://openaccess.thecvf.com/content_cvpr_2018/html/Zhang_ShuffleNet_An_Extremely_CVPR_2018_paper.html)
[Zhang_ShuffleNet_An_Extremely_CVPR_2018_paper.html.](http://openaccess.thecvf.com/content_cvpr_2018/html/Zhang_ShuffleNet_An_Extremely_CVPR_2018_paper.html)

Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V Le. Learning transferable architectures
for scalable image recognition. In Proceedings of the IEEE conference on computer vision and
_[pattern recognition, pp. 8697–8710, 2018. URL http://arxiv.org/abs/1707.07012.](http://arxiv.org/abs/1707.07012)_


-----

A BEST PRACTICES FOR NAS, CODE AND DATA

To improve the reproducibility and facilitate fair experimental comparisons, we follow the bestpractices checklist (Lindauer & Hutter, 2020):

_• Release Code for the Training Pipeline(s) you use. Our experiments are based on White_
et al. (2021), who use NASLib (Ruchte et al., 2020) to compare 31 methods for accuracy
prediction. Our NASLib fork, extending the framework for HW-NAS-Bench, TransNASBench, some performance predictors and the hypervolume simulations, is provided in the
supplementary materials. We intend to either make our fork available on GitHub or submit
the changes upstream once this paper is accepted/published.

_• Use the Same Evaluation Protocol for the Methods Being Compared. Aside from the_
implementation of each predictor, all experiments use the same pipeline.

_• Validate The Results Several Times._ We ran each predictor 50 times, with seeds
_{0, ..., 49}. The reductions in hypervolume are simulated 1000 times using different a_
different subset of the data set, for each combination of {iteration, HW-NAS data set, noise
on HW metric}.

_• Control Confounding Factors. While all experiments used the same software libraries_
and hardware resources, they were run on different machines to speed up the evaluation.
We found hardly any benefit in using a GPU even for the network-based predictors, which
is why every method only used two CPU cores. The OS is Ubuntu 18.04, notable software
packages are PyTorch 1.9.0, numpy 1.19.5, scikit-learn 0.24.2, pybnn 0.0.5, ngboost 0.3.11,
and xgboost 1.4.2

_• Report the Use of Hyperparameter Optimization. See Appendix C._

In addition to the code in the supplementary materials, we also provide the experimental results as
csv files. Running the predictors and hypervolume simulations takes some time, but the easy access
to the data of the finished experiments may prove useful for future research. Please see readme.md
in the accompanying code zip file for instructions.

B ENCODINGS AND PREDICTORS

B.1 DATA ENCODINGS

Every architecture a ∈A requires a unique representation, which depends on the used predictor.
The common encoding types are:

**Adjacency one-hot: Each architecture a is uniquely defined by the chosen candidate operation on**
every path. For example, each architecture in NAS-BENCH-201 consists of a repeated cell structure,
which has five candidate operations on each of the six paths. Therefore there are 5[6] = 15625 unique
architectures, which can each be referenced by a sequence of operation-indices such as [0 1 2 3 4 0].
Many predictors perform better if the sequence is presented as a one-hot encoding, which is in this
case [10000 01000 00100 00010 00001 10000].

Similarly, the path-encoding (used by BANANAS) is a one-hot representation over the used candidate operation all possible paths. Since the connectivity within cells for HW-NAS-Bench and
TransNAS-Bench-101 is fixed, it provides no more information than the adjacency one-hot encoding. If the connectivity can be adjusted more freely, as in the NAS-Bench-101 search space, the
additional information may improve the fit.

The encodings for BONAS, GCN, and NAO each provide further information in addition to the
Adjacency one-hot vector, most notably the adjacency-matrix. This {0, 1}[(][N] [+2)][×][(][N] [+2)] matrix lists
describes which of the N architecture paths (rows) serves as inputs for each other path (column),
and also includes input/output.


-----

B.2 PREDICTORS

We briefly describe the 18 predictor methods in our experiments. We adopt their implementations
from the NASLib library (see Appendix A), which we extend with Linear Regression, Ridge Regression, and Support Vector Machines from the scikit-learn package; and a simple Lookup Table
implementation. Unless specified otherwise, the methods use the adjacency one-hot encoding.

_• BANANAS An ensemble of three MLP models with five to 20 layers, each using the path-_
encoding (White et al., 2019).

_• Bayesian Linear Regression A bayesian model that assumes (1) a linear dependency be-_
tween inputs and outputs, and (2) that the samples are normally distributed (Bishop, 2007).

_• BOHAMIANN A bayesian inference predictor using stochastic gradient Hamiltonian_
Monte Carlo (SGHMC) to sample from a bayesian neural network (Springenberg et al.,
2016).

_• BONAS Bayesian Optimization for NAS (Shi et al., 2020) uses a GCN predictor within an_
outer loop of bayesian optimization, as a meta-learning task. The GCN requires encoding
the adjacency matrix of each architecture.

_• Gaussian Process A simple model that assumes a joint Gaussian distribution underlying_
the training data (Rasmussen, 2003).

_• GCN A Graph Convolutional Network that makes use of an adjacency-matrix encoding of_
each architecture (Wen et al., 2020).

_• Linear Regression A simple model that assumes an independent value/cost for each oper-_
ation/layer, which only need to be summed up. Unlike the Lookup Table model, it uses a
least-square fit on the training data.

_• Lookup Table The most simple and perhaps widely used model for differentiable archi-_
tecture selection. It generally assumes a single baseline architecture (e.g. [001 001] in
adjacency one-hot encoding), and a lookup matrix R[(num layers)][×][(num candidates)] that contains
the increases/reductions in the metric for each layer and candidate operation. The metric
value of a new architecture can be predicted with a simple sum over the respective matrix
entries and the baseline value. The model is obtained from measuring either each candidate
operation in isolation, or by computing the differences between the baseline architecture
and specific variations (e.g. [010 001] or [100 001], to measure the first candidates). This
model always requires 1+(num layers) · (num candidates−1) neighbored architectures to
fit. We detail the resulting correlation values for each used dataset in Appendix E.

_• LGBoost Light Gradient Boosting Machine (LightGBM or LGBoost, Ke et al. (2017)) is a_
lightweight gradient-boosted decision tree model.

_• MLP We use fully-connected Multi Layer Perceptrons in two size-categories._

_• NAO NAO (Luo et al., 2018) uses an encoder-decoder topology,_ which encodes/compresses an architecture to a continuous representation, and decodes it again. This
representation is further used to make architecture predictions.

_• NGBoost Natural Gradient Boosting (NGBoost, Duan et al. (2020)) is a gradient-boosted_
decision tree model that uses natural gradients to estimate uncertainty.

_• Ridge Regression Ridge Regression (Saunders et al., 1998) extends the Linear Regression_
least-squares fit with a regularization term that serves as bias-variance tradeoff.

_• Random Forests An ensemble of decision trees (Liaw et al., 2002)._

_• Sparse Gaussian Process an approximation of Gaussian Processes that summarizes train-_
ing data (Candela & Rasmussen, 2005).

_• Support Vector Machine A model that maps its inputs to a high-dimensional space, where_
training samples are used as support-vectors for decision-boundaries (Cortes & Vapnik,
1995).

_• XGBoost eXtreme Gradient Boosting (XGBoost, Chen & Guestrin (2016)) is a gradient-_
boosted decision tree model.


-----

C HYPERPARAMETERS

We list our default and hyper-parameter sample ranges in Table 2. For comparability with White
et al. (2021), we only change the values of newly introduced parameterized predictors: Ridge Regression, Support Vector Machines, and small MLPs.

Model Hyper-parameter Range/Choice Log-transform Default

Num. Layers [5, 25] false 20
BANANAS Layer width [5, 25] false 20
Learning rate [0.0001, 0.1] true 0.001

Num. Layers [16, 128] true 64
BONAS Batch size [32, 256] true 128
Learning rate [0.00001, 0.1] true 0.0001


Num. Layers [64, 200] true 144
Batch size [5, 32] true 7
Learning rate [0.00001, 0.1] true 0.0001
Weight decay [0.00001, 0.1] true 0.0003


GCN


Num. leaves [10, 100] false 31
LGBoost Learning rate [0.001, 0.1] true 0.05
Feature fraction [0.1, 1] false 0.9

Num. layers [2, 5] false 3
Layer width [16, 128] true 32

MLP (small)

Learning rate [0.0001, 0.1] true 0.001
Activation function _{relu, tanh, hardswish}_ relu

Num. layers [5, 25] false 20
MLP (huge) Layer width [5, 25] false 20
Learning rate [0.0001, 0.1] true 0.001

Num. layers [16, 128] true 64
NAO Batch size [32, 256] true 100
Learning rate [0.00001, 0.1] true 0.001


Num. estimators [128, 512] true 64
Learning rate [0.001, 0.1] true 0.081
Max depth [1, 25] false 6
Max features [0.1, 1] false 0.79


NGBoost


Ridge Regression Regularization α [0.25, 2.5] false 1.0


Num. estimators [16, 128] true 116
Max features [0.1, 0.9] true 0.17
Min samples (leaf) [1, 20] false 2
Min samples (split) [2, 20] true 2


Random Forests


Regularization C [0.5, 1.5] false 1.0
Support Vector Machine
Kernel _{linear, poly, rbf, sigmoid}_ rbf

Max depth [1, 15] false 6
Min child weight [1, 10] false 1

XGBoost Col sample (tree) [0, 1] false 1

Learning rate [0.001, 0.5] true 0.3
Col sample (level) [0, 1] false 1

Table 2: Hyper-parameter ranges and default values of the configurable predictors


-----

D NAS-BENCH-201 / HW-NAS-BENCH CELL DESIGN

with exactly one out of five candidate operations {Zero, Skip, Convolution 1×1, Convolution 3×3,

1 3

5

2

6

4

Figure 5: Basic NAS-Bench-201 / HW-NAS cell design. Each of the six orange paths is finalizedshared cell topology
Zero, Skip, Convolution 1

Average Pooling 3×3}.

E SELECTION OF DATASETS

Linear Regression XGBoost LUT

11 25 55 124 276 614 1366 3036 6748 15000 15000 - 

ImageNet16-120-raspi4 latency 0.324 0.205 0.606 0.676 0.705 0.716 0.715 0.723 0.728 0.729 0.757 0.443
cifar100-pixel3 latency 0.392 0.292 0.732 0.780 0.797 0.803 0.806 0.809 0.812 0.812 0.877 0.484
cifar10-edgegpu latency 0.370 0.258 0.724 0.790 0.806 0.819 0.820 0.822 0.830 0.829 0.926 0.175
cifar100-edgegpu energy 0.376 0.275 0.732 0.793 0.812 0.821 0.821 0.823 0.831 0.831 0.920 0.221
ImageNet16-120-eyeriss arith. int. 0.369 0.293 0.748 0.805 0.817 0.827 0.825 0.832 0.843 0.846 0.970 0.861

cifar10-pixel3 latency 0.388 0.300 0.733 0.780 0.797 0.805 0.805 0.810 0.813 0.813 0.878 0.475
cifar10-raspi4 latency 0.393 0.315 0.740 0.787 0.799 0.805 0.807 0.810 0.813 0.813 0.890 0.462
cifar100-raspi4 latency 0.393 0.308 0.744 0.786 0.801 0.807 0.810 0.810 0.814 0.814 0.888 0.445
ImageNet16-120-pixel3 latency 0.398 0.312 0.739 0.786 0.799 0.807 0.809 0.812 0.815 0.816 0.884 0.509
cifar100-edgegpu latency 0.375 0.268 0.728 0.793 0.810 0.821 0.820 0.822 0.831 0.831 0.924 0.191
cifar10-edgegpu energy 0.375 0.284 0.728 0.792 0.810 0.821 0.823 0.824 0.831 0.831 0.922 0.183
ImageNet16-120-edgegpu energy 0.377 0.281 0.733 0.797 0.814 0.825 0.825 0.826 0.834 0.833 0.926 0.280
ImageNet16-120-edgegpu latency 0.379 0.264 0.737 0.799 0.817 0.826 0.826 0.828 0.836 0.835 0.938 0.277
cifar10-eyeriss arith. int. 0.384 0.296 0.757 0.811 0.826 0.835 0.832 0.843 0.854 0.854 0.969 0.826
cifar100-eyeriss arith. int. 0.384 0.297 0.757 0.811 0.826 0.835 0.833 0.844 0.855 0.856 0.971 0.830
ImageNet16-120-fpga latency 0.443 0.494 0.904 0.936 0.947 0.951 0.948 0.951 0.952 0.952 0.983 0.965
ImageNet16-120-fpga energy 0.443 0.494 0.905 0.935 0.947 0.951 0.948 0.951 0.952 0.952 0.983 0.965
ImageNet16-120-eyeriss latency 0.457 0.937 0.953 0.954 0.954 0.954 0.953 0.953 0.954 0.954 0.952 0.989
cifar10-eyeriss latency 0.461 0.943 0.959 0.959 0.960 0.960 0.959 0.960 0.960 0.960 0.958 0.995
cifar100-eyeriss latency 0.462 0.946 0.963 0.963 0.963 0.963 0.963 0.963 0.964 0.963 0.962 0.998
cifar10-eyeriss energy 0.456 0.967 0.985 0.985 0.985 0.985 0.985 0.985 0.985 0.985 0.975 0.996
ImageNet16-120-eyeriss energy 0.458 0.967 0.985 0.985 0.986 0.985 0.986 0.985 0.985 0.986 0.972 0.998
cifar100-eyeriss energy 0.457 0.967 0.985 0.985 0.985 0.986 0.985 0.986 0.986 0.986 0.976 0.998
cifar10-fpga energy 0.458 0.973 0.987 0.987 0.987 0.987 0.987 0.987 0.987 0.987 0.986 0.999
cifar100-fpga energy 0.458 0.973 0.987 0.987 0.987 0.987 0.987 0.987 0.987 0.987 0.986 0.999
cifar100-fpga latency 0.457 0.973 0.987 0.987 0.987 0.987 0.987 0.987 0.987 0.987 0.986 0.999
cifar10-fpga latency 0.457 0.973 0.987 0.987 0.987 0.987 0.987 0.987 0.987 0.987 0.986 0.999

Table 3: Kendall’s Tau test correlation for Linear Regression, XGBoost, and Lookup Table (LUT)
on all HW-NAS-Bench datasets (rows), for different amounts of available training data (columns),
tested on the remaining 625 samples. The Lookup Table model is tested on all 15625 architectures.
We selected the five data sets at the top.

Linear Regression XGBoost LUT

9 18 34 65 123 234 442 837 1585 2999 2999 - 

jigsaw 0.201 0.227 0.410 0.535 0.586 0.605 0.616 0.624 0.631 0.632 0.661 0.201
class object 0.268 0.262 0.518 0.646 0.711 0.741 0.759 0.771 0.780 0.780 0.828 0.701
room layout 0.275 0.271 0.527 0.653 0.721 0.753 0.768 0.780 0.789 0.789 0.896 0.685
class scene 0.275 0.268 0.527 0.653 0.721 0.755 0.768 0.782 0.789 0.790 0.907 0.710
segmentsemantic 0.282 0.259 0.545 0.684 0.746 0.780 0.798 0.809 0.816 0.818 0.871 0.726

Table 4: Kendall’s Tau test correlation for Linear Regression and XGBoost on the five used
TransNAS datasets (rows), for different amounts of available training data (columns), tested on
the remaining 256 samples. The Lookup Table model (LUT) is tested on all 3256 architectures.

**HW-NAS-Bench:** To select five datasets that are (1) non-linear and (2) different from one another,
we first fit Linear Regression to every available dataset, with the results listed in Table 3. The bottom
12 datasets can be accurately fit with only 25 training samples, so they are not very interesting as a


-----

challenge. On these datasets, the Lookup Table model achieves exceptional performance. Since the
networks for CIFAR10, CIFAR100 and ImageNet16-120 only differ slightly, their measurements on
the same device and metric (e.g. raspi4 latency) is very similar. To improve the generalizability of
our results, we thus select datasets on different devices and metrics, which are listed at the top of
Table 3. As displayed in Figure 6, their data distributions are generally different.

**TransNAS-Bench-101:** Since the latency measurements of the architectures is generally very similarly distributed (see Figure 7), it is not necessary to train the predictors on all of them. We select
all data sets that provide the test loss and inference time attributes for all architectures, resulting in
exactly the five datasets listed in Section 4 (the other two datasets contain more specific test losses).

ImageNet16-120-raspi4_latency cifar100-pixel3_latency cifar10-edgegpu_latency

1400 1000 1000

1200

800 800

1000

800 600 600

occurrences 600 occurrences 400 occurrences 400

400

200 200

200

0 0 0

0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 0 10 20 30 40 0 2 4 6 8 10 12

measurements measurements measurements

cifar100-edgegpu_energy ImageNet16-120-eyeriss_arithmetic_intensity

2000

800 1750

1500

600 1250

1000

occurrences 400 occurrences

750

200 500

250

0 0

0 10 20 30 40 0 1 2 3 4 5 6 7 8

measurements measurements


Figure 6: How the data of each selected HW-NAS-Bench dataset is distributed (not normalized).

class_object class_scene jigsaw

1200 1200 600

1000 1000 500

800 800 400

600 600 300

occurrences occurrences occurrences

400 400 200

200 200 100

0 0 0

0.00 0.05 0.10 0.15 0.20 0.25 0.00 0.05 0.10 0.15 0.20 0.25 0.05 0.10 0.15 0.20 0.25 0.30

measurements measurements measurements

room_layout segmentsemantic

1000

1000

800

800

600 600

occurrences occurrences

400 400

200 200

0 0

0.00 0.05 0.10 0.15 0.20 0.25 0.00 0.05 0.10 0.15 0.20 0.25

measurements measurements


Figure 7: How the data of each selected TransNAS-Bench-101 dataset is distributed (not normalized). Since all architectures are measured for latency on the same hardware, the resulting datasets
are much less diverse than the HW-NAS-Bench ones.


-----

PREDICTOR FIT TIME


Lin. Reg.
Bayes. Lin. R
Ridge Reg.
XGBoost
NGBoost
LGBoost
Random Fore
Sparse GP
GP
BOHAMIANN
SVM Reg.
NAO
GCN
BONAS
BANANAS
MLP (large)
MLP (small)

|Col1|Col2|Col3|
|---|---|---|
|. Reg. .|||
|Reg.|||
|orests|||
|rests NN|||
|N|||
|)|||
|) l)|||
||||


Average over TransNAS datasetsAverage over HW-NAS datasets Average over TransNAS datasetsAverage over HW-NAS datasetsAverage over HW-NAS datasets Average over HW-NAS datasets

150002500 15000

200003000 20000

12500 12500Lin. Reg.

17500 175002000 Bayes. Lin. Reg.

2500 10000 10000Lin. Reg.Ridge Reg.

15000 15000 Bayes. Lin. Reg.XGBoost

1500 Ridge Reg.NGBoost

2000 7500 7500XGBoostLGBoost

12500 12500 NGBoostRandom Forests

10005000 5000LGBoostSparse GP

100001500 10000 Random ForestsGP

Sparse GPBOHAMIANN

2500 2500GPSVM Reg.

500

7500 7500 BOHAMIANNNAO

1000 SVM Reg.GCN

0 0

MLP (large)BONAS

Time to fit the predictor (absolute)Time to fit the predictor (absolute) 5000 Time to fit the predictor (absolute) 50000 MLP (small)BANANAS

500 2500 2500MLP (large)

2500 Time to fit the predictor (centered on average)Time to fit the predictor (centered on average) 2500 Time to fit the predictor (centered on average) MLP (small)

500

5000 5000

00 0

1010[1][1] 10[2]10[2] 10[3] 10[3] 10[4] 101010[1][1][1] 1010[2][2]10[2] 1010[3][3] 10[3] 1010[4][4] 10[1] 10[2] 10[3] 10[4]

training set sizetraining set size training set sizetraining set sizetraining set size training set size


Figure 8: Fit time (in seconds) of predictors to data, depending on the training set size. By far the
most expensive methods are network-based. However, a significant portion of this time is spent on
the hyper-parameter optimization prior to the actual fitting.

G APPROXIMATING PREDICTOR MISTAKES


Predictor deviations


2.00

1.75

1.50

1.25

1.00

0.75

0.50

0.25

0.00

|Col1|normal fit, std=0.309 deviations|
|---|---|


normal fit, std=0.309
deviations


Predictor deviations Predictor deviations

1.75 normal fit, std=0.348 1.75 normal fit, std=0.456

deviations deviations

1.50 1.50

1.25 1.25

density 1.00 density 1.00

0.75 0.75

0.50 0.50

0.25 0.25

0.00 0.00

2 1 0 2 1 0 1

deviation of the predictions deviation of the predictions

Figure 9: Further examples of predictor deviation distributions, as visualized in the center of Figure 2. Left: Linear Regression on CIFAR100, edgegpu, energy consumption. Center: Support
Vector Machine on Jigsaw. Right: small MLP on ImageNet16-120, raspi4, latency.

Intuitively, the predictor deviation distributions (see Figures 2 and 9) generally resemble a normal
distribution. However, most predictors:


(1) Have a notable peak, sometimes off-center (e.g. at x=0.2)
(2) Have less density than a normal distribution almost everywhere else
(3) Have some outliers (e.g. at x>1.5) that are extremely unlikely for a normal distribution

We measured the p-value for different distributions on the first 100 test samples using a T-Test, every
time we evaluated a predictor. The average statistics can be found in Table 5. Since a large number
of empirical observations generally pushes the p-value to 0, this only serves to compare them to each
other. We find that the outliers (3) appear often enough and are so unlikely to happen for a normal
distribution, that even a uniform distribution has a higher statistical support. Consequentially, we
approximate the common predictor deviations by sampling from a mixed distribution that adresses
(1) to (3).


-----

p-value


normal 0.028
cauchy 0.030
lognorm 0.028
t 0.028
uniform 0.037

Table 5: P-values of different distributions, trying to fit the distribution of all predictor mistakes
according to a t-test. Larger values are better, but comparing many empirically sampled points with
a true density function tends to push the p-values to 0.


This mixed distribution consists of two Normal distributions (N1, N2) and one Uniform distribution
(U ), from which we sample with 72.5%, 26.5% and 1% respectively. For some constant v:

_• We uniformly sample a shift c from [0, 2 · v], that is used to push the centers of N1 and N2_
to x > 0 and x < 0 respectively.

We sample each value from N1(c, v), N2( _c, 3_ _v), and U1(_ 15 _v, 15_ _v) randomly, with_

_•_ _−_ _·_ _−_ _·_ _·_
the weighting given above.

_• We normalize (subtract mean, divide by standard deviation) our sampled distribution and_
then scale it to the desired standard deviation.

_• The predictors produce non-smooth distributions. We simulate that by sampling 15 times_
fewer values as needed, and repeat them as often.


The code for the simulation is also provided (see Appendix A). As seen in Figure 10, the resulting
simulated deviation distributions generally resemble a common predictor pattern. We do not account
for differences in predictors, training set sizes or more, since that may become too specific and overengineered.

Appendix I visualizes simulation sanity checks. We find that the simulation is slightly pessimistic
and simplified, but resembles the results of actual predictors.


Simulated predictor deviations

|normal fit, std mixed dist. ge|normal fit, std mixed dist. ge|=0.500 nerated with std=0.5|
|---|---|---|
||||



2 0 2

normal fit, std=0.500
mixed dist. generated with std=0.5

deviation of the simulated predictions


Simulated predictor deviations

|normal fit, std= mixed dist. gene|normal fit, std= mixed dist. gene|0.500 rated with std=0.5|
|---|---|---|
||||



2 1 0 1 2

normal fit, std=0.500
mixed dist. generated with std=0.5

deviation of the simulated predictions


Simulated predictor deviations

|normal fit, std mixed dist. ge|normal fit, std mixed dist. ge|=0.500 nerated with std=0.5|
|---|---|---|
||||



2 0 2

normal fit, std=0.500
mixed dist. generated with std=0.5

deviation of the simulated predictions


1.6

1.4

1.2

1.0

0.8

0.6

0.4

0.2

0.0


1.4

1.2

1.0

0.8

0.6

0.4

0.2

0.0


1.2

1.0

0.8

0.6

0.4

0.2

0.0


Figure 10: The sampled values of gaussian+uniform fit the measured predictor mistakes better than
a single distribution, as they are roughly normally distributed, but include outliers.


-----

MEASURING SIMULATED MISTAKES


0.45

0.40


0.45

0.40


0.35

0.30


0.35

0.30


0.25


0.25

|true pareto front predicted pareto front all architectures selected architectures|Col2|Col3|
|---|---|---|
||true pareto front predicted pareto front all architectures selected architectures||
||4 normali|2 0 2 4 zed ImageNet16-120-raspi4_latency|

|Col1|Col2|Col3|Col4|Col5|Col6|Col7|Col8|
|---|---|---|---|---|---|---|---|
|||||||||
|||||||||
|||||||||
|||||||||
||true pa discove selecte|re re d|to front, d paret arch., M|HV=2.93 o front, HV RAall = 3.22||=2.67 %, MRApa|reto = 3.77%|
|||||||||
|||||||||
|2.|0 1.5 normali||1.0 0.5 0.0 0.5 zed ImageNet16-120-raspi4_latency|||||


true pareto front
predicted pareto front
all architectures
selected architectures


Figure 11: Similar to Figure 3. When the discovered Pareto set is considerably worse than the true
Pareto set, it is possible for the Mean Reduction of Accuracy of the Pareto subset (MRApareto) to be
_worse than the average over all architectures (MRAall). This naturally happens more frequently for_
worse predictors with a high sampling std. and low KT correlation. Consequentially, the difference
between MRAall and MRApareto is wider for better predictors (see Figure 4). Additionally, all of
the selected non-Pareto-front members are clustered in a high-latency area and redundant with each
other. This emphasizes the limitations of just considering drops in accuracy, as the hardware metric
aspect is ignored. In this case, the predictor-guided selection failed to find a low-latency solution.
In this regard, hypervolume is a better but less intuitive metric.

hardware metric hardware metric

Figure 12: Examples to explain measurement methods.

62

true pareto front A5

60 actually selected architecture

58

A4

56

54 A3

52 accuracy

accuracy [%] difference

A2

50

48

HW metric difference C1

46 A1

44

18 20 22 24 26 28 30

hardware metric

50 pareto front

hypervolume +10%
reference point

40

30

accuracy [%] 20

10 to 0

reference point

0

18 20 22 24 26 28 30 32

hardware metric

**Left: The distance of each selected candidate architecture C1 to the true Pareto front is measured,**
for accuracy and the hardware metric. C1 is dominated by A2, A3, and A4 of the true Pareto set. A2
has a slightly higher accuracy than C1 while being much better on the hardware metric, e.g. latency.
A4 has a slightly better hardware metric value, but much higher accuracy. Given several candidate
architectures, their differences are averaged.
**Right: We compute the reference point for the hypervolume (for two objectives: area under a**
Pareto front) by multiplying the highest hardware metric value from the true Pareto front with 1.1,
and accuracy 0. While we are consistent throughout all experiments, this choice is arbitrary, as there
is no obviously correct choice for the reference point. If the hypervolume of a supposed Pareto
front is computed, the reference point of the true Pareto front is reused. Thus, choosing inferior
architectures will always reduce the hypervolume. We arbitrarily chose the multiplier of m = 1.1
as a middle ground between making the rightmost point of the Pareto front irrelevant (m = 1.0) and
overemphasizing it (m >> 1.0).


-----

SIMULATION SANITY CHECK

1.0


mean over any number of architectures


0.7

0.6

0.5

ImageNet16-120-eyeriss_arithmetic_intensity
ImageNet16-120-raspi4_latency
cifar10-edgegpu_latency
cifar100-edgegpu_energy
cifar100-pixel3_latency
mean

0.0/1.00 0.2/0.88 0.4/0.78 0.6/0.69 0.8/0.62 1.0/0.57

|Imag Imag|eNet16-120 eNet16-120|-eyeriss_a -raspi4_lat|Col4|rithmetic_i ency|ntensity|Col7|Col8|
|---|---|---|---|---|---|---|---|
|cifar1 cifar1 cifar1|0-edgegpu 00-edgegp 00-pixel3_l|_latency u_energy atency||||||
|mean||||||||
|||||||||
|||||||||
|||||||||


Std. of prediction deviations / Kendall's Tau


0.8

0.6


0.4

0.2

|Col1|Col2|Col3|Col4|Col5|Col6|Col7|
|---|---|---|---|---|---|---|
||||||||
||||||||
||||||||
||||||||
||||||||
||KT=-0.7|5, SCC=-0.88|, PCC=0.77||||


0.0 0.2 0.4 0.6 0.8 1.0

KT=-0.75, SCC=-0.88, PCC=0.77

Std. of prediction deviations


Figure 13: Standard deviation over the predictor deviations (x axis) and Kendall’s Tau correlation
(y axis), for the trained predictors on HW-NAS-Bench (left) and in simulation (right). The simulated
predictor inaccuracies are slightly pessimistic (low KT), but still match the true values.


-----

Predictor deviations


1.75

1.50

1.25

1.00

0.75

0.50

0.25

0.00


normal fit, std=0.445
deviations


All candidates

|Col1|normal fit, std=0.445 deviations|
|---|---|
|||
|||


deviation of the predictions

candidate occurrences in the architecture


candidate not at all exactly once exactly twice

|Col1|normal fit, std=0.541 deviations|
|---|---|
|||
|||


|Col1|normal fit, std=0.532 deviations|
|---|---|
|||
|||


|Col1|normal fit, std=0.462 deviations|
|---|---|
|||
|||


|Predictor|deviations|
|---|---|
||normal fit, std=0.146 deviations|
|||
|||


|Col1|normal fit, std=0.446 deviations|
|---|---|
|||
|||


Predictor deviations Predictor deviations Predictor deviations

1.2 2.5

normal fit, std=0.541 2.00 normal fit, std=0.443 normal fit, std=0.356

1.0 deviations 1.75 deviations 2.0 deviations

0.8 1.50

1.25 1.5

density 0.6 density 1.00 density

1.0

0.4 0.75

0.2 0.50 0.5

0.25

0.0 0.00 0.0

1.5 1.0 0.5 0.0 0.5 1.0 1.5 1.5 1.0 0.5 0.0 0.5 1.0 1.5 1 0 1 2

Zero deviation of the predictions deviation of the predictions deviation of the predictions

Predictor deviations Predictor deviations Predictor deviations

1.4 normal fit, std=0.532 1.75 normal fit, std=0.436 normal fit, std=0.412

1.2 deviations 1.50 deviations 2.0 deviations

1.0 1.25 1.5

density 0.80.6 density 1.000.75 density 1.0

0.4 0.50

0.5

0.2 0.25

0.0 0.00 0.0

1 0 1 2 1.5 1.0 0.5 0.0 0.5 1.0 1.0 0.5 0.0 0.5 1.0 1.5

Skip deviation of the predictions deviation of the predictions deviation of the predictions

Predictor deviations Predictor deviations Predictor deviations

2.00

normal fit, std=0.462 normal fit, std=0.470 normal fit, std=0.393

2.0 deviations 1.75 deviations 2.0 deviations

1.50

1.5 1.25 1.5

1.00

density 1.0 density 0.75 density 1.0

0.5 0.50 0.5

0.25

0.0 0.00 0.0

1.0 0.5 0.0 0.5 1.0 1 0 1 2 1.0 0.5 0.0 0.5 1.0 1.5

Conv1x1 deviation of the predictions deviation of the predictions deviation of the predictions

Predictor deviations Predictor deviations Predictor deviations

2.00 1.4

normal fit, std=0.146 normal fit, std=0.403 normal fit, std=0.565

4 deviations 1.75 deviations 1.2 deviations

1.50 1.0

3 1.25

0.8

1.00

density 2 density 0.75 density 0.6

1 0.50 0.4

0.25 0.2

0 0.00 0.0

0.4 0.2 0.0 0.2 0.4 1.0 0.5 0.0 0.5 1.0 1.5 1.5 1.0 0.5 0.0 0.5 1.0 1.5

Conv3x3 deviation of the predictions deviation of the predictions deviation of the predictions

Predictor deviations Predictor deviations Predictor deviations

2.00 normal fit, std=0.446 normal fit, std=0.411 2.00 normal fit, std=0.477

1.75 deviations 2.0 deviations 1.75 deviations

1.50 1.50

1.25 1.5 1.25

density 1.00 density 1.0 density 1.00

0.75 0.75

0.50 0.5 0.50

0.25 0.25

0.00 0.0 0.00

1 0 1 2 1.0 0.5 0.0 0.5 1.0 1.5 1.5 1.0 0.5 0.0 0.5 1.0 1.5

Pool deviation of the predictions deviation of the predictions deviation of the predictions


Table 6: How a trained XGB predictor deviates from the ground-truth values for different architecture subsets, akin to Figure 2. While they are not exactly the same, they still resemble the distribution
over the entire test set (top plot, 625 samples). One noteworthy exception is when no Conv3x3 operations are used at all, in which case the standard deviation is considerably smaller.


-----