pradachan
/

AI-Scientist

Model card Files Files and versions Community

AI-Scientist / review_iclr_bench /iclr_parsed /2DJn3E7lXu.txt

pradachan

Upload folder using huggingface_hub

f71c233 verified 19 days ago

raw

history blame contribute delete

89.1 kB

	# WHAT TO EXPECT OF HARDWARE METRIC PREDIC## TORS IN NEURAL ARCHITECTURE SEARCH

	Anonymous authors
	Paper under double-blind review

	ABSTRACT

	Modern Neural Architecture Search (NAS) focuses on finding the best performing architectures in hardware-aware settings; e.g., those with an optimal tradeoff
	of accuracy and latency. Due to many advantages of prediction models over live
	measurements, the search process is often guided by estimates of how well each
	considered network architecture performs on the desired metrics. Typical prediction models range from operation-wise lookup tables over gradient-boosted trees
	and neural networks, with little known information on how they compare. We
	evaluate 18 different performance predictors on ten combinations of metrics, devices, network types, and training tasks, and find that MLP models are the most
	promising. We then simulate and evaluate how the guidance of such prediction
	models affects the subsequent architecture selection. Due to inaccurate predictions, the selected architectures are generally suboptimal, which we quantify as
	an expected reduction in accuracy and hypervolume. We show that simply verifying the predictions of just the selected architectures can lead to substantially
	improved results. Under a time budget, we find it preferable to use a fast and
	inaccurate prediction model over accurate but slow live measurements.

	1 INTRODUCTION

	Modern neural network architectures are designed not only considering their primary objective,
	such as accuracy. While existing architectures can be scaled down to work with the limited available
	memory and computational power of, e.g., mobile phones, they are significantly outperformed by
	specifically designed architectures (Howard et al., 2017; Sandler et al., 2018; Zhang et al., 2018;
	Ma et al., 2018). Standard hardware metrics include memory usage, number of model parameters,
	Multiply-Accumulate operations, energy consumption, latency, and more; each of which may be
	limited by the hardware platform or network task. As the range of tasks and target platforms grows,
	specialized architectures and the methods to find them efficiently are gaining importance.

	The automated design and discovery of specialized architectures is the main intent of Neural Architecture Search (NAS). This recent field of study repeatedly broke state of the art records (Zoph et al.,
	2018; Real et al., 2018; Cai et al., 2019; Tan & Le, 2019; Chu et al., 2019a; Hu et al., 2020) while
	aiming to reduce the researchers’ involvement with this tedious and time-consuming process to a
	minimum. As the performance of each considered architecture needs to be evaluated, the hardware
	metrics need to be either measured live or guessed by a trained prediction model. While measuring live has the advantage of not suffering from inaccurate predictions, the corresponding hardware
	needs to be available during the search process. Measuring on-demand may also significantly slow
	down the search process and necessitates further measurements for each new architecture search.
	On the other hand, a prediction model abstracts the hardware from the search code and simplifies
	changes to the optimization targets, such as metrics or devices. The data set to train the predictor
	also has to be collected only once so that a trained predictor then works in the absence of the hardware it is predicting for, e.g., in a cloud environment. Furthermore, a differentiable predictor can be
	used for gradient-based architecture optimization of typically non-differentiable metrics (Cai et al.,
	2019; Xu et al., 2020; Nayman et al., 2021).

	While the many advantages make predictors a popular choice of hardware-aware NAS (e.g. Xu
	et al. (2020); Wu et al. (2019); Wan et al. (2020); Dai et al. (2020); Nayman et al. (2021)), there
	are no guidelines on which predictors perform best, how many training samples are required, or


	-----

	what happens when a predictor is inaccurate. This work investigates the above points. As a first
	contribution, we conduct large-scale experiments on ten hardware-metric datasets chosen from HWNAS-Bench (Li et al., 2021a) and TransNAS-Bench-101 (Duan et al., 2021). We explore how
	powerful the different predictors are when using different amounts of training data and whether
	these results generalize across different network architecture types. As a second contribution, we
	extensively simulate the subsequent architecture selection to investigate the impact of inaccurate
	predictors. Our results demonstrate the effectiveness of network-based prediction models; provide
	insights into predictor mistakes and what to expect from them. To facilitate reproducibility and
	further research, our experimental results and code are made available in Appendix A.

	2 RELATED WORK

	NAS Benchmarks: As the search spaces of NAS methods often differ from one another and lack
	extensive studies, the difficulty of fair comparisons and reproducibility have become a major concern
	(Yang et al., 2019; Li & Talwalkar, 2020). To alleviate this problem, researchers have exhaustively
	evaluated search spaces of several thousand architectures to create benchmarks (Ying et al., 2019;
	Dong & Yang, 2020; Dong et al., 2020; Siems et al., 2020), containing detailed statistics for each
	architecture. TransNAS-Bench-101 (Duan et al., 2021) evaluates several thousand architectures
	across seven diverse tasks and finds that the best task-specific architectures may vary significantly.

	The popular NAS-BENCH 201 benchmark (Dong & Yang, 2020) has been further extended with ten
	different hardware metrics for all 15625 architectures on each of the three data sets CIFAR10, CIFAR100 (Krizhevsky et al., 2009) and ImageNet16-120 (Chrabaszcz et al., 2017). Major findings of
	this HW-NAS Bench (Li et al., 2021a) include that FLOPs and the number of parameters are a poor
	approximation for other metrics such as latency. Many existing NAS methods use such inadequate
	substitutes for their simplicity and would benefit from their replacement with better prediction models. Li et al. also find that hardware-specific costs do not correlate well across hardware platforms.
	While accounting for each device’s characteristics improves the NAS results, it is also expensive.
	Predictors can reduce costs by requiring fewer measurements and shorter query times. [1].

	Predictors in NAS: Aside from real-time measurements (Tan et al., 2019; Yang et al., 2018),
	hardware metric estimation in NAS is commonly performed via Lookup Table (Wu et al., 2019),
	Analytical Estimation or a Prediction Model (Dai et al., 2020; Xu et al., 2020). While an operationand layer-wise Lookup Table can accurately estimate hardware-agnostic metrics, such as FLOPs or
	the number of parameters (Cai et al., 2019; Guo et al., 2020; Chu et al., 2019a), they may be suboptimal for device-dependent metrics. Latency and energy consumption have non-obvious factors that
	depend on hardware specifics such as memory, cache usage, the ability to parallelize each operation,
	and an interplay between different network operations. Such details can be captured with neural
	networks (Dai et al., 2020; Mendoza & Wang, 2020; Ponomarev et al., 2020; Xu et al., 2020) or
	other specialized models (Yao et al., 2018; Wess et al., 2021).

	Of particular interest is the correct prediction of the model loss or accuracy, possibly reducing the
	architecture search time by orders of magnitude (Mellor et al., 2020; Wang et al., 2021; Li et al.,
	2021b). In addition to common predictors such as Linear Regression, Random Forests (Liaw et al.,
	2002) or Gaussian Processes (Rasmussen, 2003); specialized techniques may exploit training curve
	extrapolation, network weight sharing or gradient information. Our experiments follow the recent
	large-scale study of White et al. (2021), who compare 31 diverse accuracy prediction methods based
	on initialization and query time, using three NAS benchmarks.

	3 PREDICTING HARDWARE METRICS

	Our methods follow the large-scale study of White et al. (2021), who compared a total of 31 accuracy prediction methods. The differences between accuracy and hardware-metric prediction, our
	selection of predictors, and the general training pipeline are described in this section. In our experiments on HW-NAS-Bench and TransNAS-Bench-101, described in Section 4, we then compare
	these predictors across different training set sizes.

	1For further reading, we recommend a recent survey on hardware-aware NAS (Benmeziane et al., 2021)


	-----

	Differences to accuracy predictors: There are fundamental differences when predicting hardware metrics and the accuracy of network topologies. The most essential is the cost to obtain a
	helpful predictor, which may vary widely for accuracy prediction methods. While determining the
	test accuracy requires the costly and lengthy training of networks, measuring hardware metrics does
	not necessitate any network training. Consequentially, specialized accuracy-estimation methods that
	rely on trained networks, loss history, learning curve extrapolation, or early stopping do not apply to
	hardware metrics. Furthermore, so-called zero-cost proxies that predict metrics from the gradients
	of a single batch are dependant on the network topology but not on the hardware the network is
	placed on. Therefore, the dominant hardware-metric predictor family is model-based.

	Since all relevant predictors are model-based, they can be compared by their training set size. This
	simplifies the initialization time of a predictor as the number of prior measured architectures on
	which they are trained. In stark contrast, some accuracy predictors do not need any training data,
	while others require several partially or fully trained networks. Since an untrained network and a
	few batches suffice to measure a hardware-metric, the collection of such a training set is comparably
	inexpensive.

	Additionally, hardware predictors are generally used supplementary to a one-shot network optimized
	for loss or accuracy. Depending on the NAS method, a fully differentiable predictor is required in
	order to guide the gradient-based architecture selection. Typical choices are Lookup Tables (Cai
	et al., 2019; Nayman et al., 2021) and neural networks (Xu et al., 2020).

	Model-based predictors: The goal of a predictor fp(a) is to accurately approximate the function
	_f_ (a), which may be, e.g., the latency of an architecture a from the search space A. A model-based
	predictor is trained via supervised learning on a set Dtrain of datapoints (a, f (a)), after which it can
	be inexpensively queried for estimates on further architectures. The collection of the dataset and the
	duration of the training are referred to as initialization time and training time respectively.

	The quality of such a trained predictor is generally determined by the (ranking) correlation between
	measurements _f_ (a) _a_ _test_ and predictions _fp(a)_ _a_ _test_ on the unseen architectures
	_{_ _\|_ _∈A_ _}_ _{_ _\|_ _∈A_ _}_
	_test_ . Common correlation metric choices are Pearson (PCC), Spearman (SCC) and Kendall’s
	_ATau (KT) (Chu et al., 2019b; Yu et al., 2020; Siems et al., 2020). ⊂A_

	Our experiments include 18 model-based predictors from different families: Linear Regression,
	Ridge Regression (Saunders et al., 1998), Bayesian Linear Regression (Bishop, 2007), Support
	Vector Machines (Cortes & Vapnik, 1995), Gaussian Process (Rasmussen, 2003), Sparse Gaussian Process (Candela & Rasmussen, 2005), Random Forests (Liaw et al., 2002), XGBoost (Chen &
	Guestrin, 2016), NGBoost (Duan et al., 2020), LGBoost (Ke et al., 2017), BOHAMIANN (Springenberg et al., 2016), BANANAS (White et al., 2019), BONAS (Shi et al., 2020), GCN (Wen
	et al., 2020), small and large Multi-Layer-Perceptrons (MLP), NAO (Luo et al., 2018), and a layeroperation-wise Lookup Table model. We provide further descriptions and implementation details in
	Appendix B.

	Hyper-parameter tuning: The default hyperparameters of the used predictors vary significantly
	in their levels of hyper-parameter tuning, especially in the context of NAS. Additionally, some predictors may internally make use of cross-validation, while others do not. Following White et al.
	(2021), we attempt to level the playing field by running a cross-validation random-search over hyperparameters each time a predictor is fit to data. Each search is limited to 5000 iterations and a total
	run time of 15 minutes and naturally excludes any test data. The predictor-specific parameter details
	are given in Appendix C.

	Training pipeline To make a reliable comparison, we use the NASLib library (Ruchte et al.
	(2020), see Appendix A). We fit each predictor on each dataset and training size 50 times, using
	seeds {0, ..., 49}.

	Some predictors internally normalize the training values (subtract mean, divide by standard deviation). We choose to explicitly do this for all predictors and datasets, which reduces the dependency
	of hyper-parameters (e.g. learning rate) on the dataset and allows us to analyze and compare the
	prediction errors across datasets more effectively.


	-----

	4 PREDICTOR EXPERIMENTS

	We compare the different predictor models based on two NAS benchmarks, HW-NAS-Bench (Li
	et al., 2021a) and TransNAS-Bench-101 (Duan et al., 2021). They differ considerably by their
	network tasks, hardware devices, and architecture designs.

	HW-NAS-Bench architecture design and datasets In HW-NAS-Bench, each architecture is
	solely defined by the topology of a building block (”cell”), which is stacked multiple times to create
	a complete network. Each cell is completely defined by choosing six candidate operations. Since
	they select from five different candidates each time, there are 5[6] = 15625 unique cell topologies.
	These cells are not fully sequential but contain paths of different lengths, which is visualized in
	Appendix D.

	HW-NAS-Bench provides ten hardware statistics on CIFAR10, CIFAR100 Krizhevsky et al. (2009)
	and ImageNet16-120 Chrabaszcz et al. (2017), of which we exclude the incomplete EdgeTPU metric. Thus there are 27 data sets of varying difficulty. As detailed in Appendix E, 12 of them can
	be accurately fit with Linear Regression and only 25 training samples. Many are also very similar
	since their measured networks differ only by the number of image classes. We therefore select five
	datasets that (1.) are not trivial to learn as they are non-linear and (2.) not redundant:



	_• ImageNet16-120, raspi4, latency_

	_• CIFAR100, pixel3, latency_

	_• CIFAR10, edgegpu, latency_



	_• CIFAR100, edgegpu, energy consumption_

	_• ImageNet16-120, eyeriss, arithmetic intensity_


	TransNAS-Bench-101 architecture design and datasets TransNAS-Bench-101 contains information for 7,352 different network architectures, used as backbones in seven diverse vision tasks.
	Since 4,096 are also a subset of HW-NAS-Bench, we focus on the remaining 3,256 architectures
	with a macro-level search space. Unlike a micro-level search space, where a cell is stacked multiple
	times to create a network, each network layer and block is considered individually. In particular, the
	TransNAS-Bench-101 networks consist of four to six pairs of ResNet blocks (He et al., 2016), which
	may modify the image size and channels in four ways: not at all, double the channel count, halve the
	spatial size, and both. Every network has to double the channel count 1 to 3 times, resulting in 3,256
	unique architectures. The networks may consequentially differ in their number of layers (depth), the
	number of channels (width), and image size at any layer.

	As done for HW-NAS-Bench, we select five of the seven available datasets for their latency measurements. Aside from the self-supervised Jigsaw task, there is little difference between the cross-task
	latency measurements (see Appendix E). We evaluate the possibly redundant datasets nonetheless,
	since latency predictions in macro-level search spaces are an important domain for NAS on image
	classification and object detection tasks:



	_• Object classification_

	_• Scene classification_

	_• Room layout_



	_• Jigsaw_

	_• Semantic segmentation_


	Fitting results and comparison The results, averaged over all selected HW-NAS-Bench and
	TransNAS-Bench-101 datasets, are presented in Figures 1a and 1b, respectively. The left plots
	present the absolute predictor performance, the right ones make relative comparisons easier.

	Unsurprisingly, more training samples (i.e., evaluated architectures) generally lead to better prediction results, even until the entire search space is known (aside from the test set). This is true for
	most of the predictors, although e.g. Gaussian Processes and BOHAMIANN saturate early. The
	simple Linear Regression and Ridge Regression models also fail to make proper use of hundreds
	of data points but perform decently when only a few training samples are available. Interestingly,
	the same is true for the graph-encoding network-based predictors BONAS and GCN. While knowing how the different paths within each cell connect (see Appendix B) is especially useful given
	less than fifty training samples, the advantage disappears afterward. In contrast, the graph-encoding
	encoder-decoder approach of NAO performs decently at all times.


	-----

	(b) Results on TransNAS-Bench-101. Since all network architectures are purely sequential by design, we do not

	Average over HW-NAS datasets Average over HW-NAS datasets

	0.2

	Lin. Reg.

	0.8 Bayes. Lin. Reg.

	Ridge Reg.
	XGBoost

	0.1 NGBoost

	LGBoost

	0.6

	Random Forests
	Sparse GP
	GP

	0.0

	BOHAMIANN

	0.4 SVM Reg.

	NAO

	Kendall's Tau (absolute) GCN

	0.1 BONAS

	0.2 Kendall's Tau (centered on average) BANANAS

	MLP (large)
	MLP (small)
	Lookup Table

	0.2

	0.0

	10[1] 10[2] 10[3] 10[4] 10[1] 10[2] 10[3] 10[4]

	training set size training set size

	(a) Results on HW-NAS-Bench. NAO performs decently at all times, and none of the prediction models requires
	more than 60 training samples to improve over a Lookup Table model.

	Average over TransNAS datasets Average over TransNAS datasets

	0.9

	0.10

	0.8

	Lin. Reg.
	Bayes. Lin. Reg.

	0.7 0.05 Ridge Reg.

	XGBoost
	NGBoost

	0.6

	LGBoost

	0.00 Random Forests

	0.5 Sparse GP

	GP
	BOHAMIANN

	0.4

	Kendall's Tau (absolute) 0.05 SVM Reg.

	MLP (large)

	0.3 Kendall's Tau (centered on average) MLP (small)

	Lookup Table

	0.10

	0.2

	0.1

	10[1] 10[2] 10[3] 10[1] 10[2] 10[3]

	training set size training set size

	evaluate predictors that specifically encode the architecture connectivity (BANANAS, BONAS, GCN, NAO).
	After as few as 20 training samples, MLP models outclass all other predictors.

	Figure 1: How well the different predictors rank the test architectures, depending on the training
	set size and averaged over the five selected datasets. Left plots: absolute Kendall’s Tau ranking
	correlation, higher is better. Right plots: same as left, but centered on the predictor-average.

	Due to their powerful rule-based approach, tree-based models perform much better given many
	training samples. Under such circumstances, LGBoost is a candidate for the best predictor model.
	Similarly, the predictions of Support Vector Machines also benefit strongly from more samples.

	The model we find to perform best for most training set sizes are MLPs. They are among the top
	predictors at almost all times in the HW-NAS-Bench, although tree-based models are competitive
	given enough data. After around 3,000 training samples, thinner and deeper MLPs improve over the
	wider and smaller ones. The path-encoding BANANAS model behaves similarly to a regular large
	MLP but requires more samples to reach the same performance. This is interesting since, aside from
	the data encoding, BANANAS is an ensemble of three large MLP models. Even though only the first
	network layer is affected by the data encoding, the more complicated path-encoding proves harmful


	-----

	HW-NAS-Bench TransNAS-Bench-101

	Raspi4 FPGA Eyeriss Pixel3 EdgeGPU Tesla V100

	latency 0.45 (0.75) 0.99 (0.97) 0.99 (0.96) 0.49 (0.78) 0.21 (0.79) 0.60 (0.70)
	energy 0.99 (0.97) 1.00 (0.99) 0.23 (0.79)
	arithmetic intensity 0.84 (0.81)

	Table 1: The Kendall’s Tau correlation of Lookup Tables and Linear Regression (in brackets, using
	only 124 training samples) across metrics and devices. Lookup Tables perform only marginally
	better on the FPGA and Eyeriss devices, but considerably worse in all other cases. More detailed
	statistics are available in Appendix E.

	when the connectivity of the architectures in the search space is fixed. On TransNAS-Bench-101,
	MLP perform exceptionally well. They are much better than any other tested predictor once more
	than just 20 training samples are available. The small MLP model can achieve a KT correlation
	of 80% with just 200 training samples, which takes the best non-network-based predictor (Support
	Vector Machine) four times as many. They are also the only models that achieve a KT correlation of
	over 90%, about 5% higher than the next best model (LGBoost).

	Finally, the Lookup Table models (black horizontal lines) perform poorly in comparison to any other
	predictor. Even though building such a model for HW-NAS-Bench datasets requires only 25 neighbored architectures, NAO and GCN perform better after just ten random samples. More than half
	of the predictor models require less than 25 random samples, while the worst need at most 60. On
	TransNAS-Bench-101, Lookup Tables perform comparably better. Building one requires only 21
	neighbored architectures, and it takes most models between 50 and 100 random training samples to
	achieve better performance. When measured on a per dataset basis, we find that the Lookup Table
	models display a severe performance difference ranging from about 20% KT correlation (cifar10edgegpu latency and Jigsaw) to over 70% (ImageNet16-120-eyeriss arithmetic intensity and Semantic Segmentation, see Appendix E). Other models prove to be much more stable.

	Devices and Metrics The previously described results are based on a specific selection of HWNAS-Bench and TransNAS-Bench-101 datasets that are hard to fit for Lookup Table models. As
	shown in Table 1, that is not always the case. The FPGA and Eyeriss hardware devices are very suitable for Lookup Tables, achieving an almost perfect ranking correlation is possible. Nonetheless,
	Linear Regression requires only 124 training samples to compete even there and is significantly better in every other case. We finally observe that the difficulty of fitting predictors primarily depends
	on the hardware device, much more than the measured metric.

	5 EVALUATING THE PREDICTOR-GUIDED ARCHITECTURE SELECTION

	Although the experiments in Section 4 greatly assist us in selecting a predictor, it is not clear what a
	specific Kendall’s Tau correlation implies for the subsequent architecture selection. Given a perfect
	hardware metric predictor (Kendall’s Tau = 1.0), we can expect that an ideal architecture search
	process will select the architectures with the best tradeoff of accuracy and the hardware metric, i.e.,
	the true Pareto front. On the other hand, imperfect predictions result in the selection of supposedlybest architectures that are wrongly believed to be better.

	To study how hardware predictors affect NAS results, we extensively evaluate the selection of such
	supposedly-best architectures in simulation. This approach can evaluate any combination of predictor quality, test set size, and dataset, without the technical difficulties of obtaining actual predictor
	models that precisely match such requirements. Since the hardware and accuracy prediction models
	are usually independent and can be studied in isolation, we use ground-truth accuracy values in all
	cases.

	Simulating predictors The main challenge of the simulation is to quickly and accurately model
	predictor outputs. We base our simulation on how predictor-generated values deviate from their
	ground-truth targets on the test set, which is explained in Figure 2 and further detailed in Appendix G. Since the simulated deviations are similar to those of actual predictors, simulated predictions are obtained by drawing random values from this deviation distribution and adding them to
	the ground-truth hardware measurements.


	-----

	Predictor deviations

	\|n d\|ormal fit, std=0.477 eviations\|
	\|---\|---\|
	\|\|\|



	1 0 1

	normal fit, std=0.477
	deviations

	deviation of the predictions


	Simulated predictor deviations

	\|normal fit, st mixed dist. g\|normal fit, st mixed dist. g\|d=0.500 enerated with std=0.5\|
	\|---\|---\|---\|
	\|\|\|\|



	2 1 0 1 2

	normal fit, std=0.500
	mixed dist. generated with std=0.5

	deviation of the simulated predictions


	1.4

	1.2

	1.0

	0.8

	0.6

	0.4

	0.2

	0.0


	1.4

	1.2

	1.0

	0.8

	0.6

	0.4

	0.2

	0.0


	Figure 2: A trained XGBoost prediction model on normalized ImageNet16-120 raspi4-latency test

	Predictions and targets

	KT=0.73, SCC=0.90, PCC=0.88

	3

	2

	1

	0

	predicted values

	1

	2

	2 1 0 1 2 3

	true values

	data. Left: The latency prediction (y-axis) for every architecture (blue dot) is approximately correct
	(red line). Center: The same data as on the left, the distribution of deviations made by the predictor
	(blue) and a normal distribution fit to them (orange). Right: A mixed distribution can simulate


	0.45

	0.40


	0.45

	0.40


	0.35

	0.30


	0.35

	0.30


	0.25


	0.25



	\|al deviation distributions as that in the center plot.\|Col2\|
	\|---\|---\|
	\|true pareto front predicted pareto front all architectures selected architectures\|\|
	\|true pareto front predicted pareto front all architectures selected architectures\|\|
	\|4 no\|2 0 2 4 rmalized ImageNet16-120-raspi4_latency\|


	\|Col1\|Col2\|Col3\|Col4\|
	\|---\|---\|---\|---\|
	\|\|\|\|\|
	\|\|\|\|\|
	\|\|\|\|\|
	\|\|\|\|\|
	\|\|true par discover selected\|eto front, HV=2.9 ed pareto front, H arch., MRAall = 1.\|3 V=2.86 06%, MRApareto = 0.43%\|
	\|\|\|\|\|
	\|\|\|\|\|
	\|2.\|0 1.5 1.0 0.5 0.0 0.5 normalized ImageNet16-120-raspi4_latency\|\|\|


	Figure 3: An example of predictor-guided architecture selection, std=0.5. Left: The simulated predictor makes an inaccurate latency prediction for each architecture (blue), resulting in the selection
	of the supposedly-best architectures (orange dots). Even though the predicted Pareto front (orange
	line) may differ significantly from the ground-truth Pareto front (red line), most selected architectures are close to optimal. Right: Same data as left. The true Pareto front (red) and that of the
	selected architectures (orange). Simply accepting all selected architectures results in a Mean Reduction of Accuracy (MRA) of 1.06%, while verifying the predictions and discarding inferior results
	improves that to 0.43%. The hypervolume (HV, area under the Pareto-fronts) is reduced by 0.07.

	A single example of a simulation can be seen in Figure 3. Although most selected architectures
	(orange) are close to the true optimum (red Pareto front), there almost always exists an architecture that has superior accuracy and, at most, the same latency. Simply accepting the 13 selected
	architectures in this particular example results in a mean reduction of accuracy (MRAall) of 1.06%.
	In other words, the average selected architecture has 1.06% lower accuracy than a comparable one
	on the true Paret front. However, simply verifying the hardware metric predictions through actual
	measurements reveals that some selected architectures are suboptimal. By choosing only the Pareto
	subset of the selection, the opportunity loss can be reduced to 0.43% (MRApareto).

	An important property of this approach is that it is independent of any particular optimization
	method. The supposedly-best architectures are always correctly identified, which is an upper bound
	on how well Bayesian Optimization, Evolutionary Algorithms, and other approaches can perform.
	The exemplary MRApareto opportunity loss of 0.43% is therefore unavoidable and depends solely
	on the hardware metric predictor, the dataset, and the number of considered architectures.

	Results We simulate 1,000 architecture selections for each of the five chosen HW-NAS-Bench
	datasets, six different test set sizes, and eleven distribution standard deviations between 0.0 and 1.0.
	As exemplarily shown in Figure 3, each such simulation allows us to compute the mean reduction
	in accuracy (MRA) and the hypervolume (HV) under the Pareto fronts. The most important insights
	are visualized in Figure 4 and summarized below.


	-----

	mean over any number of architectures

	\|Image cifar1 cifar1 cifar1\|Net16-120 0-edgegpu_ 00-edgegpu 00-pixel3_l\|-raspi4_late latency _energy atency\|ncy\|Col5\|Col6\|Col7\|
	\|---\|---\|---\|---\|---\|---\|---\|
	\|mean\|\|\|\|\|\|\|



	0.0/1.00 0.2/0.88 0.4/0.78 0.6/0.69 0.8/0.62 1.0/0.57

	ImageNet16-120-eyeriss_arithmetic_intensity
	ImageNet16-120-raspi4_latency
	cifar10-edgegpu_latency
	cifar100-edgegpu_energy
	cifar100-pixel3_latency
	mean

	Std. of prediction deviations / Kendall's Tau


	mean over all datasets


	4.5

	4.0

	3.5

	3.0

	2.5


	2.5

	2.0

	1.5

	1.0

	0.5

	\|Col1\|Col2\|Col3\|Col4\|Col5\|Col6\|
	\|---\|---\|---\|---\|---\|---\|
	\|\|\|all selec pareto-s\|ted archite et of the s\|ctures elected ar\|chitectures\|



	0.0/1.00 0.2/0.88 0.4/0.78 0.6/0.69 0.8/0.62 1.0/0.57

	all selected architectures
	pareto-set of the selected architectures

	Std. of prediction deviations / Kendall's Tau

	\|Col1\|Col2\|Col3\|Col4\|Col5\|Col6\|
	\|---\|---\|---\|---\|---\|---\|
	\|100\|10\|00\|5000\|\|\|


	2.0

	0.0/1.00 0.2/0.88 0.4/0.78 0.6/0.69 0.8/0.62 1.0/0.57

	100 1000 5000
	500 2000 15625

	Std. of prediction deviations / Kendall's Tau


	Figure 4: Simulation results, with the standard deviation of the predictor deviations and the resulting
	KT correlation on the x-axis. Left: Verifying the hardware predictions can significantly improve the
	results, even more so for better predictors. Center: The drops in average accuracy are dependant on
	the dataset and hardware metric. Right: Considering more candidate architectures and using better
	prediction models improves the results; larger values are better.

	Verifying the predicted results matters (Figure 4, left). The best prediction models achieve a KT
	correlation of almost 0.9, which translates to a mean reduction in accuracy of MRAall 1.5%. That
	means, for each selected architecture, there exists an architecture of equal or lower latency in the ≈
	true Pareto set (if latency is the hardware metric) that improves the average accuracy by 1.5%. Even
	though all selected architectures are believed to form a Pareto set, that is not the case. Their optimal
	subset has a reduction of only MRApareto 0.5%, a significant improvement. However, finding
	this optimal subset requires actually measuring the hardware metrics of the architectures selected by ≈
	the used NAS method.

	Furthermore, the left of Figure 4 aids in anticipating the MRA given a specific predictor. If one used
	e.g. BOHAMIANN (KT 0.8, see Figure1a) instead of MLPs or LGBoost (KT 0.9), MRApareto
	_≈_ _≈_
	increases from around 0.5% to roughly 1.2%. The average accuracy of the selected architectures is
	thus reduced by another 0.7%, just by using an unsuitable hardware metric predictor. Lookup Tables
	(KT≈0.45) are not even visualized anymore, they have an MRApareto of over 2.5%.

	Another interesting observation is that the gap between MRAall and MRApareto is wider for better
	predictors. This is a shortcoming of the MRA metric that we elaborate on in Appendix H.

	The dataset and metric matter (Figure 4, center). While we generally present the results averaged
	over datasets, there exists some discrepancy among them. Most interestingly, predicting hardware
	metrics on harder classification problems (ImageNet16-120 is harder than CIFAR10) also results in
	a higher MRA. This is especially important since MRA is an absolute accuracy reduction. Even
	though the CIFAR10 networks achieve twice the accuracy of ImageNet16-120 networks, they lose
	less absolute accuracy to imperfect predictions. The order of MRA/dataset is primarily stable for
	any predictor KT correlation. Finally, as visualized by the shaded areas, the standard deviation
	of the MRA is generally huge. Consequentially, predictor-guided NAS is very likely to produce
	results of varying quality for each different predictor or search attempt, especially with less accurate
	predictors.

	The number of considered architectures matters (Figure 4, right). We measure the hypervolume of
	the discovered Pareto front (i.e., the area beneath it, see Appendix H), which, unlike MRA, also
	considers the hardware metric. Quite obviously, if the architectures from the true Pareto set are not
	considered, they can not be selected. To achieve the highest possible hypervolume of around 4.2
	(i.e. find the true Pareto set), every architecture in the search space must be evaluated with a perfect
	predictor. This is impossible in most real-world cases, where only a tiny fraction of all possible
	architectures can ever be considered.

	For HW-NAS-Bench, considering 5000 architectures with perfect live measurements and predicting
	the metrics for all 15625 with ranking correlation KT≈0.73 results in selecting equivalent sets of
	architectures. As seen in Figure1a, Ridge Regression can achieve this performance with fewer
	than 100 training samples. Thus, a worse predictor leads to better results if it enables considering
	more architectures. This insight is especially crucial for live measurements, which are accurate but
	slow. Similarly, estimating the network accuracy with super-networks takes much more time than
	predicting their performance with a neural predictor (Wen et al., 2020). If the measurement of any
	metrics is the limiting factor, a guided selection of a cheap predictor is likely to do better.


	-----

	6 DISCUSSION

	Chosen prediction methods Given the nature of hardware-metric prediction, only the subset of
	model-based predictors evaluated by White et al. (2021) is suitable. We extended this subset with
	four models, including the popular Lookup Table. We abstained from evaluating layer-wise predictors (e.g. Wess et al. (2021)) since such data is not available, and meta-learning predictors (Lee
	et al., 2021) due to the vast possibilities to configure them. A separate and specialized comparison
	between classic and meta-learning predictors seems favorable to us.

	Simulation limitations In contrast to evaluating real predictors, the simulation allows us to
	quickly make statements for any test set sizes and predictor-inaccuracies. However, naturally, the
	results are only approximations. While they match actual values, they are generally slightly pessimistic (see Appendix I). We also limit the simulation to HW-NAS-Bench since the changes to
	classification results are more accessible to interpretation than changes to loss values across different problem types. Finally, the current simulation approach can not investigate methods that
	absolutely require a trained one-shot network, such as gradient-based approaches. Including such
	methods is an interesting direction for future research.

	Transferability of the results Our evaluation includes five challenging and diverse datasets
	based on the micro-level search space of HW-NAS-Bench and five latency-based datasets of various macro-level search space architectures in TransNAS-Bench-101. Nonetheless, we find shared
	trends: All tested prediction models improve over Lookup Tables with little amounts of training
	data. Furthermore, most predictors benefit from more training data, even until the entire search
	space (aside from the test set) is known. We also find that network-based predictors are generally
	best but may be challenged by tree-based predictors if enough training data is available. Given only
	a few samples, Ridge Regression performs better than most other models.

	Recommendations While Lookup Tables are a cheap, simple, and popular model in gradientbased architecture selection, we find a significant variance in performance across tasks and devices
	(see Table 1 and Appendix E). We recommend replacing such models with either MLPs or Ridge
	Regression, which are more stable, fully differentiable, and often take less than 100 training samples
	to achieve better results.

	For most realistic scenarios where more than 100 training samples are available, MLP models are
	the most promising. They are among the top predictors on HW-NAS-Bench and demonstrate outstanding performance on the TransNAS-Bench-101 datasets. We found that specialized architecture
	encodings are primarily beneficial for little training data but suspect that they enjoy an additional
	advantage when network topologies are more complex and diverse (White et al., 2021).

	While the query time for all predictors is less than 0.05s and thus negligible, there is a notable
	difference in training time (see Appendix F), primarily due to the hyper-parameter optimization. We
	recommend Ridge Regression for very little amounts of training data and LGBoost otherwise if that
	is an important factor.

	If a NAS method selects architectures based on hardware metric predictions, we strongly suggest
	verifying the results by measuring the true metric value afterward. Doing so may eliminate inferior
	candidates and improve the average result substantially. Finally, if the limiting factor to a NAS
	method is the slow measurement of hardware metrics, using a much faster predictor may lead to an
	improvement, even if the prediction model is less accurate.

	7 CONCLUSIONS

	This work evaluated various hardware-metric prediction models on ten problems of different metrics, devices, and network architecture types. We then simulated the selection process for different
	test set sizes and predictor inaccuracies to improve our understanding of predictor-based architecture selection. We find that even imperfect predictors may improve NAS if their low query time
	enables considering more candidate architectures. Finally, verifying the predictions for the selected
	candidates can lead to a drastic improvement of their average performance. The code and results are
	made available, thus acting both for recommendation and as a baseline for future works.


	-----

	REFERENCES

	Hadjer Benmeziane, Kaoutar El Maghraoui, Hamza Ouarnoughi, Sma¨ıl Niar, Martin Wistuba, and
	Naigang Wang. A Comprehensive Survey on Hardware-Aware Neural Architecture Search.
	_[CoRR, abs/2101.09336, 2021. URL https://arxiv.org/abs/2101.09336.](https://arxiv.org/abs/2101.09336)_

	Christopher M. Bishop. Pattern recognition and machine learning, 5th Edition. Information science
	[and statistics. Springer, 2007. ISBN 9780387310732. URL https://www.worldcat.org/](https://www.worldcat.org/oclc/71008143)
	[oclc/71008143.](https://www.worldcat.org/oclc/71008143)

	Han Cai, Ligeng Zhu, and Song Han. ProxylessNAS: Direct Neural Architecture Search on
	Target Task and Hardware. In 7th International Conference on Learning Representations,
	_ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net, 2019._ [URL https:](https://openreview.net/forum?id=HylVB3AqYm)
	[//openreview.net/forum?id=HylVB3AqYm.](https://openreview.net/forum?id=HylVB3AqYm)

	Joaquin Qui˜nonero Candela and Carl Edward Rasmussen. A Unifying View of Sparse Approximate
	[Gaussian Process Regression. J. Mach. Learn. Res., 6:1939–1959, 2005. URL http://jmlr.](http://jmlr.org/papers/v6/quinonero-candela05a.html)
	[org/papers/v6/quinonero-candela05a.html.](http://jmlr.org/papers/v6/quinonero-candela05a.html)

	Tianqi Chen and Carlos Guestrin. XGBoost: A scalable tree boosting system. In Proceedings of the
	_22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD_
	’16, pp. 785–794, New York, NY, USA, 2016. ACM. ISBN 978-1-4503-4232-2. doi: 10.1145/
	[2939672.2939785. URL http://doi.acm.org/10.1145/2939672.2939785.](http://doi.acm.org/10.1145/2939672.2939785)

	Patryk Chrabaszcz, Ilya Loshchilov, and Frank Hutter. A downsampled variant of imagenet as an
	alternative to the cifar datasets. arXiv preprint arXiv:1707.08819, 2017.

	Xiangxiang Chu, Bo Zhang, Jixiang Li, Qingyuan Li, and Ruijun Xu. ScarletNAS: Bridging the
	Gap Between Scalability and Fairness in Neural Architecture Search. CoRR, abs/1908.06022,
	[2019a. URL http://arxiv.org/abs/1908.06022.](http://arxiv.org/abs/1908.06022)

	Xiangxiang Chu, Bo Zhang, Ruijun Xu, and Jixiang Li. FairNAS: Rethinking Evaluation Fairness
	[of Weight Sharing Neural Architecture Search. CoRR, abs/1907.01845, 2019b. URL http:](http://arxiv.org/abs/1907.01845)
	[//arxiv.org/abs/1907.01845.](http://arxiv.org/abs/1907.01845)

	Corinna Cortes and Vladimir Vapnik. Support-vector networks. Machine learning, 20(3):273–297,
	1995.

	Xiaoliang Dai, Alvin Wan, Peizhao Zhang, Bichen Wu, Zijian He, Zhen Wei, Kan Chen, Yuandong
	Tian, Matthew Yu, Peter Vajda, and Joseph E. Gonzalez. FBNetV3: Joint Architecture-Recipe
	Search using Neural Acquisition Function. _CoRR, abs/2006.02049, 2020._ [URL https://](https://arxiv.org/abs/2006.02049)
	[arxiv.org/abs/2006.02049.](https://arxiv.org/abs/2006.02049)

	Xuanyi Dong and Yi Yang. NAS-Bench-201: Extending the Scope of Reproducible Neural Architecture Search. In 8th International Conference on Learning Representations, ICLR 2020, Addis
	_[Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020. URL https://openreview.](https://openreview.net/forum?id=HJxyZkBKDr)_
	[net/forum?id=HJxyZkBKDr.](https://openreview.net/forum?id=HJxyZkBKDr)

	Xuanyi Dong, Lu Liu, Katarzyna Musial, and Bogdan Gabrys. NATS-Bench: Benchmarking NAS
	Algorithms for Architecture Topology and Size. arXiv preprint arXiv:2009.00437, 2020.

	Tony Duan, Avati Anand, Daisy Yi Ding, Khanh K. Thai, Sanjay Basu, Andrew Y. Ng, and Alejandro Schuler. Ngboost: Natural gradient boosting for probabilistic prediction. In Proceedings of
	_the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual_
	_Event, volume 119 of Proceedings of Machine Learning Research, pp. 2690–2700. PMLR, 2020._
	[URL http://proceedings.mlr.press/v119/duan20a.html.](http://proceedings.mlr.press/v119/duan20a.html)

	Yawen Duan, Xin Chen, Hang Xu, Zewei Chen, Xiaodan Liang, Tong Zhang, and Zhenguo Li.
	TransNAS-Bench-101: Improving Transferability and Generalizability of Cross-Task Neural Ar[chitecture Search. CoRR, abs/2105.11871, 2021. URL https://arxiv.org/abs/2105.](https://arxiv.org/abs/2105.11871)
	[11871.](https://arxiv.org/abs/2105.11871)


	-----

	Zichao Guo, Xiangyu Zhang, Haoyuan Mu, Wen Heng, Zechun Liu, Yichen Wei, and Jian Sun. Single Path One-Shot Neural Architecture Search with Uniform Sampling. In European Conference
	_[on Computer Vision, pp. 544–560. Springer, 2020. URL http://arxiv.org/abs/1904.](http://arxiv.org/abs/1904.00420)_
	[00420.](http://arxiv.org/abs/1904.00420)

	Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual Learning for Image
	Recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition,
	[pp. 770–778, 2016. URL http://arxiv.org/abs/1512.03385.](http://arxiv.org/abs/1512.03385)

	Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand,
	Marco Andreetto, and Hartwig Adam. MobileNets: Efficient Convolutional Neural Networks for
	[Mobile Vision applications. CoRR, abs/1704.04861, 2017. URL http://arxiv.org/abs/](http://arxiv.org/abs/1704.04861)
	[1704.04861.](http://arxiv.org/abs/1704.04861)

	Shoukang Hu, Sirui Xie, Hehui Zheng, Chunxiao Liu, Jianping Shi, Xunying Liu, and Dahua Lin.
	DSNAS: Direct Neural Architecture Search without Parameter Retraining. In Proceedings of the
	_IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12084–12092, 2020._
	[URL http://arxiv.org/abs/2002.09128.](http://arxiv.org/abs/2002.09128)

	Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and TieYan Liu. Lightgbm: A highly efficient gradient boosting decision tree. In Isabelle Guyon, Ulrike
	von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman
	Garnett (eds.), Advances in Neural Information Processing Systems 30: Annual Conference on
	_Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pp._
	[3146–3154, 2017. URL https://proceedings.neurips.cc/paper/2017/hash/](https://proceedings.neurips.cc/paper/2017/hash/6449f44a102fde848669bdd9eb6b76fa-Abstract.html)
	[6449f44a102fde848669bdd9eb6b76fa-Abstract.html.](https://proceedings.neurips.cc/paper/2017/hash/6449f44a102fde848669bdd9eb6b76fa-Abstract.html)

	Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. CIFAR-10 (Canadian Institute for Advanced
	[Research). 2009. URL http://www.cs.toronto.edu/˜kriz/cifar.html.](http://www.cs.toronto.edu/~kriz/cifar.html)

	Hayeon Lee, Sewoong Lee, Song Chong, and Sung Ju Hwang. HELP: Hardware-Adaptive Efficient
	[Latency Predictor for NAS via Meta-Learning. CoRR, abs/2106.08630, 2021. URL https:](https://arxiv.org/abs/2106.08630)
	[//arxiv.org/abs/2106.08630.](https://arxiv.org/abs/2106.08630)

	Chaojian Li, Zhongzhi Yu, Yonggan Fu, Yongan Zhang, Yang Zhao, Haoran You, Qixuan Yu, Yue
	Wang, and Yingyan Lin. HW-NAS-Bench: Hardware-Aware Neural Architecture Search Bench[mark. CoRR, abs/2103.10584, 2021a. URL https://arxiv.org/abs/2103.10584.](https://arxiv.org/abs/2103.10584)

	Guihong Li, Sumit K. Mandal, Umit Y. Ogras, and Radu Marculescu. FLASH: Fast Neural Ar-[¨]
	[chitecture Search with Hardware Optimization. CoRR, abs/2108.00568, 2021b. URL https:](https://arxiv.org/abs/2108.00568)
	[//arxiv.org/abs/2108.00568.](https://arxiv.org/abs/2108.00568)

	Liam Li and Ameet Talwalkar. Random Search and Reproducibility for Neural Architecture Search.
	In Uncertainty in Artificial Intelligence, pp. 367–377. PMLR, 2020.

	Andy Liaw, Matthew Wiener, et al. Classification and Regression by randomForest. R news, 2(3):
	18–22, 2002.

	Marius Lindauer and Frank Hutter. Best Practices for Scientific Research on Neural Architecture
	Search. Journal of Machine Learning Research, 21(243):1–18, 2020.

	Renqian Luo, Fei Tian, Tao Qin, Enhong Chen, and Tie-Yan Liu. Neural Architecture Optimization.
	In Samy Bengio, Hanna M. Wallach, Hugo Larochelle, Kristen Grauman, Nicol`o Cesa-Bianchi,
	and Roman Garnett (eds.), Advances in Neural Information Processing Systems 31: Annual Con_ference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2018,_
	_[Montr´eal, Canada, pp. 7827–7838, 2018. URL https://proceedings.neurips.cc/](https://proceedings.neurips.cc/paper/2018/hash/933670f1ac8ba969f32989c312faba75-Abstract.html)_
	[paper/2018/hash/933670f1ac8ba969f32989c312faba75-Abstract.html.](https://proceedings.neurips.cc/paper/2018/hash/933670f1ac8ba969f32989c312faba75-Abstract.html)

	Ningning Ma, Xiangyu Zhang, Hai-Tao Zheng, and Jian Sun. ShuffleNet V2: Practical Guidelines
	for Efficient CNN Architecture Design. In Vittorio Ferrari, Martial Hebert, Cristian Sminchisescu, and Yair Weiss (eds.), Computer Vision - ECCV 2018 - 15th European Conference, Mu_nich, Germany, September 8-14, 2018, Proceedings, Part XIV, volume 11218 of Lecture Notes_
	_[in Computer Science, pp. 122–138. Springer, 2018. doi: 10.1007/978-3-030-01264-9\ 8. URL](https://doi.org/10.1007/978-3-030-01264-9_8)_
	[https://doi.org/10.1007/978-3-030-01264-9_8.](https://doi.org/10.1007/978-3-030-01264-9_8)


	-----

	Joseph Mellor, Jack Turner, Amos Storkey, and Elliot J. Crowley. Neural Architecture Search with[out Training, 2020. URL http://arxiv.org/abs/2006.04647.](http://arxiv.org/abs/2006.04647)

	Daniel M. Mendoza and Sijin Wang. Predicting Latency of Neural Network Inference,
	2020. [URL http://cs230.stanford.edu/projects_fall_2020/reports/](http://cs230.stanford.edu/projects_fall_2020/reports/55793069.pdf)
	[55793069.pdf.](http://cs230.stanford.edu/projects_fall_2020/reports/55793069.pdf)

	Niv Nayman, Yonathan Aflalo, Asaf Noy, and Lihi Zelnik. HardCoRe-NAS: Hard Constrained
	diffeRentiable Neural Architecture Search. In Marina Meila and Tong Zhang (eds.), Proceedings
	_of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual_
	_Event, volume 139 of Proceedings of Machine Learning Research, pp. 7979–7990. PMLR, 2021._
	[URL http://proceedings.mlr.press/v139/nayman21a.html.](http://proceedings.mlr.press/v139/nayman21a.html)

	Evgeny Ponomarev, Sergey A. Matveev, and Ivan V. Oseledets. LETI: Latency Estimation Tool and
	Investigation of Neural Networks inference on Mobile GPU. CoRR, abs/2010.02871, 2020. URL
	[https://arxiv.org/abs/2010.02871.](https://arxiv.org/abs/2010.02871)

	Carl Edward Rasmussen. Gaussian Processes in Machine Learning. In Olivier Bousquet, Ulrike
	von Luxburg, and Gunnar R¨atsch (eds.), Advanced Lectures on Machine Learning, ML Sum_mer Schools 2003, Canberra, Australia, February 2-14, 2003, T¨ubingen, Germany, August 4-_
	_16, 2003, Revised Lectures, volume 3176 of Lecture Notes in Computer Science, pp. 63–71._
	[Springer, 2003. doi: 10.1007/978-3-540-28650-9\ 4. URL https://doi.org/10.1007/](https://doi.org/10.1007/978-3-540-28650-9_4)
	[978-3-540-28650-9_4.](https://doi.org/10.1007/978-3-540-28650-9_4)

	Esteban Real, Alok Aggarwal, Yanping Huang, and Quoc V Le. Regularized Evolution for Image
	[Classifier Architecture Search, 2018. URL http://arxiv.org/abs/1802.01548.](http://arxiv.org/abs/1802.01548)

	Michael Ruchte, Arber Zela, Julien Siems, Josif Grabocka, and Frank Hutter. Naslib: A modular
	[and flexible neural architecture search library. https://github.com/automl/NASLib,](https://github.com/automl/NASLib)
	2020.

	Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on
	_computer vision and pattern recognition, pp. 4510–4520, 2018._

	Craig Saunders, Alexander Gammerman, and Volodya Vovk. Ridge Regression Learning Algorithm
	in Dual Variables. In Proceedings of the Fifteenth International Conference on Machine Learning,
	ICML ’98, pp. 515521, San Francisco, CA, USA, 1998. Morgan Kaufmann Publishers Inc. ISBN
	1558605568.

	Han Shi, Renjie Pi, Hang Xu, Zhenguo Li, James T. Kwok, and Tong Zhang. Bridging
	the Gap between Sample-based and One-shot Neural Architecture Search with BONAS. In
	Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and HsuanTien Lin (eds.), Advances in Neural Information Processing Systems 33: _Annual Con-_
	_ference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12,_
	_[2020, virtual, 2020. URL https://proceedings.neurips.cc/paper/2020/hash/](https://proceedings.neurips.cc/paper/2020/hash/13d4635deccc230c944e4ff6e03404b5-Abstract.html)_
	[13d4635deccc230c944e4ff6e03404b5-Abstract.html.](https://proceedings.neurips.cc/paper/2020/hash/13d4635deccc230c944e4ff6e03404b5-Abstract.html)

	Julien Siems, Lucas Zimmer, Arber Zela, Jovita Lukasik, Margret Keuper, and Frank Hutter. NASBench-301 and the Case for Surrogate Benchmarks for Neural Architecture Search, 2020.

	Jost Tobias Springenberg, Aaron Klein, Stefan Falkner, and Frank Hutter. Bayesian
	Optimization with Robust Bayesian Neural Networks. In Daniel D. Lee, Masashi
	Sugiyama, Ulrike von Luxburg, Isabelle Guyon, and Roman Garnett (eds.), Advances
	_in Neural Information Processing Systems 29:_ _Annual Conference on Neural Infor-_
	_mation Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, pp. 4134–_
	4142, 2016. URL [https://proceedings.neurips.cc/paper/2016/hash/](https://proceedings.neurips.cc/paper/2016/hash/a96d3afec184766bfeca7a9f989fc7e7-Abstract.html)
	[a96d3afec184766bfeca7a9f989fc7e7-Abstract.html.](https://proceedings.neurips.cc/paper/2016/hash/a96d3afec184766bfeca7a9f989fc7e7-Abstract.html)

	Mingxing Tan and Quoc V. Le. EfficientNet: Rethinking Model Scaling for Convolutional Neural
	[Networks. CoRR, abs/1905.11946, 2019. URL http://arxiv.org/abs/1905.11946.](http://arxiv.org/abs/1905.11946)


	-----

	Mingxing Tan, Bo Chen, Ruoming Pang, Vijay Vasudevan, Mark Sandler, Andrew Howard, and
	Quoc V. Le. MnasNet: Platform-Aware Neural Architecture Search for Mobile. In IEEE
	_Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA,_
	_USA, June 16-20, 2019, pp. 2820–2828. Computer Vision Foundation / IEEE, 2019._ doi:
	10.1109/CVPR.2019.00293. URL [http://openaccess.thecvf.com/content_](http://openaccess.thecvf.com/content_CVPR_2019/html/Tan_MnasNet_Platform-Aware_Neural_Architecture_Search_for_Mobile_CVPR_2019_paper.html)
	[CVPR_2019/html/Tan_MnasNet_Platform-Aware_Neural_Architecture_](http://openaccess.thecvf.com/content_CVPR_2019/html/Tan_MnasNet_Platform-Aware_Neural_Architecture_Search_for_Mobile_CVPR_2019_paper.html)
	[Search_for_Mobile_CVPR_2019_paper.html.](http://openaccess.thecvf.com/content_CVPR_2019/html/Tan_MnasNet_Platform-Aware_Neural_Architecture_Search_for_Mobile_CVPR_2019_paper.html)

	Alvin Wan, Xiaoliang Dai, Peizhao Zhang, Zijian He, Yuandong Tian, Saining Xie, Bichen Wu,
	Matthew Yu, Tao Xu, Kan Chen, Peter Vajda, and Joseph E. Gonzalez. FBNetV2: Differentiable
	Neural Architecture Search for Spatial and Channel Dimensions. In 2020 IEEE/CVF Conference
	_on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020,_
	[pp. 12962–12971. IEEE, 2020. doi: 10.1109/CVPR42600.2020.01298. URL https://doi.](https://doi.org/10.1109/CVPR42600.2020.01298)
	[org/10.1109/CVPR42600.2020.01298.](https://doi.org/10.1109/CVPR42600.2020.01298)

	Ruochen Wang, Xiangning Chen, Minhao Cheng, Xiaocheng Tang, and Cho-Jui Hsieh. RANKNOSH: Efficient Predictor-Based Architecture Search via Non-Uniform Successive Halving.
	_[CoRR, abs/2108.08019, 2021. URL https://arxiv.org/abs/2108.08019.](https://arxiv.org/abs/2108.08019)_

	Wei Wen, Hanxiao Liu, Yiran Chen, Hai Helen Li, Gabriel Bender, and Pieter-Jan Kindermans.
	Neural Predictor for Neural Architecture Search. In Andrea Vedaldi, Horst Bischof, Thomas
	Brox, and Jan-Michael Frahm (eds.), Computer Vision - ECCV 2020 - 16th European Conference,
	_Glasgow, UK, August 23-28, 2020, Proceedings, Part XXIX, volume 12374 of Lecture Notes in_
	_[Computer Science, pp. 660–676. Springer, 2020. doi: 10.1007/978-3-030-58526-6\ 39. URL](https://doi.org/10.1007/978-3-030-58526-6_39)_
	[https://doi.org/10.1007/978-3-030-58526-6_39.](https://doi.org/10.1007/978-3-030-58526-6_39)

	Matthias Wess, Matvey Ivanov, Christoph Unger, Anvesh Nookala, Alexander Wendt, and Axel
	Jantsch. ANNETTE: Accurate Neural Network Execution Time Estimation With Stacked Models.
	_IEEE Access, 9:35453556, 2021. ISSN 2169-3536. doi: 10.1109/access.2020.3047259. URL_
	[http://dx.doi.org/10.1109/ACCESS.2020.3047259.](http://dx.doi.org/10.1109/ACCESS.2020.3047259)

	Colin White, Willie Neiswanger, and Yash Savani. BANANAS: Bayesian Optimization with Neural
	Architectures for Neural Architecture Search. arXiv preprint arXiv:1910.11858, 2019.

	Colin White, Arber Zela, Binxin Ru, Yang Liu, and Frank Hutter. How Powerful are Performance
	Predictors in Neural Architecture Search? _CoRR, abs/2104.01177, 2021._ [URL https://](https://arxiv.org/abs/2104.01177)
	[arxiv.org/abs/2104.01177.](https://arxiv.org/abs/2104.01177)

	Bichen Wu, Xiaoliang Dai, Peizhao Zhang, Yanghan Wang, Fei Sun, Yiming Wu, Yuandong
	Tian, Peter Vajda, Yangqing Jia, and Kurt Keutzer. FBNet: Hardware-Aware Efficient ConvNet
	Design via Differentiable Neural Architecture Search. In IEEE Conference on Computer Vision
	_and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pp. 10734–_
	10742. Computer Vision Foundation / IEEE, 2019. doi: 10.1109/CVPR.2019.01099. URL
	[http://openaccess.thecvf.com/content_CVPR_2019/html/Wu_FBNet_](http://openaccess.thecvf.com/content_CVPR_2019/html/Wu_FBNet_Hardware-Aware_Efficient_ConvNet_Design_via_Differentiable_Neural_Architecture_Search_CVPR_2019_paper.html)
	[Hardware-Aware_Efficient_ConvNet_Design_via_Differentiable_](http://openaccess.thecvf.com/content_CVPR_2019/html/Wu_FBNet_Hardware-Aware_Efficient_ConvNet_Design_via_Differentiable_Neural_Architecture_Search_CVPR_2019_paper.html)
	[Neural_Architecture_Search_CVPR_2019_paper.html.](http://openaccess.thecvf.com/content_CVPR_2019/html/Wu_FBNet_Hardware-Aware_Efficient_ConvNet_Design_via_Differentiable_Neural_Architecture_Search_CVPR_2019_paper.html)

	Yuhui Xu, Lingxi Xie, Xiaopeng Zhang, Xin Chen, Bowen Shi, Qi Tian, and Hongkai Xiong.
	Latency-Aware Differentiable Neural Architecture Search. CoRR, abs/2001.06392, 2020. URL
	[https://arxiv.org/abs/2001.06392.](https://arxiv.org/abs/2001.06392)

	Antoine Yang, Pedro M Esperanc¸a, and Fabio M Carlucci. Nas evaluation is frustratingly hard.
	_arXiv preprint arXiv:1912.12522, 2019._

	Tien-Ju Yang, Andrew G. Howard, Bo Chen, Xiao Zhang, Alec Go, Mark Sandler, Vivienne Sze,
	and Hartwig Adam. NetAdapt: Platform-Aware Neural Network Adaptation for Mobile Applications. In Vittorio Ferrari, Martial Hebert, Cristian Sminchisescu, and Yair Weiss (eds.),
	_Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8-14,_
	_2018, Proceedings, Part X, volume 11214 of Lecture Notes in Computer Science, pp. 289–304._
	[Springer, 2018. doi: 10.1007/978-3-030-01249-6\ 18. URL https://doi.org/10.1007/](https://doi.org/10.1007/978-3-030-01249-6_18)
	[978-3-030-01249-6_18.](https://doi.org/10.1007/978-3-030-01249-6_18)


	-----

	Shuochao Yao, Yiran Zhao, Huajie Shao, ShengZhong Liu, Dongxin Liu, Lu Su, and Tarek Abdelzaher. FastDeepIoT: Towards Understanding and Optimizing Neural Network Execution Time
	on Mobile and Embedded Devices. In Proceedings of the 16th ACM Conference on Embedded
	_Networked Sensor Systems, pp. 278–291, 2018._

	Chris Ying, Aaron Klein, Eric Christiansen, Esteban Real, Kevin Murphy, and Frank Hutter. NASBench-101: Towards Reproducible Neural Architecture Search. In Kamalika Chaudhuri and Ruslan Salakhutdinov (eds.), Proceedings of the 36th International Conference on Machine Learn_ing, ICML 2019, 9-15 June 2019, Long Beach, California, USA, volume 97 of Proceedings of_
	_[Machine Learning Research, pp. 7105–7114. PMLR, 2019. URL http://proceedings.](http://proceedings.mlr.press/v97/ying19a.html)_
	[mlr.press/v97/ying19a.html.](http://proceedings.mlr.press/v97/ying19a.html)

	Kaicheng Yu, Ren´e Ranftl, and Mathieu Salzmann. How to Train Your Super-Net: An Analysis
	[of Training Heuristics in Weight-Sharing NAS. CoRR, abs/2003.04276, 2020. URL https:](https://arxiv.org/abs/2003.04276)
	[//arxiv.org/abs/2003.04276.](https://arxiv.org/abs/2003.04276)

	Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun. ShuffleNet: An Extremely
	Efficient Convolutional Neural Network for Mobile Devices. In 2018 IEEE Conference
	_on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA,_
	_June 18-22, 2018, pp. 6848–6856. IEEE Computer Society, 2018._ doi: 10.1109/CVPR.
	[2018.00716. URL http://openaccess.thecvf.com/content_cvpr_2018/html/](http://openaccess.thecvf.com/content_cvpr_2018/html/Zhang_ShuffleNet_An_Extremely_CVPR_2018_paper.html)
	[Zhang_ShuffleNet_An_Extremely_CVPR_2018_paper.html.](http://openaccess.thecvf.com/content_cvpr_2018/html/Zhang_ShuffleNet_An_Extremely_CVPR_2018_paper.html)

	Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V Le. Learning transferable architectures
	for scalable image recognition. In Proceedings of the IEEE conference on computer vision and
	_[pattern recognition, pp. 8697–8710, 2018. URL http://arxiv.org/abs/1707.07012.](http://arxiv.org/abs/1707.07012)_


	-----

	A BEST PRACTICES FOR NAS, CODE AND DATA

	To improve the reproducibility and facilitate fair experimental comparisons, we follow the bestpractices checklist (Lindauer & Hutter, 2020):

	_• Release Code for the Training Pipeline(s) you use. Our experiments are based on White_
	et al. (2021), who use NASLib (Ruchte et al., 2020) to compare 31 methods for accuracy
	prediction. Our NASLib fork, extending the framework for HW-NAS-Bench, TransNASBench, some performance predictors and the hypervolume simulations, is provided in the
	supplementary materials. We intend to either make our fork available on GitHub or submit
	the changes upstream once this paper is accepted/published.

	_• Use the Same Evaluation Protocol for the Methods Being Compared. Aside from the_
	implementation of each predictor, all experiments use the same pipeline.

	_• Validate The Results Several Times._ We ran each predictor 50 times, with seeds
	_{0, ..., 49}. The reductions in hypervolume are simulated 1000 times using different a_
	different subset of the data set, for each combination of {iteration, HW-NAS data set, noise
	on HW metric}.

	_• Control Confounding Factors. While all experiments used the same software libraries_
	and hardware resources, they were run on different machines to speed up the evaluation.
	We found hardly any benefit in using a GPU even for the network-based predictors, which
	is why every method only used two CPU cores. The OS is Ubuntu 18.04, notable software
	packages are PyTorch 1.9.0, numpy 1.19.5, scikit-learn 0.24.2, pybnn 0.0.5, ngboost 0.3.11,
	and xgboost 1.4.2

	_• Report the Use of Hyperparameter Optimization. See Appendix C._

	In addition to the code in the supplementary materials, we also provide the experimental results as
	csv files. Running the predictors and hypervolume simulations takes some time, but the easy access
	to the data of the finished experiments may prove useful for future research. Please see readme.md
	in the accompanying code zip file for instructions.

	B ENCODINGS AND PREDICTORS

	B.1 DATA ENCODINGS

	Every architecture a ∈A requires a unique representation, which depends on the used predictor.
	The common encoding types are:

	Adjacency one-hot: Each architecture a is uniquely defined by the chosen candidate operation on
	every path. For example, each architecture in NAS-BENCH-201 consists of a repeated cell structure,
	which has five candidate operations on each of the six paths. Therefore there are 5[6] = 15625 unique
	architectures, which can each be referenced by a sequence of operation-indices such as [0 1 2 3 4 0].
	Many predictors perform better if the sequence is presented as a one-hot encoding, which is in this
	case [10000 01000 00100 00010 00001 10000].

	Similarly, the path-encoding (used by BANANAS) is a one-hot representation over the used candidate operation all possible paths. Since the connectivity within cells for HW-NAS-Bench and
	TransNAS-Bench-101 is fixed, it provides no more information than the adjacency one-hot encoding. If the connectivity can be adjusted more freely, as in the NAS-Bench-101 search space, the
	additional information may improve the fit.

	The encodings for BONAS, GCN, and NAO each provide further information in addition to the
	Adjacency one-hot vector, most notably the adjacency-matrix. This {0, 1}[(][N] [+2)][×][(][N] [+2)] matrix lists
	describes which of the N architecture paths (rows) serves as inputs for each other path (column),
	and also includes input/output.


	-----

	B.2 PREDICTORS

	We briefly describe the 18 predictor methods in our experiments. We adopt their implementations
	from the NASLib library (see Appendix A), which we extend with Linear Regression, Ridge Regression, and Support Vector Machines from the scikit-learn package; and a simple Lookup Table
	implementation. Unless specified otherwise, the methods use the adjacency one-hot encoding.

	_• BANANAS An ensemble of three MLP models with five to 20 layers, each using the path-_
	encoding (White et al., 2019).

	_• Bayesian Linear Regression A bayesian model that assumes (1) a linear dependency be-_
	tween inputs and outputs, and (2) that the samples are normally distributed (Bishop, 2007).

	_• BOHAMIANN A bayesian inference predictor using stochastic gradient Hamiltonian_
	Monte Carlo (SGHMC) to sample from a bayesian neural network (Springenberg et al.,
	2016).

	_• BONAS Bayesian Optimization for NAS (Shi et al., 2020) uses a GCN predictor within an_
	outer loop of bayesian optimization, as a meta-learning task. The GCN requires encoding
	the adjacency matrix of each architecture.

	_• Gaussian Process A simple model that assumes a joint Gaussian distribution underlying_
	the training data (Rasmussen, 2003).

	_• GCN A Graph Convolutional Network that makes use of an adjacency-matrix encoding of_
	each architecture (Wen et al., 2020).

	_• Linear Regression A simple model that assumes an independent value/cost for each oper-_
	ation/layer, which only need to be summed up. Unlike the Lookup Table model, it uses a
	least-square fit on the training data.

	_• Lookup Table The most simple and perhaps widely used model for differentiable archi-_
	tecture selection. It generally assumes a single baseline architecture (e.g. [001 001] in
	adjacency one-hot encoding), and a lookup matrix R[(num layers)][×][(num candidates)] that contains
	the increases/reductions in the metric for each layer and candidate operation. The metric
	value of a new architecture can be predicted with a simple sum over the respective matrix
	entries and the baseline value. The model is obtained from measuring either each candidate
	operation in isolation, or by computing the differences between the baseline architecture
	and specific variations (e.g. [010 001] or [100 001], to measure the first candidates). This
	model always requires 1+(num layers) · (num candidates−1) neighbored architectures to
	fit. We detail the resulting correlation values for each used dataset in Appendix E.

	_• LGBoost Light Gradient Boosting Machine (LightGBM or LGBoost, Ke et al. (2017)) is a_
	lightweight gradient-boosted decision tree model.

	_• MLP We use fully-connected Multi Layer Perceptrons in two size-categories._

	_• NAO NAO (Luo et al., 2018) uses an encoder-decoder topology,_ which encodes/compresses an architecture to a continuous representation, and decodes it again. This
	representation is further used to make architecture predictions.

	_• NGBoost Natural Gradient Boosting (NGBoost, Duan et al. (2020)) is a gradient-boosted_
	decision tree model that uses natural gradients to estimate uncertainty.

	_• Ridge Regression Ridge Regression (Saunders et al., 1998) extends the Linear Regression_
	least-squares fit with a regularization term that serves as bias-variance tradeoff.

	_• Random Forests An ensemble of decision trees (Liaw et al., 2002)._

	_• Sparse Gaussian Process an approximation of Gaussian Processes that summarizes train-_
	ing data (Candela & Rasmussen, 2005).

	_• Support Vector Machine A model that maps its inputs to a high-dimensional space, where_
	training samples are used as support-vectors for decision-boundaries (Cortes & Vapnik,
	1995).

	_• XGBoost eXtreme Gradient Boosting (XGBoost, Chen & Guestrin (2016)) is a gradient-_
	boosted decision tree model.


	-----

	C HYPERPARAMETERS

	We list our default and hyper-parameter sample ranges in Table 2. For comparability with White
	et al. (2021), we only change the values of newly introduced parameterized predictors: Ridge Regression, Support Vector Machines, and small MLPs.

	Model Hyper-parameter Range/Choice Log-transform Default

	Num. Layers [5, 25] false 20
	BANANAS Layer width [5, 25] false 20
	Learning rate [0.0001, 0.1] true 0.001

	Num. Layers [16, 128] true 64
	BONAS Batch size [32, 256] true 128
	Learning rate [0.00001, 0.1] true 0.0001


	Num. Layers [64, 200] true 144
	Batch size [5, 32] true 7
	Learning rate [0.00001, 0.1] true 0.0001
	Weight decay [0.00001, 0.1] true 0.0003


	GCN


	Num. leaves [10, 100] false 31
	LGBoost Learning rate [0.001, 0.1] true 0.05
	Feature fraction [0.1, 1] false 0.9

	Num. layers [2, 5] false 3
	Layer width [16, 128] true 32

	MLP (small)

	Learning rate [0.0001, 0.1] true 0.001
	Activation function _{relu, tanh, hardswish}_ relu

	Num. layers [5, 25] false 20
	MLP (huge) Layer width [5, 25] false 20
	Learning rate [0.0001, 0.1] true 0.001

	Num. layers [16, 128] true 64
	NAO Batch size [32, 256] true 100
	Learning rate [0.00001, 0.1] true 0.001


	Num. estimators [128, 512] true 64
	Learning rate [0.001, 0.1] true 0.081
	Max depth [1, 25] false 6
	Max features [0.1, 1] false 0.79


	NGBoost


	Ridge Regression Regularization α [0.25, 2.5] false 1.0


	Num. estimators [16, 128] true 116
	Max features [0.1, 0.9] true 0.17
	Min samples (leaf) [1, 20] false 2
	Min samples (split) [2, 20] true 2


	Random Forests


	Regularization C [0.5, 1.5] false 1.0
	Support Vector Machine
	Kernel _{linear, poly, rbf, sigmoid}_ rbf

	Max depth [1, 15] false 6
	Min child weight [1, 10] false 1

	XGBoost Col sample (tree) [0, 1] false 1

	Learning rate [0.001, 0.5] true 0.3
	Col sample (level) [0, 1] false 1

	Table 2: Hyper-parameter ranges and default values of the configurable predictors


	-----

	D NAS-BENCH-201 / HW-NAS-BENCH CELL DESIGN

	with exactly one out of five candidate operations {Zero, Skip, Convolution 1×1, Convolution 3×3,

	1 3

	5

	2

	6

	4

	Figure 5: Basic NAS-Bench-201 / HW-NAS cell design. Each of the six orange paths is finalizedshared cell topology
	Zero, Skip, Convolution 1

	Average Pooling 3×3}.

	E SELECTION OF DATASETS

	Linear Regression XGBoost LUT

	11 25 55 124 276 614 1366 3036 6748 15000 15000 -

	ImageNet16-120-raspi4 latency 0.324 0.205 0.606 0.676 0.705 0.716 0.715 0.723 0.728 0.729 0.757 0.443
	cifar100-pixel3 latency 0.392 0.292 0.732 0.780 0.797 0.803 0.806 0.809 0.812 0.812 0.877 0.484
	cifar10-edgegpu latency 0.370 0.258 0.724 0.790 0.806 0.819 0.820 0.822 0.830 0.829 0.926 0.175
	cifar100-edgegpu energy 0.376 0.275 0.732 0.793 0.812 0.821 0.821 0.823 0.831 0.831 0.920 0.221
	ImageNet16-120-eyeriss arith. int. 0.369 0.293 0.748 0.805 0.817 0.827 0.825 0.832 0.843 0.846 0.970 0.861

	cifar10-pixel3 latency 0.388 0.300 0.733 0.780 0.797 0.805 0.805 0.810 0.813 0.813 0.878 0.475
	cifar10-raspi4 latency 0.393 0.315 0.740 0.787 0.799 0.805 0.807 0.810 0.813 0.813 0.890 0.462
	cifar100-raspi4 latency 0.393 0.308 0.744 0.786 0.801 0.807 0.810 0.810 0.814 0.814 0.888 0.445
	ImageNet16-120-pixel3 latency 0.398 0.312 0.739 0.786 0.799 0.807 0.809 0.812 0.815 0.816 0.884 0.509
	cifar100-edgegpu latency 0.375 0.268 0.728 0.793 0.810 0.821 0.820 0.822 0.831 0.831 0.924 0.191
	cifar10-edgegpu energy 0.375 0.284 0.728 0.792 0.810 0.821 0.823 0.824 0.831 0.831 0.922 0.183
	ImageNet16-120-edgegpu energy 0.377 0.281 0.733 0.797 0.814 0.825 0.825 0.826 0.834 0.833 0.926 0.280
	ImageNet16-120-edgegpu latency 0.379 0.264 0.737 0.799 0.817 0.826 0.826 0.828 0.836 0.835 0.938 0.277
	cifar10-eyeriss arith. int. 0.384 0.296 0.757 0.811 0.826 0.835 0.832 0.843 0.854 0.854 0.969 0.826
	cifar100-eyeriss arith. int. 0.384 0.297 0.757 0.811 0.826 0.835 0.833 0.844 0.855 0.856 0.971 0.830
	ImageNet16-120-fpga latency 0.443 0.494 0.904 0.936 0.947 0.951 0.948 0.951 0.952 0.952 0.983 0.965
	ImageNet16-120-fpga energy 0.443 0.494 0.905 0.935 0.947 0.951 0.948 0.951 0.952 0.952 0.983 0.965
	ImageNet16-120-eyeriss latency 0.457 0.937 0.953 0.954 0.954 0.954 0.953 0.953 0.954 0.954 0.952 0.989
	cifar10-eyeriss latency 0.461 0.943 0.959 0.959 0.960 0.960 0.959 0.960 0.960 0.960 0.958 0.995
	cifar100-eyeriss latency 0.462 0.946 0.963 0.963 0.963 0.963 0.963 0.963 0.964 0.963 0.962 0.998
	cifar10-eyeriss energy 0.456 0.967 0.985 0.985 0.985 0.985 0.985 0.985 0.985 0.985 0.975 0.996
	ImageNet16-120-eyeriss energy 0.458 0.967 0.985 0.985 0.986 0.985 0.986 0.985 0.985 0.986 0.972 0.998
	cifar100-eyeriss energy 0.457 0.967 0.985 0.985 0.985 0.986 0.985 0.986 0.986 0.986 0.976 0.998
	cifar10-fpga energy 0.458 0.973 0.987 0.987 0.987 0.987 0.987 0.987 0.987 0.987 0.986 0.999
	cifar100-fpga energy 0.458 0.973 0.987 0.987 0.987 0.987 0.987 0.987 0.987 0.987 0.986 0.999
	cifar100-fpga latency 0.457 0.973 0.987 0.987 0.987 0.987 0.987 0.987 0.987 0.987 0.986 0.999
	cifar10-fpga latency 0.457 0.973 0.987 0.987 0.987 0.987 0.987 0.987 0.987 0.987 0.986 0.999

	Table 3: Kendall’s Tau test correlation for Linear Regression, XGBoost, and Lookup Table (LUT)
	on all HW-NAS-Bench datasets (rows), for different amounts of available training data (columns),
	tested on the remaining 625 samples. The Lookup Table model is tested on all 15625 architectures.
	We selected the five data sets at the top.

	Linear Regression XGBoost LUT

	9 18 34 65 123 234 442 837 1585 2999 2999 -

	jigsaw 0.201 0.227 0.410 0.535 0.586 0.605 0.616 0.624 0.631 0.632 0.661 0.201
	class object 0.268 0.262 0.518 0.646 0.711 0.741 0.759 0.771 0.780 0.780 0.828 0.701
	room layout 0.275 0.271 0.527 0.653 0.721 0.753 0.768 0.780 0.789 0.789 0.896 0.685
	class scene 0.275 0.268 0.527 0.653 0.721 0.755 0.768 0.782 0.789 0.790 0.907 0.710
	segmentsemantic 0.282 0.259 0.545 0.684 0.746 0.780 0.798 0.809 0.816 0.818 0.871 0.726

	Table 4: Kendall’s Tau test correlation for Linear Regression and XGBoost on the five used
	TransNAS datasets (rows), for different amounts of available training data (columns), tested on
	the remaining 256 samples. The Lookup Table model (LUT) is tested on all 3256 architectures.

	HW-NAS-Bench: To select five datasets that are (1) non-linear and (2) different from one another,
	we first fit Linear Regression to every available dataset, with the results listed in Table 3. The bottom
	12 datasets can be accurately fit with only 25 training samples, so they are not very interesting as a


	-----

	challenge. On these datasets, the Lookup Table model achieves exceptional performance. Since the
	networks for CIFAR10, CIFAR100 and ImageNet16-120 only differ slightly, their measurements on
	the same device and metric (e.g. raspi4 latency) is very similar. To improve the generalizability of
	our results, we thus select datasets on different devices and metrics, which are listed at the top of
	Table 3. As displayed in Figure 6, their data distributions are generally different.

	TransNAS-Bench-101: Since the latency measurements of the architectures is generally very similarly distributed (see Figure 7), it is not necessary to train the predictors on all of them. We select
	all data sets that provide the test loss and inference time attributes for all architectures, resulting in
	exactly the five datasets listed in Section 4 (the other two datasets contain more specific test losses).

	ImageNet16-120-raspi4_latency cifar100-pixel3_latency cifar10-edgegpu_latency

	1400 1000 1000

	1200

	800 800

	1000

	800 600 600

	occurrences 600 occurrences 400 occurrences 400

	400

	200 200

	200

	0 0 0

	0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 0 10 20 30 40 0 2 4 6 8 10 12

	measurements measurements measurements

	cifar100-edgegpu_energy ImageNet16-120-eyeriss_arithmetic_intensity

	2000

	800 1750

	1500

	600 1250

	1000

	occurrences 400 occurrences

	750

	200 500

	250

	0 0

	0 10 20 30 40 0 1 2 3 4 5 6 7 8

	measurements measurements


	Figure 6: How the data of each selected HW-NAS-Bench dataset is distributed (not normalized).

	class_object class_scene jigsaw

	1200 1200 600

	1000 1000 500

	800 800 400

	600 600 300

	occurrences occurrences occurrences

	400 400 200

	200 200 100

	0 0 0

	0.00 0.05 0.10 0.15 0.20 0.25 0.00 0.05 0.10 0.15 0.20 0.25 0.05 0.10 0.15 0.20 0.25 0.30

	measurements measurements measurements

	room_layout segmentsemantic

	1000

	1000

	800

	800

	600 600

	occurrences occurrences

	400 400

	200 200

	0 0

	0.00 0.05 0.10 0.15 0.20 0.25 0.00 0.05 0.10 0.15 0.20 0.25

	measurements measurements


	Figure 7: How the data of each selected TransNAS-Bench-101 dataset is distributed (not normalized). Since all architectures are measured for latency on the same hardware, the resulting datasets
	are much less diverse than the HW-NAS-Bench ones.


	-----

	PREDICTOR FIT TIME


	Lin. Reg.
	Bayes. Lin. R
	Ridge Reg.
	XGBoost
	NGBoost
	LGBoost
	Random Fore
	Sparse GP
	GP
	BOHAMIANN
	SVM Reg.
	NAO
	GCN
	BONAS
	BANANAS
	MLP (large)
	MLP (small)

	\|Col1\|Col2\|Col3\|
	\|---\|---\|---\|
	\|. Reg. .\|\|\|
	\|Reg.\|\|\|
	\|orests\|\|\|
	\|rests NN\|\|\|
	\|N\|\|\|
	\|)\|\|\|
	\|) l)\|\|\|
	\|\|\|\|


	Average over TransNAS datasetsAverage over HW-NAS datasets Average over TransNAS datasetsAverage over HW-NAS datasetsAverage over HW-NAS datasets Average over HW-NAS datasets

	150002500 15000

	200003000 20000

	12500 12500Lin. Reg.

	17500 175002000 Bayes. Lin. Reg.

	2500 10000 10000Lin. Reg.Ridge Reg.

	15000 15000 Bayes. Lin. Reg.XGBoost

	1500 Ridge Reg.NGBoost

	2000 7500 7500XGBoostLGBoost

	12500 12500 NGBoostRandom Forests

	10005000 5000LGBoostSparse GP

	100001500 10000 Random ForestsGP

	Sparse GPBOHAMIANN

	2500 2500GPSVM Reg.

	500

	7500 7500 BOHAMIANNNAO

	1000 SVM Reg.GCN

	0 0

	MLP (large)BONAS

	Time to fit the predictor (absolute)Time to fit the predictor (absolute) 5000 Time to fit the predictor (absolute) 50000 MLP (small)BANANAS

	500 2500 2500MLP (large)

	2500 Time to fit the predictor (centered on average)Time to fit the predictor (centered on average) 2500 Time to fit the predictor (centered on average) MLP (small)

	500

	5000 5000

	00 0

	1010[1][1] 10[2]10[2] 10[3] 10[3] 10[4] 101010[1][1][1] 1010[2][2]10[2] 1010[3][3] 10[3] 1010[4][4] 10[1] 10[2] 10[3] 10[4]

	training set sizetraining set size training set sizetraining set sizetraining set size training set size


	Figure 8: Fit time (in seconds) of predictors to data, depending on the training set size. By far the
	most expensive methods are network-based. However, a significant portion of this time is spent on
	the hyper-parameter optimization prior to the actual fitting.

	G APPROXIMATING PREDICTOR MISTAKES


	Predictor deviations


	2.00

	1.75

	1.50

	1.25

	1.00

	0.75

	0.50

	0.25

	0.00

	\|Col1\|normal fit, std=0.309 deviations\|
	\|---\|---\|


	normal fit, std=0.309
	deviations


	Predictor deviations Predictor deviations

	1.75 normal fit, std=0.348 1.75 normal fit, std=0.456

	deviations deviations

	1.50 1.50

	1.25 1.25

	density 1.00 density 1.00

	0.75 0.75

	0.50 0.50

	0.25 0.25

	0.00 0.00

	2 1 0 2 1 0 1

	deviation of the predictions deviation of the predictions

	Figure 9: Further examples of predictor deviation distributions, as visualized in the center of Figure 2. Left: Linear Regression on CIFAR100, edgegpu, energy consumption. Center: Support
	Vector Machine on Jigsaw. Right: small MLP on ImageNet16-120, raspi4, latency.

	Intuitively, the predictor deviation distributions (see Figures 2 and 9) generally resemble a normal
	distribution. However, most predictors:


	(1) Have a notable peak, sometimes off-center (e.g. at x=0.2)
	(2) Have less density than a normal distribution almost everywhere else
	(3) Have some outliers (e.g. at x>1.5) that are extremely unlikely for a normal distribution

	We measured the p-value for different distributions on the first 100 test samples using a T-Test, every
	time we evaluated a predictor. The average statistics can be found in Table 5. Since a large number
	of empirical observations generally pushes the p-value to 0, this only serves to compare them to each
	other. We find that the outliers (3) appear often enough and are so unlikely to happen for a normal
	distribution, that even a uniform distribution has a higher statistical support. Consequentially, we
	approximate the common predictor deviations by sampling from a mixed distribution that adresses
	(1) to (3).


	-----

	p-value


	normal 0.028
	cauchy 0.030
	lognorm 0.028
	t 0.028
	uniform 0.037

	Table 5: P-values of different distributions, trying to fit the distribution of all predictor mistakes
	according to a t-test. Larger values are better, but comparing many empirically sampled points with
	a true density function tends to push the p-values to 0.


	This mixed distribution consists of two Normal distributions (N1, N2) and one Uniform distribution
	(U ), from which we sample with 72.5%, 26.5% and 1% respectively. For some constant v:

	_• We uniformly sample a shift c from [0, 2 · v], that is used to push the centers of N1 and N2_
	to x > 0 and x < 0 respectively.

	We sample each value from N1(c, v), N2( _c, 3_ _v), and U1(_ 15 _v, 15_ _v) randomly, with_

	_•_ _−_ _·_ _−_ _·_ _·_
	the weighting given above.

	_• We normalize (subtract mean, divide by standard deviation) our sampled distribution and_
	then scale it to the desired standard deviation.

	_• The predictors produce non-smooth distributions. We simulate that by sampling 15 times_
	fewer values as needed, and repeat them as often.


	The code for the simulation is also provided (see Appendix A). As seen in Figure 10, the resulting
	simulated deviation distributions generally resemble a common predictor pattern. We do not account
	for differences in predictors, training set sizes or more, since that may become too specific and overengineered.

	Appendix I visualizes simulation sanity checks. We find that the simulation is slightly pessimistic
	and simplified, but resembles the results of actual predictors.


	Simulated predictor deviations

	\|normal fit, std mixed dist. ge\|normal fit, std mixed dist. ge\|=0.500 nerated with std=0.5\|
	\|---\|---\|---\|
	\|\|\|\|



	2 0 2

	normal fit, std=0.500
	mixed dist. generated with std=0.5

	deviation of the simulated predictions


	Simulated predictor deviations

	\|normal fit, std= mixed dist. gene\|normal fit, std= mixed dist. gene\|0.500 rated with std=0.5\|
	\|---\|---\|---\|
	\|\|\|\|



	2 1 0 1 2

	normal fit, std=0.500
	mixed dist. generated with std=0.5

	deviation of the simulated predictions


	Simulated predictor deviations

	\|normal fit, std mixed dist. ge\|normal fit, std mixed dist. ge\|=0.500 nerated with std=0.5\|
	\|---\|---\|---\|
	\|\|\|\|



	2 0 2

	normal fit, std=0.500
	mixed dist. generated with std=0.5

	deviation of the simulated predictions


	1.6

	1.4

	1.2

	1.0

	0.8

	0.6

	0.4

	0.2

	0.0


	1.4

	1.2

	1.0

	0.8

	0.6

	0.4

	0.2

	0.0


	1.2

	1.0

	0.8

	0.6

	0.4

	0.2

	0.0


	Figure 10: The sampled values of gaussian+uniform fit the measured predictor mistakes better than
	a single distribution, as they are roughly normally distributed, but include outliers.


	-----

	MEASURING SIMULATED MISTAKES


	0.45

	0.40


	0.45

	0.40


	0.35

	0.30


	0.35

	0.30


	0.25


	0.25

	\|true pareto front predicted pareto front all architectures selected architectures\|Col2\|Col3\|
	\|---\|---\|---\|
	\|\|true pareto front predicted pareto front all architectures selected architectures\|\|
	\|\|4 normali\|2 0 2 4 zed ImageNet16-120-raspi4_latency\|

	\|Col1\|Col2\|Col3\|Col4\|Col5\|Col6\|Col7\|Col8\|
	\|---\|---\|---\|---\|---\|---\|---\|---\|
	\|\|\|\|\|\|\|\|\|
	\|\|\|\|\|\|\|\|\|
	\|\|\|\|\|\|\|\|\|
	\|\|\|\|\|\|\|\|\|
	\|\|true pa discove selecte\|re re d\|to front, d paret arch., M\|HV=2.93 o front, HV RAall = 3.22\|\|=2.67 %, MRApa\|reto = 3.77%\|
	\|\|\|\|\|\|\|\|\|
	\|\|\|\|\|\|\|\|\|
	\|2.\|0 1.5 normali\|\|1.0 0.5 0.0 0.5 zed ImageNet16-120-raspi4_latency\|\|\|\|\|


	true pareto front
	predicted pareto front
	all architectures
	selected architectures


	Figure 11: Similar to Figure 3. When the discovered Pareto set is considerably worse than the true
	Pareto set, it is possible for the Mean Reduction of Accuracy of the Pareto subset (MRApareto) to be
	_worse than the average over all architectures (MRAall). This naturally happens more frequently for_
	worse predictors with a high sampling std. and low KT correlation. Consequentially, the difference
	between MRAall and MRApareto is wider for better predictors (see Figure 4). Additionally, all of
	the selected non-Pareto-front members are clustered in a high-latency area and redundant with each
	other. This emphasizes the limitations of just considering drops in accuracy, as the hardware metric
	aspect is ignored. In this case, the predictor-guided selection failed to find a low-latency solution.
	In this regard, hypervolume is a better but less intuitive metric.

	hardware metric hardware metric

	Figure 12: Examples to explain measurement methods.

	62

	true pareto front A5

	60 actually selected architecture

	58

	A4

	56

	54 A3

	52 accuracy

	accuracy [%] difference

	A2

	50

	48

	HW metric difference C1

	46 A1

	44

	18 20 22 24 26 28 30

	hardware metric

	50 pareto front

	hypervolume +10%
	reference point

	40

	30

	accuracy [%] 20

	10 to 0

	reference point

	0

	18 20 22 24 26 28 30 32

	hardware metric

	Left: The distance of each selected candidate architecture C1 to the true Pareto front is measured,
	for accuracy and the hardware metric. C1 is dominated by A2, A3, and A4 of the true Pareto set. A2
	has a slightly higher accuracy than C1 while being much better on the hardware metric, e.g. latency.
	A4 has a slightly better hardware metric value, but much higher accuracy. Given several candidate
	architectures, their differences are averaged.
	Right: We compute the reference point for the hypervolume (for two objectives: area under a
	Pareto front) by multiplying the highest hardware metric value from the true Pareto front with 1.1,
	and accuracy 0. While we are consistent throughout all experiments, this choice is arbitrary, as there
	is no obviously correct choice for the reference point. If the hypervolume of a supposed Pareto
	front is computed, the reference point of the true Pareto front is reused. Thus, choosing inferior
	architectures will always reduce the hypervolume. We arbitrarily chose the multiplier of m = 1.1
	as a middle ground between making the rightmost point of the Pareto front irrelevant (m = 1.0) and
	overemphasizing it (m >> 1.0).


	-----

	SIMULATION SANITY CHECK

	1.0


	mean over any number of architectures


	0.7

	0.6

	0.5

	ImageNet16-120-eyeriss_arithmetic_intensity
	ImageNet16-120-raspi4_latency
	cifar10-edgegpu_latency
	cifar100-edgegpu_energy
	cifar100-pixel3_latency
	mean

	0.0/1.00 0.2/0.88 0.4/0.78 0.6/0.69 0.8/0.62 1.0/0.57

	\|Imag Imag\|eNet16-120 eNet16-120\|-eyeriss_a -raspi4_lat\|Col4\|rithmetic_i ency\|ntensity\|Col7\|Col8\|
	\|---\|---\|---\|---\|---\|---\|---\|---\|
	\|cifar1 cifar1 cifar1\|0-edgegpu 00-edgegp 00-pixel3_l\|_latency u_energy atency\|\|\|\|\|\|
	\|mean\|\|\|\|\|\|\|\|
	\|\|\|\|\|\|\|\|\|
	\|\|\|\|\|\|\|\|\|
	\|\|\|\|\|\|\|\|\|


	Std. of prediction deviations / Kendall's Tau


	0.8

	0.6


	0.4

	0.2

	\|Col1\|Col2\|Col3\|Col4\|Col5\|Col6\|Col7\|
	\|---\|---\|---\|---\|---\|---\|---\|
	\|\|\|\|\|\|\|\|
	\|\|\|\|\|\|\|\|
	\|\|\|\|\|\|\|\|
	\|\|\|\|\|\|\|\|
	\|\|\|\|\|\|\|\|
	\|\|KT=-0.7\|5, SCC=-0.88\|, PCC=0.77\|\|\|\|


	0.0 0.2 0.4 0.6 0.8 1.0

	KT=-0.75, SCC=-0.88, PCC=0.77

	Std. of prediction deviations


	Figure 13: Standard deviation over the predictor deviations (x axis) and Kendall’s Tau correlation
	(y axis), for the trained predictors on HW-NAS-Bench (left) and in simulation (right). The simulated
	predictor inaccuracies are slightly pessimistic (low KT), but still match the true values.


	-----

	Predictor deviations


	1.75

	1.50

	1.25

	1.00

	0.75

	0.50

	0.25

	0.00


	normal fit, std=0.445
	deviations


	All candidates

	\|Col1\|normal fit, std=0.445 deviations\|
	\|---\|---\|
	\|\|\|
	\|\|\|


	deviation of the predictions

	candidate occurrences in the architecture


	candidate not at all exactly once exactly twice

	\|Col1\|normal fit, std=0.541 deviations\|
	\|---\|---\|
	\|\|\|
	\|\|\|


	\|Col1\|normal fit, std=0.532 deviations\|
	\|---\|---\|
	\|\|\|
	\|\|\|


	\|Col1\|normal fit, std=0.462 deviations\|
	\|---\|---\|
	\|\|\|
	\|\|\|


	\|Predictor\|deviations\|
	\|---\|---\|
	\|\|normal fit, std=0.146 deviations\|
	\|\|\|
	\|\|\|


	\|Col1\|normal fit, std=0.446 deviations\|
	\|---\|---\|
	\|\|\|
	\|\|\|


	Predictor deviations Predictor deviations Predictor deviations

	1.2 2.5

	normal fit, std=0.541 2.00 normal fit, std=0.443 normal fit, std=0.356

	1.0 deviations 1.75 deviations 2.0 deviations

	0.8 1.50

	1.25 1.5

	density 0.6 density 1.00 density

	1.0

	0.4 0.75

	0.2 0.50 0.5

	0.25

	0.0 0.00 0.0

	1.5 1.0 0.5 0.0 0.5 1.0 1.5 1.5 1.0 0.5 0.0 0.5 1.0 1.5 1 0 1 2

	Zero deviation of the predictions deviation of the predictions deviation of the predictions

	Predictor deviations Predictor deviations Predictor deviations

	1.4 normal fit, std=0.532 1.75 normal fit, std=0.436 normal fit, std=0.412

	1.2 deviations 1.50 deviations 2.0 deviations

	1.0 1.25 1.5

	density 0.80.6 density 1.000.75 density 1.0

	0.4 0.50

	0.5

	0.2 0.25

	0.0 0.00 0.0

	1 0 1 2 1.5 1.0 0.5 0.0 0.5 1.0 1.0 0.5 0.0 0.5 1.0 1.5

	Skip deviation of the predictions deviation of the predictions deviation of the predictions

	Predictor deviations Predictor deviations Predictor deviations

	2.00

	normal fit, std=0.462 normal fit, std=0.470 normal fit, std=0.393

	2.0 deviations 1.75 deviations 2.0 deviations

	1.50

	1.5 1.25 1.5

	1.00

	density 1.0 density 0.75 density 1.0

	0.5 0.50 0.5

	0.25

	0.0 0.00 0.0

	1.0 0.5 0.0 0.5 1.0 1 0 1 2 1.0 0.5 0.0 0.5 1.0 1.5

	Conv1x1 deviation of the predictions deviation of the predictions deviation of the predictions

	Predictor deviations Predictor deviations Predictor deviations

	2.00 1.4

	normal fit, std=0.146 normal fit, std=0.403 normal fit, std=0.565

	4 deviations 1.75 deviations 1.2 deviations

	1.50 1.0

	3 1.25

	0.8

	1.00

	density 2 density 0.75 density 0.6

	1 0.50 0.4

	0.25 0.2

	0 0.00 0.0

	0.4 0.2 0.0 0.2 0.4 1.0 0.5 0.0 0.5 1.0 1.5 1.5 1.0 0.5 0.0 0.5 1.0 1.5

	Conv3x3 deviation of the predictions deviation of the predictions deviation of the predictions

	Predictor deviations Predictor deviations Predictor deviations

	2.00 normal fit, std=0.446 normal fit, std=0.411 2.00 normal fit, std=0.477

	1.75 deviations 2.0 deviations 1.75 deviations

	1.50 1.50

	1.25 1.5 1.25

	density 1.00 density 1.0 density 1.00

	0.75 0.75

	0.50 0.5 0.50

	0.25 0.25

	0.00 0.0 0.00

	1 0 1 2 1.0 0.5 0.0 0.5 1.0 1.5 1.5 1.0 0.5 0.0 0.5 1.0 1.5

	Pool deviation of the predictions deviation of the predictions deviation of the predictions


	Table 6: How a trained XGB predictor deviates from the ground-truth values for different architecture subsets, akin to Figure 2. While they are not exactly the same, they still resemble the distribution
	over the entire test set (top plot, 625 samples). One noteworthy exception is when no Conv3x3 operations are used at all, in which case the standard deviation is considerably smaller.


	-----