pradachan
/

AI-Scientist

Model card Files Files and versions Community

AI-Scientist / review_iclr_bench /iclr_parsed /-6me0AsJVdu.txt

pradachan

Upload folder using huggingface_hub

f71c233 verified 19 days ago

raw

history blame

63.2 kB

	MODEL VALIDATION USING MUTATED TRAINING LABELS: AN EXPLORATORY STUDY

	Anonymous authors
	Paper under double-blind review

	ABSTRACT

	We introduce an exploratory study on Mutation Validation (MV), a model validation
	method using mutated training labels for supervised learning. MV mutates training
	data labels, retrains the model against the mutated data, then uses the metamorphic
	relation that captures the consequent training performance changes to assess model
	fit. It does not use a validation set or test set. The intuition underpinning MV
	is that overfitting models tend to fit noise in the training data. We explore 8
	different learning algorithms, 18 datasets, and 5 types of hyperparameter tuning
	tasks. Our results demonstrate that MV is accurate in model selection: the model
	recommendation hit rate is 92% for MV and less than 60% for out-of-samplevalidation. MV also provides more stable hyperparameter tuning results than
	out-of-sample-validation across different runs.

	1 INTRODUCTION

	Out-of-sample validation (such as test accuracy) is arguably the most popular approach adopted by
	researchers and developers for empirically validating models in applied machine learning. It uses data
	different from the training data to approximate future unseen data. However, out-of-sample validation
	is widely acknowledged to have limitations: 1) the sample set may be too small to represent the data
	distribution; 2) the accuracy can have a large variance across different runs (Pham et al., 2020); 3) the
	samples are typically randomly selected from the collected data, and may therefore have similar bias
	as in the training data, leading to an inflated validation score (Piironen & Vehtari, 2017; Gronau &
	Wagenmakers, 2019); 4) excessive reuse of a fixed set of samples can lead to overfitting even if the
	samples are held out and not used in the training process (Feldman et al., 2019).

	The Mutation Validation (MV) approach we explore is a new approach to validating machine learning
	models relying only upon training data. MV applies Mutation Testing and Metamorphic Testing
	—two software engineering techniques that validate code correctness. Mutation testing mutates the
	program by making synthetic changes (e.g., to change a + b into a − _b or to remove a functional_
	call), then re-executes the tests to monitor the behaviour changes and check the power of a test
	suite (Papadakis et al., 2019; Jia & Harman, 2010). Metamorphic Testing detects program errors
	by checking metamorphic relations, which is the relationship between input changes and output
	changes (Chen et al., 2020; Segura et al., 2016). Combining these two techniques, MV mutates training
	data labels and retrains the model using the mutated data, then measures the training performance
	change. As shown by Figure 1, the key intuition is that a learner, if fitting the given training data
	property, would be less likely to be ‘fooled’ by a small ratio of mutated labels. Consequently, the
	model trained with the mutated training data would “detect” the mutated labels and still exhibit
	high predictive performance on the original training data. By contrast, an overfitted learner violates
	the Occam’s Razor (Hawkins, 2004), and has extra capacity to fit incorrect noisy labels, and thus
	will yield a model that exhibits poor predictive performance on the original training data, but high
	performance on the mutated training data. Furthermore, an over-simple learner has poor learnability,
	and will have low performance on the original training data and the mutated data.

	There are a significant number of theories proposed to explain model validation and complexity, such
	as Rademacher complexity and VC dimension. Nevertheless, the prescriptive and descriptive value
	of these theories remains debated. Zhang et al. (2017) found that deep hypothesis spaces can be
	large enough to memorise random labels. They discussed the limitations of existing measurements
	in explaining the generalisation ability of large neural networks, and called for new measurements.


	-----

	good learner over-complex learner over-simple learner

	dataoriginal \ XX °°00 O O (oo[00] o ^ XX + go000O ° o

	##### ::÷:÷÷÷:¥:÷:::÷÷\× { :µ ::÷I :

	X ^ X 000 .

	× 0 × 0 0

	mutated :O
	data

	x × x
	# :i÷÷÷:÷::/ I ::÷÷:÷¥i:

	X train X train X train

	Figure 1: The intuition underpinning MV: A better learner is less affected by the mutated labels.acc. acc. " acc.

	performance expected PV ÷[!] ° ° ° " 5 :!÷÷÷a° ° ° ° ° .

	..→÷÷;;t*÷÷÷i

	In this present work, we do not compare MV with these theories. Rather, similar to Zhang et al.0=1-7\| 0 } 0 \|
	(2017), we focus on empirical investigation. In Appendix A.1, we provide the theoretical foundationsunderpinning the metamorphic relation that MV uses.0.1 0.2 0.3 noisedegree Out 0,2 0.3 noisedegree 0.1 0.2 0.3 noisedegree

	We report on the performance of MV on 12 open datasets and 6 synthetic datasets with different
	known data distributions (see Table 1), using 8 widely-adopted classifiers (including both classic
	learning classifiers and deep learning classifiers). We investigate the effectiveness and stability of MV
	serving as a complementary measure to the existing practical model validation methods for model
	selection and parameter tuning. The experimental results provide evidence to support the following
	, MV captures well the degree of match between decision boundaries and data
	patterns in model selection. The model recommendation hit rate for MV is 92%, but is 53% for cross
	validation, and 55% for test accuracy. Second, MV is more responsive to changes in capacity than
	conventional validation methods. When cross validation (CV) accuracy and test accuracy could not

	distinguish among large-capacity hyperparameters, MV complements them in hyperparameter tuning.
	Third, MV is stable in model validation results. Its dropout rate tuning result does not change across
	five runs; the variance is zero. For validation set and test set, the average variance is 0.003 and 0.007,
	respectively. The paper also discusses the connections between MV and other noise injection work in
	the literature, as well as the usage scenarios of MV (Section 5).

	In summary, we make the following primary contributions:
	1) An exploratory study on the performance of Mutation Validation. We explore the effectiveness
	and stability of MV to validate machine learning (ML) models as a complement to the currently used
	empirical model validation methods. MV requires neither validation nor test sets, but uses the training
	performance sensitivity to the mutated training labels. We study 18 datasets, 8 different learning
	algorithms, and 5 hyperparameter tuning tasks.
	2) An application of software testing techniques in ML model validation. MV is the first approach
	that applies mutation testing and metamorphic testing —two widely studied software testing techniques —on model validation tasks.

	2 MUTATION VALIDATION

	2.1 GENERAL INTUITION

	Figure 1 illustrates the intuition that underpins MV. For a ‘good’ learner that fits the training data
	well (first column of Figure 1), the learner is less likely to be ‘fooled’ by a small number of mutated
	labels, and would keep predicting the original labels for the mutated samples. As a result, the model
	trained on the mutated data will still have high predictive accuracy on the original training data, but
	will have decreased predictive accuracy on the mutated data. An overfitted learner tends to fit noise in
	the training data (second column of Figure 1). With mutated data, the learner will yield a model that
	makes predictions following the incorrect labels, leading to a high training accuracy on the mutated
	data, but poor accuracy on the original data. An over-simple learner has poor learnability. The model
	it yields has poor performance with or without the mutated training data. As a result, the model
	trained on the mutated data has low accuracy with both the original labels and the mutated labels.


	-----

	2.2 MUTATION VALIDATION

	MV uses mutation testing and metamorphic testing, two software validation techniques, to validate
	ML models. Mutation testing creates mutants by injecting faults in a program, then re-executes
	the program to check whether a test suite detects those faults. Metamorphic relation specifies
	how a change in the input should result in a change in the output. It is used to detect errors in
	software when there are no reliable oracles. For a program f, its input x and x[′]. Let f (x) and
	_f_ (x[′]) be the execution outputs of x and x[′] against f . Let Ri be the relationship between x and
	_x[′], Ro is the relationship between f_ (x) and f (x[′]). A metamorphic relation can be represented as:
	Ri(x, x[′]) ⇒ Ro(f (x), f (x[′])). If the relation is violated, the program under test contains a bug. For
	example, when validating the sin mathematical function, one metamorphic relation that a correct
	program should hold is sin(x + π) = − sin(x).

	Now consider the scenario of ML model validation. If we treat a learner as the program under test,
	the training data as the input, the trained model’s behaviours as the output, our intuition introduced in
	Section 2.1 can be well captured by mutation testing and metamorphic relations. In particular, the
	input changes are introduced by mutating training data, each mutated data instance is called a mutant;
	the output changes are defined in terms of performance changes of the trained models. Based on this,
	we propose Mutation Validation, a new machine learning model validation method that validates ML
	models based on the relationship between training data changes and training performance changes. A
	_good learner is expected to “detect” the mutants and have a certain amount of training performance_
	_changes according to the number of input data changes._

	There are different metamorphic relations that can be explored to conduct MV, which may
	depend on the data mutation method, the training performance measurement (e.g., accuracy,
	precision, loss), and the calculation of performance changes. Let η be the mutation degree
	(i.e., the ratio of randomly mutated labels in the training data). S is the original training data,
	_Sη is the mutated training data with mutation degree η (η ≤_ 0.5), f (S) is the model output trained on S, f (Sη) is the model output trained on Sη, _TS(f_ (S)), _TS(f_ (Sη)) are the
	accuracy of f (S) and f (Sη) based on the original training labels, respectively. _TSη_ (fS(η))
	is the accuracy of f (Sη) based on the mutated labels. In this present work, as the first ex-[b] [b]
	ploratory study on MV, we study the following MV measurement m to conduct model validation:

	[b]

	_m = (1_ 2η)TS(f (Sη)) + _TS(f_ (S)) _TSη_ (f (Sη)) + η. (1)
	_−_ _−_ [b]

	The above measurement, although derived from theory (Appendix A.1), matches well with our

	[b] [b]

	intuition introduced in Section 2.1. In particular, if the learner is less affected by the mutated labels,
	the predictive behaviours of the trained model with mutated labels should be closer to that of the
	model trained with the original labels. This leads to a larger _TS(f_ (Sη)) and a larger difference
	betweenbetter the learner fits the training data. In the optimal case, the trained model on mutated data hasTS(f (S)) and _TSη_ (f (Sη), as long as the mutation degree[b] _η is fixed. The larger m is, the_
	a perfect training accuracy on the original training labels (i.e.,[b] [b] _TS(f_ (Sη)) = 1), and detects all the
	mutants (i.e, _TS(f_ (S)) _TSη_ (f (Sη)) = η). Thus, m = 1. Mutation degree η ranges between 0 and
	_−_ [b]
	0.5, but needs to be a fixed value. The theory inspiration of this metric, its metamorphic relation, as[b]
	well as the influence of η on m are provided in our appendix.

	[b]

	3 EXPERIMENTAL SETUP

	The main body of this paper answers four research questions:

	_RQ1: What is the effectiveness of MV in model selection?_
	_RQ2: What is the effectiveness of MV in hyperparameter tuning?_
	_RQ3: What is the stability of MV in validating machine learning models?_
	_RQ4: How does training data size affect MV?_

	We also explore the efficiency of MV in validating machine learning models, as well as the influence
	of mutation degree η on MV. The details are in the appendix.

	To evaluate MV, we choose to use datasets that are diverse in category, size, class number, feature
	number, and class balance situations. Small datasets are particularly important to demonstrate the


	-----

	\|Dataset abbr.\|#training #test #class #feature\|
	\|---\|---\|
	\|synthetic-moon moon synthetic-moon (0.2 noise) moon-0.2 synthetic-circle circle synthetic-circle (0.2 noise) circle-0.2 synthetic-linear linear synthetic-linear (0.2 noise) linear-0.2\|100 −1e+6 2,000 2 2 100 −1e+6 2,000 2 2 100 −1e+6 2,000 2 2 100 −1e+6 2,000 2 2 100 −1e+6 2,000 2 2 100 −1e+6 2,000 2 2\|
	\|Iris iris Wine wine Breast Cancer Wisconsin cancer Car Evaluation car Heart Disease heart Bank Marketing bank Adult adult Connect-4 connect\|150 – 3 4 178 – 3 13 569 – 2 9 1,728 – 4 6 303 – 5 14 45,211 – 2 17 48,842 16,281 2 14 67,557 – 2 42\|
	\|MNIST mnist fashion MNIST fashion CIFAR-10 cifar10 CIFAR-100 cifar100\|60,000 10,000 10 – 60,000 10,000 10 – 50,000 10,000 10 – 50,000 10,000 100 –\|


	Table 1: Details of the datasets.

	ability of MV in providing warnings for over-complex learners. Table 1 shows the details of each
	dataset used to evaluate MV. Column “#training” and Column “#test” show the size of training data
	and test data, which is presented by the dataset providers. Column “#class” shows the number of
	classes (or labels) for each dataset. Column “#feature” presents the number of features.

	To obtain datasets with known ground-truth decision boundaries for validating model fitting, we use
	synthetic datasets with three types of data distributions: moon, circle, and linearly-separable. These
	three data distributions are provided by scikit-learn (Pedregosa et al., 2011) tutorial to demonstrate
	the decision boundaries of different classifiers. To study the influence of original noise in training data
	on MV, for each type of distribution, we create datasets with noise. We also generate different-size
	training data ranging from 100 to 1 million data points to study the influence of training data size.
	These synthetic datasets help check whether MV identifies the right model whose decision boundary
	matches the data distribution, with and without noise in the original dataset S. We do not expect such
	synthetic datasets to reflect real-world data, but the degree of control and interpretability they offer
	allows us to verify the behaviour of MV with a known ground-truth for model selection.

	We also report results on 12 real-world widely-adopted datasets with different sizes, numbers of
	features and classes. Eight of them are from the UCI repository (Asuncion & Newman, 2007), the
	remaining four are the most widely used image datasets: MNIST, fashionMNIST, CIFAR-10, and
	CIFAR-100. For each dataset, we calculate MV with η = 0.2 to answer the research questions. That
	is, under each class of the dataset, we randomly select 20% of the labels to mutate. This guarantees
	that MV is not affected by the problem of class imbalance. We mutate the labels by label swapping:
	to replace a label with the next label in the label list, the final label in the label list is replaced with the
	first label in the label list. For the synthetic datasets and the 8 UCI datasets, we compare with 3-fold
	CV accuracy and test accuracy. We use 3-fold CV because it has almost identical results to more
	folds cross validation, but has lower cost, thereby being more conservative to compare with when
	studying the efficiency of MV. For the three image datasets on deep learning models, we compare
	with validation accuracy with 20% validation data (split out from the training data) and test accuracy.
	The next section introduces more configuration details.

	4 RESULTS

	4.1 RQ1: EFFECTIVENESS OF MV IN MODEL SELECTION

	To answer RQ1, we use synthetic datasets to explore whether MV recommends the learners whose
	decision boundaries best match the data patterns. We use synthetic datasets because synthetic datasets
	have ground-truth data patterns for model selection (more details in Section 3). The data distribution
	of real-world UCI datasets, however, is often unknown or difficult to visualise. To generate synthetic
	datasets, we use three scikit-learn (Pedregosa et al., 2011) synthetic dataset distributions, which
	are designed as a tutorial to illustrate the nature of decision boundaries of different classifiers. This
	experiment uses the same settings as in the Scikit-learn tutorial (Classifier Comparison, 2019). Each
	dataset has 100 training data points for model selection. We generate another 2,000 points as test sets.

	Figure 2 shows the training data points, decision boundaries of each classifier, and measurement
	values from MV, CV, and test accuracy (on the 2,000 test data points). We have the following


	-----

	zero noise

	\|Col1\|MV=0.89 CV=0.86 Test=0.86\|MV=1.00 CV=1.00 Test=1.00\|MV=1.00 CV=0.99 Test=1.00\|MV=0.74 CV=0.96 Test=0.99\|MV=0.69 CV=0.96 Test=0.98\|MV=0.75 CV=0.95 Test=0.98\|MV=0.89 CV=0.88 Test=0.89\|
	\|---\|---\|---\|---\|---\|---\|---\|---\|
	\|\|MV=0.53 CV=0.54 Test=0.50\|MV=1.00 CV=1.00 Test=1.00\|MV=0.88 CV=1.00 Test=1.00\|MV=0.69 CV=1.00 Test=1.00\|MV=0.71 CV=0.99 Test=1.00\|MV=0.72 CV=1.00 Test=0.99\|MV=1.00 CV=1.00 Test=1.00\|
	\|\|MV=0.93 CV=0.92 Test=0.90\|MV=0.88 CV=0.94 Test=0.90\|MV=0.91 CV=0.93 Test=0.90\|MV=0.67 CV=0.90 Test=0.88\|MV=0.66 CV=0.91 Test=0.89\|MV=0.77 CV=0.92 Test=0.89\|MV=0.94 CV=0.93 Test=0.90\|


	MV=0.89 MV=1.00 MV=1.00 MV=0.74 MV=0.69 MV=0.75 MV=0.89
	CV=0.86 CV=1.00 CV=0.99 CV=0.96 CV=0.96 CV=0.95 CV=0.88
	Test=0.86 Test=1.00 Test=1.00 Test=0.99 Test=0.98 Test=0.98 Test=0.89

	MV=0.53 MV=1.00 MV=0.88 MV=0.69 MV=0.71 MV=0.72 MV=1.00
	CV=0.54 CV=1.00 CV=1.00 CV=1.00 CV=0.99 CV=1.00 CV=1.00
	Test=0.50 Test=1.00 Test=1.00 Test=1.00 Test=1.00 Test=0.99 Test=1.00

	MV=0.93 MV=0.88 MV=0.91 MV=0.67 MV=0.66 MV=0.77 MV=0.94
	CV=0.92 CV=0.94 CV=0.93 CV=0.90 CV=0.91 CV=0.92 CV=0.93
	Test=0.90 Test=0.90 Test=0.90 Test=0.88 Test=0.89 Test=0.89 Test=0.90


	0.2 noise


	Linear SVM

	MV=0.89
	CV=0.86
	Test=0.86

	MV=0.53
	CV=0.54
	Test=0.50

	MV=0.93
	CV=0.92
	Test=0.90

	Linear SVM


	RBF SVM

	MV=1.00
	CV=1.00
	Test=1.00

	MV=1.00
	CV=1.00
	Test=1.00

	MV=0.88
	CV=0.94
	Test=0.90

	RBF SVM


	Gaussian Process

	MV=1.00
	CV=0.99
	Test=1.00

	MV=0.88
	CV=1.00
	Test=1.00

	MV=0.91
	CV=0.93
	Test=0.90

	Gaussian Process


	Decision Tree

	MV=0.74
	CV=0.96
	Test=0.99

	MV=0.69
	CV=1.00
	Test=1.00

	MV=0.67
	CV=0.90
	Test=0.88

	Decision Tree


	Random Forest

	MV=0.69
	CV=0.96
	Test=0.98

	MV=0.71
	CV=0.99
	Test=1.00

	MV=0.66
	CV=0.91
	Test=0.89

	Random Forest


	AdaBoost

	MV=0.75
	CV=0.95
	Test=0.98

	MV=0.72
	CV=1.00
	Test=0.99

	MV=0.77
	CV=0.92
	Test=0.89

	AdaBoost


	Naive Bayes

	MV=0.89
	CV=0.88
	Test=0.89

	MV=1.00
	CV=1.00
	Test=1.00

	MV=0.94
	CV=0.93
	Test=0.90

	Naive Bayes

	\|Col1\|MV=0.86 CV=0.85 Test=0.86\|MV=0.94 CV=0.97 Test=0.96\|MV=0.95 CV=0.95 Test=0.96\|MV=0.67 CV=0.91 Test=0.94\|MV=0.70 CV=0.88 Test=0.94\|MV=0.78 CV=0.90 Test=0.96\|MV=0.90 CV=0.89 Test=0.87\|
	\|---\|---\|---\|---\|---\|---\|---\|---\|
	\|\|MV=0.53 CV=0.54 Test=0.50\|MV=0.88 CV=0.90 Test=0.86\|MV=0.82 CV=0.91 Test=0.87\|MV=0.74 CV=0.88 Test=0.82\|MV=0.69 CV=0.90 Test=0.85\|MV=0.71 CV=0.87 Test=0.83\|MV=0.91 CV=0.91 Test=0.85\|
	\|\|MV=0.82 CV=0.81 Test=0.82\|MV=0.77 CV=0.80 Test=0.79\|MV=0.86 CV=0.85 Test=0.81\|MV=0.69 CV=0.74 Test=0.76\|MV=0.66 CV=0.79 Test=0.77\|MV=0.74 CV=0.77 Test=0.78\|MV=0.83 CV=0.81 Test=0.81\|


	MV=0.86 MV=0.94 MV=0.95 MV=0.67 MV=0.70 MV=0.78 MV=0.90
	CV=0.85 CV=0.97 CV=0.95 CV=0.91 CV=0.88 CV=0.90 CV=0.89
	Test=0.86 Test=0.96 Test=0.96 Test=0.94 Test=0.94 Test=0.96 Test=0.87

	MV=0.53 MV=0.88 MV=0.82 MV=0.74 MV=0.69 MV=0.71 MV=0.91
	CV=0.54 CV=0.90 CV=0.91 CV=0.88 CV=0.90 CV=0.87 CV=0.91
	Test=0.50 Test=0.86 Test=0.87 Test=0.82 Test=0.85 Test=0.83 Test=0.85

	MV=0.82 MV=0.77 MV=0.86 MV=0.69 MV=0.66 MV=0.74 MV=0.83
	CV=0.81 CV=0.80 CV=0.85 CV=0.74 CV=0.79 CV=0.77 CV=0.81
	Test=0.82 Test=0.79 Test=0.81 Test=0.76 Test=0.77 Test=0.78 Test=0.81


	Figure 2: Performance of MV in model selection. MV, CV, Test denote MV score, CV accuracy, and
	Test Accuracy. Red and blue points are the original training data without noise (top-three rows) and
	with 0.2 noise (bottom-three rows). Areas with different colours show the decision boundaries. We
	observe that MV captures well the match between decision boundaries and data patterns.

	observations. First, MV tends to have large values for cases where the decision boundaries match
	well the data patterns; it provides more discriminating scores for model selection under different
	datasets, it is easy to pick out the top-2 models that match the data distributions the best based on
	MV. Second, the recommended models from MV are less affected by the noise in the training data.

	The figure also shows cases where CV and Dataset Method Noise Recommended models based on top-2 scores
	_test accuracy have limitationsmodels. For example, in Figure 2, for the circledistribution, Decision Trees and Random Forests in evaluating ML_ moon MVCV 0.00.20.00.2 RBF SVMRBF SVMRBF SVMRBF SVM,,,, Gaussian Process Gaussian Process Gaussian Process Gaussian Process
	(with maximum depth of 10) give obviously ill- Test 0.00.2 RBF SVMRBF SVM,, Gaussian Process Gaussian Process, AdaBoost
	fitted rectangle-shaped decision boundaries, yetthe cross validation accuracy and test accuracyremain high. In addition, from Table 2, it is more circle MVCV 0.00.20.00.2 RBF SVMRBF SVMRBF SVMGaussian Process,,,, Gaussian Process, DT, RF, AdaBoost, Naive Bayes Naive Bayes Naive Bayes Naive Bayes
	difficult to select proper models based on CV Test 0.00.2 RBF SVMRBF SVM, Gaussian Process, DT, RF,, Gaussian Process Naive Bayes
	and test accuracy, because CV and test accuracy MV 0.00.2 Linear SVMGaussian Process,, Naive Bayes Naive Bayes
	are often very similar across different models. linear CV 0.0 RBF SVM, Gaussian Process, Naive Bayes


	For ease of observation, we present Table 2 to
	show the models recommended by MV, CV, and
	test accuracy, based on their top-2 scores shown
	by Figure 2. The ground-truth models in bold
	are based on manual observation and widelyadopted ML knowledge. For example, Scikit_learn documentation (Classifier Comparison,_

	\|Dataset\|Method Noise Recommended models based on top-2 scores\|
	\|---\|---\|
	\|moon\|MV 0 0. .0 2 R RB BF F S SV VM M,, G Ga au us ss si ia an n P Pr ro oc ce es ss s\|
	\|\|CV 0 0. .0 2 R RB BF F S SV VM M,, G Ga au us ss si ia an n P Pr ro oc ce es ss s\|
	\|\|Test 0 0. .0 2 R RB BF F S SV VM M,, G Ga au us ss si ia an n P Pr ro oc ce es ss s, AdaBoost\|
	\|circle\|MV 0 0. .0 2 R RB BF F S SV VM M,, N Na ai iv ve e B Ba ay ye es s\|
	\|\|CV 0 0. .0 2 R GB auF s sS iaV nM P, r oG ca eu sss,s i Nan a iP vr eo Bce as ys e, sDT, RF, AdaBoost, Naive Bayes\|
	\|\|Test 0 0. .0 2 R RB BF F S SV VM M,, G Ga au us ss si ia an n P Pr ro oc ce es ss s, DT, RF, Naive Bayes\|
	\|linear\|MV 0 0. .0 2 L Gi an ue sa sir a S n V PM roc, eN ssa,i v Ne a B iva ey Bes ayes\|
	\|\|CV 0 0. .0 2 R GB auF s sS iV anM P, r oG ca eu ss ss, i Nan a iP vr eo c Be as ys, e sNaive Bayes\|
	\|\|Test 0 0. .0 2 L Li in ne ea ar r S SV VM M,, R GB auF s sS iV anM P, r oG ca eu ss ss, i Nan a iP vr eo c Be as ys, e sRF, Naive Bayes\|


	Table 2: Recommended models by MV, CV, and
	test accuracy. The models in bold are the groundtruth ones whose decision boundaries match the
	data patterns. The recommendation hit rate is 92%
	for MV, 53% for CV, 55% for test accuracy.


	-----

	2019) mentions that Naive Bayes and Linear SVM are more suitable for linearly-separable data.
	Overall, we have the following conclusion: among all the model selection tasks we explored, the hit
	rate (i.e., the ratio of recommended models that match the ground truth model) is 92% for MV, 53%
	for CV accuracy, and 55% for test accuracy.

	4.2 RQ2: EFFECTIVENESS OF MV IN HYPERPARAMETER CONFIGURATION

	For models whose capacity increases along with their hyperparameters, we expect their goodness
	of model fit to first increase, then peak, before decreasing. The peak indicates the best model fit
	assessed according to MV. In RQ2, we assess whether MV matches this pattern. We study five
	capacity-related hyperparameters for several widely-adopted algorithms: the maximum depth for a
	Decision Tree, C and gamma for a Support Vector Machine (SVM), and the dropout rate and learning
	rate for Convolutional Neural Networks (CNNs).

	Maximum Depth for Decision Tree. Figure 3 shows how MV responds to increases in maximum
	depth of Decision Tree for the eight real-world UCI datasets. For small datasets (smaller than 2,000),
	we do not split out test data, but take the whole data as training data. For the bank and connect
	datasets, we use 80% of the data as test sets. For adult, we use its original test set. We repeat the
	experiments 10 times. The yellow shadow in the figure indicates the variance across the 10 runs.










	iris wine cancer car heart adult bank connect

	1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0

	0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5

	MV

	MV CV

	0.0 CV 0.0 0.0 0.0 0.0 0.0 Test 0.0 0.0

	1 3 5 7 9 1 3 5 7 9 1 3 5 7 9 1 5 9 13 17 1 3 5 7 9 1 3 5 7 9 1 3 5 7 9 5 10 15 20 25 30


	Figure 3: Changes in MV, CV accuracy, and test accuracy when increasing the maximum depth for
	Decision Trees. The x-axis ticks for car and connect differ to capture MV’s inflection point. We can
	observe that, while CV and test accuracy agree with MV on the key influence point in most cases,
	they are less responsive to large depths (which lead to overfitting).

	From Figure 3, we make the following observations. First, for 6 out of the 8 datasets, MV increases
	then decreases as maximum depth increases, and exhibits a maximum in each curve. This is consistent
	with the pattern we expect a good measure to exhibit. For small datasets, large depths yield low
	MV scores, whereas large datasets do not. For adult and bank, MV values remain large when the
	maximum depth increases. We suspect that this is because, for these two datasets, the training data
	size is large enough for the model to obtain resilience against mutated labels. We explore the influence
	of training data size in the appendix.

	We also observe that MV, CV and test accuracy have similar inflection points, yet MV is more
	responsive to depth changes, especially when the model overfits due to large depths. This observation
	indicates that MV provides further information to help tune maximum depth when CV and test
	accuracy are less able to distinguish multiple parameters. In particular, if a developer uses grid search
	to select the best maximum-depth ranged between 5 and 10, we find that grid search suggests depths
	of 8, 6, 9 in three runs for the cancer dataset, which are over-complex and unstable. Similar results
	are observed for other small datasets. With MV, its decrease trend in this range indicates that there is
	a simpler model with comparable predictive accuracy but better resilience to label mutation.

	C and gamma for SVM. In SVM, the gamma parameter defines how far the influence of a single
	training example reaches; the C parameter decides the size of the decision boundary margin, behaving
	as a regularisation parameter. In a heat map of MV scores as a function of C and gamma, the
	expectation is that good models should be found close to the diagonal of C and gamma (Scikitlearn:SVM, 2020). Figure 4 presents the heat map for cross validation and MV for two datasets.
	We do not use a hold out test set to ensure sufficient training data. The upper left triangle in each
	sub-figure denotes small complexity; the bottom right triangle in each sub-figure denotes large
	complexity. In both cases, MV gives low scores.

	When comparing CV and MV, MV is more responsive to hyperparameter value changes. With MV
	scores, it is more obvious that good models can be found along the diagonal of C and gamma. When
	C and gamma are both large, the CV score is high but MV score is low, this is an indication that there
	exists a simpler model with similar test accuracy. In practice, as stated by Scikit-learn documentation,


	-----

	it is interesting to simplify the decision function with a lower value of C so as to favour models that

	&9 09 &9 09

	ZLQH FDQFHU

	Figure 4: Influence of SVM parameters on CV and MV. The horizontal/vertical axis is gamma/C.&9 09
	Good models are expected to be found close to the diagonal. As can be seen, CV has a broad highvalued (bright) region, while MV’s high-valued region (bright) is narrower, showing that MV is more
	responsive to parameter changes.FDQFHU

	use less memory and that are faster to predict (Scikit-learn:SVM, 2020).

	Dropout Rate and Learning Rate for CNN. We use CNN models coming from Keras documentation.
	Validation accuracy is calculated with 80% training data and 20% validation data. Figure 5 shows
	the results. We observe that when tuning dropout rate for mnist and fashion-mnist, validation and
	test accuracy are less discriminating and provide different tuning results across different runs (more
	details in Table 4), yet MV is more discriminating. For dropout rate, MV and test accuracy have
	different key influence points on cifar10 (0.4 v.s. 0.2). This is because there is a big capacity jump
	between 0.2 and 0.4. The result suggests that the optimal dropout rate with comparable validation/test
	accuracy is between 0.2 and 0.4.





	1.00 fashion-mnist cifar10 1.0 cifar100 1.0 mnist 1.0 fashion-mnist 1.0 cifar10 1.0 cifar100

	0.8 0.8 0.8 0.8 0.8 0.8

	0.95

	0.6 0.6 0.6 0.6 0.6

	0.90 0.6 0.4 0.4 0.4 0.4 0.4

	0.2 0.2 0.2 0.2 0.2

	0.850.0 0.2 0.4 0.6 0.8 0.40.0 0.2 0.4 0.6 0.8 0.00.0 0.2 0.4 0.6 0.8 0.00.01 0.001 1e-04 1e-050.00.01 0.001 1e-04 1e-050.00.01 0.001 1e-04 1e-050.00.01 0.001 1e-04 1e-05

	dropout rate dropout rate dropout rate learning rate learning rate learning rate learning rate


	Figure 5: Influence of CNN’s dropout rate and learning rate on MV and validation/test accuracy.

	Overall, we observe that MV is responsive to hyperparameter changes on all the hyperparameter
	tuning tasks we explored. CV and test accuracy are less responsive especially for large-capacity
	hyperparameters (i.e., large maximum depth for Decision Tree and small dropout rate for CNNs).
	This leads to the following issues. First, due to the large variance of their results across different
	runs, such low response often leads to unstable hyperparameter recommendation results. We further
	explore this in Section 4.3. Second, developers may easily choose an over-complex learner, making
	the learner: 1) easily biased by incorrect training labels; 2) vulnerable to training data attack; 3)
	computationally and memory intensive; 4) more difficult to interpret.

	4.3 RQ3: STABILITY OF MV IN MODEL VALIDATION

	For the machine learning experimental design in the literature, data is usually random split. The values
	of MV, CV, and test accuracy may be easily affected by the randomness in model building (Pham
	et al., 2020), especially when the overall data set is small, or building deep learning models. As
	a result, with different runs, developers may get completely different recommendations for the
	best hyperparameter configuration. A good model validation method is expected to be stable in
	hyperparameter recommendation results, giving developers clear instructions on the best choice.

	To explore the stability of MV, CV, and test accuracy, we run the model building process multiple
	times, each time with a different split of training data and validation data (the size of training/validation
	data is unchanged). We then record the recommended best hyperparameters during each run under
	the scenario of hyperparameter configuration. We use the maximum depth for Decision Tree with
	the 5 small UCI datasets as well as the dropout rate for CNN with the 3 image datasets. Table 4
	shows the results, which lead to the following conclusion: MV has good stability in recommending
	hyperparameters. When recommending maximum depth for Decision Tree, the overall variance is


	-----

	\|Dataset\|Method Recommended maximum depth Variance\|
	\|---\|---\|
	\|iris\|MV [3, 3, 3, 2, 2, 2, 2, 3, 3, 2] 0 CV [7, 6, 6, 3, 3, 9, 3, 7, 5, 8] 4\|
	\|wine\|MV [3, 3, 2, 3, 2, 3, 4, 3, 2, 2] 0 CV [3, 4, 7, 8, 3, 3, 4, 3, 5, 3] 3\|
	\|cancer\|MV [3, 3, 3, 2, 3, 3, 3, 3, 3, 3] 0 CV [5, 7, 3, 4, 4, 4, 7, 4, 3, 6] 2\|
	\|car\|MV [7, 7, 7, 7, 7, 7, 7, 7, 7, 7] 0 CV [11,19,19,15,17,13,11,13,13,15] 8\|
	\|heart\|MV [5, 6, 3, 3, 6, 5, 6, 4, 3, 5] 1 CV [4, 4, 3, 3, 4, 4, 3, 3, 9, 5] 3\|
	\|mean\|MV – 0.200 CV – 4.000\|

	\|Dataset\|Method Recommended dropout rate Variance\|
	\|---\|---\|
	\|mnist\|MV [0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2] 0.000 Vali. [0.0, 0.0, 0.2, 0.2, 0.0, 0.0, 0.2, 0.0, 0.0, 0.2] 0.011 Test [0.2, 0.0, 0.0, 0.2, 0.0, 0.0, 0.0, 0.0, 0.2, 0.0] 0.009\|
	\|f-mnist\|MV [0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2] 0.000 Vali. [0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2] 0.000 Test [0.2, 0.0, 0.2, 0.0, 0.0, 0.0, 0.0, 0.2, 0.2, 0.0] 0.011\|
	\|cifar10\|MV [0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4] 0.000 Vali. [0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2] 0.000 Test [0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2] 0.000\|
	\|mean\|MV – 0.000 Vali. – 0.003 Test – 0.007\|


	Table 4: Recommended hyperparameters across different runs. The third column shows the specific
	recommended hyperparameters (we run the datasets 10 times); the last column shows the variance
	(the average of the squared differences from the Mean) across different runs. MV is more stable than
	CV and test accuracy in recommending hyperparameters.

	0.200 for MV, but 4.000 for CV accuracy; when recommending dropout rate for CNN, the overall
	variance is 0.000 for MV, 0.003 for validation accuracy, and 0.007 for test accuracy.


	4.4 RQ4: INFLUENCE OF TRAINING DATA SIZE ON MV


	We expect that when a learner is over-complex
	for the data, adding extra data will improve the
	learner’s resilience to mutated labels, thus the
	MV score ought also to increase when training
	data size increases. Indeed, previously from Figure 3, we have observed that models trained on
	large datasets (e.g., bank and adult) tend to be
	more resilient to mutated labels for large depths.
	Figure 6 shows the results for connect and moon
	datasets, indicating that data size plays a role in
	determining the trained model’s robustness to
	incorrect training labels.




	connect moon

	0.80 1.00

	0.75

	0.75

	0.70 1000 100

	MV 3000 MV0.50 300

	0.65 5000 500

	0.60 7000 0.25 700

	9000 900

	0.55 0.00
	7 11 15 19 23 2 6 10 14 18


	Figure 6: Influence of training data size on MV

	datasets, indicating that data size plays a role in

	when increasing maximum depths (horizontal

	determining the trained model’s robustness to

	axis) of Decision Trees. For a given depth, larger

	incorrect training labels.

	datasets tend to yield larger MV values.

	These observations indicate another possible
	value in the use of MV, as a complement to
	CV and test accuracy. Specifically, where MV is low, the ML engineers have two potential actions
	to improve the fitting between the learner and the training data: either to optimise the learner (e.g.,
	search for smaller-capacity models with comparable test accuracy), or to optimise the data (e.g.,
	increase data size to increase the learner’s resilience to incorrect training labels).


	5 DISCUSSION

	5.1 CONNECTION WITH RELATED WORK


	MV is also connected with several key concepts in the literature. This section discusses the connections
	as well as the differences between MV and these concepts.

	Noise injection in training data has long been recognised as an approach to complex training data
	and reduce overfitting (Holmstrom et al., 1992; Greff et al., 2016). There are three main differences
	between MV and this random noise injection. 1) Overfitting prevention noise is often Gaussian
	noise (Greff et al., 2016), and is added into the training inputs (not labels); on the contrary, MV
	does not use random noise, but uses label swapping, the mutation is applied only on labels. 2)
	Conventional noise injection changes the training data, then directly uses the model trained on this
	noisy training data. MV keeps the original training data, but creates another mutated training data
	to for measurement calculation. This mutated training data will not be used to yield any model for
	real prediction tasks. 3) Conventional noise injection aims to improve the complexity of the training
	data, to reduce overfitting. MV aims to calculate a model validation measurement score, to measure


	-----

	the goodness of model fitting. Zhang et al. (2017) adds noise in training data labels to study the
	generalisation of DNN, while MV mutates labels to provide model validation measurement.

	Noise injection in test data is adopted to evaluate the robustness of a model. For example, the
	generation of adversarial examples (Goodfellow et al., 2014) uses noise injection in the test inputs.
	Compared to this technique, MV 1) mutates training data, not test data; 2) mutates labels, not features;
	3) aims to validate model fitting, rather than model robustness to feature perturbations.

	Overfitting prevention refers to the techniques adopted in the training process to avoid overfitting,
	especially when training deep neural networks. The key techniques are regularisation, early stopping,
	ensembling, dropout, and so on. However, as shown in this paper, conventional overfitting detection
	techniques, such as CV or validation accuracy, are often less responsive to overfitting. In practice,
	developers often conduct the prevention without knowing whether the overfitting happens or not. MV
	is demonstrated to be discriminating and stable in detecting over-complex learners. It thus provides
	signals for the adoption and configuration of these overfitting prevention techniques.

	Rademacher complexity. In statistical machine learning, Rademacher complexity has been used to
	measure the complexity of a learner’s hypothesis space. It also mutates training data. It measures how
	well the learner correlates with randomly generated labels on the training data, but can be difficult to
	compute (Rosenberg & Bartlett, 2007). Different from Rademacher complexity, MV is an applied
	tool, not a theoretical tool. MV uses label mutations for a part of the data. Furthermore, Rademacher
	complexity cares only about the training accuracy on the mutated data, but MV uses the accuracy
	changes on the original and the mutated data.

	5.2 THE USAGE OF MV IN PRACTICE

	With our exploratory study, we find that MV is capable of complementing existing model validation
	techniques. There are two main application scenarios of MV: 1) When out-of-sample validation
	results are similar or unstable across different models or hyperparameters, MV can help to guide the
	selection process (see more in Section 4.2 and Section 4.3). 2) When the training data size is limited,
	MV is a better option than validation accuracy because it does not need to split out a validation set,
	thus reserving more data for training.

	The test accuracy, although having limitations shown in the literature and our experiments, is still a
	very important method for developers to learn the generalisation ability of a model, especially when
	there is data distribution shift between the training data and unseen data. However, as shown in this
	paper, when test accuracy is high but MV is low, this means that there is a learner with comparable
	test accuracy, but is simpler.

	MV deserves more attention from developers when simplicity, security (e.g., defending training data
	attack), and interpretibility of the built models are required. Theoretically, MV can be applied in
	other ML tasks which rely on out-of-sample validation, such as feature selection. We will explore the
	effectiveness of MV in these tasks in future work.

	6 CONCLUSION

	We introduced an exploratory study on MV, a new approach to assessing how good a learner fits the
	given training data. MV validates via checking the learner’s sensitivity to training labels changes
	(expressed as label mutants). The sensitivity is captured by metamorphic relations. We show that MV
	is more effective and stable than the currently adopted CV, validation accuracy, and test accuracy. It
	is also responsive to model capacity and training data characteristics. These results provide evidence
	that MV complements the existing model validation practises. We hope the present paper will serve
	as a starting point to future contributions.


	-----

	REFERENCES

	Arthur Asuncion and David Newman. Uci machine learning repository, 2007.

	Tsong Y Chen, Shing C Cheung, and Shiu Ming Yiu. Metamorphic testing: a new approach for
	generating next test cases. arXiv preprint arXiv:2002.12543, 2020.

	[Classifier Comparison. Classifier Comparison, 2019. URL https://scikit-learn.org/stable/auto_](https://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html)
	[examples/classification/plot_classifier_comparison.html.](https://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html)

	Vitaly Feldman, Roy Frostig, and Moritz Hardt. The advantages of multiple classes for reducing
	overfitting from test set reuse. In International Conference on Machine Learning, pp. 1892–1900.
	PMLR, 2019.

	Aritra Ghosh, Himanshu Kumar, and PS Sastry. Robust loss functions under label noise for deep
	neural networks. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 31,
	2017.

	Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial
	examples. arXiv preprint arXiv:1412.6572, 2014.

	Klaus Greff, Rupesh K Srivastava, Jan Koutník, Bas R Steunebrink, and Jürgen Schmidhuber. Lstm:
	A search space odyssey. IEEE transactions on neural networks and learning systems, 28(10):
	2222–2232, 2016.

	Quentin F Gronau and Eric-Jan Wagenmakers. Limitations of bayesian leave-one-out cross-validation
	for model selection. Computational brain & behavior, 2(1):1–11, 2019.

	Douglas M Hawkins. The problem of overfitting. Journal of chemical information and computer
	_sciences, 44(1):1–12, 2004._

	Lasse Holmstrom, Petri Koistinen, et al. Using additive noise in back-propagation training. IEEE
	_transactions on neural networks, 3(1):24–38, 1992._

	Yue Jia and Mark Harman. An analysis and survey of the development of mutation testing. IEEE
	_transactions on software engineering, 37(5):649–678, 2010._

	Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of Machine Learning.
	MIT press, 2018. Second edition.

	Mike Papadakis, Marinos Kintis, Jie Zhang, Yue Jia, Yves Le Traon, and Mark Harman. Mutation
	testing advances: an analysis and survey. In Advances in Computers, volume 112, pp. 275–378.
	Elsevier, 2019.

	F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer,
	R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and
	E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research,
	12:2825–2830, 2011.

	Hung Viet Pham, Shangshu Qian, Jiannan Wang, Thibaud Lutellier, Jonathan Rosenthal, Lin Tan,
	Yaoliang Yu, and Nachiappan Nagappan. Problems and opportunities in training deep learning
	software systems: An analysis of variance. In 2020 35th IEEE/ACM International Conference on
	_Automated Software Engineering (ASE), pp. 771–783. IEEE, 2020._

	Juho Piironen and Aki Vehtari. Comparison of bayesian predictive methods for model selection.
	_Statistics and Computing, 27(3):711–735, 2017._

	David S Rosenberg and Peter L Bartlett. The rademacher complexity of co-regularized kernel classes.
	In Artificial Intelligence and Statistics, pp. 396–403, 2007.

	[Scikit-learn:SVM. RBF SVM parameters. https://scikit-learn.org/stable/auto_examples/svm/plot_](https://scikit-learn.org/stable/auto_examples/svm/plot_rbf_parameters.html)
	[rbf_parameters.html, 2020.](https://scikit-learn.org/stable/auto_examples/svm/plot_rbf_parameters.html)

	Sergio Segura, Gordon Fraser, Ana B Sanchez, and Antonio Ruiz-Cortés. A survey on metamorphic
	testing. IEEE Transactions on software engineering, 42(9):805–824, 2016.


	-----

	Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep
	learning requires rethinking generalization. International Conference on Learning Representations
	_(ICLR), 2017._


	-----

	A APPENDIX

	A.1 THEORETICAL INSPIRATION

	This part introduces the theory that leads to a metamorphic relation and formula that we use for
	conducting MV. The denotations and terms we used are guided by the standard statistical machine
	learning literature (Mohri et al., 2018; Ghosh et al., 2017).

	Theoretical Metamorphic Relation for Model Validation:

	Let X be the feature space from which the data are drawn. Let Y = [k] = {1, ..., k} be the class
	labels. Given a training set S = (x1, yx1 ), ..., (xm, yxm ) ( )[m] which is independently
	_{_ _} ∈_ _X × Y_
	and identically distributed according to some fixed, but unknown, distribution D of X × Y. Let a
	classifier be f : X →C, C ⊆ R[k].

	A loss function L is a map L : C ×Y → R[+]. We use E to denote expectation. Given any loss function
	_L and a classifier f_, we define the L-risk of f by

	_RL(f_ ) = E [L(f (x), yx)] = Ex,yx [L(f (x), yx)]. (2)
	_D_

	Let (x, ˆyx) be the mutated data, where
	_{_ _}_

	_yˆx =_ _yx, with probability (1 −_ _ηx)_
	_j_ [k] _yx, with probability ηxj_ _._
	_∈_ _\_

	For all x, conditioned on yx = i, we have _j=i_ _[η]xj_ [=][ η][x][. The noise is called symmetric or uniform]

	if ηx = η. For uniform or symmetric noise, we also have̸ _ηxj =_ _k_ _η_ 1 [. We consider a symmetric loss]

	_−_
	function L that satisfies the following equation, where[P] _C is a constant._

	_k_

	_L(f_ (x), i) = C, ∀x ∈X _, ∀f._ (3)
	_i=1_

	X

	Let Sη = {(xn, ˆyxn ), n = 1, ..., m} be the mutated training data. We call η as a mutation degree.
	Let f be the model trained on S, fη be the model trained on Sη [1]. f and fη are from the same learner
	(with identical hypothesis space). Let r(x) = (L(fη(x), ˆyx) _L(f_ (x), yx))/η be the loss change
	_−_
	rate between f and fη, x . For uniform noise, we have:
	_∈X_

	E [r(x)] = E _L(fη(x), ˆyx) −_ _L(f_ (x), yx) (4)
	_D_ _D_ _η_

	= E (1 − _η)L(fη(x), yx) +_ _k−η_ 1 _i≠_ _yx_ _[L][(][f][η][(][x][)][, i][)][ −]_ _[L][(][f]_ [(][x][)][, y][x][)] (5)
	_D_ _η_

	P

	1

	= E _L(fη(x), yx)_ _L(fη(x), i)_ (6)
	_D_  _η_ _−_ _η [1]_ _[L][(][f]_ [(][x][)][, y][x][) +] _k_ 1 

	_−_ _iX≠_ _yx_

	 [1][ −] _[η]_ 

	= [1][ −]η _[η]_ _RL(fη) −_ _η [1]_ _[R][L][(][f]_ [) +][ C][ −]k[R][L]1[(][f][η][)] (7)

	_−_

	1 _C_

	= ( [1] (8)

	_η_ _k_ 1 [)][R][L][(][f][η][)][ −] _η [1]_ _[R][L][(][f]_ [) +] _k_ 1

	_[−]_ [1][ −] _−_ _−_

	_k_ _C_

	= ( [1] (9)

	_η_ _[−]_ _k −_ 1 [)][R][L][(][f][η][)][ −] _η [1]_ _[R][L][(][f]_ [) +] _k −_ 1 _[.]_


	If we consider L to be error rate[2], C = 1. Now consider the situation for multi-class classification
	problems, we mutate the labels by label swapping: to mutate the labels by replacing them with the

	1f is the same as f (S) in Equation 15.
	2The loss function L does not have to be error rate (0-1 loss). Any loss function that satisfies Equation 3 can
	be applied. In the present work, we use error rate (also accuracy) considering its popularity.


	-----

	next label in the label list, the final label in the label list is replaced with the first label in the label list.
	In this way, we have ηxnj = η. Thus, Equation 9 becomes:


	E [r(x)] = ( [1] (10)
	_D_ _η_ _[−]_ [2)][R][L][(][f][η][)][ −] _η [1]_ _[R][L][(][f]_ [) + 1][.]


	Thus, we have:


	_RL(f_ ) = (1 2η)RL(fη) _ηE_ [r(x)] + η. (11)
	_−_ _−_ _D_

	Let T (f ) be the accuracy of f over distribution, T (f ) = 1 _RL(f_ ), T (fη) = 1 _RL(fη). Let_
	_T_ (f ) be the empirical accuracy of f on training data with size D _n −:_ _T_ (f ) = _n[1]_ _ni=1_ [1][f] −[(][x]i[)=][y]xi [, where]

	1w is the indicator function of event w. We have:
	P

	b [b]

	_T_ (f ) = (1 2η)T (fη) + ηE [r(x)] + η. (12)
	_−_ _D_

	Empirical Model Validation Measurement

	Equation 12 specifies a theoretical metamorphic relation: Mutation degree η defines the relationship
	between inputs S and Sη: under each class of S, η proportion of the labels are mutated with label
	swapping (see more in Section A.1), yielding Sη. Such input changes lead to the expected output
	changes reflected by T (f ), T (fη), and loss change rate r(x) in the equation.

	The calculation of such a theoretical metamorphic relation, however, is impractical, because data
	distribution D is unknown, the expectation calculation is also unrealistic. Inspired by Equation 12,
	also considering that our motivation is to empirically measure how good a learner fits the available
	training data, we change the expectations on data distribution D into empirical observations on the
	available training data. This leads to the following measurement, m, to empirically access a learner[3].


	_m = (1_ 2η)TS(fη) + ηrˆ + η (13)
	_−_

	_TS(f_ ) _TSη_ (fη)
	= (1 2η)T[b]S(fη) + η _−_ [b] + η (14)
	_−_ _η_
	b

	= (1 2η)TS(fη) + _TS(f_ ) _TSη_ (fη) + η. (15)
	_−_ [b][b] [b] _−_ [b]

	In Equation 15, S is the original training data, Sη is the mutated training data with mutation degree η
	(η ≤ 0.5[4]), f is the model trained on the original training data, fη is the model trained on the mutated
	data, _TS(f_ ), _TS(fη) are the accuracy of f and fη based on the original training labels, respectively._
	_TSη_ (fη) is the accuracy of fη based on the mutated training labels.

	[b] [b]

	Connection betweenb _m and Our Intuition:_

	Interestingly, Equation 15 matches well with our intuition introduced in Section 2.1. In particular, if
	the learner is less affected by the mutated labels, the predictive behaviours of the trained model with
	mutated labels should be closer to that of the model trained with the original labels. This leads to
	a larger _TS(fη) and ˆr, as long as the mutation degree η is fixed. The matching between m and our_
	intuition provides extra supports for the reliability of MV in validating machine learning models.

	The calculation of[b] _m can also be regarded as a type of mutation score Jia & Harman (2010); Papadakis_
	et al. (2019) for model validation. As explained in Section 2.1, we expect that a good learner kills
	more mutants. However, the intuitionistic mutation score (i.e., the proportion of killed mutants) has a
	limitation in model validation: a poor learner that makes random guesses may also kill many mutants.

	3The purpose of m is not to approximate T (f ), but to measure how good a learner fits the available training
	data. However, if the training data is sufficiently large, m is expected to be close to T (f ).


	-----

	Equation 15 covers the mutant killing results (i.e., by calculating accuracy decrease rate r), but also
	fixes this limitation by also considering the accuracy on the original correct labels (i.e, _TS(fη)). Thus,_
	it can be regarded as a mutation score calculation adapted to suit the model validation scenario.

	The larger a learner’s m is, the better the learner fits the training data. Thus, we adapt the concept of[b]
	metamorphic relation in the scenario of model validation and extend it to a qualitative measurement,
	rather than a simple binary judgement. However, the value of m can also reflect the metamorphic
	relation we introduced in Section 2.2: let us define that once a learner’s m is below a threshold
	(e.g., 0.8), the metamorphic relation is violated. The violation then indicates a fault in model fitting:
	the learner is either over-complex (with a large _TS(f_ )) or over-simple (with a small _TS(f_ )) for the
	training data.

	[b] [b]

	A.2 INFLUENCE OF MUTATION DEGREE

	From Equation 15, it can be seen that the calculation of MV involves a mutation degree, η, for
	generating the mutated training data. The value of η needs to be fixed during the calculation. However,
	if Equation 15 is reliable, the influence of η on model validation should be minor. This section
	empirically explores whether this is true.

	The first sub-figure in Figure 7 shows the results for UCI datasets. The second sub-figure shows the
	results for the three large image datasets. It reveals that for most datasets (except for the three smallest
	datasets, i.e., iris, wine, and heart), the values of MV remain almost identical with different mutation
	degrees. This is because there is the constant term η at the end of Equation 15 when calculating MV,
	which cancels out the decrease of the detected mutants. This observation provides further evidence
	for the reliability of our calculation formula shown in Equation 15.

	For the three very small datasets, i.e., iris, wine, and heart, with fewer than 300 data points, we
	observe that a larger noise degree leads to a smaller MV. This may be because label mutations have
	more influence on very small datasets. Nevertheless, with different mutation degrees, we observe that
	the effectiveness of MV in model selection and hyperparameter configuration do not change, because
	the relative rankings of models/hyperparameters remain unchanged. For example, as shown by the
	third sub-figure in Figure 7, even for the smallest dataset iris, the recommended maximum depths for
	Decision Tree are identical.

	The same as the choice of n in n-fold cross-validation, although we demonstrate that the choice of
	_η doest not affect model validation conclusions, there may be a best practice for selecting η under_
	different application scenarios. We call for future work and practices to explore this.




	1.0

	0.5

	iris heart
	wine adult
	cancer bank

	0.0 car connect

	0.1 0.2 0.3 0.4 0.5

	mutation degree


	iris

	1.0 1.0

	0.5 0.5 : 0.1: 0.2

	: 0.3

	mnist cifar10 : 0.4

	0.0 fashion 0.0 : 0.5

	0.1 0.2 0.3 0.4 0.5 1 3 5 7 9

	mutation degree maximum depth


	Figure 7: Influence of mutation degree η on MV. The results show that different η lead to similar MV
	values and identical model validation conclusions.

	A.3 INFLUENCE OF TRAINING DATA SIZE ON MV

	In addition to RQ4, we further investigate what would happen to MV should we deliberately use
	training data much more than normally expected. That is, we go beyond the assumption that there is
	no need to increase data when the test accuracy becomes stable. As we can see from Figure 8, MV is
	more responsive to changes in training data size than test accuracy. For learners that are over-complex
	to the data, when adding more training data no longer increases test accuracy, MV continues to
	increase, indicating that the model’s resilience to incorrect labels continues to increase. The extra
	training data improve the robustness of the learner to training label noise. The pattern that MV no


	-----

	longer changes when model complexity increases is also a signal to developers that the training data
	is perhaps larger than necessary.








	moon circle linear

	1.0 1.0 1.0

	0.5 0.5 0.5

	1000

	PV score

	10000
	100000

	0.0 0.0 0.0

	3 7 11 15 19 3 7 11 15 19 3 7 11 15 19


	moon circle linear

	1.0 1.0 1.0

	0.5 0.5 0.5

	1000
	10000

	Test accuracy 100000

	0.0 0.0 0.0

	3 7 11 15 19 3 7 11 15 19 3 7 11 15 19


	Figure 8: MV (first row) and test accuracy (second row) on training data of increasing size with
	different maximum depths (horizontal axis). MV is responsive to data size changes; 2) MV no longer
	decreases for large depths when the training data size is sufficiently large.

	A.4 EFFICIENCY OF MV IN MODEL VALIDATION


	In this part, we compare the efficiency of MV, CV (3-fold), and validation accuracy (for image
	datasets), using the synthetic datasets on 7 classifiers (with the same setup in RQ1), the 8 UCI datasets
	on Decision Tree (with a fixed maximum depth of 5), and the 3 image datasets (with a fixed dropout
	rate of 0.5, and learning rate of 0.0001). The deep learning experiments with the three image datasets
	were run on Tesla V100, with 16GB Memory and 61GB RAM.

	Table 5 shows the results. For brevity, we show only the results for the moon synthetic datasets.
	Overall, both MV, CV, and validation accuracy have good efficiency on these datasets. Note that in
	practice developers often use 5-fold or 10-fold CV, which have larger time cost than the 3-fold CV.

	The cost of MV mainly comes from data mutation and model training. As observed from Table 5,
	larger datasets take more time to get MV values. Our results demonstrate that the cost of MV is
	manageable and comparable to 3-fold CV and validation accuracy in both classic learning and deep
	learning. In particular, for the three large image datasets, MV costs only half the time of 3-fold CV.

	What is more, as demonstrated by RQ1 and RQ2, the effectiveness and stability of MV help to
	conduct model selection and hyperparameter configuration more quickly. Thus, it helps to save cost in
	selecting the best learners for a given training set. On the other hand, MV is sensitive to over-complex
	models, and can help to select the simplest model with reasonable test accuracy. This will also reduce
	the model training and maintainability cost in the long run.

	Overall, MV has comparable efficiency to 3-fold CV and validation accuracy. For the three deep
	learning tasks, MV’s efficiency doubles that of 3-fold CV.

	\|Dataset Learner MV-time CV-time\|Dataset Learner MV-time CV-time\|
	\|---\|---\|
	\|moon Linear SVM 0.002s 0.003s moon RBF SVM 0.003s 0.004s moon Gaussian Process 0.161s 0.157s moon Decision Tree 0.001s 0.002s moon Random Forest 0.046s 0.064s moon AdaBoost 0.124s 0.185s moon Naive Bayes 0.002s 0.003s mean – 0.048s 0.060s\|iris Decision Tree 0.021s 0.003s wine Decision Tree 0.004s 0.007s cancer Decision Tree 0.023s 0.017s car Decision Tree 0.011s 0.014s heart Decision Tree 0.004s 0.008s adult Decision Tree 0.386s 0.385s bank Decision Tree 0.118s 0.109s connect Decision Tree 0.195s 0.196s – – 0.095s 0.092s\|


	Dataset Learner epoch MV-time CV-time Validation-time

	mnist convolutional neural network 10 2.772min 5.296min 1.761min
	fashion-mnist convolutional neural network 10 2.803min 5.352min 1.778min
	cifar10 convolutional neural network 50 12.052min 24.315min 8.072min
	mean – – 5.876min 11.654min 3.870min

	Table 5: Efficiency of MV, CV, and validation accuracy. The top/bottom sub-table shows the results
	for classic/deep learning. We observe that the efficiency of MV is comparable to that of 3-fold CV
	and validation accuracy.


	-----

	A.5 MODEL SELECTION WITH UCI DATASETS (RQ1)

	In this part, as an extension to RQ1, we present the results of model selection using UCI datasets.
	Note that we do not use test accuracy as the ground truth for model selection considering the possible
	limitations of test accuracy we discussed in the introduction. Instead, we present the results to
	demonstrate how MV differs from CV and test accuracy, as well as how it complements CV and
	test accuracy in model selection when developers observe similar accuracy results across different
	learners.

	Figure 9 shows the results. It is interesting that we do have observations on MV that are consistent
	with common machine learning knowledge. In particular, for the four smallest dataset (i.e., iris, wine,
	cancer, and heart), MV suggest the two simplest learners (i.e., Linear SVM and Naive Bayes).






	iris wine cancer car

	1 1 1 1

	MV
	CV
	TA

	0 0 0 0

	Linear SVMRBF SVMDecision TreeRandom ForestAdaBoostNaive Bayes Linear SVMRBF SVMDecision TreeRandom ForestAdaBoostNaive Bayes Linear SVMRBF SVMDecision TreeRandom ForestAdaBoostNaive Bayes Linear SVMRBF SVMDecision TreeRandom ForestAdaBoostNaive Bayes

	heart adult bank connect
	1 1 1 1

	0 0 0 0

	Linear SVMRBF SVMDecision TreeRandom ForestAdaBoostNaive Bayes RBF SVM Decision Tree Random Forest AdaBoost Naive Bayes Linear SVMRBF SVMDecision TreeRandom ForestAdaBoostNaive Bayes Linear SVMRBF SVMDecision TreeRandom ForestAdaBoostNaive Bayes


	Figure 9: Model selection results with UCI datasets (extended analysis for RQ1).


	-----

	A.6 MODEL SELECTION WITH RANDOM LABEL REPLACEMENT

	In the main body of this work, we present empirical results with MV calculated using label swapping.
	In this part, we further explore the effectiveness of another label mutation approach: random label
	replacement. That is, when conducting label mutation, we replace the original label with a label that
	is randomly chosen from the label list. We compare the performance of these two label mutation
	approaches in model selection with the ground truth provided by synthetic datasets. Figure 12 shows
	the results. We observe that random label replacement is less accurate than label swapping in model
	selection, but is still more accurate than CV and test accuracy.


	zero noise


	Linear SVM


	RBF SVM


	Gaussian Process


	Decision Tree


	Random Forest


	AdaBoost


	Naive Bayes

	\|Col1\|MV-SL=0.89 MV-RL=0.80\|MV-SL=1.00 MV-RL=0.94\|MV-SL=1.00 MV-RL=0.95\|MV-SL=0.74 MV-RL=0.77\|MV-SL=0.75 MV-RL=0.77\|MV-SL=0.75 MV-RL=0.73\|MV-SL=0.89 MV-RL=0.77\|
	\|---\|---\|---\|---\|---\|---\|---\|---\|
	\|\|MV-SL=0.53 MV-RL=0.51\|MV-SL=1.00 MV-RL=0.87\|MV-SL=0.88 MV-RL=0.91\|MV-SL=0.67 MV-RL=0.73\|MV-SL=0.69 MV-RL=0.78\|MV-SL=0.72 MV-RL=0.74\|MV-SL=1.00 MV-RL=0.93\|
	\|\|MV-SL=0.93 MV-RL=0.79\|MV-SL=0.88 MV-RL=0.83\|MV-SL=0.91 MV-RL=0.88\|MV-SL=0.68 MV-RL=0.73\|MV-SL=0.67 MV-RL=0.78\|MV-SL=0.77 MV-RL=0.77\|MV-SL=0.94 MV-RL=0.88\|


	MV-SL=0.89 MV-SL=1.00 MV-SL=1.00 MV-SL=0.74 MV-SL=0.75 MV-SL=0.75 MV-SL=0.89
	MV-RL=0.80 MV-RL=0.94 MV-RL=0.95 MV-RL=0.77 MV-RL=0.77 MV-RL=0.73 MV-RL=0.77

	MV-SL=0.53 MV-SL=1.00 MV-SL=0.88 MV-SL=0.67 MV-SL=0.69 MV-SL=0.72 MV-SL=1.00
	MV-RL=0.51 MV-RL=0.87 MV-RL=0.91 MV-RL=0.73 MV-RL=0.78 MV-RL=0.74 MV-RL=0.93

	MV-SL=0.93 MV-SL=0.88 MV-SL=0.91 MV-SL=0.68 MV-SL=0.67 MV-SL=0.77 MV-SL=0.94
	MV-RL=0.79 MV-RL=0.83 MV-RL=0.88 MV-RL=0.73 MV-RL=0.78 MV-RL=0.77 MV-RL=0.88


	Figure 10: Performance of MV in model selection with label swapping (MV-SL) and random label
	replacement (ML-RL).

	A.7

	Figure 11: Performance of MV in suggesting hyperparameters that follow Occam’s Razor on dataset

	test accuracy MV

	1 1

	0.9 0.9

	0.8 0.8

	0.7 0.7

	0.6 0.6

	num of trees 0.5 0.5

	0.4 0.4

	0.3 0.3

	50 0.2 50 0.2

	1 50 1 50

	depth depth

	Cancer. The training data has only 300 samples. The low values of MV on large depths and number
	of trees provide warnings to developers that the hyperparameters are over complex and violate the
	rule of Occam’s Razor. The unnecessary complexity in the complex learner affects the interpretability
	of the learner, also making it vulnerable to training label attacks.


	-----

	A.8

	Figure 12: Correlation between MV, training accuracy changes, and the new training accuracy based

	0.25

	0.20 r=0.91, p=1.5e-46

	0.15

	0.10

	0.05

	Accuracy changes

	0.00

	0.05

	0.10

	0.6 0.7 0.8 0.9 1.0

	MV

	1.0

	r=0.84, p=1.4e-32

	0.9

	0.8

	0.7

	New accuracy on original labels

	0.6

	0.6 0.7 0.8 0.9 1.0

	MV

	on the original labels.


	-----