MODEL VALIDATION USING MUTATED TRAINING LABELS: AN EXPLORATORY STUDY **Anonymous authors** Paper under double-blind review ABSTRACT We introduce an exploratory study on Mutation Validation (MV), a model validation method using mutated training labels for supervised learning. MV mutates training data labels, retrains the model against the mutated data, then uses the metamorphic relation that captures the consequent training performance changes to assess model fit. It does not use a validation set or test set. The intuition underpinning MV is that overfitting models tend to fit noise in the training data. We explore 8 different learning algorithms, 18 datasets, and 5 types of hyperparameter tuning tasks. Our results demonstrate that MV is accurate in model selection: the model recommendation hit rate is 92% for MV and less than 60% for out-of-samplevalidation. MV also provides more stable hyperparameter tuning results than out-of-sample-validation across different runs. 1 INTRODUCTION Out-of-sample validation (such as test accuracy) is arguably the most popular approach adopted by researchers and developers for empirically validating models in applied machine learning. It uses data different from the training data to approximate future unseen data. However, out-of-sample validation is widely acknowledged to have limitations: 1) the sample set may be too small to represent the data distribution; 2) the accuracy can have a large variance across different runs (Pham et al., 2020); 3) the samples are typically randomly selected from the collected data, and may therefore have similar bias as in the training data, leading to an inflated validation score (Piironen & Vehtari, 2017; Gronau & Wagenmakers, 2019); 4) excessive reuse of a fixed set of samples can lead to overfitting even if the samples are held out and not used in the training process (Feldman et al., 2019). The Mutation Validation (MV) approach we explore is a new approach to validating machine learning models relying only upon training data. MV applies Mutation Testing and Metamorphic Testing —two software engineering techniques that validate code correctness. Mutation testing mutates the program by making synthetic changes (e.g., to change a + b into a − _b or to remove a functional_ call), then re-executes the tests to monitor the behaviour changes and check the power of a test suite (Papadakis et al., 2019; Jia & Harman, 2010). Metamorphic Testing detects program errors by checking metamorphic relations, which is the relationship between input changes and output changes (Chen et al., 2020; Segura et al., 2016). Combining these two techniques, MV mutates training data labels and retrains the model using the mutated data, then measures the training performance change. As shown by Figure 1, the key intuition is that a learner, if fitting the given training data property, would be less likely to be ‘fooled’ by a small ratio of mutated labels. Consequently, the model trained with the mutated training data would “detect” the mutated labels and still exhibit high predictive performance on the original training data. By contrast, an overfitted learner violates the Occam’s Razor (Hawkins, 2004), and has extra capacity to fit incorrect noisy labels, and thus will yield a model that exhibits poor predictive performance on the original training data, but high performance on the mutated training data. Furthermore, an over-simple learner has poor learnability, and will have low performance on the original training data and the mutated data. There are a significant number of theories proposed to explain model validation and complexity, such as Rademacher complexity and VC dimension. Nevertheless, the prescriptive and descriptive value of these theories remains debated. Zhang et al. (2017) found that deep hypothesis spaces can be large enough to memorise random labels. They discussed the limitations of existing measurements in explaining the generalisation ability of large neural networks, and called for new measurements. ----- good learner over-complex learner over-simple learner dataoriginal \ XX °°00 O O (oo[00] o ^ XX + go000O ° o ##### ::÷:÷÷÷:¥:÷:::÷÷\× { :*µ ::÷I :* X ^ X 000 . × 0 × 0 0 mutated :O data x × x # :i÷÷÷:÷::/ I ::÷÷:÷¥i: X train X train X train Figure 1: The intuition underpinning MV: A better learner is less affected by the mutated labels.acc. acc. " acc. performance expected PV ÷[!] ° ° ° " 5 :!÷÷÷a° ° ° ° ° . ..→÷÷;;t*÷÷÷i In this present work, we do not compare MV with these theories. Rather, similar to Zhang et al.0=1-7| 0 } 0 | (2017), we focus on empirical investigation. In Appendix A.1, we provide the theoretical foundationsunderpinning the metamorphic relation that MV uses.0.1 0.2 0.3 noisedegree Out 0,2 0.3 noisedegree 0.1 0.2 0.3 noisedegree We report on the performance of MV on 12 open datasets and 6 synthetic datasets with different known data distributions (see Table 1), using 8 widely-adopted classifiers (including both classic learning classifiers and deep learning classifiers). We investigate the effectiveness and stability of MV serving as a complementary measure to the existing practical model validation methods for model selection and parameter tuning. The experimental results provide evidence to support the following , MV captures well the degree of match between decision boundaries and data patterns in model selection. The model recommendation hit rate for MV is 92%, but is 53% for cross validation, and 55% for test accuracy. Second, MV is more responsive to changes in capacity than conventional validation methods. When cross validation (CV) accuracy and test accuracy could not distinguish among large-capacity hyperparameters, MV complements them in hyperparameter tuning. **Third, MV is stable in model validation results. Its dropout rate tuning result does not change across** five runs; the variance is zero. For validation set and test set, the average variance is 0.003 and 0.007, respectively. The paper also discusses the connections between MV and other noise injection work in the literature, as well as the usage scenarios of MV (Section 5). In summary, we make the following primary contributions: **1) An exploratory study on the performance of Mutation Validation. We explore the effectiveness** and stability of MV to validate machine learning (ML) models as a complement to the currently used empirical model validation methods. MV requires neither validation nor test sets, but uses the training performance sensitivity to the mutated training labels. We study 18 datasets, 8 different learning algorithms, and 5 hyperparameter tuning tasks. **2) An application of software testing techniques in ML model validation. MV is the first approach** that applies mutation testing and metamorphic testing —two widely studied software testing techniques —on model validation tasks. 2 MUTATION VALIDATION 2.1 GENERAL INTUITION Figure 1 illustrates the intuition that underpins MV. For a ‘good’ learner that fits the training data well (first column of Figure 1), the learner is less likely to be ‘fooled’ by a small number of mutated labels, and would keep predicting the original labels for the mutated samples. As a result, the model trained on the mutated data will still have high predictive accuracy on the original training data, but will have decreased predictive accuracy on the mutated data. An overfitted learner tends to fit noise in the training data (second column of Figure 1). With mutated data, the learner will yield a model that makes predictions following the incorrect labels, leading to a high training accuracy on the mutated data, but poor accuracy on the original data. An over-simple learner has poor learnability. The model it yields has poor performance with or without the mutated training data. As a result, the model trained on the mutated data has low accuracy with both the original labels and the mutated labels. ----- 2.2 MUTATION VALIDATION MV uses mutation testing and metamorphic testing, two software validation techniques, to validate ML models. Mutation testing creates mutants by injecting faults in a program, then re-executes the program to check whether a test suite detects those faults. Metamorphic relation specifies how a change in the input should result in a change in the output. It is used to detect errors in software when there are no reliable oracles. For a program f, its input x and x[′]. Let f (x) and _f_ (x[′]) be the execution outputs of x and x[′] against f . Let Ri be the relationship between x and _x[′], Ro is the relationship between f_ (x) and f (x[′]). A metamorphic relation can be represented as: Ri(x, x[′]) ⇒ Ro(f (x), f (x[′])). If the relation is violated, the program under test contains a bug. For example, when validating the sin mathematical function, one metamorphic relation that a correct program should hold is sin(x + π) = − sin(x). Now consider the scenario of ML model validation. If we treat a learner as the program under test, the training data as the input, the trained model’s behaviours as the output, our intuition introduced in Section 2.1 can be well captured by mutation testing and metamorphic relations. In particular, the input changes are introduced by mutating training data, each mutated data instance is called a mutant; the output changes are defined in terms of performance changes of the trained models. Based on this, we propose Mutation Validation, a new machine learning model validation method that validates ML models based on the relationship between training data changes and training performance changes. A _good learner is expected to “detect” the mutants and have a certain amount of training performance_ _changes according to the number of input data changes._ There are different metamorphic relations that can be explored to conduct MV, which may depend on the data mutation method, the training performance measurement (e.g., accuracy, precision, loss), and the calculation of performance changes. Let η be the mutation degree (i.e., the ratio of randomly mutated labels in the training data). S is the original training data, _Sη is the mutated training data with mutation degree η (η ≤_ 0.5), f (S) is the model output trained on S, f (Sη) is the model output trained on Sη, _TS(f_ (S)), _TS(f_ (Sη)) are the accuracy of f (S) and f (Sη) based on the original training labels, respectively. _TSη_ (fS(η)) is the accuracy of f (Sη) based on the mutated labels. In this present work, as the first ex-[b] [b] ploratory study on MV, we study the following MV measurement m to conduct model validation: [b] _m = (1_ 2η)TS(f (Sη)) + _TS(f_ (S)) _TSη_ (f (Sη)) + η. (1) _−_ _−_ [b] The above measurement, although derived from theory (Appendix A.1), matches well with our [b] [b] intuition introduced in Section 2.1. In particular, if the learner is less affected by the mutated labels, the predictive behaviours of the trained model with mutated labels should be closer to that of the model trained with the original labels. This leads to a larger _TS(f_ (Sη)) and a larger difference betweenbetter the learner fits the training data. In the optimal case, the trained model on mutated data hasTS(f (S)) and _TSη_ (f (Sη), as long as the mutation degree[b] _η is fixed. The larger m is, the_ a perfect training accuracy on the original training labels (i.e.,[b] [b] _TS(f_ (Sη)) = 1), and detects all the mutants (i.e, _TS(f_ (S)) _TSη_ (f (Sη)) = η). Thus, m = 1. Mutation degree η ranges between 0 and _−_ [b] 0.5, but needs to be a fixed value. The theory inspiration of this metric, its metamorphic relation, as[b] well as the influence of η on m are provided in our appendix. [b] 3 EXPERIMENTAL SETUP The main body of this paper answers four research questions: **_RQ1: What is the effectiveness of MV in model selection?_** **_RQ2: What is the effectiveness of MV in hyperparameter tuning?_** **_RQ3: What is the stability of MV in validating machine learning models?_** **_RQ4: How does training data size affect MV?_** We also explore the efficiency of MV in validating machine learning models, as well as the influence of mutation degree η on MV. The details are in the appendix. To evaluate MV, we choose to use datasets that are diverse in category, size, class number, feature number, and class balance situations. Small datasets are particularly important to demonstrate the ----- |Dataset abbr.|#training #test #class #feature| |---|---| |synthetic-moon moon synthetic-moon (0.2 noise) moon-0.2 synthetic-circle circle synthetic-circle (0.2 noise) circle-0.2 synthetic-linear linear synthetic-linear (0.2 noise) linear-0.2|100 −1e+6 2,000 2 2 100 −1e+6 2,000 2 2 100 −1e+6 2,000 2 2 100 −1e+6 2,000 2 2 100 −1e+6 2,000 2 2 100 −1e+6 2,000 2 2| |Iris iris Wine wine Breast Cancer Wisconsin cancer Car Evaluation car Heart Disease heart Bank Marketing bank Adult adult Connect-4 connect|150 – 3 4 178 – 3 13 569 – 2 9 1,728 – 4 6 303 – 5 14 45,211 – 2 17 48,842 16,281 2 14 67,557 – 2 42| |MNIST mnist fashion MNIST fashion CIFAR-10 cifar10 CIFAR-100 cifar100|60,000 10,000 10 – 60,000 10,000 10 – 50,000 10,000 10 – 50,000 10,000 100 –| Table 1: Details of the datasets. ability of MV in providing warnings for over-complex learners. Table 1 shows the details of each dataset used to evaluate MV. Column “#training” and Column “#test” show the size of training data and test data, which is presented by the dataset providers. Column “#class” shows the number of classes (or labels) for each dataset. Column “#feature” presents the number of features. To obtain datasets with known ground-truth decision boundaries for validating model fitting, we use synthetic datasets with three types of data distributions: moon, circle, and linearly-separable. These three data distributions are provided by scikit-learn (Pedregosa et al., 2011) tutorial to demonstrate the decision boundaries of different classifiers. To study the influence of original noise in training data on MV, for each type of distribution, we create datasets with noise. We also generate different-size training data ranging from 100 to 1 million data points to study the influence of training data size. These synthetic datasets help check whether MV identifies the right model whose decision boundary matches the data distribution, with and without noise in the original dataset S. We do not expect such synthetic datasets to reflect real-world data, but the degree of control and interpretability they offer allows us to verify the behaviour of MV with a known ground-truth for model selection. We also report results on 12 real-world widely-adopted datasets with different sizes, numbers of features and classes. Eight of them are from the UCI repository (Asuncion & Newman, 2007), the remaining four are the most widely used image datasets: MNIST, fashionMNIST, CIFAR-10, and CIFAR-100. For each dataset, we calculate MV with η = 0.2 to answer the research questions. That is, under each class of the dataset, we randomly select 20% of the labels to mutate. This guarantees that MV is not affected by the problem of class imbalance. We mutate the labels by label swapping: to replace a label with the next label in the label list, the final label in the label list is replaced with the first label in the label list. For the synthetic datasets and the 8 UCI datasets, we compare with 3-fold CV accuracy and test accuracy. We use 3-fold CV because it has almost identical results to more folds cross validation, but has lower cost, thereby being more conservative to compare with when studying the efficiency of MV. For the three image datasets on deep learning models, we compare with validation accuracy with 20% validation data (split out from the training data) and test accuracy. The next section introduces more configuration details. 4 RESULTS 4.1 RQ1: EFFECTIVENESS OF MV IN MODEL SELECTION To answer RQ1, we use synthetic datasets to explore whether MV recommends the learners whose decision boundaries best match the data patterns. We use synthetic datasets because synthetic datasets have ground-truth data patterns for model selection (more details in Section 3). The data distribution of real-world UCI datasets, however, is often unknown or difficult to visualise. To generate synthetic datasets, we use three scikit-learn (Pedregosa et al., 2011) synthetic dataset distributions, which are designed as a tutorial to illustrate the nature of decision boundaries of different classifiers. This experiment uses the same settings as in the Scikit-learn tutorial (Classifier Comparison, 2019). Each dataset has 100 training data points for model selection. We generate another 2,000 points as test sets. Figure 2 shows the training data points, decision boundaries of each classifier, and measurement values from MV, CV, and test accuracy (on the 2,000 test data points). We have the following ----- zero noise |Col1|MV=0.89 CV=0.86 Test=0.86|MV=1.00 CV=1.00 Test=1.00|MV=1.00 CV=0.99 Test=1.00|MV=0.74 CV=0.96 Test=0.99|MV=0.69 CV=0.96 Test=0.98|MV=0.75 CV=0.95 Test=0.98|MV=0.89 CV=0.88 Test=0.89| |---|---|---|---|---|---|---|---| ||MV=0.53 CV=0.54 Test=0.50|MV=1.00 CV=1.00 Test=1.00|MV=0.88 CV=1.00 Test=1.00|MV=0.69 CV=1.00 Test=1.00|MV=0.71 CV=0.99 Test=1.00|MV=0.72 CV=1.00 Test=0.99|MV=1.00 CV=1.00 Test=1.00| ||MV=0.93 CV=0.92 Test=0.90|MV=0.88 CV=0.94 Test=0.90|MV=0.91 CV=0.93 Test=0.90|MV=0.67 CV=0.90 Test=0.88|MV=0.66 CV=0.91 Test=0.89|MV=0.77 CV=0.92 Test=0.89|MV=0.94 CV=0.93 Test=0.90| MV=0.89 MV=1.00 MV=1.00 MV=0.74 MV=0.69 MV=0.75 MV=0.89 CV=0.86 CV=1.00 CV=0.99 CV=0.96 CV=0.96 CV=0.95 CV=0.88 Test=0.86 Test=1.00 Test=1.00 Test=0.99 Test=0.98 Test=0.98 Test=0.89 MV=0.53 MV=1.00 MV=0.88 MV=0.69 MV=0.71 MV=0.72 MV=1.00 CV=0.54 CV=1.00 CV=1.00 CV=1.00 CV=0.99 CV=1.00 CV=1.00 Test=0.50 Test=1.00 Test=1.00 Test=1.00 Test=1.00 Test=0.99 Test=1.00 MV=0.93 MV=0.88 MV=0.91 MV=0.67 MV=0.66 MV=0.77 MV=0.94 CV=0.92 CV=0.94 CV=0.93 CV=0.90 CV=0.91 CV=0.92 CV=0.93 Test=0.90 Test=0.90 Test=0.90 Test=0.88 Test=0.89 Test=0.89 Test=0.90 0.2 noise Linear SVM MV=0.89 CV=0.86 Test=0.86 MV=0.53 CV=0.54 Test=0.50 MV=0.93 CV=0.92 Test=0.90 Linear SVM RBF SVM MV=1.00 CV=1.00 Test=1.00 MV=1.00 CV=1.00 Test=1.00 MV=0.88 CV=0.94 Test=0.90 RBF SVM Gaussian Process MV=1.00 CV=0.99 Test=1.00 MV=0.88 CV=1.00 Test=1.00 MV=0.91 CV=0.93 Test=0.90 Gaussian Process Decision Tree MV=0.74 CV=0.96 Test=0.99 MV=0.69 CV=1.00 Test=1.00 MV=0.67 CV=0.90 Test=0.88 Decision Tree Random Forest MV=0.69 CV=0.96 Test=0.98 MV=0.71 CV=0.99 Test=1.00 MV=0.66 CV=0.91 Test=0.89 Random Forest AdaBoost MV=0.75 CV=0.95 Test=0.98 MV=0.72 CV=1.00 Test=0.99 MV=0.77 CV=0.92 Test=0.89 AdaBoost Naive Bayes MV=0.89 CV=0.88 Test=0.89 MV=1.00 CV=1.00 Test=1.00 MV=0.94 CV=0.93 Test=0.90 Naive Bayes |Col1|MV=0.86 CV=0.85 Test=0.86|MV=0.94 CV=0.97 Test=0.96|MV=0.95 CV=0.95 Test=0.96|MV=0.67 CV=0.91 Test=0.94|MV=0.70 CV=0.88 Test=0.94|MV=0.78 CV=0.90 Test=0.96|MV=0.90 CV=0.89 Test=0.87| |---|---|---|---|---|---|---|---| ||MV=0.53 CV=0.54 Test=0.50|MV=0.88 CV=0.90 Test=0.86|MV=0.82 CV=0.91 Test=0.87|MV=0.74 CV=0.88 Test=0.82|MV=0.69 CV=0.90 Test=0.85|MV=0.71 CV=0.87 Test=0.83|MV=0.91 CV=0.91 Test=0.85| ||MV=0.82 CV=0.81 Test=0.82|MV=0.77 CV=0.80 Test=0.79|MV=0.86 CV=0.85 Test=0.81|MV=0.69 CV=0.74 Test=0.76|MV=0.66 CV=0.79 Test=0.77|MV=0.74 CV=0.77 Test=0.78|MV=0.83 CV=0.81 Test=0.81| MV=0.86 MV=0.94 MV=0.95 MV=0.67 MV=0.70 MV=0.78 MV=0.90 CV=0.85 CV=0.97 CV=0.95 CV=0.91 CV=0.88 CV=0.90 CV=0.89 Test=0.86 Test=0.96 Test=0.96 Test=0.94 Test=0.94 Test=0.96 Test=0.87 MV=0.53 MV=0.88 MV=0.82 MV=0.74 MV=0.69 MV=0.71 MV=0.91 CV=0.54 CV=0.90 CV=0.91 CV=0.88 CV=0.90 CV=0.87 CV=0.91 Test=0.50 Test=0.86 Test=0.87 Test=0.82 Test=0.85 Test=0.83 Test=0.85 MV=0.82 MV=0.77 MV=0.86 MV=0.69 MV=0.66 MV=0.74 MV=0.83 CV=0.81 CV=0.80 CV=0.85 CV=0.74 CV=0.79 CV=0.77 CV=0.81 Test=0.82 Test=0.79 Test=0.81 Test=0.76 Test=0.77 Test=0.78 Test=0.81 Figure 2: Performance of MV in model selection. MV, CV, Test denote MV score, CV accuracy, and Test Accuracy. Red and blue points are the original training data without noise (top-three rows) and with 0.2 noise (bottom-three rows). Areas with different colours show the decision boundaries. We observe that MV captures well the match between decision boundaries and data patterns. observations. First, MV tends to have large values for cases where the decision boundaries match well the data patterns; it provides more discriminating scores for model selection under different datasets, it is easy to pick out the top-2 models that match the data distributions the best based on MV. Second, the recommended models from MV are less affected by the noise in the training data. The figure also shows cases where CV and Dataset Method Noise Recommended models based on top-2 scores _test accuracy have limitationsmodels. For example, in Figure 2, for the circledistribution, Decision Trees and Random Forests in evaluating ML_ moon MVCV 0.00.20.00.2 **RBF SVMRBF SVMRBF SVMRBF SVM,,,, Gaussian Process Gaussian Process Gaussian Process Gaussian Process** (with maximum depth of 10) give obviously ill- Test 0.00.2 **RBF SVMRBF SVM,, Gaussian Process Gaussian Process, AdaBoost** fitted rectangle-shaped decision boundaries, yetthe cross validation accuracy and test accuracyremain high. In addition, from Table 2, it is more circle MVCV 0.00.20.00.2 **RBF SVMRBF SVMRBF SVMGaussian Process,,,, Gaussian Process, DT, RF, AdaBoost, Naive Bayes Naive Bayes Naive Bayes** **Naive Bayes** difficult to select proper models based on CV Test 0.00.2 **RBF SVMRBF SVM, Gaussian Process, DT, RF,, Gaussian Process** **Naive Bayes** and test accuracy, because CV and test accuracy MV 0.00.2 **Linear SVMGaussian Process,, Naive Bayes Naive Bayes** are often very similar across different models. linear CV 0.0 RBF SVM, Gaussian Process, Naive Bayes For ease of observation, we present Table 2 to show the models recommended by MV, CV, and test accuracy, based on their top-2 scores shown by Figure 2. The ground-truth models in bold are based on manual observation and widelyadopted ML knowledge. For example, Scikit_learn documentation (Classifier Comparison,_ |Dataset|Method Noise Recommended models based on top-2 scores| |---|---| |moon|MV 0 0. .0 2 R RB BF F S SV VM M,, G Ga au us ss si ia an n P Pr ro oc ce es ss s| ||CV 0 0. .0 2 R RB BF F S SV VM M,, G Ga au us ss si ia an n P Pr ro oc ce es ss s| ||Test 0 0. .0 2 R RB BF F S SV VM M,, G Ga au us ss si ia an n P Pr ro oc ce es ss s, AdaBoost| |circle|MV 0 0. .0 2 R RB BF F S SV VM M,, N Na ai iv ve e B Ba ay ye es s| ||CV 0 0. .0 2 R GB auF s sS iaV nM P, r oG ca eu sss,s i Nan a iP vr eo Bce as ys e, sDT, RF, AdaBoost, Naive Bayes| ||Test 0 0. .0 2 R RB BF F S SV VM M,, G Ga au us ss si ia an n P Pr ro oc ce es ss s, DT, RF, Naive Bayes| |linear|MV 0 0. .0 2 L Gi an ue sa sir a S n V PM roc, eN ssa,i v Ne a B iva ey Bes ayes| ||CV 0 0. .0 2 R GB auF s sS iV anM P, r oG ca eu ss ss, i Nan a iP vr eo c Be as ys, e sNaive Bayes| ||Test 0 0. .0 2 L Li in ne ea ar r S SV VM M,, R GB auF s sS iV anM P, r oG ca eu ss ss, i Nan a iP vr eo c Be as ys, e sRF, Naive Bayes| Table 2: Recommended models by MV, CV, and test accuracy. The models in bold are the groundtruth ones whose decision boundaries match the data patterns. The recommendation hit rate is 92% for MV, 53% for CV, 55% for test accuracy. ----- 2019) mentions that Naive Bayes and Linear SVM are more suitable for linearly-separable data. Overall, we have the following conclusion: among all the model selection tasks we explored, the hit rate (i.e., the ratio of recommended models that match the ground truth model) is 92% for MV, 53% for CV accuracy, and 55% for test accuracy. 4.2 RQ2: EFFECTIVENESS OF MV IN HYPERPARAMETER CONFIGURATION For models whose capacity increases along with their hyperparameters, we expect their goodness of model fit to first increase, then peak, before decreasing. The peak indicates the best model fit assessed according to MV. In RQ2, we assess whether MV matches this pattern. We study five capacity-related hyperparameters for several widely-adopted algorithms: the maximum depth for a Decision Tree, C and gamma for a Support Vector Machine (SVM), and the dropout rate and learning rate for Convolutional Neural Networks (CNNs). **Maximum Depth for Decision Tree. Figure 3 shows how MV responds to increases in maximum** depth of Decision Tree for the eight real-world UCI datasets. For small datasets (smaller than 2,000), we do not split out test data, but take the whole data as training data. For the bank and connect datasets, we use 80% of the data as test sets. For adult, we use its original test set. We repeat the experiments 10 times. The yellow shadow in the figure indicates the variance across the 10 runs. iris wine cancer car heart adult bank connect 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 MV MV CV 0.0 CV 0.0 0.0 0.0 0.0 0.0 Test 0.0 0.0 1 3 5 7 9 1 3 5 7 9 1 3 5 7 9 1 5 9 13 17 1 3 5 7 9 1 3 5 7 9 1 3 5 7 9 5 10 15 20 25 30 Figure 3: Changes in MV, CV accuracy, and test accuracy when increasing the maximum depth for Decision Trees. The x-axis ticks for car and connect differ to capture MV’s inflection point. We can observe that, while CV and test accuracy agree with MV on the key influence point in most cases, they are less responsive to large depths (which lead to overfitting). From Figure 3, we make the following observations. First, for 6 out of the 8 datasets, MV increases then decreases as maximum depth increases, and exhibits a maximum in each curve. This is consistent with the pattern we expect a good measure to exhibit. For small datasets, large depths yield low MV scores, whereas large datasets do not. For adult and bank, MV values remain large when the maximum depth increases. We suspect that this is because, for these two datasets, the training data size is large enough for the model to obtain resilience against mutated labels. We explore the influence of training data size in the appendix. We also observe that MV, CV and test accuracy have similar inflection points, yet MV is more responsive to depth changes, especially when the model overfits due to large depths. This observation indicates that MV provides further information to help tune maximum depth when CV and test accuracy are less able to distinguish multiple parameters. In particular, if a developer uses grid search to select the best maximum-depth ranged between 5 and 10, we find that grid search suggests depths of 8, 6, 9 in three runs for the cancer dataset, which are over-complex and unstable. Similar results are observed for other small datasets. With MV, its decrease trend in this range indicates that there is a simpler model with comparable predictive accuracy but better resilience to label mutation. **C and gamma for SVM. In SVM, the gamma parameter defines how far the influence of a single** training example reaches; the C parameter decides the size of the decision boundary margin, behaving as a regularisation parameter. In a heat map of MV scores as a function of C and gamma, the expectation is that good models should be found close to the diagonal of C and gamma (Scikitlearn:SVM, 2020). Figure 4 presents the heat map for cross validation and MV for two datasets. We do not use a hold out test set to ensure sufficient training data. The upper left triangle in each sub-figure denotes small complexity; the bottom right triangle in each sub-figure denotes large complexity. In both cases, MV gives low scores. When comparing CV and MV, MV is more responsive to hyperparameter value changes. With MV scores, it is more obvious that good models can be found along the diagonal of C and gamma. When C and gamma are both large, the CV score is high but MV score is low, this is an indication that there exists a simpler model with similar test accuracy. In practice, as stated by Scikit-learn documentation, ----- it is interesting to simplify the decision function with a lower value of C so as to favour models that &9 09 &9 09 ZLQH FDQFHU Figure 4: Influence of SVM parameters on CV and MV. The horizontal/vertical axis is gamma/C.&9 09 Good models are expected to be found close to the diagonal. As can be seen, CV has a broad highvalued (bright) region, while MV’s high-valued region (bright) is narrower, showing that MV is more responsive to parameter changes.FDQFHU use less memory and that are faster to predict (Scikit-learn:SVM, 2020). **Dropout Rate and Learning Rate for CNN. We use CNN models coming from Keras documentation.** Validation accuracy is calculated with 80% training data and 20% validation data. Figure 5 shows the results. We observe that when tuning dropout rate for mnist and fashion-mnist, validation and test accuracy are less discriminating and provide different tuning results across different runs (more details in Table 4), yet MV is more discriminating. For dropout rate, MV and test accuracy have different key influence points on cifar10 (0.4 v.s. 0.2). This is because there is a big capacity jump between 0.2 and 0.4. The result suggests that the optimal dropout rate with comparable validation/test accuracy is between 0.2 and 0.4. 1.00 fashion-mnist cifar10 1.0 cifar100 1.0 mnist 1.0 fashion-mnist 1.0 cifar10 1.0 cifar100 0.8 0.8 0.8 0.8 0.8 0.8 0.95 0.6 0.6 0.6 0.6 0.6 0.90 0.6 0.4 0.4 0.4 0.4 0.4 0.2 0.2 0.2 0.2 0.2 0.850.0 0.2 0.4 0.6 0.8 0.40.0 0.2 0.4 0.6 0.8 0.00.0 0.2 0.4 0.6 0.8 0.00.01 0.001 1e-04 1e-050.00.01 0.001 1e-04 1e-050.00.01 0.001 1e-04 1e-050.00.01 0.001 1e-04 1e-05 dropout rate dropout rate dropout rate learning rate learning rate learning rate learning rate Figure 5: Influence of CNN’s dropout rate and learning rate on MV and validation/test accuracy. Overall, we observe that MV is responsive to hyperparameter changes on all the hyperparameter tuning tasks we explored. CV and test accuracy are less responsive especially for large-capacity hyperparameters (i.e., large maximum depth for Decision Tree and small dropout rate for CNNs). This leads to the following issues. First, due to the large variance of their results across different runs, such low response often leads to unstable hyperparameter recommendation results. We further explore this in Section 4.3. Second, developers may easily choose an over-complex learner, making the learner: 1) easily biased by incorrect training labels; 2) vulnerable to training data attack; 3) computationally and memory intensive; 4) more difficult to interpret. 4.3 RQ3: STABILITY OF MV IN MODEL VALIDATION For the machine learning experimental design in the literature, data is usually random split. The values of MV, CV, and test accuracy may be easily affected by the randomness in model building (Pham et al., 2020), especially when the overall data set is small, or building deep learning models. As a result, with different runs, developers may get completely different recommendations for the best hyperparameter configuration. A good model validation method is expected to be stable in hyperparameter recommendation results, giving developers clear instructions on the best choice. To explore the stability of MV, CV, and test accuracy, we run the model building process multiple times, each time with a different split of training data and validation data (the size of training/validation data is unchanged). We then record the recommended best hyperparameters during each run under the scenario of hyperparameter configuration. We use the maximum depth for Decision Tree with the 5 small UCI datasets as well as the dropout rate for CNN with the 3 image datasets. Table 4 shows the results, which lead to the following conclusion: MV has good stability in recommending hyperparameters. When recommending maximum depth for Decision Tree, the overall variance is ----- |Dataset|Method Recommended maximum depth Variance| |---|---| |iris|MV [3, 3, 3, 2, 2, 2, 2, 3, 3, 2] 0 CV [7, 6, 6, 3, 3, 9, 3, 7, 5, 8] 4| |wine|MV [3, 3, 2, 3, 2, 3, 4, 3, 2, 2] 0 CV [3, 4, 7, 8, 3, 3, 4, 3, 5, 3] 3| |cancer|MV [3, 3, 3, 2, 3, 3, 3, 3, 3, 3] 0 CV [5, 7, 3, 4, 4, 4, 7, 4, 3, 6] 2| |car|MV [7, 7, 7, 7, 7, 7, 7, 7, 7, 7] 0 CV [11,19,19,15,17,13,11,13,13,15] 8| |heart|MV [5, 6, 3, 3, 6, 5, 6, 4, 3, 5] 1 CV [4, 4, 3, 3, 4, 4, 3, 3, 9, 5] 3| |mean|MV – 0.200 CV – 4.000| |Dataset|Method Recommended dropout rate Variance| |---|---| |mnist|MV [0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2] 0.000 Vali. [0.0, 0.0, 0.2, 0.2, 0.0, 0.0, 0.2, 0.0, 0.0, 0.2] 0.011 Test [0.2, 0.0, 0.0, 0.2, 0.0, 0.0, 0.0, 0.0, 0.2, 0.0] 0.009| |f-mnist|MV [0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2] 0.000 Vali. [0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2] 0.000 Test [0.2, 0.0, 0.2, 0.0, 0.0, 0.0, 0.0, 0.2, 0.2, 0.0] 0.011| |cifar10|MV [0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4] 0.000 Vali. [0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2] 0.000 Test [0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2] 0.000| |mean|MV – 0.000 Vali. – 0.003 Test – 0.007| Table 4: Recommended hyperparameters across different runs. The third column shows the specific recommended hyperparameters (we run the datasets 10 times); the last column shows the variance (the average of the squared differences from the Mean) across different runs. MV is more stable than CV and test accuracy in recommending hyperparameters. 0.200 for MV, but 4.000 for CV accuracy; when recommending dropout rate for CNN, the overall variance is 0.000 for MV, 0.003 for validation accuracy, and 0.007 for test accuracy. 4.4 RQ4: INFLUENCE OF TRAINING DATA SIZE ON MV We expect that when a learner is over-complex for the data, adding extra data will improve the learner’s resilience to mutated labels, thus the MV score ought also to increase when training data size increases. Indeed, previously from Figure 3, we have observed that models trained on large datasets (e.g., bank and adult) tend to be more resilient to mutated labels for large depths. Figure 6 shows the results for connect and moon datasets, indicating that data size plays a role in determining the trained model’s robustness to incorrect training labels. connect moon 0.80 1.00 0.75 0.75 0.70 1000 100 MV 3000 MV0.50 300 0.65 5000 500 0.60 7000 0.25 700 9000 900 0.55 0.00 7 11 15 19 23 2 6 10 14 18 Figure 6: Influence of training data size on MV datasets, indicating that data size plays a role in when increasing maximum depths (horizontal determining the trained model’s robustness to axis) of Decision Trees. For a given depth, larger incorrect training labels. datasets tend to yield larger MV values. These observations indicate another possible value in the use of MV, as a complement to CV and test accuracy. Specifically, where MV is low, the ML engineers have two potential actions to improve the fitting between the learner and the training data: either to optimise the learner (e.g., search for smaller-capacity models with comparable test accuracy), or to optimise the data (e.g., increase data size to increase the learner’s resilience to incorrect training labels). 5 DISCUSSION 5.1 CONNECTION WITH RELATED WORK MV is also connected with several key concepts in the literature. This section discusses the connections as well as the differences between MV and these concepts. **Noise injection in training data has long been recognised as an approach to complex training data** and reduce overfitting (Holmstrom et al., 1992; Greff et al., 2016). There are three main differences between MV and this random noise injection. 1) Overfitting prevention noise is often Gaussian noise (Greff et al., 2016), and is added into the training inputs (not labels); on the contrary, MV does not use random noise, but uses label swapping, the mutation is applied only on labels. 2) Conventional noise injection changes the training data, then directly uses the model trained on this noisy training data. MV keeps the original training data, but creates another mutated training data to for measurement calculation. This mutated training data will not be used to yield any model for real prediction tasks. 3) Conventional noise injection aims to improve the complexity of the training data, to reduce overfitting. MV aims to calculate a model validation measurement score, to measure ----- the goodness of model fitting. Zhang et al. (2017) adds noise in training data labels to study the generalisation of DNN, while MV mutates labels to provide model validation measurement. **Noise injection in test data is adopted to evaluate the robustness of a model. For example, the** generation of adversarial examples (Goodfellow et al., 2014) uses noise injection in the test inputs. Compared to this technique, MV 1) mutates training data, not test data; 2) mutates labels, not features; 3) aims to validate model fitting, rather than model robustness to feature perturbations. **Overfitting prevention refers to the techniques adopted in the training process to avoid overfitting,** especially when training deep neural networks. The key techniques are regularisation, early stopping, ensembling, dropout, and so on. However, as shown in this paper, conventional overfitting detection techniques, such as CV or validation accuracy, are often less responsive to overfitting. In practice, developers often conduct the prevention without knowing whether the overfitting happens or not. MV is demonstrated to be discriminating and stable in detecting over-complex learners. It thus provides signals for the adoption and configuration of these overfitting prevention techniques. **Rademacher complexity. In statistical machine learning, Rademacher complexity has been used to** measure the complexity of a learner’s hypothesis space. It also mutates training data. It measures how well the learner correlates with randomly generated labels on the training data, but can be difficult to compute (Rosenberg & Bartlett, 2007). Different from Rademacher complexity, MV is an applied tool, not a theoretical tool. MV uses label mutations for a part of the data. Furthermore, Rademacher complexity cares only about the training accuracy on the mutated data, but MV uses the accuracy changes on the original and the mutated data. 5.2 THE USAGE OF MV IN PRACTICE With our exploratory study, we find that MV is capable of complementing existing model validation techniques. There are two main application scenarios of MV: 1) When out-of-sample validation results are similar or unstable across different models or hyperparameters, MV can help to guide the selection process (see more in Section 4.2 and Section 4.3). 2) When the training data size is limited, MV is a better option than validation accuracy because it does not need to split out a validation set, thus reserving more data for training. The test accuracy, although having limitations shown in the literature and our experiments, is still a very important method for developers to learn the generalisation ability of a model, especially when there is data distribution shift between the training data and unseen data. However, as shown in this paper, when test accuracy is high but MV is low, this means that there is a learner with comparable test accuracy, but is simpler. MV deserves more attention from developers when simplicity, security (e.g., defending training data attack), and interpretibility of the built models are required. Theoretically, MV can be applied in other ML tasks which rely on out-of-sample validation, such as feature selection. We will explore the effectiveness of MV in these tasks in future work. 6 CONCLUSION We introduced an exploratory study on MV, a new approach to assessing how good a learner fits the given training data. MV validates via checking the learner’s sensitivity to training labels changes (expressed as label mutants). The sensitivity is captured by metamorphic relations. We show that MV is more effective and stable than the currently adopted CV, validation accuracy, and test accuracy. It is also responsive to model capacity and training data characteristics. These results provide evidence that MV complements the existing model validation practises. We hope the present paper will serve as a starting point to future contributions. ----- REFERENCES Arthur Asuncion and David Newman. Uci machine learning repository, 2007. Tsong Y Chen, Shing C Cheung, and Shiu Ming Yiu. Metamorphic testing: a new approach for generating next test cases. arXiv preprint arXiv:2002.12543, 2020. [Classifier Comparison. Classifier Comparison, 2019. URL https://scikit-learn.org/stable/auto_](https://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html) [examples/classification/plot_classifier_comparison.html.](https://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html) Vitaly Feldman, Roy Frostig, and Moritz Hardt. The advantages of multiple classes for reducing overfitting from test set reuse. In International Conference on Machine Learning, pp. 1892–1900. PMLR, 2019. Aritra Ghosh, Himanshu Kumar, and PS Sastry. Robust loss functions under label noise for deep neural networks. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 31, 2017. Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014. Klaus Greff, Rupesh K Srivastava, Jan Koutník, Bas R Steunebrink, and Jürgen Schmidhuber. Lstm: A search space odyssey. IEEE transactions on neural networks and learning systems, 28(10): 2222–2232, 2016. Quentin F Gronau and Eric-Jan Wagenmakers. Limitations of bayesian leave-one-out cross-validation for model selection. Computational brain & behavior, 2(1):1–11, 2019. Douglas M Hawkins. The problem of overfitting. Journal of chemical information and computer _sciences, 44(1):1–12, 2004._ Lasse Holmstrom, Petri Koistinen, et al. Using additive noise in back-propagation training. IEEE _transactions on neural networks, 3(1):24–38, 1992._ Yue Jia and Mark Harman. An analysis and survey of the development of mutation testing. IEEE _transactions on software engineering, 37(5):649–678, 2010._ Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of Machine Learning. MIT press, 2018. Second edition. Mike Papadakis, Marinos Kintis, Jie Zhang, Yue Jia, Yves Le Traon, and Mark Harman. Mutation testing advances: an analysis and survey. In Advances in Computers, volume 112, pp. 275–378. Elsevier, 2019. F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011. Hung Viet Pham, Shangshu Qian, Jiannan Wang, Thibaud Lutellier, Jonathan Rosenthal, Lin Tan, Yaoliang Yu, and Nachiappan Nagappan. Problems and opportunities in training deep learning software systems: An analysis of variance. In 2020 35th IEEE/ACM International Conference on _Automated Software Engineering (ASE), pp. 771–783. IEEE, 2020._ Juho Piironen and Aki Vehtari. Comparison of bayesian predictive methods for model selection. _Statistics and Computing, 27(3):711–735, 2017._ David S Rosenberg and Peter L Bartlett. The rademacher complexity of co-regularized kernel classes. In Artificial Intelligence and Statistics, pp. 396–403, 2007. [Scikit-learn:SVM. RBF SVM parameters. https://scikit-learn.org/stable/auto_examples/svm/plot_](https://scikit-learn.org/stable/auto_examples/svm/plot_rbf_parameters.html) [rbf_parameters.html, 2020.](https://scikit-learn.org/stable/auto_examples/svm/plot_rbf_parameters.html) Sergio Segura, Gordon Fraser, Ana B Sanchez, and Antonio Ruiz-Cortés. A survey on metamorphic testing. IEEE Transactions on software engineering, 42(9):805–824, 2016. ----- Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. International Conference on Learning Representations _(ICLR), 2017._ ----- A APPENDIX A.1 THEORETICAL INSPIRATION This part introduces the theory that leads to a metamorphic relation and formula that we use for conducting MV. The denotations and terms we used are guided by the standard statistical machine learning literature (Mohri et al., 2018; Ghosh et al., 2017). **Theoretical Metamorphic Relation for Model Validation:** Let X be the feature space from which the data are drawn. Let Y = [k] = {1, ..., k} be the class labels. Given a training set S = (x1, yx1 ), ..., (xm, yxm ) ( )[m] which is independently _{_ _} ∈_ _X × Y_ and identically distributed according to some fixed, but unknown, distribution D of X × Y. Let a classifier be f : X →C, C ⊆ R[k]. A loss function L is a map L : C ×Y → R[+]. We use E to denote expectation. Given any loss function _L and a classifier f_, we define the L-risk of f by _RL(f_ ) = E [L(f (x), yx)] = Ex,yx [L(f (x), yx)]. (2) _D_ Let (x, ˆyx) be the mutated data, where _{_ _}_ _yˆx =_ _yx, with probability (1 −_ _ηx)_ _j_ [k] _yx, with probability ηxj_ _._  _∈_ _\_ For all x, conditioned on yx = i, we have _j=i_ _[η]xj_ [=][ η][x][. The noise is called symmetric or uniform] if ηx = η. For uniform or symmetric noise, we also have̸ _ηxj =_ _k_ _η_ 1 [. We consider a symmetric loss] _−_ function L that satisfies the following equation, where[P] _C is a constant._ _k_ _L(f_ (x), i) = C, ∀x ∈X _, ∀f._ (3) _i=1_ X Let Sη = {(xn, ˆyxn ), n = 1, ..., m} be the mutated training data. We call η as a mutation degree. Let f be the model trained on S, fη be the model trained on Sη [1]. f and fη are from the same learner (with identical hypothesis space). Let r(x) = (L(fη(x), ˆyx) _L(f_ (x), yx))/η be the loss change _−_ rate between f and fη, x . For uniform noise, we have: _∈X_ E [r(x)] = E _L(fη(x), ˆyx) −_ _L(f_ (x), yx) (4) _D_ _D_ _η_ = E  (1 − _η)L(fη(x), yx) +_ _k−η_ 1  _i≠_ _yx_ _[L][(][f][η][(][x][)][, i][)][ −]_ _[L][(][f]_ [(][x][)][, y][x][)] (5) _D_ _η_  P  1 = E _L(fη(x), yx)_ _L(fη(x), i)_ (6) _D_  _η_ _−_ _η [1]_ _[L][(][f]_ [(][x][)][, y][x][) +] _k_ 1  _−_ _iX≠_ _yx_  [1][ −] _[η]_  = [1][ −]η _[η]_ _RL(fη) −_ _η [1]_ _[R][L][(][f]_ [) +][ C][ −]k[R][L]1[(][f][η][)] (7) _−_ 1 _C_ = ( [1] (8) _η_ _k_ 1 [)][R][L][(][f][η][)][ −] _η [1]_ _[R][L][(][f]_ [) +] _k_ 1 _[−]_ [1][ −] _−_ _−_ _k_ _C_ = ( [1] (9) _η_ _[−]_ _k −_ 1 [)][R][L][(][f][η][)][ −] _η [1]_ _[R][L][(][f]_ [) +] _k −_ 1 _[.]_ If we consider L to be error rate[2], C = 1. Now consider the situation for multi-class classification problems, we mutate the labels by label swapping: to mutate the labels by replacing them with the 1f is the same as f (S) in Equation 15. 2The loss function L does not have to be error rate (0-1 loss). Any loss function that satisfies Equation 3 can be applied. In the present work, we use error rate (also accuracy) considering its popularity. ----- next label in the label list, the final label in the label list is replaced with the first label in the label list. In this way, we have ηxnj = η. Thus, Equation 9 becomes: E [r(x)] = ( [1] (10) _D_ _η_ _[−]_ [2)][R][L][(][f][η][)][ −] _η [1]_ _[R][L][(][f]_ [) + 1][.] Thus, we have: _RL(f_ ) = (1 2η)RL(fη) _ηE_ [r(x)] + η. (11) _−_ _−_ _D_ Let T (f ) be the accuracy of f over distribution, T (f ) = 1 _RL(f_ ), T (fη) = 1 _RL(fη). Let_ _T_ (f ) be the empirical accuracy of f on training data with size D _n −:_ _T_ (f ) = _n[1]_ _ni=1_ [1][f] −[(][x]i[)=][y]xi [, where] 1w is the indicator function of event w. We have: P b [b] _T_ (f ) = (1 2η)T (fη) + ηE [r(x)] + η. (12) _−_ _D_ **Empirical Model Validation Measurement** Equation 12 specifies a theoretical metamorphic relation: Mutation degree η defines the relationship between inputs S and Sη: under each class of S, η proportion of the labels are mutated with label swapping (see more in Section A.1), yielding Sη. Such input changes lead to the expected output changes reflected by T (f ), T (fη), and loss change rate r(x) in the equation. The calculation of such a theoretical metamorphic relation, however, is impractical, because data distribution D is unknown, the expectation calculation is also unrealistic. Inspired by Equation 12, also considering that our motivation is to empirically measure how good a learner fits the available training data, we change the expectations on data distribution D into empirical observations on the available training data. This leads to the following measurement, m, to empirically access a learner[3]. _m = (1_ 2η)TS(fη) + ηrˆ + η (13) _−_ _TS(f_ ) _TSη_ (fη) = (1 2η)T[b]S(fη) + η _−_ [b] + η (14) _−_ _η_ b = (1 2η)TS(fη) + _TS(f_ ) _TSη_ (fη) + η. (15) _−_ [b][b] [b] _−_ [b] In Equation 15, S is the original training data, Sη is the mutated training data with mutation degree η (η ≤ 0.5[4]), f is the model trained on the original training data, fη is the model trained on the mutated data, _TS(f_ ), _TS(fη) are the accuracy of f and fη based on the original training labels, respectively._ _TSη_ (fη) is the accuracy of fη based on the mutated training labels. [b] [b] **Connection betweenb** _m and Our Intuition:_ Interestingly, Equation 15 matches well with our intuition introduced in Section 2.1. In particular, if the learner is less affected by the mutated labels, the predictive behaviours of the trained model with mutated labels should be closer to that of the model trained with the original labels. This leads to a larger _TS(fη) and ˆr, as long as the mutation degree η is fixed. The matching between m and our_ intuition provides extra supports for the reliability of MV in validating machine learning models. The calculation of[b] _m can also be regarded as a type of mutation score Jia & Harman (2010); Papadakis_ et al. (2019) for model validation. As explained in Section 2.1, we expect that a good learner kills more mutants. However, the intuitionistic mutation score (i.e., the proportion of killed mutants) has a limitation in model validation: a poor learner that makes random guesses may also kill many mutants. 3The purpose of m is not to approximate T (f ), but to measure how good a learner fits the available training data. However, if the training data is sufficiently large, m is expected to be close to T (f ). ----- Equation 15 covers the mutant killing results (i.e., by calculating accuracy decrease rate r), but also fixes this limitation by also considering the accuracy on the original correct labels (i.e, _TS(fη)). Thus,_ it can be regarded as a mutation score calculation adapted to suit the model validation scenario. The larger a learner’s m is, the better the learner fits the training data. Thus, we adapt the concept of[b] metamorphic relation in the scenario of model validation and extend it to a qualitative measurement, rather than a simple binary judgement. However, the value of m can also reflect the metamorphic relation we introduced in Section 2.2: let us define that once a learner’s m is below a threshold (e.g., 0.8), the metamorphic relation is violated. The violation then indicates a fault in model fitting: the learner is either over-complex (with a large _TS(f_ )) or over-simple (with a small _TS(f_ )) for the training data. [b] [b] A.2 INFLUENCE OF MUTATION DEGREE From Equation 15, it can be seen that the calculation of MV involves a mutation degree, η, for generating the mutated training data. The value of η needs to be fixed during the calculation. However, if Equation 15 is reliable, the influence of η on model validation should be minor. This section empirically explores whether this is true. The first sub-figure in Figure 7 shows the results for UCI datasets. The second sub-figure shows the results for the three large image datasets. It reveals that for most datasets (except for the three smallest datasets, i.e., iris, wine, and heart), the values of MV remain almost identical with different mutation degrees. This is because there is the constant term η at the end of Equation 15 when calculating MV, which cancels out the decrease of the detected mutants. This observation provides further evidence for the reliability of our calculation formula shown in Equation 15. For the three very small datasets, i.e., iris, wine, and heart, with fewer than 300 data points, we observe that a larger noise degree leads to a smaller MV. This may be because label mutations have more influence on very small datasets. Nevertheless, with different mutation degrees, we observe that the effectiveness of MV in model selection and hyperparameter configuration do not change, because the relative rankings of models/hyperparameters remain unchanged. For example, as shown by the third sub-figure in Figure 7, even for the smallest dataset iris, the recommended maximum depths for Decision Tree are identical. The same as the choice of n in n-fold cross-validation, although we demonstrate that the choice of _η doest not affect model validation conclusions, there may be a best practice for selecting η under_ different application scenarios. We call for future work and practices to explore this. 1.0 0.5 iris heart wine adult cancer bank 0.0 car connect 0.1 0.2 0.3 0.4 0.5 mutation degree iris 1.0 1.0 0.5 0.5 : 0.1: 0.2 : 0.3 mnist cifar10 : 0.4 0.0 fashion 0.0 : 0.5 0.1 0.2 0.3 0.4 0.5 1 3 5 7 9 mutation degree maximum depth Figure 7: Influence of mutation degree η on MV. The results show that different η lead to similar MV values and identical model validation conclusions. A.3 INFLUENCE OF TRAINING DATA SIZE ON MV In addition to RQ4, we further investigate what would happen to MV should we deliberately use training data much more than normally expected. That is, we go beyond the assumption that there is no need to increase data when the test accuracy becomes stable. As we can see from Figure 8, MV is more responsive to changes in training data size than test accuracy. For learners that are over-complex to the data, when adding more training data no longer increases test accuracy, MV continues to increase, indicating that the model’s resilience to incorrect labels continues to increase. The extra training data improve the robustness of the learner to training label noise. The pattern that MV no ----- longer changes when model complexity increases is also a signal to developers that the training data is perhaps larger than necessary. moon circle linear 1.0 1.0 1.0 0.5 0.5 0.5 1000 PV score 10000 100000 0.0 0.0 0.0 3 7 11 15 19 3 7 11 15 19 3 7 11 15 19 moon circle linear 1.0 1.0 1.0 0.5 0.5 0.5 1000 10000 Test accuracy 100000 0.0 0.0 0.0 3 7 11 15 19 3 7 11 15 19 3 7 11 15 19 Figure 8: MV (first row) and test accuracy (second row) on training data of increasing size with different maximum depths (horizontal axis). MV is responsive to data size changes; 2) MV no longer decreases for large depths when the training data size is sufficiently large. A.4 EFFICIENCY OF MV IN MODEL VALIDATION In this part, we compare the efficiency of MV, CV (3-fold), and validation accuracy (for image datasets), using the synthetic datasets on 7 classifiers (with the same setup in RQ1), the 8 UCI datasets on Decision Tree (with a fixed maximum depth of 5), and the 3 image datasets (with a fixed dropout rate of 0.5, and learning rate of 0.0001). The deep learning experiments with the three image datasets were run on Tesla V100, with 16GB Memory and 61GB RAM. Table 5 shows the results. For brevity, we show only the results for the moon synthetic datasets. Overall, both MV, CV, and validation accuracy have good efficiency on these datasets. Note that in practice developers often use 5-fold or 10-fold CV, which have larger time cost than the 3-fold CV. The cost of MV mainly comes from data mutation and model training. As observed from Table 5, larger datasets take more time to get MV values. Our results demonstrate that the cost of MV is manageable and comparable to 3-fold CV and validation accuracy in both classic learning and deep learning. In particular, for the three large image datasets, MV costs only half the time of 3-fold CV. What is more, as demonstrated by RQ1 and RQ2, the effectiveness and stability of MV help to conduct model selection and hyperparameter configuration more quickly. Thus, it helps to save cost in selecting the best learners for a given training set. On the other hand, MV is sensitive to over-complex models, and can help to select the simplest model with reasonable test accuracy. This will also reduce the model training and maintainability cost in the long run. Overall, MV has comparable efficiency to 3-fold CV and validation accuracy. For the three deep learning tasks, MV’s efficiency doubles that of 3-fold CV. |Dataset Learner MV-time CV-time|Dataset Learner MV-time CV-time| |---|---| |moon Linear SVM 0.002s 0.003s moon RBF SVM 0.003s 0.004s moon Gaussian Process 0.161s 0.157s moon Decision Tree 0.001s 0.002s moon Random Forest 0.046s 0.064s moon AdaBoost 0.124s 0.185s moon Naive Bayes 0.002s 0.003s mean – 0.048s 0.060s|iris Decision Tree 0.021s 0.003s wine Decision Tree 0.004s 0.007s cancer Decision Tree 0.023s 0.017s car Decision Tree 0.011s 0.014s heart Decision Tree 0.004s 0.008s adult Decision Tree 0.386s 0.385s bank Decision Tree 0.118s 0.109s connect Decision Tree 0.195s 0.196s – – 0.095s 0.092s| Dataset Learner epoch MV-time CV-time Validation-time mnist convolutional neural network 10 2.772min 5.296min 1.761min fashion-mnist convolutional neural network 10 2.803min 5.352min 1.778min cifar10 convolutional neural network 50 12.052min 24.315min 8.072min **mean** – – **5.876min** **11.654min** **3.870min** Table 5: Efficiency of MV, CV, and validation accuracy. The top/bottom sub-table shows the results for classic/deep learning. We observe that the efficiency of MV is comparable to that of 3-fold CV and validation accuracy. ----- A.5 MODEL SELECTION WITH UCI DATASETS (RQ1) In this part, as an extension to RQ1, we present the results of model selection using UCI datasets. Note that we do not use test accuracy as the ground truth for model selection considering the possible limitations of test accuracy we discussed in the introduction. Instead, we present the results to demonstrate how MV differs from CV and test accuracy, as well as how it complements CV and test accuracy in model selection when developers observe similar accuracy results across different learners. Figure 9 shows the results. It is interesting that we do have observations on MV that are consistent with common machine learning knowledge. In particular, for the four smallest dataset (i.e., iris, wine, cancer, and heart), MV suggest the two simplest learners (i.e., Linear SVM and Naive Bayes). iris wine cancer car 1 1 1 1 MV CV TA 0 0 0 0 Linear SVMRBF SVMDecision TreeRandom ForestAdaBoostNaive Bayes Linear SVMRBF SVMDecision TreeRandom ForestAdaBoostNaive Bayes Linear SVMRBF SVMDecision TreeRandom ForestAdaBoostNaive Bayes Linear SVMRBF SVMDecision TreeRandom ForestAdaBoostNaive Bayes heart adult bank connect 1 1 1 1 0 0 0 0 Linear SVMRBF SVMDecision TreeRandom ForestAdaBoostNaive Bayes RBF SVM Decision Tree Random Forest AdaBoost Naive Bayes Linear SVMRBF SVMDecision TreeRandom ForestAdaBoostNaive Bayes Linear SVMRBF SVMDecision TreeRandom ForestAdaBoostNaive Bayes Figure 9: Model selection results with UCI datasets (extended analysis for RQ1). ----- A.6 MODEL SELECTION WITH RANDOM LABEL REPLACEMENT In the main body of this work, we present empirical results with MV calculated using label swapping. In this part, we further explore the effectiveness of another label mutation approach: random label replacement. That is, when conducting label mutation, we replace the original label with a label that is randomly chosen from the label list. We compare the performance of these two label mutation approaches in model selection with the ground truth provided by synthetic datasets. Figure 12 shows the results. We observe that random label replacement is less accurate than label swapping in model selection, but is still more accurate than CV and test accuracy. zero noise Linear SVM RBF SVM Gaussian Process Decision Tree Random Forest AdaBoost Naive Bayes |Col1|MV-SL=0.89 MV-RL=0.80|MV-SL=1.00 MV-RL=0.94|MV-SL=1.00 MV-RL=0.95|MV-SL=0.74 MV-RL=0.77|MV-SL=0.75 MV-RL=0.77|MV-SL=0.75 MV-RL=0.73|MV-SL=0.89 MV-RL=0.77| |---|---|---|---|---|---|---|---| ||MV-SL=0.53 MV-RL=0.51|MV-SL=1.00 MV-RL=0.87|MV-SL=0.88 MV-RL=0.91|MV-SL=0.67 MV-RL=0.73|MV-SL=0.69 MV-RL=0.78|MV-SL=0.72 MV-RL=0.74|MV-SL=1.00 MV-RL=0.93| ||MV-SL=0.93 MV-RL=0.79|MV-SL=0.88 MV-RL=0.83|MV-SL=0.91 MV-RL=0.88|MV-SL=0.68 MV-RL=0.73|MV-SL=0.67 MV-RL=0.78|MV-SL=0.77 MV-RL=0.77|MV-SL=0.94 MV-RL=0.88| MV-SL=0.89 MV-SL=1.00 MV-SL=1.00 MV-SL=0.74 MV-SL=0.75 MV-SL=0.75 MV-SL=0.89 MV-RL=0.80 MV-RL=0.94 MV-RL=0.95 MV-RL=0.77 MV-RL=0.77 MV-RL=0.73 MV-RL=0.77 MV-SL=0.53 MV-SL=1.00 MV-SL=0.88 MV-SL=0.67 MV-SL=0.69 MV-SL=0.72 MV-SL=1.00 MV-RL=0.51 MV-RL=0.87 MV-RL=0.91 MV-RL=0.73 MV-RL=0.78 MV-RL=0.74 MV-RL=0.93 MV-SL=0.93 MV-SL=0.88 MV-SL=0.91 MV-SL=0.68 MV-SL=0.67 MV-SL=0.77 MV-SL=0.94 MV-RL=0.79 MV-RL=0.83 MV-RL=0.88 MV-RL=0.73 MV-RL=0.78 MV-RL=0.77 MV-RL=0.88 Figure 10: Performance of MV in model selection with label swapping (MV-SL) and random label replacement (ML-RL). A.7 Figure 11: Performance of MV in suggesting hyperparameters that follow Occam’s Razor on dataset test accuracy MV 1 1 0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.6 num of trees 0.5 0.5 0.4 0.4 0.3 0.3 50 0.2 50 0.2 1 50 1 50 depth depth Cancer. The training data has only 300 samples. The low values of MV on large depths and number of trees provide warnings to developers that the hyperparameters are over complex and violate the rule of Occam’s Razor. The unnecessary complexity in the complex learner affects the interpretability of the learner, also making it vulnerable to training label attacks. ----- A.8 Figure 12: Correlation between MV, training accuracy changes, and the new training accuracy based 0.25 0.20 r=0.91, p=1.5e-46 0.15 0.10 0.05 Accuracy changes 0.00 0.05 0.10 0.6 0.7 0.8 0.9 1.0 MV 1.0 r=0.84, p=1.4e-32 0.9 0.8 0.7 New accuracy on original labels 0.6 0.6 0.7 0.8 0.9 1.0 MV on the original labels. -----