# UNDERSTANDING THE SUCCESS OF KNOWLEDGE DIS## TILLATION – A DATA AUGMENTATION PERSPECTIVE **Anonymous authors** Paper under double-blind review ABSTRACT Knowledge distillation (KD) is a general neural network training approach that uses a teacher model to guide a student model. Many works have explored the rationale for its success. However, its interplay with data augmentation (DA) has not been well understood so far. In this paper, we are motivated by an interesting observation in classification: KD loss can take more advantage of a DA method than cross-entropy loss simply by training for more iterations. We present a generic framework to explain this interplay between KD and DA. Inspired by it, we enhance KD via stronger data augmentation schemes named TLmixup and TLCutMix. Furthermore, an even stronger and efficient DA approach is developed specifically for KD based on the idea of active learning. The findings and merits of our method are validated with extensive experiments on CIFAR-100, Tiny ImageNet, and ImageNet datasets. We achieve new state-of-the-art accuracy by using the original KD loss armed with stronger augmentation schemes, compared to existing state-of-the-art methods that employ more advanced distillation losses. We also show that, by combining our approaches with the advanced distillation losses, we can advance the state-of-the-art even further. In addition to very promising performance, this paper importantly sheds light on explaining the success of knowledge distillation. The interaction of KD and DA methods we have discovered can inspire more powerful KD algorithms. 1 INTRODUCTION Deep neural networks (DNNs) are the best performing machine learning method in many fields of interest (LeCun et al., 2015; Schmidhuber, 2015). How to effectively train a deep network for classification has been a central topic for decades. In the past several years, efforts have mainly focused on better architecture design (e.g., batch normalization (Ioffe & Szegedy, 2015), residual blocks (He et al., 2016), dense connections (Huang et al., 2017)) and better loss functions (e.g., label smoothing (Szegedy et al., 2016; M¨uller et al., 2019), contrastive loss (Hinton, 2002), largemargin softmax (Liu et al., 2016)) than the standard cross-entropy (CE) loss. Knowledge distillation (KD) (Hinton et al., 2014) is a training framework that falls in the second group. In KD, a stronger network – called teacher – is introduced to guide the learning of the original network – called student – by minimizing the discrepancy between the representations of the two networks, _LKD = (1 −_ _α)LCE(y, p[(][s][)]) + ατ_ [2]DKL(p[(][t][)]/τ, p[(][s][)]/τ ), (1) where DKL represents KL divergence (Kullback, 1997); α ∈ (0, 1) is a factor to balance the two loss terms; LCE denotes the cross-entropy loss; y is the one-hot label and p[(][t][)], p[(][s][)] stands for the teacher’s output and student’s output, respectively (which are probability distributions over the classes); τ is a temperature constant (Hinton et al., 2014) to smooth predicted probabilities. KD allows us to train smaller, more efficient neural networks without compromising on accuracy, which facilitates deploying deep learning in resource constrained environments (e.g., on mobile devices). The effectiveness of KD has been seen in many tasks (Chen et al., 2017; Wang et al., 2020; Jiao et al., 2019; Wang & Yoon, 2021). Meanwhile, many works have investigated the reason behind its success, such as class similarity structure (Hinton et al., 2014) and regularization (Yuan et al., 2020). However, few works have paid attention to its interplay with the input image data augmentation (DA), a technique to obtain more data through various transformations (Shorten & Khoshgoftaar, 2019). In this paper, we will show that data augmentation is also an important dimension to explain ----- the success of KD. Moreover, our findings show we can achieve much better performance simply using the original KD loss equipped with a stronger data augmentation scheme. Our proposed algorithms are inspired from interesting observations shown in Fig. 1, where Without DA Flip Flip+Crop we plot the student test error curves when 42.0 CE CE the model is trained for different numbers of 40.0 KD KD epochs using KD loss vs. CE loss[1]. Three data 38.0 augmentation scenarios are examined: not us- 36.0 ing DA at all (Without DA), only using the hor-izontal flip (Flip), and using both the horizontal Test error rate (%) 34.032.0(180, 33.01) (120, 34.33) (360, 30.50) flip and random crop (Flip+Crop). We have the 30.0 CE following observations. (1) Within each plot, 28.0 KD (480, 28.27) (720, 28.03) |Col1|Col2|Col3|Col4| |---|---|---|---| ||||| |9.55)|||| ||||| |, 33.01|)||| ||||| |C K|E D||| |Col1|Col2|Col3| |---|---|---| |||CE KD| |||| |||| |||| |4.33)||| |(48|0, 28.|27)| |Col1|Col2|Col3| |---|---|---| |||CE KD| |||| |||| |||| |||| |360, 30. (|50) 720,|28.03| KD loss delivers lower test error than CE loss. 240 480 720 960 240 480 720 960 240 480 720 960 _, 39.55)_ **180, 33.01)** CE KD CE KD _, 34.33)_ (480, 28.27) CE KD (360, 30.50) (720, 28.03) **(2) When DA is used (comparing the middle or** Number of total training epochs right plot to the left), both CE and KD curves Figure 1: Test error rate of ResNet20 on CIFAR are improved. (3) When DA is used (compar 100 when trained for different numbers of epochs ing the middle or right plot to the left), the op (the teacher is ResNet56 for KD). Each result is timal number of training epochs is postponed obtained by averaging 3 random runs (shaded area and the postponement is greater for KD than indicates the std) “Flip” refers to horizontal flip; CE (the optimal number of epochs is postponed “Crop” refers to random crop. Both are standard from 180 to 480 for KD versu from 60 to 120 data augmentation schemes in classification. The for CE). (4) When a stronger DA is employed optimal number of training epochs and its test er (comparing the right plot to the middle), the ror are highlighted in red. optimal number of epochs is further postponed with even lower test error. The first two observations are well-recognized by existing works (they simply reiterate the effectiveness of KD and DA, respectively), while the last two observations are new discoveries of this work, concerning the _interaction between KD and DA. In other words, KD and DA, as two common techniques to im-_ prove the performance of DNNs, are actually not independent. This paper explains the interplay between KD and DA and leverages it for stronger KD performance using only the standard KD loss. Specifically, we explain why KD is able to exploit DA more than the CE loss. Owing to the random transformations in data augmentation, the input data are not fixed over different epochs. Different _views of each image are presented over the training process. When KD loss is used, the teacher_ maps these different views to different targets. As illustrated in Fig. 2(a), these targets with different probability structures can reveal more information of the data, thus helping the student more. In contrast, when CE loss is adopted, the target is fixed regardless of the different views of the input. The extra information is thus lost. This observation inspired us to develop two stronger data augmentation techniques (TLmixup, TLCutmix) that are tailored for KD. We further tap into the idea of active learning to make TLCutmix even better, that is, our TLCutmix+pick method. In summary, these are the main contributions of this work: - We make novel observations of the interaction between KD and DA. We explain how DA methods are more suited to be exploited by KD which uses teacher outputs as labels instead of the ground-truth one-hot labels used by CE loss. - Inspired by the above, we propose to enhance the original KD loss with stronger data augmentation schemes (by adapting mixup (Zhang et al., 2018) and CutMix (Yun et al., 2019) to KD). It is shown that these methods are more reasonably applied in the KD case than in the CE case. - We further propose an even stronger data augmentation method specifically for knowledge distillation using the idea of active learning (Settles, 2011). - We show empirically better results simply by using the original KD loss combined with the proposed DA scheme, compared to state-of-the-art KD methods, which adopt more advanced distillation losses. 1Note, data points on one curve in Fig. 1 are from independent experiments of different total number of _epochs; the learning rate schedule is proportionally scaled based on the total number of epochs._ ----- target (KD) target (CE) [0.13, 0.02, 0.70, 0.15] [0, 0, 1, 0] Teacher **DA** [0.10, 0.05, 0.60, 0.25] [0, 0, 1, 0] **Standard DA** (fixed) Raw input 𝑥! Input 𝑥 KD loss Input [0.05, 0.22, 0.58, 0.15] [0, 0, 1, 0] sunset dog lawn **Stronger DA** Student Different views of the input (a) mapped to mapped to (b) different targets one target Input 𝑥’ Figure 2: Interplay between knowledge distillation (KD) and data augmentation (DA). (a) Illustration of the difference of supervised target between the KD loss and cross-entropy (CE) loss. An input is transformed to different versions (called views in this paper) owing to data augmentation. KD loss can provide extra information to the student by mapping these views to different targets, while the CE loss cannot. (b) Illustration of KD with the proposed data augmentation framework. The standard DA consists of random crop and horizontal flip. The stronger DA refers to any data augmentation scheme more advanced than the standard one. In this paper, we propose three stronger DA schemes: TLmixup, TLCutMix, TLCutMix+pick (see Sec. 3.2 for details). 2 RELATED WORK **Knowledge distillation: The general idea of knowledge distillation is to guide the training of a** student model through a stronger pretrained teacher model (or an ensemble of models). It was pioneered by Buciluˇa et al. (Buciluˇa et al., 2006) and later refined by Hinton et al. (Hinton et al., 2014), who coined the term. Since its debut, knowledge distillation has seen extensive application in vision and language tasks (Chen et al., 2017; Wang et al., 2020; Jiao et al., 2019; Wang & Yoon, 2021). Many variants have been proposed regarding the central question in knowledge distillation, that is, how to define the knowledge that is supposed to be transferred from the teacher to the student. Examples of such knowledge definitions include feature distance (Romero et al., 2015), feature map attention (Zagoruyko & Komodakis, 2017), feature distribution (Passalis & Tefas, 2018), activation boundary (Heo et al., 2019), inter-sample distance relationship (Park et al., 2019; Peng et al., 2019; Liu et al., 2019; Tung & Mori, 2019), and mutual information (Tian et al., 2020). Over the past several years, the progress has been made primarily at the output end (i.e., through a better loss function). In contrast to previous works, our goal in this paper is to improve the KD performance at the input end with the help of data augmentation. We will show this path is as effective and also has much potential for future work. **Data augmentation: Deep neural networks are prone to overfitting, i.e., building input-target con-** nection using undesirable or irrelevant features (like noise) in the data. Data augmentation is a prevailing technique to curb overfitting (Shorten & Khoshgoftaar, 2019). In classification tasks, data augmentation aims to explicitly provide data with label-invariant transformations (such as random crop, horizontal flip, color jittering) in the training so that the model can learn representations robust to those nuisance factors. Recently, some advanced data augmentation methods were proposed, which not only transform the input, but also transform the target based on certain corresponding relations. For example, mixup (Zhang et al., 2018) linearly mixes two images with the labels mixed by the same linear interpolation; manifold mixup (Verma et al., 2019) is similar to mixup but conducts the mix operation in the feature level instead of pixel level; CutMix (Yun et al., 2019) pastes a patch cut from an image onto another image with the label decided by the area ratio of the two parts. Now that the input and target are transformed simultaneously, the key is to maintain a semantic _correspondence between the new input and new target. Although these methods have been proven_ effective, one lingering concern is about the reasoning behind them. Specifically, it is easy to come up with examples where the semantic correspondence is poorly kept (see Fig. 5 for examples on CutMix). Unlike these methods, which focus on general classification using the cross-entropy loss, our work investigates the interplay between data augmentation and knowledge distillation loss and the proposed new data augmentation is specifically for knowledge distillation. ----- One recent work (Das et al., 2020) also conducts empirical study about the impact of data augmentation on knowledge distillation. However, their exploration is very different from ours: they first apply data augmentation (e.g., mixup/CutMix) to the teacher then conduct the distillation step as usual (no extra data augmentation in this step); our investigation is the exact opposite to their setup: we train the teacher as usual (not applying mixup/CutMix), then in the distillation step we employ a more advanced data augmentation (e.g., mixup/CutMix). Interestingly, they conclude that the teacher trained with mixup/CutMix hurts the student’s generalization ability, while we consistently see a performance boost by using a stronger DA in the distillation step during student training. 3 PROPOSED METHOD 3.1 INTERPLAY OF KD, DA, AND TRAINING ITERATIONS We first introduce a framework to explain the phenomenon that KD can exploit DA more simply by training for more iterations. Training for more iterations means presenting more examples to the network. Over iterative training, the presented examples are not exactly the same among different epochs because of the random transformations of data augmentation. Different versions of an image produced by data augmentation can be regarded as multi-views of that image (Wu et al., 2018; Tian et al., 2020). We term this kind of data difference as input view diversity. As depicted in Fig. 2(a), when the CE loss is employed, different views of an image are mapped to a single point in the target space (the hard label). In contrast, when the KD loss is used, different views of the data are mapped to a group of points in the target space through the teacher, which can reveal richer information around that class. By richer, we specifically mean two sources. First is the class structure information provided by the soft labels instead of the hard labels, i.e., the well-known dark knowledge (Hinton et al., 2014). Second is the information provided by the different input views from data augmentation. Concretely, in Fig. 2(a), although the three views of the input share the same main class “dog”, the target probability vectors are different: compared with the first view, the second one has more lawn in it, thus the teacher has larger predicted probability in the “lawn” class; similarly, the third view has more sunshine, thus larger probability in the “sunset” class. These subtle changes in class-wise probabilities are beneficial for the student to learn. However, if the CE loss is used, all the three views are mapped to the same one-hot label, not putting the extra information to good use. More training iterations keep producing new data views to the student, making it afford more training epochs without overfitting. In contrast, for CE, as the target is fixed for different views, little new information is added in longer training. Thus, the student can only afford fewer extra epochs. Denote the optimal number of training epochs as N, lowest test error as E. The synergistic interplay between KD and DA can be summarized in the following hypotheses, _NKD[(DA)]_ _[−]_ _[N][ (w/o DA)]KD_ _> NCE[(DA)]_ _[−]_ _[N][ (w/o DA)]CE_ _,_ (2) _NKD[(DA+)]_ _> NKD[(DA)]_ [and][ E]KD[(DA+)] _< EKD[(DA)][,]_ where “DA” refers to a data augmentation method; “DA+” refers to another data augmentation method stronger than “DA”; “w/o DA” means not using any data augmentation. These inequalities are empirically verified in our experiments (Fig. 1, Fig. 3). The first inequality suggests the advantage of KD over CE: given the same DA scheme, KD loss can make more use of extra training iterations. The second suggests we can obtain better accuracy simply by training for more epochs using a stronger DA method. This leads us to investigate stronger DA methods for KD, as follows. 3.2 PROPOSED ALGORITHMS FOR IMPROVED KD **(1) KD+TLmixup/TLCutMix. We continue our exploration with two existing data augmentation** techniques that are more advanced than the standard random crop and flip: mixup (Zhang et al., 2018) and CutMix (Yun et al., 2019). They are initially proposed in the CE case. Here we upgrade them for KD, resulting in TLmixup and TLCutMix (TL is short for teacher-labeled). Specifically, let x0 denote the raw data, x denote the transformed data by the standard augmentation (random crop and flip). Illustrated in Fig. 2(b), we propose to add mixup/CutMix following x to obtain x[′]. Unlike the common data augmentation where only the transformed input is fed into the ----- Without DA Flip Flip+Crop Without DA Flip Flip+Crop 42.5 40.0 37.5 35.0 37.5 35.0 32.5 30.0 27.5 25.0 32.5 30.0 27.5 |Col1|Col2|Col3|CE| |---|---|---|---| ||||| ||||KD| ||||| |360, 26.|70)||| CE KD (360, 26.70) (960, 24 240 480 720 960 |Col1|Col2|Col3|CE| |---|---|---|---| ||||| ||||KD| ||||| |.56)|||| CE KD **30.56)** (840, 24. 240 480 720 960 |Col1|Col2|Col3| |---|---|---| |6.91)||| |(480,|31.63|)| |C|E|| |Col1|Col2|Col3| |---|---|---| |.60) 4.87)||| |C|E|| |Col1|Col2|Col3|CE| |---|---|---|---| ||||| ||||KD| |.47)|||| |Col1|Col2|Col3|CE| |---|---|---|---| ||||| ||||KD| |360, 29.|10)||| _, 36.91)_ (480, 31.63) CE KD **40.60)** _, 34.87)_ CE KD CE KD **34.47)** (600, 26.08) CE KD (360, 29.10) (960, 26 240 480 720 960 240 480 720 960 240 480 720 960 240 480 720 960 Number of total training epochs Number of total training epochs (a) WRN-16-2 (teacher: WRN-40-2) (b) VGG8 (teacher: VGG13) Figure 3: Test error rate of WRN-16-2 and VGG8 on CIFAR-100 when trained for different numbers of epochs, using KD or cross-entropy (CE) loss, with or without data augmentation (DA). Every error rate is averaged by 3 random runs (shaded area indicates the stddev). Consistent with Fig. 1, when DA is used, the optimal number of epochs is postponed and postponed more for KD than CE. When a stronger DA is used, the optimal number of epochs is postponed even more with smaller optimal test error. network, we keep both the input x and x[′] for the training (as such, the number of input examples during training is increased). The consideration of keeping both inputs is to maintain the information path for the original input x so that we can easily see how the added information path of x[′] leads to a difference. For x, its loss is still the original KD loss, consisting of the cross-entropy loss and the KL divergence (Eq. 1). Of special note is that, for x[′], its loss is only the KL divergence, i.e., we do not use the labels _assigned by mixup or CutMix because they can be misleading and do not perform well as we will_ show later (Tab. 1, Fig. 5). In fact, not using the hard label has another bonus. A dataset augmentation scheme which employs CE loss has to provide corresponding labels as supervisory information. In order to maintain the semantic correspondence, it cannot admit very extreme transformations for data augmentation. In contrast, in the mixup/CutMix+KD setting described above, the data augmentation scheme need not worry about the labels as they are assigned by the teacher. Therefore, it admits a broader set of transformations to expose the teacher’s knowledge more completely. Between TLmixup and TLCutMix, we will empirically show TLCutMix is more favorable (Tab. 1) (and both of them are significantly better than the standard augmentation, random crop and flip). Therefore, we choose TLCutMix as base to develop our next algorithm as follows. **(2) KD+TLCutMix+pick. Our next algorithm is an even stronger DA scheme tailored to KD, based** on the idea of active learning (Settles, 2011). In active learning, the learner enjoys the freedom to query the data instances to be labeled for training by an oracle (i.e., the teacher in our case) (Settles, 2011). Since the augmented data can vary in their quality, we can introduce certain criterion to pick the more valuable data for the student. We tap into the idea of hard examples (Micaelli & Storkey, 2019) to define the criterion. Specifically, we measure the hardness by the KL divergence between the teacher’s output and the student’s output (p[(][t][)] and p[(][s][)] are the teacher and student output probabilities over classes), _d =_ _KL(p[(][t][)]/τ, p[(][s][)]/τ_ ). (3) _D_ We sort the augmented samples by their d’s in ascending order and pick a subset with the largest _d’s. Notably, the criterion d has exactly the same form that the student is supposed to minimize in_ Eq. (1); while here we pick samples to maximize it. This design makes an adaptive competition: when the student is updated, the criterion made of the KL divergence will also be updated. Each time, it makes sure the hardest samples are selected for the student. Other common choices for the criterion include the teacher’s entropy or the student’s entropy (larger entropy implying more uncertainty meaning the sample is harder). They only take into account one-side information, either the teacher’s or student’s. Conceivably, they are not as good as the KL divergence criterion, which considers the information from both sides. This choice will be empirically justified in our experiments (Tab. 2). ----- Table 1: KD test accuracy comparison on CIFAR-100 when using different DA schemes. Each experiment is run 3 times and the mean and (standard deviation) are reported. Default DA: random crop and flip. Note (1) KD+CutMix is much worse than KD alone on Res56/Res20; (2) KD+TLCutMix consistently outperforms KD+CutMix on all the 3 pairs. Teacher WRN-40-2 ResNet56 VGG13 Student WRN-16-2 ResNet20 VGG8 Teacher Acc. 75.61 72.34 74.64 Student Acc. 73.26 69.06 70.36 KD (default DA) 74.92 (0.28) 70.66 (0.24) 72.98 (0.19) KD+TLmixup 75.33 (0.07) 71.00 (0.16) 73.79 (0.18) KD+TLCutMix **75.34 (0.19) 70.77 (0.17) 74.16 (0.18)** KD+CutMix 75.25 (0.13) 69.76 (0.19) 73.75 (0.16) 4 EXPERIMENTAL RESULTS Table 2: KD test accuracy comparison on CIFAR-100 when using different data picking schemes for the proposed “KD+TLCutMix+pick” method. “T/S” refers to teacher/student, “ent.” is short for entropy, and “kld” stands for KL divergence. The mean and (std) of 3 runs are reported for each entry. Teacher WRN-40-2 ResNet56 VGG13 Student WRN-16-2 ResNet20 VGG8 Teacher Acc. 75.61 72.34 74.64 Student Acc. 73.26 69.06 70.36 TLCutMix 75.34 (0.19) 70.77 (0.17) 74.16 (0.18) +Pick (T ent.) 75.46 (0.07) 70.88 (0.10) 74.16 (0.18) +Pick (S ent.) 75.52 (0.06) 70.84 (0.12) 74.16 (0.48) +Pick (T/S kld) 75.59 (0.22) 70.99 (0.20) 74.43 (0.20) **Datasets and networks. We evaluate our method on the CIFAR-100 (Krizhevsky, 2009), Tiny** ImageNet[2], and ImageNet (Deng et al., 2009) object recognition datasets. CIFAR-100 has 100 object classes (32×32 RGB images). Each class has 500 images for training and 100 images for testing. ImageNet is now the standard large-scale benchmark dataset in image classification, which has 1000 classes (224×224 RGB images), over 1.2 million images in total. Tiny ImageNet is a small version of ImageNet with 200 classes (64×64 RGB images). Each class has 500 images for training, 50 for validation and 50 for testing. To thoroughly evaluate our methods, we benchmark them on various standard network architectures: VGG (Simonyan & Zisserman, 2015), ResNet (He et al., 2016), WRN (Wide-ResNet) (Zagoruyko & Komodakis, 2016), MobileNetV2 (Sandler et al., 2018), ShuffleNetV2 (Ma et al., 2018). Our code and trained models will be made publicly available. **Evaluated methods.** In addition to the standard cross-entropy training and the original KD method (Hinton et al., 2014), we also compare with the state-of-the-art distillation approach Contrastive Representation Distillation (CRD) (Tian et al., 2020). It is important to note that our method focuses on improving KD by using better inputs, while CRD improves KD at the output end (i.e., a better loss function). Therefore, they are orthogonal and we will show they can be combined together to deliver even better results. **Hyperparameter settings. The temperature τ of knowledge distillation is set to 4. Loss weight** _α = 0.9 (Eq. equation 1). (1) For CIFAR-100 and Tiny ImageNet, training batch size is 64; the_ original number of total training epochs is 240, with learning rate decayed at epoch 150, 180, and 210 by multiplier 1/10. The initial lr is 0.05. (2) For ImageNet, training batch size is 256; the original number of training epochs is 100, with learning rate decayed at epoch 30, 60, 90. The initial learning rate is 0.1. All these settings are the same as CRD (Tian et al., 2020) for fair comparison with it. Note, in our experiments we will show the results of more training iterations. If the number of total epochs is scaled by a factor k, the epochs after which learning rate is decayed is also be scaled by k. For example, if we train a network for CIFAR-100 for 480 epochs (k = 2) in total, the epochs after which the learning rate is decayed will be 300, 360, and 420. We use PyTorch (Paszke et al., 2019) to conduct all our experiments. For CIFAR-100, we adopt the pretrained teacher models from CRD (https://github.com/HobbitLong/RepDistiller) for fair comparison with it. For Tiny ImageNet, we train our own teacher models. For ImageNet, we adopt the standard torchvision models as teachers. 4.1 CIFAR-100 **Effect of more training iterations. In Sec. 1, we presented Fig. 1 to show the advantage of KD loss** over CE loss in exploiting extra epochs. Here we show more results in Fig. 3 on different network architectures to confirm the finding is general. In line with the ResNet case (Fig. 1), extra training 2https://tiny-imagenet.herokuapp.com/ ----- Table 3: Student test accuracy comparison on CIFAR-100. Each result is obtained by 3 random runs, mean (std) accuracy reported. The best results are in bold and second best underlined. The subscript 960 means the total number of training epochs (default: 240). Teacher WRN-40-2 ResNet56 ResNet32x4 VGG13 VGG13 ResNet50 ResNet32x4 Student WRN-16-2 ResNet20 ResNet8x4 VGG8 MobileNetV2 VGG8 ShuffleNetV2 Teacher Acc. 75.61 72.34 79.42 74.64 74.64 79.34 79.42 Student Acc. 73.26 69.06 72.50 70.36 64.60 70.36 71.82 KD (Hinton et al., 2014) 74.92 (0.28) 70.66 (0.24) 73.33 (0.25) 72.98 (0.19) 67.37 (0.32) 73.81 (0.13) 74.45 (0.27) KD960 (Hinton et al., 2014) 75.68 (0.12) 71.79 (0.29) 73.14 (0.06) 74.00 (0.34) 68.77 (0.05) 74.04 (0.25) 74.64 (0.30) **KD+TLCutMix** 75.34 (0.19) 70.77 (0.17) 74.91 (0.20) 74.16 (0.18) 68.79 (0.35) 74.85 (0.23) 76.61 (0.18) **KD+TLCutMix+pick** 75.59 (0.22) 70.99 (0.20) 74.78 (0.35) 74.43 (0.20) 69.49 (0.32) 74.95 (0.18) 76.90 (0.25) **KD+TLCutMix+pick960** **76.41 (0.10) 71.66 (0.15) 75.12 (0.18) 75.00 (0.17) 70.47 (0.12) 76.13 (0.16) 77.90 (0.30)** CRD (Tian et al., 2020) 75.64 (0.21) 71.63 (0.15) 75.46 (0.25) 74.29 (0.12) 69.94 (0.05) 74.58 (0.27) 76.05 (0.09) CRD+TLCutMix+pick 75.96 (0.27) 71.41 (0.26) 76.11 (0.53) 74.65 (0.12) 69.95 (0.22) 75.35 (0.22) 76.93 (0.11) CRD+TLCutMix+pick960 76.61 (0.01) 72.40 (0.20) 75.96 (0.29) 75.41 (0.10) 70.84 (0.05) 76.20 (0.22) 78.51 (0.27) Table 4: Student test accuracy comparison on Tiny ImageNet. Each result is obtained by 3 random runs, mean (std) accuracy reported. The best results are in bold and second best underlined. The subscript 480 means the total number of training epochs (default: 480). Teacher WRN-40-2 ResNet56 ResNet32x4 VGG13 VGG13 ResNet50 ResNet32x4 Student WRN-16-2 ResNet20 ResNet8x4 VGG8 MobileNetV2 VGG8 ShuffleNetV2 Teacher Acc. 61.28 58.37 64.41 62.59 62.59 68.20 64.41 Student Acc. 58.23 52.53 55.41 56.67 58.20 56.67 62.07 KD (Hinton et al., 2014) 58.65 (0.09) 53.58 (0.18) 55.67 (0.09) 61.48 (0.36) 59.28 (0.13) 60.39 (0.16) 66.34 (0.11) KD480 (Hinton et al., 2014) 59.20 (0.30) 54.23 (0.24) 55.49 (0.11) 61.72 (0.10) 59.27 (0.08) 60.10 (0.30) 65.81 (0.11) **KD+TLCutMix** 59.06 (0.18) 53.77 (0.33) 56.41 (0.04) 62.17 (0.11) 60.48 (0.30) 61.12 (0.18) 67.01 (0.30) **KD+TLCutMix+pick** 59.22 (0.05) 53.66 (0.05) 56.82 (0.23) 62.32 (0.18) 60.53 (0.18) 61.40 (0.26) 67.08 (0.13) **KD+TLCutMix+pick480** **60.07 (0.04) 54.25 (0.07) 57.54 (0.23) 62.60 (0.25) 60.66 (0.15) 61.95 (0.14) 67.35 (0.21)** CRD (Tian et al., 2020) 60.79 (0.24) 55.34 (0.02) 59.28 (0.13) 62.92 (0.31) 62.38 (0.19) 62.03 (0.16) 67.33 (0.13) CRD+TLCutMix+pick 60.72 (0.09) 54.99 (0.16) 59.65 (0.24) 63.39 (0.10) 62.54 (0.22) 62.85 (0.18) 67.64 (0.18) CRD+TLCutMix+pick480 60.99 (0.33) 55.68 (0.22) 60.13 (0.13) 63.60 (0.20) 62.79 (0.03) 62.60 (0.17) 67.70 (0.35) also brings more performance gains with KD loss on WRN and VGG. The gains are more or less up to the particular pairs but the trends are consistent: When DA is used, the optimal number of training epochs is higher and even more so for KD than CE; when a stronger DA is employed, the optimal number of epochs is further significantly higher and produces lower test error. These results support the proposed hypotheses in Eq. (2). **Exploring different data augmentation schemes.** In Tab. 1 we compare three different DA schemes on CIFAR-100: the default, TLmixup, and TLCutMix. It has been shown in the original papers of mixup and CutMix that they improve accuracy over the standard data augmentation _in the CE case. However, it does not mean naively combining KD and CutMix/mixup as it is_ can always bring performance improvement. As seen, CutMix is actually at odds with KD on the ResNet56/ResNet20 pair while our TLCutMix consistently improves the performance on all three pairs. On the other pairs, original CutMix is also not as effective as our adapted TLCutMix. These confirm that using the teacher’s output for distillation for the augmented data is critical. TLCutMix is better than TLmixup in general, so we choose it as base to develop TLCutMix+pick. **Exploring different data picking schemes. In Tab. 2, we compare the three potential schemes of** selecting more informative data for the student: entropy of the teacher’s output (“T ent.”), entropy of the student output (“S ent.”), and the KL divergence between the teacher’s and student’s outputs (“T/S kld”). As shown, the KL divergence scheme performs best. This is expected as either the teacher entropy or student entropy alone does not reveal the whole picture. **Benchmark on CIFAR-100. The results are shown in Tab. 3. We have the following observations.** (1) KD can be improved by training for more iterations (960 epochs vs. 240), owing to the effect of data augmentation (only one exception is ResNet32x4/ResNet8x4). This is not true for CE alone. This is a novel observation which shows the optimal number of training epochs for KD w/ DA is significantly different from that of CE. (2) Comparing the row “KD+TLCutMix” to “KD”, we see the proposed TLCutMix scheme improves the accuracies of all teacher-student pairs. On 5 out of ----- the 7 pairs, the improvement is very significant (more than 1 percentage point). (3) Comparing the row “KD+TLCutMix+pick” to “KD+TLCutMix”, we see 6 out of the 7 pairs are improved further, showing the proposed data picking scheme works in most cases. (4) Finally, “KD+TLCutMix+pick” scheme can be combined with more training iterations, which delivers even higher accuracies. (5) If comparing our best results (KD+TLCutMix+pick960) with those of CRD (though this is not an apples-to-apples comparison since the two methods focus on different aspects to improve KD), we can see our approach outperforms CRD on 6 out of the 7 pairs. It is worth emphasizing that we achieve this simply using the original KD loss (Hinton et al., 2014), with no bells and whistles. This justifies one of our motivations in this paper, i.e., existing KD methods (Peng et al., 2019; Park et al., 2019; Tian et al., 2020) mainly improve KD at the output layer through better loss functions, while we propose to improve KD at the input end and show this path is just as promising. In the last two rows of Fig. 3, when CRD (Tian et al., 2020), the state-of-the-art KD algorithm, is armed with our proposed “TLCutMix+pick” and more training iterations, its results can be further advanced consistently. This demonstrates that the proposed schemes are general and can readily work with those methods focusing on better KD loss functions. In the Appendix Tab. 9, we present the results of applying TLCutMix to another 5 KD methods. All of the evaluated pairs see accuracy gains; half of them are even improved by more than 1% point. **Further remarks. Observation (1) above has another implication to the community in addition** to improving the performance of KD. It tells us that the number of training iterations can have a _big impact on the performance of a KD method. Unaware of this issue, if authors of a KD paper_ compare their method to others by directly citing numbers from other papers and the training epochs happen to be different, then the comparison may well be unintentionally unfair from the beginning. 4.2 TINY IMAGENET In this section we evaluate the proposed schemes on a more challenging dataset – Tiny ImageNet. Similar to the case on CIFAR-100, we have results on different teacher-student pairs, shown in Tab. 4. For more training iterations, we train for 480 epochs instead of 960 to save time. Most claims on the CIFAR-100 dataset are also validated here: (1) “KD+TLCutMix” is better than KD, which is verified on all pairs. (2) “KD+TLCutMix+pick” is better than “KD+TLCutMix”, verified on 6 pairs. The exception pair is ResNet56/ResNet20, where adding data picking decreases the accuracy slightly by 0.11%. (3) When “KD+TLCutMix+pick” is equipped with more training iterations, we obtain the best performance. The main difference from CIFAR-100 results lies in the comparison between “KD480” and “KD”. In the CIFAR-100 case with standard augmentation, more training iterations consistently improves the accuracy on 6 pairs, while here only 3 are improved. We believe this is because the standard DA scheme – random crops and horizontal flips – cannot produce diverse enough data on this challenging dataset (Tiny ImageNet vs. CIFAR-100). In contrast, using the stronger DA scheme (TLCutMix+pick) and more training iterations does show significant improvement in all 7 out of 7 cases. We also evaluate the compatibility of our DA methods with the state-of-the-art CRD, shown in the last two rows of Tab. 4. Our “TLCutMix+pick” method further advances the prior state-of-theart on 5 pairs. When CRD+TLCutMix+pick is trained for 480 epochs (instead of 240), further improvement can be seen on 6 of 7 pairs. 4.3 IMAGENET We further evaluate our methods on the ImageNet dataset, shown in Tab. 5. “KD+TLCutMix” improves original KD (from 70.66 to 71.05 in top-1 accuracy). When data picking is added, it does not help here. Possible reasons will be analyzed later. When the student is trained for 200 epochs with KD+TLCutMix+pick, it delivers the new state-of-the-art top-1 performance. We also present the result of original KD trained for 200 epochs. Interestingly, it matches the previous state-of-the-art method CRD and beats many other KD methods without any additional _loss terms. However, this is not an apples-to-apples comparison, as these methods are trained for_ 100 epochs. Yet it is a clear indication that the interplay between KD and DA is useful even on a large-scale dataset. ----- Table 5: Top-1 and Top-5 accuracy (%) of the student ResNet18 on ImageNet validation set. The subscript 200 indicates the total number of training epochs is 200 (the original one is 100). Top-1 acc. Top-5 acc. Teacher (ResNet34) 73.31 91.42 Student (ResNet18) 69.75 89.97 KD (Hinton et al., 2014) 70.66 89.88 SP (Tung & Mori, 2019) 70.62 89.80 AT (Zagoruyko & Komodakis, 2017) 70.70 90.00 CRD (Tian et al., 2020) 71.38 90.49 KD200 (Hinton et al., 2014) 71.38 **90.59** **KD+TLCutMix** 71.05 90.36 **KD+TLCutMix+pick** 70.78 90.04 **KD+TLCutMix+pick200** **71.76** 90.58 Figure 4: Mean KL divergence ratio r (Eq. (4)) CIFAR: VGG13/VGG8 CIFAR: WRN-40-2/WRN-16-2 CIFAR: Res56/Res20 Tiny: VGG13/VGG8 1.45 Tiny: WRN-40-2/WRN-16-2Tiny: Res56/Res20 ImageNet: Res34/Res18 1.40 1.35 Average KL divergence ratio 1.30 0.0 0.2 0.4 0.6 0.8 1.0 Iteration over iterations on different datasets. The iterations are normalized into range [0, 1] for easy comparison since the total numbers of iterations are different on the 3 datasets. **Cross-dataset analysis. Here we investigate how the proposed method of KD+TLCutmix+pick is** affected by the size and nature of the dataset. The resnet teacher-student pairs of Res56/Res20 and Res34/Res18 are of particular interest as the boost in performance for these pairs are lower than other network architectures. The picking scheme is proposed based on the idea of active learning (Sec. 3.2). Intuitively, it can work only if the picked data has more information to the student network than those randomly _presented. Since we adopt the KL divergence between the teacher’s output and the student’s output_ to measure the amount of information in the input data, we can compare this metric on two different sets of data, i.e., picked randomly vs. picked based on KL divergence. Specifically, we define _average KL divergence ratio_ _r =_ _N11p_ _NNi_ _p_ _di_ _,_ (4) _N_ Pj _[d][j]_ where di stands for the KL divergence for the i-th sample defined in Eq. (P 3); N denotes the number of total samples in a batch; Np denotes the number of sample picked based on KL divergence (Np = N/2 in our experiments); note that r > 1. Larger r means the picked samples have more information than the average samples. Then we compare r on different datasets over the training process of “KD+TLCutMix+pick”. Results are shown in Fig. 4. As seen, in terms of r, CIFAR100 > Tiny ImageNet > ImageNet on average; meanwhile, comparing the results of CIFAR-100 (Tab. 3), Tiny ImageNet (Tab. 4), and ImageNet (Tab. 5) we see the accuracy gains brought by data picking also show the same trend of CIFAR-100 > Tiny ImageNet > ImageNet, in accordance with our expectation. This validates the soundness of the metric r we introduced. The r on ImageNet is clearly lower than the other two, meaning there is no significant information difference between the picked data and the average data, which may well explain the under-performance of the data picking scheme on ImageNet. Note that the root cause of this problem actually lies in the data augmentation part – since it cannot produce more informative samples, the subsequent data-picking has no scope to expand its value. How to obtain an even stronger scheme than TLCutMix remains elusive for now, which we will investigate as part of our future work. Also note, Res56/Res20 delivers the lowest r among the three pairs. This likely explains why the picking scheme is especially not effective on the original resnet pairs (Res56/Res20, Res34/Res18). 5 CONCLUSION We carefully investigate the interplay between knowledge distillation (KD) and data augmentation (DA) in this paper. Unlike the cross-entropy loss, KD can exploit DA by training for more epochs. The proposed input view diversity framework explains the interplay well and inspires us to develop three new data augmentation methods specifically for KD. Extensive experiments demonstrate the merits of our methods across various networks on CIFAR-100, Tiny ImageNet, and ImageNet datasets. Our method achieves the new state-of-the-art using only the vanilla KD loss with no bells and whistles, showing the potential of improving KD from the input side rather than a better KD loss function. Our paper can also help the community build a more standard benchmark of KD algorithms, paying particular attention to the DA schemes and number of training epochs. ----- REFERENCES Sungsoo Ahn, Shell Xu Hu, Andreas Damianou, Neil D Lawrence, and Zhenwen Dai. Variational information distillation for knowledge transfer. In CVPR, 2019. 12, 13 Cristian Buciluˇa, Rich Caruana, and Alexandru Niculescu-Mizil. Model compression. In SIGKDD, 2006. 3 Guobin Chen, Wongun Choi, Xiang Yu, Tony Han, and Manmohan Chandraker. Learning efficient object detection models with knowledge distillation. In NeurIPS, 2017. 1, 3 Deepan Das, Haley Massa, Abhimanyu Kulkarni, and Theodoros Rekatsinas. An empirical analysis of the impact of data augmentation on knowledge distillation. arXiv preprint arXiv:2006.03810, 2020. 4 Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In CVPR, 2009. 6 Terrance DeVries and Graham W Taylor. Improved regularization of convolutional neural networks with cutout. arXiv preprint arXiv:1708.04552, 2017. 13 K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016. 1, 6 Byeongho Heo, Minsik Lee, Sangdoo Yun, and Jin Young Choi. Knowledge transfer via distillation of activation boundaries formed by hidden neurons. In AAAI, 2019. 3 Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. In _NeurIPS Workshop, 2014. 1, 3, 4, 6, 7, 8, 9, 13_ Geoffrey E Hinton. Training products of experts by minimizing contrastive divergence. Neural _Computation, 14(8):1771–1800, 2002. 1_ Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. In CVPR, 2017. 1 Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 2015. 1 Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, and Qun Liu. Tinybert: Distilling bert for natural language understanding. arXiv preprint arXiv:1909.10351, 2019. 1, 3 Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009. 6 Solomon Kullback. Information theory and statistics. Courier Corporation, 1997. 1 Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553):436, 2015. 1 Weiyang Liu, Yandong Wen, Zhiding Yu, and Meng Yang. Large-margin softmax loss for convolutional neural networks. In ICML, 2016. 1 Yufan Liu, Jiajiong Cao, Bing Li, Chunfeng Yuan, Weiming Hu, Yangxi Li, and Yunqiang Duan. Knowledge distillation via instance relationship graph. In CVPR, 2019. 3 Ningning Ma, Xiangyu Zhang, Hai-Tao Zheng, and Jian Sun. Shufflenet v2: Practical guidelines for efficient cnn architecture design. In ECCV, 2018. 6 Paul Micaelli and Amos Storkey. Zero-shot knowledge transfer via adversarial belief matching. _arXiv preprint arXiv:1905.09768, 2019. 5_ Rafael M¨uller, Simon Kornblith, and Geoffrey E Hinton. When does label smoothing help? In _NeurIPS, 2019. 1_ ----- Wonpyo Park, Dongju Kim, Yan Lu, and Minsu Cho. Relational knowledge distillation. In CVPR, 2019. 3, 8 Nikolaos Passalis and Anastasios Tefas. Learning deep representations with probabilistic knowledge transfer. In ECCV, 2018. 3, 12, 13 Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, highperformance deep learning library. In NeurIPS, 2019. 6 Baoyun Peng, Xiao Jin, Jiaheng Liu, Dongsheng Li, Yichao Wu, Yu Liu, Shunfeng Zhou, and Zhaoning Zhang. Correlation congruence for knowledge distillation. In ICCV, 2019. 3, 8, 12, 13 Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and Yoshua Bengio. Fitnets: Hints for thin deep nets. In ICLR, 2015. 3 Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In CVPR, 2018. 6 J¨urgen Schmidhuber. Deep learning in neural networks: An overview. Neural networks, 61:85–117, 2015. 1 Burr Settles. From theories to queries: Active learning in practice. In AISTATS Workshop on Active _Learning and Experimental Design, 2011. 2, 5_ Connor Shorten and Taghi M Khoshgoftaar. A survey on image data augmentation for deep learning. _Journal of Big Data, 6(1):60, 2019. 1, 3_ Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015. 6 Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. In CVPR, 2016. 1 Yonglong Tian, Dilip Krishnan, and Phillip Isola. Contrastive representation distillation. In ICLR, 2020. 3, 4, 6, 7, 8, 9, 13 Frederick Tung and Greg Mori. Similarity-preserving knowledge distillation. In CVPR, 2019. 3, 9, 12, 13 Vikas Verma, Alex Lamb, Christopher Beckham, Amir Najafi, Ioannis Mitliagkas, David LopezPaz, and Yoshua Bengio. Manifold mixup: Better representations by interpolating hidden states. In ICML, 2019. 3 Huan Wang, Yijun Li, Yuehai Wang, Haoji Hu, and Ming-Hsuan Yang. Collaborative distillation for ultra-resolution universal style transfer. In CVPR, 2020. 1, 3 Lin Wang and Kuk-Jin Yoon. Knowledge distillation and student-teacher learning for visual intelligence: A review and new outlooks. TPAMI, 2021. 1, 3 Zhirong Wu, Yuanjun Xiong, Stella X Yu, and Dahua Lin. Unsupervised feature learning via nonparametric instance discrimination. In CVPR, 2018. 4 Li Yuan, Francis EH Tay, Guilin Li, Tao Wang, and Jiashi Feng. Revisiting knowledge distillation via label smoothing regularization. In CVPR, 2020. 1 Sangdoo Yun, Dongyoon Han, Seong Joon Oh, Sanghyuk Chun, Junsuk Choe, and Youngjoon Yoo. Cutmix: Regularization strategy to train strong classifiers with localizable features. In ICCV, 2019. 2, 3, 4 Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. In BMVC, 2016. 6 Sergey Zagoruyko and Nikos Komodakis. Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer. In ICLR, 2017. 3, 9, 12, 13 Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk minimization. In ICLR, 2018. 2, 3, 4 ----- A NETWORK PARAMETERS AND FLOPS The number of parameters and FLOPs (FLoating-point OPerations) of each model on the CIFAR100, Tiny ImageNet, and ImageNet datasets are presented in Tab. 6, Tab. 7, and Tab. 8, respectively. For the number of parameters (or FLOPs), we only count them in the convolutional and _fully-connected layers (BN layers not included) following the common practice._ **Training cost of TLCutMix+pick. The proposed picking scheme only increases the wall-clock** training time by around 30%, which is reasonable considering the performance gains. Table 6: Model complexity statistics of each teacher-student pair on the CIFAR-100 dataset. The number of FLOPs is measured by the number of Multiply and Add operations in convolutional and fully-connected layers. Input image size: 32×32×3. Teacher WRN-40-2 ResNet56 ResNet32x4 VGG13 VGG13 ResNet50 ResNet32x4 Student WRN-16-2 ResNet20 ResNet8x4 VGG8 MobileNetV2 VGG8 ShuffleNetV2 Teacher #Params (10[6]) 2.2497 0.8574 7.4239 9.4563 9.4563 23.6521 7.4239 Student #Params (10[6]) 0.7015 0.2768 1.2308 3.9621 0.7945 3.9621 1.3393 Compression ratio 3.2072× 3.0979× 6.0319× 2.3867× 11.9027× 5.9696× 5.5430× Teacher #FLOPs (10[9]) 0.6552 0.2515 2.1661 0.5699 0.5699 2.5960 2.1661 Student #FLOPs (10[9]) 0.2022 0.0816 0.3541 0.1924 0.0116 0.1924 0.0863 Speedup ratio 3.2399× 3.0808× 6.1164× 2.9621× 49.1399× 13.4939× 25.0915× Table 7: Model complexity statistics of each teacher-student pair on the Tiny ImageNet dataset. The number of FLOPs is measured by the number of Multiply and Add operations in convolutional and fully-connected layers. Input image size: 64×64×3. Teacher WRN-40-2 ResNet56 ResNet32x4 VGG13 VGG13 ResNet50 ResNet32x4 Student WRN-16-2 ResNet20 ResNet8x4 VGG8 MobileNetV2 VGG8 ShuffleNetV2 Teacher #Params (10[6]) 2.2626 0.8639 7.4496 9.5076 9.5076 23.8570 7.4496 Student #Params (10[6]) 0.7144 0.2833 1.2565 4.0134 0.9226 4.0134 1.4418 Compression ratio 3.1674× 3.0498× 5.9289× 2.3690× 10.3056× 5.9444× 5.1667× Teacher #FLOPs (10[9]) 2.6208 1.0060 8.6642 1.8263 1.8263 10.3833 8.6642 Student #FLOPs (10[9]) 0.8089 0.3265 1.4165 0.5428 0.0459 0.5428 0.3449 Speedup ratio 3.2400× 3.0809× 6.1168× 3.3643× 39.8097× 19.1276× 25.1210× Table 8: Model complexity statistics of ResNet34 (teacher) / ResNet18 (student) on the ImageNet dataset. The number of FLOPs is measured by the number of Multiply and Add operations in convolutional and fully-connected layers. Input image size: 224×224×3. |Teacher #Params (106) Student #Params (106) Compression ratio|Teacher #FLOPs (109) Student #FLOPs (109) Speedup ratio| |---|---| ||| |21.7806 11.6799 1.8648×|7.3275 3.6281 2.0196×| B COMBINING TLCUTMIX WITH OTHER KD METHODS In the main paper, we show that our proposed method can advance the previous state-of-the-art method (CRD) even further on most pairs (Tabs. 3 and 4). It is interesting to see if this bonus can translate to more KD methods. Therefore, here we combine TLCutMix with five more KD methods to see how it works. The five methods are AT (Zagoruyko & Komodakis, 2017), CC (Peng et al., 2019), SP (Tung & Mori, 2019), PKT (Passalis & Tefas, 2018), and VID (Ahn et al., 2019). Note, these five methods we select are the top-performing KD methods (besides CRD) based on the CIFAR-100 results in the CRD paper (it is easy to improve a mediocre KD method). Results are shown in Tab. 9. When equipped with our proposed TLCutMix, all these methods see accuracy gains, although some teacher-student pairs see more improvement than others. For example, SP+TLCutMix and VID+TLCutMix only have marginal gains on the ResNet110/ResNet20 pair. Except them, all the other pairs see a significant accuracy improvement. There are 42 results in total in Tab. 9. Half of them (21 pairs) are improved by more than 1% point. Several (5 pairs) are even improved by over 2% points. ----- Table 9: Student test accuracy (standard deviation) of different KD methods on CIFAR-100 when _equipped with the proposed TLCutMix. The results of the different KD methods are directly cited_ from the CRD paper Table 7 (Tian et al., 2020) (where they did not report the standard deviation of the accuracies except for the original KD method (Hinton et al., 2014), so the standard deviation of accuracies of these methods are missing here as well). AT, CC, SP, PKT, and VID methods include the original KD loss as part of their loss functions. Each our result is obtained by 3 random runs, mean (std) accuracy reported. Accuracy gains are colored in red (this table is best viewed in color). Teacher WRN-40-2 ResNet110 ResNet32x4 VGG13 VGG13 ResNet50 ResNet32x4 Student WRN-16-2 ResNet20 ResNet8x4 VGG8 MobileNetV2 VGG8 ShuffleNetV2 Teacher Acc. 75.61 74.31 79.42 74.64 74.64 79.34 79.42 Student Acc. 73.26 69.06 72.50 70.36 64.60 70.36 71.82 KD (Hinton et al., 2014) 74.92 (0.28) 70.67 (0.27) 73.33 (0.25) 72.98 (0.19) 67.37 (0.32) 73.81 (0.13) 74.45 (0.27) **KD+TLCutMix (ours)** 75.34 (0.19) 71.19 (0.23) 74.91 (0.20) 74.16 (0.18) 68.79 (0.35) 74.85 (0.23) 76.61 (0.18) Acc. gain +0.42 +0.52 +1.58 +1.18 +1.42 +1.04 +2.16 AT (Zagoruyko & Komodakis, 2017) 75.32 70.97 74.53 73.48 65.13 74.01 75.39 **AT+TLCutMix (ours)** 75.65 (0.27) 71.66 (0.07) 75.68 (0.13) 74.02 (0.15) 67.20 (0.24) 74.67 (0.14) 76.25 (0.13) Acc. gain +0.33 +0.69 +1.15 +0.54 +2.07 +0.66 +0.86 CC (Peng et al., 2019) 75.09 70.88 74.21 73.04 68.02 73.48 74.71 **CC+TLCutMix (ours)** 75.75 (0.27) 71.41 (0.24) 75.54 (0.28) 74.35 (0.27) 68.44 (0.46) 74.76 (0.09) 76.78 (0.18) Acc. gain +0.66 +0.53 +1.33 +1.31 +0.42 +1.28 +2.07 SP (Tung & Mori, 2019) 74.98 71.02 74.02 73.49 68.41 73.52 74.88 **SP+TLCutMix (ours)** 75.29 (0.39) 71.10 (0.07) 74.96 (0.13) 74.10 (0.27) 68.79 (0.24) 74.77 (0.33) 76.24 (0.14) Acc. gain +0.31 +0.08 +0.94 +0.61 +0.38 +1.25 +1.36 PKT (Passalis & Tefas, 2018) 75.33 70.72 74.23 73.25 68.13 73.61 74.66 **PKT+TLCutMix (ours)** 75.85 (0.42) 71.33 (0.06) 75.44 (0.08) 74.30 (0.18) 68.98 (0.60) 74.70 (0.32) 76.79 (0.10) Acc. gain +0.52 +0.61 +1.21 +1.05 +0.85 +1.09 +2.13 VID (Ahn et al., 2019) 75.14 71.10 74.56 73.19 68.27 73.46 74.85 **VID+TLCutMix (ours)** 75.66 (0.21) 71.13 (0.27) 75.40 (0.17) 74.24 (0.12) 69.70 (0.22) 74.67 (0.17) 76.90 (0.17) Acc. gain +0.52 +0.03 +0.84 +1.05 +1.43 +1.21 +2.05 Table 10: KD+Cutout (DeVries & Taylor, 2017) vs. KD on the CIFAR-100 dataset. Teacher WRN-40-2 ResNet56 ResNet32x4 VGG13 VGG13 ResNet50 ResNet32x4 Student WRN-16-2 ResNet20 ResNet8x4 VGG8 MobileNetV2 VGG8 ShuffleNetV2 Teacher 75.61 72.34 79.42 74.64 74.64 79.34 79.42 Student 73.26 69.06 72.50 70.36 64.60 70.36 71.82 KD 74.92 (0.28) 70.66 (0.24) 73.33 (0.25) 72.98 (0.19) 67.37 (0.32) 73.81 (0.13) 74.45 (0.27) KD+Cutout 75.54 (0.16) 70.86 (0.19) 74.32 (0.30) 74.03 (0.08) 68.22 (0.06) 73.98 (0.19) 75.20 (0.10) As explained in the main paper, our method focuses on the input end to improve KD, while methods like AT, CC, SP, PKT and VID focus on the output end (i.e., a better loss function). Therefore, they are complementary. The results here reiterate one of our contributions in this paper: most existing KD papers seek to improve KD through a better loss function while we discover a new axis – improving KD through a stronger DA, which is just as promising. C KD+CUTOUT ON CIFAR-100 In the main paper, we have shown that the KD loss can exploit more advanced data augmentation schemes (like mixup/CutMix) for improved performance. Another popular image data augmentation method, which is stronger than the common random cut and horizontal flip, is cutout (DeVries & Taylor, 2017). In Tab. 10 we show KD can also work with cutout to consistently deliver stronger performances. This further confirms our finding that we can simply boost the KD performance by using a stronger data augmentation. D CUTMIX SAMPLE ANALYSIS **CutMix sample analysis and why KD is naturally suited to exploit CutMix. During the KD** training of ResNet34/ResNet18 on ImageNet, we recorded the CutMix samples on which the teacher _disagrees with the CutMix scheme on the label. We call this label disagreement issue. As show in_ ----- |0 0.09 0.062|0 .145 0|0.3 .253| |---|---|---| |0.107 0.072 .055|0.22 0.188| |---|---| ground 0.514 airship 0.518 Arabian 0.587 quill 0.624 beetle camel cab 0.486 Tibetan 0.482 acoustic 0.413 Indian 0.376 terrier guitar elephant cab 0.943 Yorkshire terrier 0.300 shopping cart 0.220 coffeepot 0.187 ambulance 0.023 Lhasa 0.253 shopping basket 0.188 milk can 0.151 police van 0.020 silky terrier 0.145 solar dish 0.107 caldron 0.084 minivan 0.004 Norfolk terrier 0.090 guinea pig 0.072 water jug 0.055 minibus 0.002 Tibetan terrier 0.062 chainlink fence 0.055 bucket 0.028 (a) (b) (c) (d) Figure 5: ImageNet CutMix samples where the main object in one of the images is no longer visible after CutMix augmentation. Below each sample, the first is the target probability assigned by CutMix and the second is the top-5 predicted probabilities by the teacher. These examples can be misleading when cross-entropy loss is used, but not for KD, as explained in the text. Fig. 5. there exist cases where the image cut from one image covers the salient object in the other. For example, the cab in (a) completely covers the ground beetle. In this case, using the label by CutMix does not make sense anymore. A similar problem appears on (b). Note that these misleading labels by CutMix are rectified when the teacher is employed to guide the student. The teacher assigns the correct label “cab” to (a) and “Yorkshire terrier” to (b) (which is still not the true label “Tibetan terrier” but it is clearly more relevant and “Tibetan terrier” is also in the top-5 predictions). For (c) and (d), they pose a problem more than occlusion: the foreground cut in (c) is labeled as “acoustic guitar”, however, the cut is too small for us to make it out without knowing the label. Meanwhile, the background object “Arabian camel” is occluded. Then the grids in the picture turn out to be the most salient part. If we look at the predictions of the teacher, “shopping cart” and “shopping basket” clearly make more sense than either of the original two labels. A similar issue happens on (d), where the “Indian elephant” is largely occluded. The foreground cut is labeled “quill” but the bottle in the middle is more salient. Thus the teacher predicted it as “coffeepot”, “milk can”, etc. In order to see how severe the label disagreement issue is, we counted the number of these synthetic samples and found that on more than half of the samples (52.1%) produced by CutMix, the teacher model and CutMix hold a different view regarding the label. Many of these suffer from the problem shown in Fig. 5. The KD loss can rectify these label mistakes. This further shows the interplay between KD and DA: KD thrives on DA and in turn, some DA schemes are more reasonable for KD (than CE) where a teacher can supply more relevant labels. -----