# IF YOUR DATA DISTRIBUTION SHIFTS, ## USE SELF-LEARNING **Anonymous authors** Paper under double-blind review ABSTRACT We demonstrate that self-learning techniques like entropy minimization and pseudo-labeling are simple and effective at improving performance of a deployed computer vision model under systematic domain shifts. We show consistent improvements irrespective of the model architecture, the pre-training technique or the type of distribution shift. At the same time, self-learning is simple to use in practice because it does not require knowledge or access to the original training data or scheme, is robust to hyperparameter choices, is straight-forward to implement and requires only a few adaptation epochs. This makes selflearning techniques highly attractive for any practitioner who applies machine learning algorithms in the real world. We present state-of-the art adaptation results on CIFAR10-C (8.5% error), ImageNet-C (22.0% mCE), ImageNet-R (17.4% error) and ImageNet-A (14.8% error), theoretically study the dynamics of self-supervised adaptation methods and propose a new classification dataset (ImageNet-D) which is challenging even with adaptation. 1 INTRODUCTION Deep Neural Networks (DNNs) can reach human-level performance in complex cognitive tasks (Brown et al., 2020; He et al., 2016a; Berner et al., 2019) if the distribution of the test data is sufficiently similar to the training data. However, DNNs are known to struggle if the distribution of the test data is shifted relatively to the training data (Geirhos et al., 2018; Dodge & Karam, 2017). Two largely distinct communities aim to increase the performance of models under test-time distribution shifts: The robustness community generally considers ImageNet-scale datasets and evaluates models in an ad-hoc scenario. Models are trained on a clean source dataset like ImageNet, using heavy data augmentation (Hendrycks et al., 2020a; Rusak et al., 2020; Geirhos et al., 2019) and/or large-scale pre-training (Xie et al., 2020a; Mahajan et al., 2018). The trained models are not adapted in any way to test-time distribution shifts. This evaluation scenario is relevant for applications in which very different distribution shifts are encountered in an unpredictable order, and hence misses out on the gains of adaptation to unlabeled samples of the target distribution. Figure 1: Robustness and adaptation to new datasets has traditionally been achieved by robust pre-training (with hand-selected/data-driven augmentation strategies, or additional data), unsupervised domain adaptation (with access to unlabeled samples from the test set), or, more recently, self-supervised learning methods. We show that on top of these different pre-training tasks, it is always possible (irrespective of architecture, model size or pre-training algorithm) to further adapt models to the target domain with simple self-learning techniques. ----- The unsupervised domain adaptation (UDA) community often considers smaller-scale datasets and assumes that both the source and the (unlabeled) target dataset are known. Models are trained on both datasets (e.g., with an adversarial domain objective, Ganin et al., 2016) before evaluation on the target domain data. This evaluation scenario provides optimal conditions for adaptation, but the reliance on the source dataset makes UDA more computationally expensive, more impractical and prevents the use of pre-trained models for which the source dataset is unknown or simply too large. In this work, we consider the source-free domain adaptation setting, a middle ground between the classical ad-hoc robustness setting and UDA in which models can adapt to the target distribution but without using the source dataset (Kundu et al., 2020; Kim et al., 2021; Li et al., 2020; Liang et al., 2020). This evaluation scenario is interesting for many practitioners and applications as an extension of the ad-hoc robustness scenario. It evaluates the possible performance of a deployed model on a systematic, unseen distribution shift at inference time: an embedded computer vision system in an autonomous car should adapt to changes without being trained on all available training data; an image-based quality control software may not necessarily open-source the images it has been trained on, but still has to be adapted to the lighting conditions at the operation location; a computer vision system in a hospital should perform robustly when tested on a scanner different from the training images—importantly, it might not be known at development time which scanner it will be tested on, and it might be prohibited to share images from many hospitals to run UDA. Can self-learning methods like pseudo-labeling and entropy-minimization also be used in this _source-free domain adaptation setting? To answer this question, we perform an extensive study_ of several self-learning variants, and find consistent and substantial gains in test-time performance across several robustness and out-of-domain benchmarks and a wide range of models and pretraining methods, including models trained with UDA methods that do not use self-learning. We also find that self-learning outperforms state-of-the-art source-free domain adaptation methods, namely Test-Time Training which is based on a self-supervised auxiliary objective and continual training (Sun et al., 2019b), test-time entropy minimization (Wang et al., 2020) and (gradient-free) BatchNorm adaptation (Schneider et al., 2020; Nado et al., 2020). We perform a large number of ablations to study important design choices for self-learning methods in source-free domain adaptation. Furthermore, we show that a variant of pseudo-labeling with a robust loss function consistently outperforms entropy minimization on ImageNet-scale datasets. We theoretically analyze and empirically verify the influence of the temperature parameter in self-learning and provide guidelines how this single parameter should be chosen. Our approach is visualized in Figure 1. We do not consider test-time adaptation in an online setting like is studied e.g., by Zhang et al. (2021), where the model is adapted to one example at a time, and reset after each example. **Related Work. Variants of self-learning have been used for UDA (Berthelot et al., 2021), for** example using auxiliary information (Xie et al., 2020b), consistency (Wei et al., 2020; Cai et al., 2021; Prabhu et al., 2021) or confidence (Zou et al., 2019) regularization. The main difference from these works to ours is that they 1) utilize both source and target data for self-learning whereas we only require access to unlabeled target data, 2) train their models from scratch whereas we merely fine-tune pretrained checkpoints on the unlabeled target data, and 3) are generally more complicated than our approach due to using more than one term in the objective function. Our work is conceptually most similar to virtual adversarial domain adaptation in the fine-tuning phase of DIRT-T (Shu et al., 2018)) and Test-time entropy minimization (TENT; Wang et al., 2020). In contrast to DIRT-T, our objective is simpler and we scale the approach to considerably larger datasets on ImageNet scale. TENT, on the other hand, only evaluated a single method (entropy minimization) on a single vanilla model (ResNet-50) on IN-C. We substantially expand this analysis to show that self-learning almost universally increases test-time performance under distribution shifts, regardless of the type of distribution shift, the model architecture or the pre-training method. Self-learning has also been applied to UDA for semantic segmentation (Zou et al., 2018), for gradual domain adaptation (Kumar et al., 2020), for semi-supervised learning (Rizve et al., 2021; Mukherjee & Awadallah, 2020), for learning in biased datasets (Chen et al., 2020b) and for automated data annotation (De Sousa Ribeiro et al., 2020). Zoph et al. (2020) show that self-learning outperforms pretraining when stronger data augmentation is used and more labeled data is present. A more detailed discussion of related work alongside with the main differences to our work can be found in Appendix F. Our main contribution beyond these works is to show the effectiveness of self-learning on top of both robust, large scale, and domain adapted models, at scale. ----- 2 SELF-LEARNING FOR TEST-TIME ADAPTATION Different variants of self-learning have been used in both unsupervised domain adaptation (French et al., 2018; Shu et al., 2018), self-supervised representation learning (Caron et al., 2021), and in semi-supervised learning (Xie et al., 2020a). In a typical self-learning setting a teacher network **f** _[t]_ trained on the source domain predicts labels on the target domain. Then, a student model f _[s]_ is fine-tuned on the predicted labels. In the following, let f _[t](x) denote the logits for sample x and let p[t](j_ **x)** _σj(f_ _[t](x)) denote_ _|_ _≡_ the probability for class j obtained from a softmax function σj( ). Similarly, f _[s](x) and p[s](j_ **x)** _·_ _|_ denote the logits and probabilities for the student model f _[s]. For all techniques, one can optionally_ only admit samples where the probability maxj p[t](j|x) exceeds some threshold. We consider three popular variants of self-learning: Pseudo-labeling with hard or soft labels, as well as entropy minimization. **Hard Pseudo-Labeling (Lee, 2013; Galstyan & Cohen, 2007). We generate labels using the** teacher and train the student on pseudo-labels i using the standard cross-entropy loss, _ℓH_ (x) := − log p[s](i|x), _i = argmaxj p[t](j|x)_ (1) Usually, only samples with a confidence above a certain threshold are considered for training the student. We test several thresholds but note that thresholding means discarding a potentially large portion of the data which leads to a performance decrease in itself. The teacher is updated after each epoch. **Soft Pseudo-Labeling (Lee, 2013; Galstyan & Cohen, 2007). In contrast to the hard pseudo-** labeling variant, we here train the student on class probabilities predicted by the teacher, _p[t](j|x) log p[s](j|x)._ (2) _ℓS(x) :=_ _−_ Soft pseudo-labeling is typically not used in conjunction with thresholding, since it already incorporates the certainty of the model. The teacher is updated after each epoch. **Entropy Minimization (ENT; Grandvalet & Bengio, 2004). This variant is similar to soft pseudo-** labeling, but we no longer differentiate between a teacher and student network. It corresponds to an “instantaneous” update of the teacher. The training objective becomes _p[s](j|x) log p[s](j|x)._ (3) _ℓE(x) :=_ _−_ Intuitively, self-training with entropy minimization leads to a sharpening of the output distribution for each sample, making the model more confident in its predictions. **Robust Pseudo-Labeling (RPL). Virtually all introduced self-training variants use the standard** cross-entropy classification objective. However, the standard cross-entropy loss has been shown to be sensitive to label noise (Zhang & Sabuncu, 2018; Zhang et al., 2017). In the setting of domain adaptation, inaccuracies in the teacher predictions and, thus, the labels for the student, are inescapable, with severe repercussions for training stability and hyperparameter sensitivity as we show in the results. As a straight-forward solution to this problem, we propose to replace the cross-entropy loss by a robust classification loss designed to withstand certain amounts of label noise (Ghosh et al., 2017; Song et al., 2020; Shu et al., 2020; Zhang & Sabuncu, 2018). A popular candidate is the Generalized _Cross Entropy (GCE) loss which combines the noise-tolerant Mean Absolute Error (MAE) loss_ (Ghosh et al., 2017) with the CE loss. We only consider the hard labels and use the robust GCE loss as the training loss for the student, _i = argmaxj p[t](j|x),_ _ℓGCE(x, i) := q[−][1](1 −_ _p[s](i|x)[q]),_ (4) with q ∈ (0, 1]. For the limit case q → 0, the GCE loss approaches the CE loss and for q = 1, the GCE loss is the MAE loss (Zhang & Sabuncu, 2018). We test updating the teacher both after every update step of the student (RPL) and once per epoch (RPL[ep]). ----- 3 EXPERIMENT DESIGN **Datasets. IN-C (Hendrycks & Dietterich, 2019) contains corrupted versions of the 50 000 images in** the IN validation set. There are fifteen test and four hold-out corruptions, and there are five severity levels for each corruption. The established metric to report model performance on IN-C is the mean Corruption Error (mCE) where the error is normalized by the AlexNet error, and averaged over all corruptions and severity levels, see Eq. 20, Appendix C.1. IN-R (Hendrycks et al., 2020a) contains 30 000 images with artistic renditions of 200 classes of the IN dataset. IN-A (an, 2019) is composed of 7500 unmodified real-world images on which standard IN-trained ResNet50 (He et al., 2016b) models yield chance level performance. CIFAR10 (Krizhevsky et al., 2009) and STL10 (Coates et al., 2011) are small-scale image recognition datasets with 10 classes each, and training sets of 50 000/5000 images and test sets of 10 000/8000 images, respectively. The digit datasets MNIST (Deng, 2012) and MNIST-M (Ganin et al., 2016) both have 60 000 training and 10 000 test images. **Hyperparameters. The different self-learning variants have a range of hyperparameters such as the** learning rate or the stopping criterion. Our goal is to give a realistic estimation on the performance to be expected in practice.. To this end, we optimize hyperparameters for each variant of pseudolabeling on a hold-out set of IN-C that contains four types of image corruptions (“speckle noise”, “Gaussian blur”, “saturate” and “spatter”) with five different strengths each, following the procedure suggested in Hendrycks & Dietterich (2019). We refer to the hold-out set of IN-C as our dev set. **Models for ImageNet-scale datasets.** We consider four popular model architectures: ResNet50 (He et al., 2016b), DenseNet161 (Huang et al., 2017), ResNeXt101 (Xie et al., 2017) and EfficientNet-L2 (Tan & Le, 2019) (see Appendix B.1 for details on the used models). For ResNet50, DenseNet and ResNeXt101, we include a simple vanilla version trained on IN only. For ResNet50 and ResNeXt101, we additionally include a state-of-the-art robust version trained with DeepAugment and Augmix (DAug+AM, Hendrycks et al., 2020a)[1]. For the ResNeXt model, we also include a version that was trained on 3.5 billion weakly labeled images (IG-3.5B, Mahajan et al., 2018). Finally, for EfficientNet-L2 we select the current state of the art on IN-C which was trained on 300 million images from JFT-300M (Chollet, 2017; Hinton et al., 2014) using a noisy studentteacher protocol (Xie et al., 2020a). We validate the IN and IN-C performance of all considered models and match the originally reported scores (Schneider et al., 2020). For EfficientNet-L2, we match IN top-1 accuracy up to 0.1% points, and IN-C up to 0.6% mCE. **Models for CIFAR10/MNIST-scale datasets.** For CIFAR10-C experiments, we use two WideResNets (WRN, Zagoruyko & Komodakis, 2016): the first one is trained on CIFAR10 and has a depth of 28 and a width of 10 and the second one is trained with AugMix (Hendrycks et al., 2020b) and has a depth of 40 and a width of 2. The remaining small-scale models are trained with unsupervised domain adaptation (UDA) methods. We propose to regard any UDA method which requires joint training with source and target data as a pre-training step, similar to regular pretraining on IN, and use self-learning on top of the final checkpoint. We consider two popular UDA methods: self-supervised domain adaptation (UDA-SS; Sun et al., 2019a) and Domain-Adversarial Training of Neural Networks (DANN; Ganin et al., 2016). In UDA-SS, the authors seek to align the representations of both domains by performing an auxiliary self-supervised task on both domains simultaneously. In all UDA-SS experiments, we use a WideResNet with a depth of 26 and a width of 16. In DANN, the authors learn a domain-invariant embedding by optimizing a minimax objective. For all DANN experiments except for MNIST→MNIST-M, we use the same WRN architecture as above. For the MNIST→MNIST-M experiment, the training with the larger model diverged and we used a smaller WideResNet version with a width of 2. We note that DANN training involves optimizing a minimax objective and is generally harder to tune. 4 RESULTS: SELF-LEARNING UNIVERSALLY IMPROVES MODELS Self-learning is a powerful learning scheme, and in the following section we show that it allows to perform test-time adaptation on robustified models, models obtained with large-scale pre-training, as well as domain adapted models across a wide range of datasets and distribution shifts. Our main results on large-scale and small-scale datasets are shown in Tables 1 and 2, respectively. These 1see leaderboard at github.com/hendrycks/robustness ----- summary tables show final results, and all experiments use the hyperparameters we determined separately on the dev set. **Table 1: Self-learning successfully adapts ImageNet-scale models across different model** **architectures on IN-C, IN-A and IN-R. We adapt the vanilla ResNet50, ResNeXt101 and** DenseNet161 models to IN-C and decrease the mCE by over 19 percent points in all models. Further, self-learning works for models irrespective of their size: Self-learning substantially improves the performance of the ResNet50 and the ResNext101 trained with DAug+AM, on IN-C by 11.9 and 9.7 percent points, respectively. Finally, we further improve the current state-of-the-art model on IN-C—the EfficientNet-L2 Noisy Student model—and report a new state-of-the-art result of 22% mCE (which corresponds to a top1 error of 17.1%) on this benchmark with test-time adaptation (compared to 28% mCE without adaptation). number of w/o adapt w/ adapt ∆ mCE [%] on IN-C test (↘) parameters RPL ResNet50 vanilla (He et al., 2016b) 2.6 × 10[7] 76.7 50.5 (-26.2) ResNet50 DAug+AM (Hendrycks et al., 2020a) 2.6 × 10[7] 53.6 41.7 (-11.9) DenseNet161 vanilla (Huang et al., 2017) 2.8 × 10[7] 66.4 47.0 (-19.4) ResNeXt10132 8d vanilla (Xie et al., 2017) 8.8 10[7] 66.6 43.2 (-23.4) _×_ _×_ ResNeXt10132 8d DAug+AM (Hendrycks et al., 2020a) 8.8 10[7] 44.5 34.8 (-9.7) _×_ _×_ ResNeXt10132 8d IG-3.5B (Mahajan et al., 2018) 8.8 10[7] 51.7 40.9 (-10.8) _×_ _×_ EfficientNet-L2 Noisy Student (Xie et al., 2020a) 4.8 × 10[8] 28.3 **22.0** (-6.3) top1 error [%] on IN-R (↘) ResNet50 vanilla (He et al., 2016b) 2.6 × 10[7] 63.8 54.1 (-9.7) EfficientNet-L2 Noisy Student (Xie et al., 2020a) 4.8 × 10[8] 23.5 **17.4** (-6.1) top1 error [%] on ImageNet-A (↘) EfficientNet-L2 Noisy Student (Xie et al., 2020a) 4.8 × 10[8] 16.5 **14.8** (-1.7) Self-learning is not limited to the distribution shifts in IN-C like compression artefacts or blur. On IN-R, a dataset with renditions, self-learning improves both the vanilla ResNet50 and the EfficientNet-L2 model, the latter of which improves from 23.5% to a new state-of-the art of 17.4% top-1 error. For a vanilla ResNet50, we improve the top-1 error from 63.8% (Hendrycks et al., 2020a) to 54.1%. On IN-A, adapting the EfficientNet-L2 model using self-learning decreases the top-1 error from 16.5% (Xie et al., 2020a) to 14.8% top-1 error, again constituting a new state of the art with test-time adaptation on this dataset. **Table 2:** **Self-learning improves robustified and domain adapted models on small-scale** **datasets. We test common domain adaptation techniques like DANN (Ganin et al., 2016) and** UDA-SS (Sun et al., 2019a), and show that self-learning is effective at further tuning such models to the target domain. We suggest to view unsupervised source/target domain adaptation as a step comparable to pre-training under corruptions, rather than an adaptation technique specifically tuned to the target set—indeed, we can achieve error rates using, e.g., DANN + target adaptation previously only possible with source/target based pseudo-labeling, across different common domain adaptation benchmarks. Self-learning also decreases the error on CIFAR10-C of the Wide ResNet model trained with AugMix (AM, Hendrycks et al., 2020b) and reaches a new state of the art on CIFAR10C of 8.5% top1 error with test-time adaptation. _[†]denotes preliminary results on CIFAR-C dev only,_ due to instabilities in training the adversarial network in DANN. number of w/o adapt w/ adapt ∆ top1 error [%] on CIFAR10-C (↘) parameters ENT WRN-28-10 vanilla (Zagoruyko & Komodakis, 2016) 3.6 × 10[7] 26.5 13.3 (-13.2) WRN-40-2 AM (Hendrycks et al., 2020b) 2.2 × 10[6] 11.2 8.5 (-2.7) WRN-26-16 UDA-SS (Sun et al., 2019a) 9.3 × 10[7] 27.7 16.7 (-11.0) WRN-26-16 DANN (Ganin et al., 2016) 9.3 × 10[7] _†29.7_ _†28.5_ (-1.2) UDA CIFAR10→STL10, top1 error on target [%](↘) WRN-26-16 UDA-SS (Sun et al., 2019a) 9.3 × 10[7] 28.7 21.8 (-6.9) WRN-26-16 DANN (Ganin et al., 2016) 9.3 × 10[7] 25.0 23.9 (-1.1) UDA MNIST→MNIST-M, top1 error on target [%](↘) WRN-26-16 UDA-SS (Sun et al., 2019a) 9.3 × 10[7] 4.8 2.0 (-2.8) WRN-26-2 DANN (Ganin et al., 2016) 1.5 × 10[6] 11.4 5.1 (-6.3) ----- **Table 3: Self-learning also improves large pre-trained models. Unlike BatchNorm adaptation** (Schneider et al., 2020), we show that self-learning transfers well to models pre-trained on a large amount of unlabeled data: self-learning decreases the mCE on IN-C of the ResNeXt101 trained on 3.5 billion weakly labeled samples (IG-3.5B, Mahajan et al., 2018) from 51.7% to 40.9%. mCE on IN-C test [%] (↘) no adaptation BN adaptation self-learning ResNeXt10132 8d vanilla 66.6 56.8 43.2 _×_ ResNeXt10132 8d IG-3.5B 51.7 51.8 **40.9** _×_ **Table 4: Self-learning outperforms previously published test-time adaptation approaches on** **IN-C. The robustness benchmark IN-C has so far mostly been regarded in the ad-hoc evaluation** setting as discussed in our introduction. Thus, there are only few published methods that report numbers for test-time adaptation: BatchNorm adaptation (Schneider et al., 2020), Test-Time Training (TTT, Sun et al., 2019b), and TENT (Wang et al., 2020). In particular, note that TTT requires a special loss function at training time, while our approach is agnostic to the pre-training phase. Our self-training results outperforms all three baselines (also after tuning TENT with our full experimental protocol): mCE on IN-C test [%] (↘) w/o adapt BN adapt TENT (ours) self-learning ResNet50 vanilla 76.7 62.2 53.5 (51.6) **50.5** top1 error [%] on IN-C, sev. 5 (↘) w/o adapt BN adapt TTT self-learning ResNet18 vanilla 85.4 72.2 66.3 **61.9** **Table 5:** **Self-supervised methods based on self-learning allow out-of-the-box test-time** **adaptation. The recently published DINO method (Caron et al., 2021) is another variant of self-** supervised learning that has proven to be effective for unsupervised representation learning. At the core, the method uses soft pseudo-labeling. Here, we test whether a model trained with DINO on the source dataset can be test-time adapted on IN-C using DINO to further improve out-of-distribution performance. Since the used model is a vision transformer model, we test different choices of adaptation parameters and find considerable performance improvements in all cases, yielding an mCE of 43.5%mCE at a parameter count comparable to a ResNet50 model. For adapting the affine layers, we follow Houlsby et al. (2019): w/o adapt w/ adapt w/ adapt w/ adapt w/ adapt mCE on IN-C test [%] (↘) affine layers bottleneck layers lin. layers all weights ViT-S/16 62.3 51.8 46.8 45.2 **43.5** 5 UNDERSTANDING TEST-TIME ADAPTATION WITH SELF-LEARNING In the following section, we show ablations and interesting insights of using self-learning for testtime adaptation. If not specified otherwise, all ablations are run on the holdout corruptions of IN-C (our dev set) with a vanilla ResNet50. **Table 6: Robust pseudo-labeling outperforms entropy minimization on large-scale datasets** **while the reverse is true on small-scale datasets. We find that robust pseudo-labeling consistently** improves over entropy minimization on IN-C, while entropy minimization performs better on smaller scale data (CIFAR10, STL10, MNIST). The finding highlights the importance of testing both algorithms on new datasets. The improvement is typically on the order of one percent point: mCE, IN-C dev ResNet50 ResNeXt-101 EfficientNet-L2 ENT 50.0 ± 0.04 43.0 22.2 RPL **48.9 ± 0.02** **42.0** **21.3** top-1 err, CIFAR-C WRN-40 ENT **8.5** RPL 9.0 **Table 7: Robust pseudo-labeling allows usage of the full dataset without a threshold. Classical** hard labeling needs a confidence threshold (T) for best performance, thereby reducing the dataset size, while best performance for RPL is reached for full dataset training with a threshold T of 0.0: diff. self-learning methods no adapt soft PL hard PL (T): 0.0 **0.5** 0.9 RPL (T): **0.0** 0.5 0.9 mCE on IN-C dev [%] 69.5 60.1 53.8 51.9 52.4 **49.7 49.9 51.8** ----- **Table 8: Short update intervals are crucial for fast adaptation.** Having established that RPL generally performs better than soft- and hard-labeling, we vary the update interval for the teacher. We find that instant updates are most effective. In entropy minimization, the update interval is instant per default. Update interval for RPL w/o adapt no update epoch instant mCE on IN-C dev [%] 69.5 54.0 49.7 **49.2** **Table 9: Adaptation of only affine layers is important in CNNs. On IN-C, adapting only the** affine parameters after the normalization layers (i.e., the rescaling and shift parameters β and γ) works better on a ResNet50 architecture than adapting all parameters or only the last layer. We indicate the number of adapted parameters in brackets. Adaptation mechanism w/o adapt last layer full model affine mCE on IN-C dev [%] 69.5 [0] 60.2 [2M] 51.5 [22.6M] **48.9 [5.3k]** Note that for Vision Transformers, full model adaptation works better than affine adaptation (see Table 5). We also noticed that on convolutional models with a smaller parameter count like ResNet18, full model adaptation is possible. **Hyperparameters obtained on corruption datasets transfer well to real world datasets. When** evaluating models, we select the hyperparameters discussed above (the learning rate and the epoch used for early stopping are the most critical ones) on the holdout set of IN-C. We note that this technique transfers well to IN-R, -A and -D, highlighting the practical value of corruption robustness datasets for adapting models on real distribution shifts. On IN-D, we performed a control experiment where we selected hyperparameters with leave-oneout cross validation—this selection scheme actually performed worse than IN-C parameter selection (see Appendix D.1). 6 ADAPTING MODELS ON A WIDER RANGE OF DISTRIBUTION SHIFTS REVEALS LIMITATIONS OF ROBUSTIFICATION AND ADAPTATION METHODS Robustness datasets on ImageNet-scale have so far been limited to a few selected domains (image corruptions in IN-C, image renditions in IN-R, difficult images for ResNet50 classifiers in IN-A). In order to test our approach on a wider range of complex distribution shifts, we re-purpose the dataset from the Visual Domain Adaptation Challenge 2019 (DomainNet, Saenko et al., 2019) as an additional robustness benchmark. This dataset comes with six image styles: Clipart, Real, Infograph, Painting, Quickdraw and Sketch. It has 345 classes in total, of which 164 overlap with IN. To benchmark robustness of IN trained models out of the box, we filter out the classes that cannot be mapped to IN and refer to the smaller version of DomainNet as ImageNet-D (IN-D). We map 463 classes in IN to these 164 IN-D classes, e.g., for an image from the “bird” class in IN-D, we accept all 39 bird classes in IN as valid predictions. We show example images from IN-D in Table 10. The detailed evaluation protocol along with justifications for our design choices and additional analysis are outlined in Appendix D. The benefit of IN-D over DomainNet is the re-mapping to ImageNet classes which allows robustness researchers to easily benchmark on this dataset, without the need of re-training a model (as common in UDA). To test whether self-learning is helpful for more complex distribution shifts, we adapt a vanilla ResNet50, several robust IN-C models and the EfficientNet-L2 Noisy Student model on IND. We use the same hyperparameters we obtained on IN-C dev for all our IN-D experiments. We show our main results in Table 10. **More robust models perform better on IN-D. Comparing the performance of the vanilla ResNet50** model to its robust DAug+AM variant, we find that the DAug+AM model performs better on all domains, with the most significant gains on the “Clipart”, “Painting” and “Sketch” domains. We show detailed results for all domains and all tested models in Appendix D.2, along with results on IN-C and IN-R for comparison. We find that the best performing models on IN-D are also the ----- Table 10: Self-learning decreases the top1 error on some IN-D domains but increases it on others. domain Real Painting Clipart Sketch Infograph Quickdraw adapt w/o w/ w/o w/ w/o w/ w/o w/ w/o w/ w/o w/ model EffNet-L2 Noisy Student 29.2 **27.9** 42.7 **40.9** 45.0 **37.9** 56.4 **51.5** **77.9** 94.3 **98.4** 99.4 ResNet50 DAug+AM 39.2 36.5 58.7 53.4 68.4 57.0 75.2 61.3 88.1 83.2 98.2 99.1 ResNet50 vanilla 40.1 37.3 65.1 57.8 76.0 63.6 82.0 73.0 89.6 85.1 99.2 99.8 strongest ones on IN-C and IN-R which indicates good generalization capabilities of the techniques combined for these models, given the large differences between the three considered datasets. However, even the best models perform 20 to 30 percentage points worse on IN-D compared to their performance on IN-C or IN-R, indicating that IN-D might be a more challenging benchmark. **All models struggle with some domains of IN-D. The EfficientNet-L2 Noisy Student model** obtains the best results on most domains. However, we note that the overall error rates are surprisingly high compared to the model’s strong performance on the other considered datasets (IN-A: 14.8% top-1 error, IN-R: 17.4% top-1 error, IN-C: 22.0% mCE). Even on the “Real” domain closest to clean IN where the EfficientNet-L2 model has a top-1 error of 11.6%, the model only reaches a top-1 error of 29.2%. Self-learning decreases the top1 error on all domains except for “Infograph” and “Quickdraw”. We note that both domains have very high error rates from the beginning and thus hypothesize that the produced pseudo-labels are of low quality. **Error analysis on IN-D. We investigate the errors a ResNet50 model makes on IN-D by analyzing** the most frequently predicted classes for different domains to reveal systematic errors indicative of the encountered distribution shifts. We find most errors interpretable: the classifier assigns the label “comic book” to images from the “Clipart” or “Painting” domains, “website” to images from the “Infograph” domain, and “envelope” to images from the “Sketch” domain. Thus, the classifier predicts the domain rather than the class. We find no systematic errors on the “Real” domain which is expected since this domain should be similar to IN. Detailed results on the top-3 most frequently predicted classes for different domains can be found in Fig. 9, Appendix D.4. **IN-D should be used as an additional robustness benchmark. While the error rates on IN-C,** -R and -A are at a well-acceptable level for our largest EfficientNet-L2 model after adaptation, IN-D performance is consistently worse for all models. We propose to move from isolated benchmark settings like IN-R (single domain) to benchmarks more common in domain adaptation (like DomainNet) and make IN-D publicly available as an easy to use dataset for this purpose. **Additional experiments and limitations. We discuss additional proof-of-concept implementations** on the WILDS benchmark (Koh et al., 2021), BigTransfer (BiT; Chen et al., 2020a) models and on self-learning based UDA models in Appendix E. On WILDS, self-learning is effective for the Camelyon17 task with a systematic shift between train, validation and test sets (each set is comprised of different hospitals), while self-learning fails to improve on tasks with mixed domains. 7 A SIMPLE MODEL OF STABILITY IN SELF-LEARNING We observed that different self-learning schemes are optimal for small-scale vs. large-scale datasets and varying amount of classes. We reconsider the used loss functions, and unify them into **f t(x)** **f s(x)** _ℓ(x) =_ _σj_ log _σj_ _,_ _−_ _τt_ _τs_ _j_      X (5) **f** (x), entropy minimization **f** _[t](x) =_ sg(f (x)), pseudo-labeling.  We introduced student and teacher temperature τs and τt as parameters in the softmax function and the stop gradient operation sg. Caron et al. (2021) fixed τs and varied τt during training, ----- and empirically found an upper bound for τt above which the training was no longer stable. To better understand such behavior, we study the learning dynamics of the loss function in equation 5 theoretically in a simple two-datapoints, two-classes model with linear student and teacher networks f _[s](x) = x[⊤]w[s]_ and f _[t](x) = x[⊤]w[t]_ defined in Appendix A.1. Gradient descent with stop gradient corresponds to hard pseudo-labeling in the limit τt 0 and to soft pseudo-labeling when τs = τt = 1. Gradient descent without stop gradient, i.e., setting → **w[s]** = **w[t]** = **w corresponds to entropy minimization.** We obtain the following result: Two points CIFAR-C **Proposition 1 (Collapse in the two-point model).** _The student and teacher networks ws and wt_ _trained with stop gradient does not collapse to the_ _trivial representation ∀x : x[⊤]w[s]_ = 0, x[⊤]w[t] = 0 _if τs > τt. The network w trained without stop_ _gradient does not collapse if τs > τt/2._ _Proof._ _see § A.2._ We validate the proposition on a simulated two datapoint toy dataset, as well as on the CIFARC dataset and outline the results in Figure 2. In general, the size and location of the region where collapse is observed in the simulated model also depends on the initial conditions, the learning rate and the optimization procedure. An in depth discussion, as well as additional simulations are given in the Appendix. In practice, the result suggests that student temperatures should exceed _the teacher temperatures for pseudo-labeling, and_ _student temperatures should exceed half the teacher_ _temperature for entropy minimization._ PL _τs_ 10 log ENT 1 1 Error Error 0 0%50% 0 _≤>BASBAS_ 100% _−1_ _−1_ _−2−2_ _−1_ 0 1 _−2_ _−2_ _−1_ 0 1 1 1 _τt = 2τs_ _τt = τs_ 0 0 _−1_ _−2−2_ _−1_ 0 1 _−1_ _−1_ 0 1 log10 τt log10 τt discussion, as well as additional simulations are Figure 2: For the two point model, we show given in the Appendix. In practice, the result error and for the CIFAR10-C simulation, we show suggests that student temperatures should exceed improvement (yellow) vs. degradation (purple) _the teacher temperatures for pseudo-labeling, and_ over the non-adapted baseline (BAS). An important _student temperatures should exceed half the teacher_ convergence criterion for pseudo-labeling (top _temperature for entropy minimization._ row) and entropy minimization (bottom row) is the ratio of student and teacher temperatures; it lies at Entropy minimization with standard temperatures _τs = τt for PL, and 2τs = τt for ENT. Despite_ (are hence stable. The two-point learning dynamicsτs = τt = 1) and hard pseudo-labeling (τt → 0) the simplicity of the two-point model, the generalconvergence regions transfer to CIFAR10-C. vanish for soft pseudo-labeling with τs = _τt,_ suggesting that one would have to analyze a more complex model with more data points. While this does not directly imply that the learning is unstable at this point, we empirically observe that both entropy minimization and hard labeling outperform soft-labeling in practice. 8 CONCLUSION We evaluated and analysed how self-learning, an essential component in many unsupervised domain adaptation and self-supervised pre-training techniques, can be applied for adaptation to both small and large-scale image recognition problems common in robustness research. We demonstrated new state-of-the-art adaptation results with the EfficientNet-L2 model on the benchmarks ImageNet-C, -R, and -A, and introduced a new benchmark dataset (ImageNet-D) which remains challenging even after adaptation. Our theoretical analysis shows the influence of the temperature parameter in the self-learning loss function on the training stability and provides guidelines how to choose a suitable value. Self-learning universally improves test-time performance under diverse, but systematic distribution shifts irrespective of the architecture or pre-training method. We hope that our work encourages both researchers and practitioners to use self-learning if their data distribution shifts. **Reproducibility Statement** We attempted to make our work as reproducible as possible: We mostly used pre-trained models which are publicly available and we denoted the URL addresses of all used checkpoints; for the checkpoints that were necessary to retrain, we report the Github directories with the source code and used an official or verified reference implementation when available. We report all used hyperparameters in the Appendix and will release our code upon acceptance of the paper. ----- REFERENCES Mart´ın Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. Tensorflow: A system for large-scale machine learning. In 12th {USENIX} Symposium on Operating Systems Design and _Implementation ({OSDI} 16), pp. 265–283, 2016. 37_ Dan Hendrycks an. Natural adversarial examples. ArXiv preprint, abs/1907.07174, 2019. URL [https://arxiv.org/abs/1907.07174. 4](https://arxiv.org/abs/1907.07174) Christopher Berner, Greg Brockman, Brooke Chan, Vicki Cheung, Przemyslaw Debiak, Christy Dennison, David Farhi, Quirin Fischer, Shariq Hashme, Chris Hesse, et al. Dota 2 with large [scale deep reinforcement learning. ArXiv preprint, abs/1912.06680, 2019. URL https://](https://arxiv.org/abs/1912.06680) [arxiv.org/abs/1912.06680. 1](https://arxiv.org/abs/1912.06680) David Berthelot, Rebecca Roelofs, Kihyuk Sohn, Nicholas Carlini, and Alex Kurakin. Adamatch: A unified approach to semi-supervised learning and domain adaptation, 2021. 2, 35 Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are fewshot learners. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin (eds.), Advances in Neural Information Processing Systems 33: Annual _Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12,_ _[2020, virtual, 2020. URL https://proceedings.neurips.cc/paper/2020/hash/](https://proceedings.neurips.cc/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html)_ [1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html. 1](https://proceedings.neurips.cc/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html) Tianle Cai, Ruiqi Gao, Jason D Lee, and Qi Lei. A theory of label propagation for subpopulation shift. arXiv preprint arXiv:2102.11203, 2021. 2, 35 Mathilde Caron, Hugo Touvron, Ishan Misra, Herv´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. _ArXiv preprint,_ [abs/2104.14294, 2021. URL https://arxiv.org/abs/2104.14294. 3, 6, 8, 21](https://arxiv.org/abs/2104.14294) Ting Chen, Simon Kornblith, Kevin Swersky, Mohammad Norouzi, and Geoffrey E. Hinton. Big self-supervised models are strong semi-supervised learners. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin (eds.), Advances in Neural Information Processing Systems 33: _Annual Conference_ _on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020,_ _virtual, 2020a._ [URL https://proceedings.neurips.cc/paper/2020/hash/](https://proceedings.neurips.cc/paper/2020/hash/fcbc95ccdd551da181207c0c1400c655-Abstract.html) [fcbc95ccdd551da181207c0c1400c655-Abstract.html. 8](https://proceedings.neurips.cc/paper/2020/hash/fcbc95ccdd551da181207c0c1400c655-Abstract.html) Yining Chen, Colin Wei, Ananya Kumar, and Tengyu Ma. Self-training avoids using spurious features under domain shift. In NeurIPS, 2020b. 2, 35 Franc¸ois Chollet. Xception: Deep learning with depthwise separable convolutions. In 2017 IEEE _Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July_ _21-26, 2017, pp. 1800–1807. IEEE Computer Society, 2017. doi: 10.1109/CVPR.2017.195. URL_ [https://doi.org/10.1109/CVPR.2017.195. 4, 20](https://doi.org/10.1109/CVPR.2017.195) Adam Coates, Andrew Ng, and Honglak Lee. An analysis of single-layer networks in unsupervised feature learning. In Proceedings of the Fourteenth International Conference on Artificial _Intelligence and Statistics, 2011. 4_ Francesco Croce, Maksym Andriushchenko, Vikash Sehwag, Edoardo Debenedetti, Nicolas Flammarion, Mung Chiang, Prateek Mittal, and Matthias Hein. Robustbench: a standardized adversarial robustness benchmark. _ArXiv preprint, abs/2010.09670, 2020._ [URL https:](https://arxiv.org/abs/2010.09670) [//arxiv.org/abs/2010.09670. 21](https://arxiv.org/abs/2010.09670) ----- Fabio De Sousa Ribeiro, Francesco Caliv´a, Mark Swainson, Kjartan Gudmundsson, Georgios Leontidis, and Stefanos Kollias. Deep bayesian self-training. _Neural Computing and_ _Applications, 32(9):4275–4291, 2020. 2, 36_ Li Deng. The mnist database of handwritten digit images for machine learning research. IEEE _Signal Processing Magazine, 29(6):141–142, 2012. 4_ Samuel F. Dodge and Lina J. Karam. A study and comparison of human and deep learning recognition performance under visual distortions. In International Conference on Computer _Communications and Networks, ICCCN 2017, 2017. 1_ Geoffrey French, Michal Mackiewicz, and Mark H. Fisher. Self-ensembling for visual domain adaptation. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, _BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. OpenReview.net, 2018._ [URL https://openreview.net/forum?id=rkpoTaxA-. 3, 33, 37](https://openreview.net/forum?id=rkpoTaxA-) Aram Galstyan and Paul R. Cohen. Empirical comparison of hard and soft label propagation for relational classification. In 17th international conference on Inductive logic programming, 2007. 3 Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, Franc¸ois Laviolette, Mario Marchand, and Victor Lempitsky. Domain-adversarial training of neural networks. The journal of machine learning research, 17(1):2096–2030, 2016. 2, 4, 5 Robert Geirhos, Carlos R. Medina Temme, Jonas Rauber, Heiko H. Sch¨utt, Matthias Bethge, and Felix A. Wichmann. Generalisation in humans and deep neural networks. In Samy Bengio, Hanna M. Wallach, Hugo Larochelle, Kristen Grauman, Nicol`o Cesa-Bianchi, and Roman Garnett (eds.), Advances in Neural Information Processing Systems 31: Annual Conference on _Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2018, Montr´eal,_ _Canada, pp. 7549–7561, 2018._ [URL https://proceedings.neurips.cc/paper/](https://proceedings.neurips.cc/paper/2018/hash/0937fb5864ed06ffb59ae5f9b5ed67a9-Abstract.html) [2018/hash/0937fb5864ed06ffb59ae5f9b5ed67a9-Abstract.html. 1](https://proceedings.neurips.cc/paper/2018/hash/0937fb5864ed06ffb59ae5f9b5ed67a9-Abstract.html) Robert Geirhos, Patricia Rubisch, Claudio Michaelis, Matthias Bethge, Felix A. Wichmann, and Wieland Brendel. Imagenet-trained cnns are biased towards texture; increasing shape bias improves accuracy and robustness. In 7th International Conference on Learning Representations, _ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net, 2019._ [URL https:](https://openreview.net/forum?id=Bygh9j09KX) [//openreview.net/forum?id=Bygh9j09KX. 1, 27](https://openreview.net/forum?id=Bygh9j09KX) Aritra Ghosh, Himanshu Kumar, and P. S. Sastry. Robust loss functions under label noise for deep neural networks. In Satinder P. Singh and Shaul Markovitch (eds.), Proceedings of the Thirty-First _AAAI Conference on Artificial Intelligence, February 4-9, 2017, San Francisco, California, USA,_ [pp. 1919–1925. AAAI Press, 2017. URL http://aaai.org/ocs/index.php/AAAI/](http://aaai.org/ocs/index.php/AAAI/AAAI17/paper/view/14759) [AAAI17/paper/view/14759. 3](http://aaai.org/ocs/index.php/AAAI/AAAI17/paper/view/14759) Yves Grandvalet and Yoshua Bengio. Semi-supervised learning by entropy minimization. In Advances in Neural Information Processing Systems 17 [Neural Information Processing _Systems, NIPS 2004, December 13-18, 2004, Vancouver, British Columbia, Canada], pp._ 529–536, 2004. [URL https://proceedings.neurips.cc/paper/2004/hash/](https://proceedings.neurips.cc/paper/2004/hash/96f2b50b5d3613adf9c27049b2a888c7-Abstract.html) [96f2b50b5d3613adf9c27049b2a888c7-Abstract.html. 3](https://proceedings.neurips.cc/paper/2004/hash/96f2b50b5d3613adf9c27049b2a888c7-Abstract.html) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR _2016, Las Vegas, NV, USA, June 27-30, 2016, pp. 770–778. IEEE Computer Society, 2016a._ [doi: 10.1109/CVPR.2016.90. URL https://doi.org/10.1109/CVPR.2016.90. 1](https://doi.org/10.1109/CVPR.2016.90) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR _2016, Las Vegas, NV, USA, June 27-30, 2016, pp. 770–778. IEEE Computer Society, 2016b._ [doi: 10.1109/CVPR.2016.90. URL https://doi.org/10.1109/CVPR.2016.90. 4, 5,](https://doi.org/10.1109/CVPR.2016.90) 21 ----- Dan Hendrycks and Thomas G. Dietterich. Benchmarking neural network robustness to common corruptions and perturbations. In 7th International Conference on Learning Representations, _ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net, 2019._ [URL https:](https://openreview.net/forum?id=HJz6tiCqYm) [//openreview.net/forum?id=HJz6tiCqYm. 4, 27](https://openreview.net/forum?id=HJz6tiCqYm) Dan Hendrycks, Steven Basart, Norman Mu, Saurav Kadavath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler Zhu, Samyak Parajuli, Mike Guo, et al. The many faces of robustness: A critical analysis of out-of-distribution generalization. _ArXiv preprint, abs/2006.16241, 2020a._ URL [https://arxiv.org/abs/2006.16241. 1, 4, 5, 20, 21, 27](https://arxiv.org/abs/2006.16241) Dan Hendrycks, Norman Mu, Ekin Dogus Cubuk, Barret Zoph, Justin Gilmer, and Balaji Lakshminarayanan. Augmix: A simple data processing method to improve robustness and uncertainty. In 8th International Conference on Learning Representations, ICLR 2020, Addis _[Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020b. URL https://openreview.](https://openreview.net/forum?id=S1gmrxHFvB)_ [net/forum?id=S1gmrxHFvB. 4, 5, 21, 27](https://openreview.net/forum?id=S1gmrxHFvB) Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. In _NIPS Deep Learning Workshop, 2014. 4, 20_ Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for NLP. In Proceedings of the 36th International Conference on Machine Learning, 2019. 6 Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q. Weinberger. Densely connected convolutional networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, _CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pp. 2261–2269. IEEE Computer Society,_ [2017. doi: 10.1109/CVPR.2017.243. URL https://doi.org/10.1109/CVPR.2017.](https://doi.org/10.1109/CVPR.2017.243) [243. 4, 5, 21, 32](https://doi.org/10.1109/CVPR.2017.243) Youngeun Kim, Donghyeon Cho, Kyeongtak Han, Priyadarshini Panda, and Sungeun Hong. Domain adaptation without source data. IEEE Transactions on Artificial Intelligence, 2021. 2 Pang Wei Koh, Shiori Sagawa, Henrik Marklund, Sang Michael Xie, Marvin Zhang, Akshay Balsubramani, Weihua Hu, Michihiro Yasunaga, Richard Lanas Phillips, Irena Gao, Tony Lee, Etienne David, Ian Stavness, Wei Guo, Berton A. Earnshaw, Imran S. Haque, Sara Beery, Jure Leskovec, Anshul Kundaje, Emma Pierson, Sergey Levine, Chelsea Finn, and Percy Liang. WILDS: A benchmark of in-the-wild distribution shifts. In International Conference on Machine _Learning (ICML), 2021. 8, 32, 37_ Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Joan Puigcerver, Jessica Yung, Sylvain Gelly, and Neil Houlsby. Big transfer (bit): General visual representation learning. In Computer Vision– _ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part_ _V 16, pp. 491–507. Springer, 2020. 33_ Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009. 4 Ananya Kumar, Tengyu Ma, and Percy Liang. Understanding self-training for gradual domain adaptation. In International Conference on Machine Learning, pp. 5468–5479. PMLR, 2020. 2, 35 Jogendra Nath Kundu, Naveen Venkat, R Venkatesh Babu, et al. Universal source-free domain adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern _Recognition, pp. 4544–4553, 2020. 2_ Dong-Hyun Lee. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In ICML Workshop : Challenges in Representation Learning (WREPL), 2013. 3 Rui Li, Qianfen Jiao, Wenming Cao, Hau-San Wong, and Si Wu. Model adaptation: Unsupervised domain adaptation without source data. In 2020 IEEE/CVF Conference on Computer Vision and _Pattern Recognition (CVPR), 2020. 2_ ----- Jian Liang, Dapeng Hu, and Jiashi Feng. Do we really need to access the source data? source hypothesis transfer for unsupervised domain adaptation. In International Conference on Machine _Learning, 2020. 2_ Dhruv Mahajan, Ross Girshick, Vignesh Ramanathan, Kaiming He, Manohar Paluri, Yixuan Li, Ashwin Bharambe, and Laurens van der Maaten. Exploring the limits of weakly supervised pretraining. In Proceedings of the European Conference on Computer Vision (ECCV), 2018. 1, 4, 5, 6, 20, 21 S´ebastien Marcel and Yann Rodriguez. Torchvision the machine-vision package of torch. In ACM _International Conference on Multimedia, 2010. 21, 37_ Dirk Merkel. Docker: Lightweight linux containers for consistent development and deployment. _Linux J., 2014(239), 2014. ISSN 1075-3583. 37_ Subhabrata Mukherjee and Ahmed Hassan Awadallah. Uncertainty-aware self-training for text classification with few labels. In NeurIPS, 2020. 2, 36 Zachary Nado, Shreyas Padhy, D Sculley, Alexander D’Amour, Balaji Lakshminarayanan, and Jasper Snoek. Evaluating prediction-time batch normalization for robustness under covariate shift. _[ArXiv preprint, abs/2006.10963, 2020. URL https://arxiv.org/abs/2006.10963. 2](https://arxiv.org/abs/2006.10963)_ Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in PyTorch. In NIPS Autodiff Workshop, 2017. 37 Viraj Prabhu, Shivam Khare, Deeksha Kartik, and Judy Hoffman. Sentry: Selective entropy optimization via committee consistency for unsupervised domain adaptation. In Proceedings _of the IEEE/CVF International Conference on Computer Vision, pp. 8558–8567, 2021. 2, 35_ Mamshad Nayeem Rizve, Kevin Duarte, Yogesh S Rawat, and Mubarak Shah. In defense of pseudolabeling: An uncertainty-aware pseudo-label selection framework for semi-supervised learning. In ICLR, 2021. 2, 36 Evgenia Rusak, Lukas Schott, Roland Zimmermann, Julian Bitterwolf, Oliver Bringmann, Matthias Bethge, and Wieland Brendel. Increasing the robustness of dnns against image corruptions by [playing the game of noise. ArXiv preprint, abs/2001.06057, 2020. URL https://arxiv.](https://arxiv.org/abs/2001.06057) [org/abs/2001.06057. 1, 27](https://arxiv.org/abs/2001.06057) Kate Saenko, Xingchao Peng, Ben Usman, Kuniaki Saito, and Ping Hu. Visual Domain Adaptation _[Challenge (VisDA-2019), 2019. URL http://ai.bu.edu/visda-2019/. 7](http://ai.bu.edu/visda-2019/)_ Steffen Schneider, Evgenia Rusak, Luisa Eck, Oliver Bringmann, Wieland Brendel, and Matthias Bethge. Improving robustness against common corruptions by covariate shift adaptation. In _Advances in neural information processing systems, 2020. 2, 4, 6, 20, 24_ Jun Shu, Qian Zhao, Keyu Chen, Zongben Xu, and Deyu Meng. Learning adaptive loss for robust [learning with noisy labels. ArXiv preprint, abs/2002.06482, 2020. URL https://arxiv.](https://arxiv.org/abs/2002.06482) [org/abs/2002.06482. 3](https://arxiv.org/abs/2002.06482) Rui Shu, Hung H. Bui, Hirokazu Narui, and Stefano Ermon. A DIRT-T approach to unsupervised domain adaptation. In 6th International Conference on Learning Representations, ICLR 2018, _Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. OpenReview.net,_ [2018. URL https://openreview.net/forum?id=H1q-TM-AW. 2, 3, 34](https://openreview.net/forum?id=H1q-TM-AW) Kihyuk Sohn, David Berthelot, Chun-Liang Li, Zizhao Zhang, Nicholas Carlini, Ekin D Cubuk, Alex Kurakin, Han Zhang, and Colin Raffel. Fixmatch: Simplifying semi-supervised learning with consistency and confidence. In NeurIPS, 2020. 35 Hwanjun Song, Minseok Kim, Dongmin Park, and Jae-Gil Lee. Learning from noisy labels [with deep neural networks: A survey. ArXiv preprint, abs/2007.08199, 2020. URL https:](https://arxiv.org/abs/2007.08199) [//arxiv.org/abs/2007.08199. 3](https://arxiv.org/abs/2007.08199) ----- Yu Sun, Eric Tzeng, Trevor Darrell, and Alexei A Efros. Unsupervised domain adaptation through [self-supervision. ArXiv preprint, abs/1909.11825, 2019a. URL https://arxiv.org/abs/](https://arxiv.org/abs/1909.11825) [1909.11825. 4, 5](https://arxiv.org/abs/1909.11825) Yu Sun, Xiaolong Wang, Zhuang Liu, John Miller, Alexei A Efros, and Moritz Hardt. Test-time training for out-of-distribution generalization. _ArXiv preprint, abs/1909.13231, 2019b._ URL [https://arxiv.org/abs/1909.13231. 2, 6, 25](https://arxiv.org/abs/1909.13231) Mingxing Tan and Quoc V. Le. Efficientnet: Rethinking model scaling for convolutional neural networks. In Kamalika Chaudhuri and Ruslan Salakhutdinov (eds.), Proceedings of the 36th _International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach,_ _California, USA, volume 97 of Proceedings of Machine Learning Research, pp. 6105–6114._ [PMLR, 2019. URL http://proceedings.mlr.press/v97/tan19a.html. 4, 21](http://proceedings.mlr.press/v97/tan19a.html) O. Tange. Gnu parallel - the command-line power tool. ;login: The USENIX Magazine, 36(1): [42–47, 2011. URL http://www.gnu.org/s/parallel. 37](http://www.gnu.org/s/parallel) Pauli Virtanen, Ralf Gommers, Travis E. Oliphant, Matt Haberland, Tyler Reddy, David Cournapeau, Evgeni Burovski, Pearu Peterson, Warren Weckesser, Jonathan Bright, St´efan J. van der Walt, Matthew Brett, Joshua Wilson, K. Jarrod Millman, Nikolay Mayorov, Andrew R. J. Nelson, Eric Jones, Robert Kern, Eric Larson, CJ Carey, [˙]Ilhan Polat, Yu Feng, Eric W. Moore, Jake Vand erPlas, Denis Laxalde, Josef Perktold, Robert Cimrman, Ian Henriksen, E. A. Quintero, Charles R Harris, Anne M. Archibald, Antˆonio H. Ribeiro, Fabian Pedregosa, Paul van Mulbregt, and SciPy 1. 0 Contributors. SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python. Nature Methods, 17:261–272, 2020. doi: https://doi.org/10.1038/s41592-019-0686-2. 37 Dequan Wang, Evan Shelhamer, Shaoteng Liu, Bruno Olshausen, and Trevor Darrell. Fully test[time adaptation by entropy minimization. ArXiv preprint, abs/2006.10726, 2020. URL https:](https://arxiv.org/abs/2006.10726) [//arxiv.org/abs/2006.10726. 2, 6](https://arxiv.org/abs/2006.10726) Colin Wei, Kendrick Shen, Yining Chen, and Tengyu Ma. Theoretical analysis of self-training with deep networks on unlabeled data. In ICLR, 2020. 2, 35 Ross Wightman. Pytorch image models. [https://github.com/rwightman/](https://github.com/rwightman/pytorch-image-models) [pytorch-image-models, 2019. 33, 37](https://github.com/rwightman/pytorch-image-models) Qizhe Xie, Minh-Thang Luong, Eduard H. Hovy, and Quoc V. Le. Self-training with noisy student improves imagenet classification. In 2020 IEEE/CVF Conference on Computer Vision and Pattern _Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020, pp. 10684–10695. IEEE, 2020a._ [doi: 10.1109/CVPR42600.2020.01070. URL https://doi.org/10.1109/CVPR42600.](https://doi.org/10.1109/CVPR42600.2020.01070) [2020.01070. 1, 3, 4, 5, 20, 21, 24](https://doi.org/10.1109/CVPR42600.2020.01070) Saining Xie, Ross B. Girshick, Piotr Doll´ar, Zhuowen Tu, and Kaiming He. Aggregated residual transformations for deep neural networks. In 2017 IEEE Conference on Computer Vision and _Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pp. 5987–5995. IEEE_ Computer Society, 2017. doi: 10.1109/CVPR.2017.634. [URL https://doi.org/10.](https://doi.org/10.1109/CVPR.2017.634) [1109/CVPR.2017.634. 4, 5, 21](https://doi.org/10.1109/CVPR.2017.634) Sang Michael Xie, Ananya Kumar, Robbie Jones, Fereshte Khani, Tengyu Ma, and Percy Liang. Inn-out: Pre-training and self-training using auxiliary information for out-of-distribution robustness. _arXiv preprint arXiv:2012.04550, 2020b. 2, 35_ Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. In Richard C. Wilson, Edwin R. Hancock, and William A. P. Smith (eds.), Proceedings of the British Machine Vision Conference _[2016, BMVC 2016, York, UK, September 19-22, 2016. BMVA Press, 2016. URL http://www.](http://www.bmva.org/bmvc/2016/papers/paper087/index.html)_ [bmva.org/bmvc/2016/papers/paper087/index.html. 4, 5](http://www.bmva.org/bmvc/2016/papers/paper087/index.html) Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. In 5th International Conference on Learning _Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings._ [OpenReview.net, 2017. URL https://openreview.net/forum?id=Sy8gdB9xx. 3](https://openreview.net/forum?id=Sy8gdB9xx) ----- Marvin Zhang, Sergey Levine, and Chelsea Finn. Memo: Test time robustness via adaptation and augmentation. arXiv preprint arXiv:2110.09506, 2021. 2, 26 Zhilu Zhang and Mert R. Sabuncu. Generalized cross entropy loss for training deep neural networks with noisy labels. In Samy Bengio, Hanna M. Wallach, Hugo Larochelle, Kristen Grauman, Nicol`o Cesa-Bianchi, and Roman Garnett (eds.), Advances _in Neural Information Processing Systems 31:_ _Annual Conference on Neural Information_ _Processing Systems 2018, NeurIPS 2018, December 3-8, 2018, Montr´eal, Canada, pp._ [8792–8802, 2018. URL https://proceedings.neurips.cc/paper/2018/hash/](https://proceedings.neurips.cc/paper/2018/hash/f2925f97bc13ad2852a7a551802feea0-Abstract.html) [f2925f97bc13ad2852a7a551802feea0-Abstract.html. 3](https://proceedings.neurips.cc/paper/2018/hash/f2925f97bc13ad2852a7a551802feea0-Abstract.html) Barret Zoph, Golnaz Ghiasi, Tsung-Yi Lin, Yin Cui, Hanxiao Liu, Ekin D Cubuk, and Quoc V Le. Rethinking pre-training and self-training. In NeurIPS, 2020. 2, 35 Yang Zou, Zhiding Yu, BVK Kumar, and Jinsong Wang. Unsupervised domain adaptation for semantic segmentation via class-balanced self-training. In Proceedings of the European _conference on computer vision (ECCV), pp. 289–305, 2018. 2, 35_ Yang Zou, Zhiding Yu, Xiaofeng Liu, BVK Kumar, and Jinsong Wang. Confidence regularized self-training. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5982–5991, 2019. 2, 35 ----- A A TWO-POINT MODEL OF SELF-LEARNING A.1 DEFINITION OF THE TWO-POINT MODEL To understand the learning dynamics and properties of different loss functions and their hyperparameters, we propose a simple model of self-learning, both for entropy minimization and pseudo-labeling. A student network w[s] R[d] and a teacher network w[t] R[d] are trained on N data points **xi** _i=1_ _∈_ _∈_ _{_ _}[N]_ with the cross-entropy loss function L defined as _σt(x[⊤]i_ **[w][t][) log][ σ][s][(][x]i[⊤][w][s][) +][ σ][t][(][−][x]i[⊤][w][t][) log][ σ][s][(][−][x]i[⊤][w][s][)]** _L = −_ _ℓ(xi) =_ _−_ _i=1_ X (6) _i=1_ where σt(z) = 1 1 + e[−][z/τ][t][ and][ σ][s][(][z][) =] 1 + e[−][z/τ][s][ .] Here τs and τt denote the student and teacher temperature parameters. With stop gradient, student and teacher evolve in time according to **w˙** _[s]_ = **ws** **w[s], w[t][]** _,_ **w˙** _[t]_ = α(w[s] **w[t]),** (7) _−∇_ _L_ _−_ where α is the learning rate of the teacher. Without stop gradient, student and teacher are set equal to each other, and they evolve as **w˙** = **w** (w), where w[s] = w[t] = w. (8) _−∇_ _L_ We restrict the theoretical analysis to the time evolution of the components of w[s,t] in direction of two data points xk and xl, yk[s,t] _≡_ **x[⊤]k** **[w][s,t][ and][ y]l[s,t]** _≡_ **x[⊤]l** **[w][s,t][. All other components][ y]i[s,t]** with _i ̸= k, l are neglected to reduce the dimensionality of the equation system. It turns out that the_ resulting model captures the neural network dynamics quite well despite the drastic simplification of taking only two data points into account (see Figure 2). with stop gradient: ˙yk[s] [=][ −][x]k[⊤][∇][w][s][ (][ℓ][(][x][k][) +][ ℓ][(][x][l][))][,] _y˙l[s]_ [=][ −][x]l[⊤][∇][w][s][ (][ℓ][(][x][k][) +][ ℓ][(][x][l][))][,] _y˙k[t]_ [=][ α][(][y]k[t] _[−]_ _[y]k[s][)][,]_ _y˙l[t]_ [=][ α][(][y]l[t] _[−]_ _[y]l[s][)][,]_ without stop gradient: ˙yk = **x[⊤]k** [(][ℓ][(][x][k][) +][ ℓ][(][x][l][))][,] _y˙l =_ **x[⊤]l** [(][ℓ][(][x][k][) +][ ℓ][(][x][l][))][ .] _−_ _[∇][w]_ _−_ _[∇][w]_ (9) A.2 PROOF OF PROPOSITION 1 **Learning dynamics with stop gradient.** Computing the stop gradient evolution defined in equation 7 explicitly yields _N_ **w˙** _[s]_ = **ws** = [1] _σt(x[⊤]i_ **[w][t][)][σ][s][(][−][x]i[⊤][w][s][)][ −]** _[σ][t][(][−][x]i[⊤][w][t][)][σ][s][(][x]i[⊤][w][s][)]_ **xi** _−∇_ _L_ _τs_ (10) _i=1_ X  **w˙** _[t]_ = α(w[s] _−_ **w[t])** The second equality uses the well-known derivative of the sigmoid function, ∂zσ(z) = σ(z)σ( _z)._ _−_ The equation system of 2d nonlinear, coupled ODEs for w[s] _∈_ R[d] and w[t] _∈_ R[d] in equation 10 is analytically difficult to analyze. Instead of studying the ODEs directly, we act on them with the data points x[⊤]k [,][ k][ = 1][, . . ., N] [, and investigate the dynamics of the components][ x]k[⊤][w][s,t][ ≡] _[y]k[s,t][:]_ _N_ _y˙k[s]_ [= 1] **x[⊤]i** **[x][k]** _σt(yi[t][)][σ][s][(][−][y]i[s][)][ −]_ _[σ][t][(][−][y]i[t][)][σ][s][(][y]i[s][)]_ _τs_ (11) _i=1_ X   _y˙k[t]_ [=][ α][(][y]k[s] _[−]_ _[y]k[t]_ [)][.] The learning rate of each mode yk[s] [is scaled by][ (][x]k[⊤][x][i][)][ which is much larger for][ i][ =][ k][ than for][ i][ ̸][=][ k] in high-dimensional spaces. In the two-point approximation, we consider only the two (in absolute ----- value) largest terms i = k, l for a given k in the sum in equation 11. Any changes that yk[s,t][(][t][)][ and] _yl[s,t][(][t][)][ might induce in other modes][ y]i[s,t][(][t][)][ are neglected, and so we are left with only four ODEs:]_ _y˙k[s]_ [= 1] **xk** _σt(yk[t]_ [)][σ][s][(][−][y]k[s][)][ −] _[σ][t][(][−][y]k[t]_ [)][σ][s][(][y]k[s][)] _τs_ _∥_ _∥[2][ ]_  + [1] (x[⊤]k **[x][l][)]** _σt(yl[t][)][σ][s][(][−][y]l[s][)][ −]_ _[σ][t][(][−][y]l[t][)][σ][s][(][y]l[s][)]_ _τs_ _y˙l[s]_ [= 1] **xl** _σt(yl[t][)][σ][s][(][−][y]l[s][)][ −]_ _[σ][t][(][−][y]l[t][)][σ][s][(][y]l[s][)]_ _τs_ _∥_ _∥[2][ ]_  + [1] (x[⊤]k **[x][l][)]** _σt(yk[t]_ [)][σ][s][(][−][y]k[s][)][ −] _[σ][t][(][−][y]k[t]_ [)][σ][s][(][y]k[s][)] _τs_ (12) _y˙k[t]_ [=][α][(][y]k[s] _[−]_ _[y]k[t]_ [)][,][ ˙]yl[t] [=][ α][(][y]l[s] _[−]_ _[y]l[t][)][.]_ The fixed points of equation 12 satisfy _y˙k[s]_ [= ˙]yl[s] [= ˙]yk[t] [= ˙]yl[t] [= 0][.] (13) For α > 0, requiring ˙yk[t] [= ˙]yl[t] [= 0][ implies that][ y]k[s] [=][ y]k[t] [and][ y]l[s] [=][ y]l[t][. For][ τ][s][ =][ τ][t][, the two] remaining equations ˙yk[s] [= ˙]yl[s] [= 0][ vanish automatically so that there are no non-trivial two-point] learning dynamics. For τs = τt, there is a fixed point at yk[s,t] = yl[s,t] = 0 since at this point, each bracket in equation 12 vanishes individually: ̸ _σt(yk,l)σs(_ _yk,l)_ _σs(_ _yk,l)σt(yk,l)_ = [1] (14) _−_ _−_ _−_ _yk,l=0_ 4 4 [= 0][.] _[−]_ [1] At the fixed point yk[s,t] = yl[s,t] = 0, w[s] and w[t] are orthogonal to both xk and xl and hence classification fails. If this fixed point is stable, w[s] and w[t] will stay at the fixed point once they have reached it, i.e. the model collapses. The fixed point is stable when all eigenvalues of the Jacobian J of the ODE system equation 12 evaluated at yk[s,t] = yl[s,t] = 0 are negative. This is the case whenever τs < τt: (x[⊤]k **[x][l][)]** 4 _∥xl∥[2]_ _∥xk∥[2]_ 4 (x[⊤]k **[x][l][)]** 1 _τt_ _τs_ 1 _[−]_ [1] _τt_ _τs_ α _[−]_ [1] 1 _τt_ _τs_ 1 _[−]_ [1] _τt_ _τs_ 0 _[−]_ [1] 0 0 _α_ 0  _−0_ _α_ _−_   _[,]_ _yk[s,t][=][y]l[s,t]=0_ eigenvalues: λ1 = λ2 = −α < 0, 1 _τt_ _−_ _τ[1]s_  _λ3,4 = [1]_ + **xk** + **xl** _∥_ _∥[2]_ _∥_ _∥[2]_ _λ3,4 = [1]_ **xk** + **xl** 2 **xk** **xl** + 4(x[⊤]k **[x][l][)][2]** + **xk** + **xl** 8 _τt_ _−_ _τ[1]s_ _∥_ _∥[4]_ _∥_ _∥[4]_ _−_ _∥_ _∥[2]∥_ _∥[2]_ _∥_ _∥[2]_ _∥_ _∥[2]_   q  _≤∥xk∥[2]+∥xl∥[2]_  [±]  | 0{z with equality if xk= **xl** } _≥_ _±_ (15) | {z } To sum up, training with stop gradient and τs > τt avoids a collapse of the two-point model to the trivial representation yk[s,t] = yl[s,t] = 0 since the fixed point is not stable in this parameter regime. **Learning dynamics without stop gradient** Without stop gradient, we set w[t] = w[s] _≡_ **w which** leads to an additional term in the gradient: **w˙** = **w** = [1] _−∇_ _L_ _τs_ _σt(x[⊤]i_ **[w][)][σ][s][(][−][x]i[⊤][w][)][ −]** _[σ][t][(][−][x]i[⊤][w][)][σ][s][(][x]i[⊤][w][)]_ **xi**  _i=1_ (16) **xi.** + [1] _τt_ _σt(x[⊤]i_ **[w][)][σ][t][(][−][x]i[⊤][w][)]** log σs(x[⊤]i **[w][)][ −]** [log][ σ][s][(][−][x]i[⊤][w][)] _i=1_ X =log ((1+e[yi/τs] )/(1+e[−][yi/τs] ))=yi/τs | {z } ----- As before, we focus on the evolution of the two components yk = w[⊤]xk and yl = w[⊤]xl. 1 _y˙k =_ **xk** (σt(yk)σs( _yk)_ _σt(_ _yk)σs(yk)) + [1]_ _σt(yk)σt(_ _yk)yk_ _∥_ _∥[2]_ _τs_ _−_ _−_ _−_ _τt_ _−_   1 1 + (x[⊤]k **[x][l][)]** (σt(yl)σs( _yl)_ _σt(_ _yl)σs(yl)) +_ _σt(yl)σt(_ _yl)yl_ _τs_ _−_ _−_ _−_ _τsτt_ _−_   1 _y˙l =_ **xl** (σt(yl)σs( _yl)_ _σt(_ _yl)σs(yl)) + [1]_ _σt(yl)σt(_ _yl)yl_ _∥_ _∥[2]_ _τs_ _−_ _−_ _−_ _τt_ _−_   1 1 + (x[⊤]k **[x][l][)]** (σt(yk)σs( _yk)_ _σt(_ _yk)σs(yk)) +_ _σt(yk)σt(_ _yk)yk_ _τs_ _−_ _−_ _−_ _τsτt_ _−_   There is a fixed point at yk = yl = 0 where each bracket in equation 17 vanishes individually, (17) = 0. (18) _yk,l_ (σt(yk,l)σs( _yk,l)_ _σt(_ _yk,l)σs(yk,l)) +_ _τs_ _−_ _−_ _−_ _σt(yk,l)σt(_ _yk,l)yk,l_ _τsτt_ _−_ The Jacobian of the ODE system in equation 17 and its eigenvalues evaluated at the fixed point are given by _∥xk∥[2]_ 4τs (x[⊤]k **[x][l][)]** 4τs (x[⊤]k **[x][l][)]** 4τs _∥xl∥[2]_ 4τs 2 _τt_ _τs_ 2 _[−]_ [1] _τt_ _τs_ _[−]_ [1] 2 _τt_ _τs_ 2 _[−]_ [1] _τt_ _τs_ _[−]_ [1] _yk=yl=0_ 2 _τt_ _−_ _τ[1]s_  + **xk** + **xl** _∥_ _∥[2]_ _∥_ _∥[2]_ 1 _λ1,2 =_ **xk** + **xl** 2 **xk** **xl** + 4(x[⊤]k **[x][l][)][2]** + **xk** + **xl** _._ 8τs _τt_ _−_ _τ[1]s_ _∥_ _∥[4]_ _∥_ _∥[4]_ _−_ _∥_ _∥[2]∥_ _∥[2]_ _∥_ _∥[2]_ _∥_ _∥[2]_   q  _≤∥xk∥[2]+∥xl∥[2]_  [±]  | 0{z with equality if xk= **xl** } _≥_ _±_ (19) | {z } Hence the fixed point is unstable when τs > τt/2 and thus the model without stop gradient does not collapse onto yk = yl = 0 in this regime. A.3 SIMULATION OF THE TWO-POINT MODEL For visualization purposes in the main paper, we set w[s] = w[t] = [0.5, 0.5][⊤] and train the model using instant gradient updates on the dataset with points x1 = [1, 0] and x2 = [0, −1] using SGD with learning rate 0.1 and momentum 0.9. We varied student and teacher temperatures on a log-scale with 250 points from 10[−][3] to 10. Qualitatively similar results can be obtained without momentum training, at higher learning rates (most likely due to the implicit learning rate scaling introduced by the momentum term). Note that the temperature scales for observing the collapse effect depend on the learning rate, and the exact training strategy—lower learning rates can empirically prevent the model from collapsing and shift the convergence region. The result in Figure 2 will hence depend on the exact choice of learning rate (which is currently not considered in our continuous time evolution theory), while the predicted region without collapse is robust to details of the optimization. To visualize the impact of different hyperparameters, we show variants of the two point model with different learning rates using gradient descent with (Figure 3) and without momentum (Figure 4), and with different start conditions (Figure 5), which all influence the regions where the model degrades, but not the stable regions predicted by our theory. ----- lr: 10 0.1 0.01 0.001 PL log _−1_ 1 _τt = 2τs_ _τt = τs_ 0 _−1_ _−21_ Error 0% 0 50%100% _−1_ _−2_ _−2 −1_ 0 1 _−2 −1_ 0 1 _−2 −1_ 0 1 _−2 −1_ 0 1 _−2 −1_ 0 1 log _τ_ log _τ_ log _τ_ log _τ_ log _τ_ Figure 3: Entropy minimization (top) Training two point model with momentum 0.9 and different learning rates with initialization w[s] = w[t] = [0.5, 0.5][⊤]. lr: 10 0.1 0.01 0.001 PL log _−1_ 1 _τt = 2τs_ _τt = τs_ 0 _−1_ _−21_ Error 0% 0 50%100% _−1_ _−2_ _−2 −1_ 0 1 _−2 −1_ 0 1 _−2 −1_ 0 1 _−2 −1_ 0 1 _−2 −1_ 0 1 log _τ_ log _τ_ log _τ_ log _τ_ log _τ_ Figure 4: Training a two point model without momentum and different learning rates with initialization w[s] = **w[t]** = [0.5, 0.5][⊤]. Note that especially for lower learning rates, longer training would increase the size of the collapsed region. lr: 10 0.1 0.01 0.001 PL log _−1_ 1 _τt = 2τs_ _τt = τs_ 0 _−1_ _−21_ Error 0% 0 50%100% _−1_ _−2_ _−2 −1_ 0 1 _−2 −1_ 0 1 _−2 −1_ 0 1 _−2 −1_ 0 1 _−2 −1_ 0 1 log _τ_ log _τ_ log _τ_ log _τ_ log _τ_ Figure 5: Training a two point model with momentum 0.9 and different learning rates with initialization w[s] = **w[t]** = [0.6, 0.3][⊤]. ----- B ADDITIONAL INFORMATION ON USED MODELS B.1 DETAILS ON ALL HYPERPARAMETERS WE TESTED FOR DIFFERENT MODELS For all models except EfficientNet-L2, we adapt the batch norm statistics to the test domains following (Schneider et al., 2020). We do not expect significant gains for combining EfficientNet-L2 with batch norm adaptation: as demonstrated in (Schneider et al., 2020), models trained with large amounts of weakly labeled data do not seem to benefit from batch norm adaptation. **ResNet50 models** We use a vanilla ResNet50 model and compare soft- and hard-labeling against entropy minimization and robust pseudo-labeling. To find optimal hyperparameters for all methods, we perform an extensive evaluation and test (i) three different adaptation mechanisms (ii) several learning rates 1.0 × 10[−][4], 1.0 × 10[−][3], 1.0 × 10[−][2] and 5.0 × 10[−][2], (iii) the number of training epochs and (iv) updating the teacher after each epoch or each iteration. For all experiments, we use a batch size of 128. The hyperparameter search is performed on IN-C dev. We then use the optimal hyperparameters to evaluate the methods on the IN-C test set. **ResNeXt101 models** The ResNeXt101 model is considerably larger than the ResNet50 model and we therefore limit the number of ablation studies we perform for this architecture. Besides a baseline, we include a state-of-the-art robust version trained with DeepAugment+Augmix (DAug+AM, Hendrycks et al., 2020a) and a version that was trained on 3.5 billion weakly labeled images (IG-3.5B, Mahajan et al., 2018). We only test the two leading methods on the ResNeXt101 models (ENT and RPL). We vary the learning rate in same interval as for the ResNet50 model but scale it down linearly to account for the smaller batch size of 32. We only train the affine batch normalization parameters because adapting only these parameters leads to the best results on ResNet50 and is much more resource efficient than adapting all model parameters. Again, the hyperparameter search is performed only on the development corruptions of IN-C. We then use the optimal hyperparameters to evaluate the methods on the IN-C test set. **EfficientNet-L2 models** The current state of the art on IN, IN-C, IN-R and IN-A is an EfficientNet-L2 trained on 300 million images from JFT-300M (Chollet, 2017; Hinton et al., 2014) using a noisy student-teacher protocol (Xie et al., 2020a). We adapt this model for only one epoch due to resource constraints. During the hyperparameter search, we only evaluate three corruptions on the IN-C development set[2] and test the learning rates 4.6 × 10[−][2], 4.6 × 10[−][3], 4.6 × 10[−][4] and 4.6 × 10[−][5]. We use the optimal hyperparameters to evaluate ENT and RPL on the full IN-C test set (with all severity levels). **UDA-SS models** We trained the models using the scripts from the official code base at github. com/yueatsprograms/uda release. We used the provided scripts for the cases: (a) source: CIFAR10, target: STL10 and (b) source: MNIST, target: MNIST-M. For the case (c) source: CIFAR10, target: CIFAR10-C, we used the hyperparameters from case (a) since this case seemed to be the closest match to the new setting. We think that the baseline performance of the UDA-SS models can be further improved with hyperparameter tuning. **DANN models** To train models with the DANN-method, we used the PyTorch implementation [of this paper at https://github.com/fungtion/DANN py3. The code base only provides scripts and](https://github.com/fungtion/DANN_py3) hyperparameters for the case (b) source: MNIST, target: MNIST-M. For the cases (a) and (c), we used the same optimizer and trained the model for 100 epochs. We think that the baseline performance of the DANN models can be further improved with hyperparameter tuning. **Preprocessing** For IN, IN-R, IN-A and IN-D, we resize all images to 256 × 256 px and take the center 224 × 224 px crop. The IN-C images are already rescaled and cropped. We center and re-scale the color values with µRGB = [0.485, 0.456, 0.406] and σRGB = [0.229, 0.224, 0.225]. For the EfficientNet-L2, we follow the procedure in Xie et al. (2020a) and rescale all inputs to a resolution of 507 × 507 px and then center-crop them to 475 × 475 px. 2We compare the results of computing the dev set on the 1, 3 and 5 severities versus the 1, 2, 3, 4 and 5 severities on our ResNeXt101 model in the Supplementary material. ----- B.2 FULL LIST OF USED MODELS **ImageNet scale models** ImageNet trained models (ResNet50, DenseNet161, ResNeXt) are taken directly from torchvision (Marcel & Rodriguez, 2010). The model variants trained with [DeepAugment and AugMix augmentations (Hendrycks et al., 2020b;a) are taken from https:](https://github.com/hendrycks/imagenet-r) [//github.com/hendrycks/imagenet-r. The weakly-supervised ResNeXt101 model is taken from the](https://github.com/hendrycks/imagenet-r) PyTorch Hub. For EfficientNet (Tan & Le, 2019), we use the PyTorch re-implementation available [at https://github.com/rwightman/gen-efficientnet-pytorch. This is a verified re-implementation of](https://github.com/rwightman/gen-efficientnet-pytorch) the original work by Xie et al. (2020a). We verify the performance on ImageNet, yielding a 88.23% top-1 accuracy and 98.546% top-5 accuracy which is within 0.2% points of the originally reported result (Xie et al., 2020a). On ImageNet-C, our reproduced baseline achieves 28.9% mCE vs. 28.3% mCE originally reported by Xie et al. (2020a). As noted in the re-implementation, this offset is possible due to minor differences in the pre-processing. It is possible that our adaptation results would improve further when applied on the original codebase by Xie et al.. **Small scale models** We train the UDA-SS models using the original code base at github.com/ yueatsprograms/uda release, with the hyperparameters given in the provided bash scripts. For our DANN experiments, we use the PyTorch implementation at github.com/fungtion/DANN py3. We use the hyperparameters in the provided bash scripts. The following Table 11 contains all models we evaluated on various datasets with references and links to the corresponding source code. Table 11: Model checkpoints used for our experiments. Model Source WideResNet(28,10) (Croce et al., 2020) [https://github.com/RobustBench/robustbench/tree/master/robustbench](https://github.com/RobustBench/robustbench/tree/master/robustbench) WideResNet(40,2)+AugMix (Croce et al., 2020) [https://github.com/RobustBench/robustbench/tree/master/robustbench](https://github.com/RobustBench/robustbench/tree/master/robustbench) ResNet50 (He et al., 2016b) [https://github.com/pytorch/vision/tree/master/torchvision/models](https://github.com/pytorch/vision/tree/master/torchvision/models) ResNeXt101, 32×8d (He et al., 2016b) [https://github.com/pytorch/vision/tree/master/torchvision/models](https://github.com/pytorch/vision/tree/master/torchvision/models) DenseNet (Huang et al., 2017) [https://github.com/pytorch/vision/tree/master/torchvision/models](https://github.com/pytorch/vision/tree/master/torchvision/models) ResNeXt101, 32×8d (Xie et al., 2017) [https://pytorch.org/hub/facebookresearch WSL-Images resnext/](https://pytorch.org/hub/facebookresearch_WSL-Images_resnext/) ResNet50+DeepAugment+AugMix (Hendrycks et al., 2020a) [https://github.com/hendrycks/imagenet-r](https://github.com/hendrycks/imagenet-r) ResNext101 (Hendrycks et al., 2020a) [https://github.com/hendrycks/imagenet-r](https://github.com/hendrycks/imagenet-r) ResNext101 32×8d IG-3.5B (Mahajan et al., 2018) [https://github.com/facebookresearch/WSL-Images/blob/master/hubconf.py](https://github.com/facebookresearch/WSL-Images/blob/master/hubconf.py) Noisy Student EfficientNet-L2 (Xie et al., 2020a) [https://github.com/rwightman/gen-efficientnet-pytorch](https://github.com/rwightman/gen-efficientnet-pytorch) ViT-S/16 (Caron et al., 2021) [https://github.com/facebookresearch/dino](https://github.com/facebookresearch/dino) ----- C DETAILED AND ADDITIONAL RESULTS ON IN-C C.1 DEFINITION OF THE MEAN CORRUPTION ERROR (MCE) The established performance metric on IN-C is the mean Corruption Error (mCE), which is obtained by normalizing the model’s top-1 errors with the top-1 errors of AlexNet across the C=15 test corruptions and S=5 severities: _C_ _S_ _s=1_ [err]c,s[model] mCE(model) = C[1] _S_ _._ (20) _c=1_ Ps=1 [err]c,s[AlexNet] X The AlexNet errors used for normalization are shown in Table 12.P Category Corruption top1 error Gaussian Noise 0.886428 Noise Shot Noise 0.894468 Impulse Noise 0.922640 Defocus Blur 0.819880 Glass Blur 0.826268 Motion Blur 0.785948 Zoom Blur 0.798360 Snow 0.866816 Frost 0.826572 Fog 0.819324 Brightness 0.564592 Contrast 0.853204 Elastic Transform 0.646056 Pixelate 0.717840 JPEG Compression 0.606500 Blur Weather Digital Hold-out Noise Speckle Noise 0.845388 Hold-out Digital Saturate 0.658248 Hold-out Blur Gaussian Blur 0.787108 Hold-out Weather Spatter 0.717512 Table 12: AlexNet top1 errors on ImageNet-C C.2 DETAILED RESULTS FOR TUNING EPOCHS AND LEARNING RATES We tune the learning rate for all models and the number of training epochs for all models except the EfficientNet-L2. In this section, we present detailed results for tuning these hyperparameters for all considered models. The best hyperparameters that we found in this analysis, are summarized in Table 17. Table 13: mCE in % on the IN-C dev set for ENT and RPL for different numbers of training epochs when adapting the affine batch norm parameters of a ResNet50 model. criterion ENT RPL lr 10[−][4] 10[−][3] 10[−][2] 10[−][4] 10[−][3] 10[−][2] epoch 0 60.2 60.2 60.2 60.2 60.2 60.2 1 54.3 **50.0** 72.5 57.4 51.1 52.5 2 52.4 50.9 96.5 55.8 49.6 57.4 3 51.5 51.0 112.9 54.6 49.2 64.2 4 51.0 52.4 124.1 53.7 49.0 71.0 5 50.7 53.5 131.2 52.9 **48.9** 76.3 6 50.7 53.5 131.2 52.9 48.9 76.3 Table 14: mCE (↘) in % on the IN-C dev set for different learning rates for EfficientNetL2. We favor q = 0.8 over q = 0.7 due to slightly improved robustness to changes in the learning rate in the worst case error setting. lr (4.6 ×) base 10[−][3] 10[−][4] 10[−][5] 10[−][6] ENT 25.5 87.8 25.3 **22.2** 24.1 RPLq=0.7 25.5 60.3 **21.3** 23.3 n/a RPLq=0.8 25.5 58.2 **21.4** 23.4 n/a ----- Table 17: The best hyperparameters for all models that we found on IN-C. For all models, we fine-tune only the affine batch normalization parameters and use q = 0.8 for RPL. The small batchsize for the EfficientNet model is due to hardware limitations. number of Model Method Learning rate batch size epochs vanilla ResNet50 ENT 1 × 10[−][3] 128 vanilla ResNet50 RPL 1 × 10[−][3] 128 vanilla ResNeXt101 ENT 2.5 × 10[−][4] 128 vanilla ResNeXt101 RPL 2.5 × 10[−][4] 128 IG-3.5B ResNeXt101 ENT 2.5 × 10[−][4] 128 IG-3.5B ResNeXt101 RPL 2.5 × 10[−][3] 128 DAug+AM ResNeXt101 ENT 2.5 × 10[−][4] 128 DAug+AM ResNeXt101 RPL 2.5 × 10[−][4] 128 EfficientNet-L2 ENT 4.6 × 10[−][5] 8 EfficientNet-L2 RPL 4.6 × 10[−][4] 8 Table 15: mCE in % on IN-C dev for entropy minimization for different learning rates and training epochs for ResNeXt101. (div.=diverged) ENT Baseline IG-3.5B DAug+AM lr 2.5 × 1e-4 1e-3 5e-3 1e-4 1e-3 5e-3 1e-4 1e-3 5e-3 epoch Table 16: mCE in % on IN-C dev for robust pseudolabeling for different learning rates and training epochs for ResNeXt101. (div.=diverged) RPL Baseline IG-3.5B DAug+AM lr 2.5× 1e-4 1e-3 5e-3 1e-4 1e-3 5e-3 1e-4 1e-3 5e-3 epoch BASE 53.6 53.6 53.6 47.4 47.4 47.4 37.4 37.4 37.4 1 **43.0** 92.2 div. 40.9 40.4 58.6 **35.4** 46.4 div. 2 44.8 118.4 div. 39.8 41.5 69.5 35.5 90.8 div. 3 45.4 131.9 div. 39.3 42.6 76.1 35.5 122.5 div. 4 46.7 div. div. **39.1** 44.2 84.3 35.6 133.8 div. BASE 53.6 53.6 53.6 47.4 47.4 47.4 37.4 37.4 37.4 1 43.4 51.3 div. 45.0 39.9 43.6 35.3 35.1 79.1 2 42.3 63.2 div. 43.4 **39.3** 48.2 34.9 35.6 121.2 3 42.0 72.6 div. 42.4 39.4 52.9 34.7 40.1 133.5 4 **42.0** 72.6 div. 42.4 39.4 52.9 **34.7** 40.1 133.5 C.3 DETAILED RESULTS FOR ALL IN-C CORRUPTIONS We outline detailed results for all corruptions and models in Table 18. Performance across the severities in the dataset is depicted in Figure 6. All detailed results presented here are obtained by following the model selection protocol outlined in the main text. RN50 2 3 Severity RNx101 3 Severity RNx101 IG-3.5B 2 3 4 Severity RNx101 DeepAug+Augmix 1 2 3 4 5 Severity Noisy Student L2 2 3 4 ENT RPL Base Severity 80 60 40 20 Figure 6: Severity-wise mean corruption error (normalized using the average AlexNet baseline error for each corruption) for ResNet50 (RN50), ResNext101 (RNx101) variants and the Noisy Student L2 model. Especially for more robust models (DeepAugment+Augmix and Noisy Student L2), most gains are obtained across higher severities 4 and 5. For weaker models, the baseline variant (Base) is additionally substantially improved for smaller corruptions. ----- Table 18: Detailed results for each corruption along with mean corruption error (mCE) as reported in Table 2 in the main paper. We show (unnormalized) top-1 error rate averaged across 15 test corruptions along with the mean corruption error (mCE: which is normalized). Hyperparameter selection for both ENT and RPL was carried out on the dev corruptions as outlined in the main text. Mismatch in baseline mCE for EfficientNetL2 can be most likely attributed to pre-processing differences between the original tensorflow implementation Xie et al. (2020a) and the PyTorch reimplementation we employ. We start with slightly weaker baselines for ResNet50 and ResNext101 than Schneider et al. (2020): ResNet50 and ResNext101 results are slightly worse than previously reported results (typically 0.1% points) due to the smaller batch size of 128 and 32. Smaller batch sizes impact the quality of re-estimated batch norm statistics when computation is performed on the fly Schneider et al. (2020), which is of no concern here due to the large gains obtained by pseudo-labeling. gaussshot impulsedefocusglassmotionzoomsnow frost fog brightcontrastelasticpixelatejpeg **mCE** ResNet50 Baseline (Schneider et al., 2020) 62.2 Baseline (ours) 57.2 59.5 60.0 61.4 62.3 51.3 49.5 54.6 54.1 39.3 29.1 46.7 41.4 38.2 41.8 62.8 ENT 45.5 45.5 46.8 48.4 48.7 40.0 40.3 42.0 46.6 33.2 28.1 42.4 35.2 32.2 35.1 51.6 RPL 44.2 44.4 45.5 47.0 47.4 38.8 39.2 40.7 46.2 32.5 27.7 42.7 34.6 31.6 34.4 50.5 ResNeXt101 Baseline Baseline (Schneider et al., 2020) 56.7 Baseline (ours) 52.8 54.1 54.0 55.4 56.8 46.7 46.6 48.5 49.4 36.6 25.4 42.8 37.8 32.5 36.7 56.8 ENT 40.5 39.5 41.4 41.6 43.0 34.1 34.5 35.0 39.4 28.5 24.0 33.8 30.3 27.2 30.5 44.3 RPL 39.4 38.9 39.8 40.3 41.0 33.4 33.8 34.6 38.7 28.0 23.7 31.4 29.8 26.8 30.0 43.2 ResNeXt101 IG-3.5B Baseline (Schneider et al., 2020) 51.6 Baseline (ours) 50.7 51.5 53.1 54.2 55.5 45.5 44.7 41.7 42.0 28.1 20.1 33.8 35.4 27.8 33.9 51.8 ENT 38.6 38.3 40.4 41.4 41.5 33.8 33.6 32.2 34.6 24.1 19.7 26.3 27.6 24.2 27.9 40.8 RPL 39.1 39.2 40.8 42.1 42.4 33.7 33.5 31.8 34.7 23.9 19.6 26.1 27.5 23.8 27.5 40.9 ResNeXt101 DeepAug+Augmix Baseline (Schneider et al., 2020) 38.0 Baseline (ours) 30.0 30.0 30.2 32.9 35.5 28.9 31.9 33.3 32.8 29.5 22.6 28.4 31.2 23.0 26.5 38.1 ENT 28.7 28.5 29.0 29.8 30.9 26.9 28.0 29.3 30.5 26.2 23.2 26.3 28.5 23.7 26.0 35.5 RPL 28.1 27.8 28.3 29.1 30.1 26.3 27.4 28.8 29.8 25.9 22.7 25.6 27.9 23.2 25.4 34.8 Noisy Student L2 Baseline (Xie et al., 2020a) 28.3 Baseline (ours) 21.6 22.0 20.5 23.9 40.5 19.8 23.2 22.8 26.9 21.0 15.2 21.2 24.8 17.9 18.6 28.9 ENT 18.5 18.7 17.4 18.8 23.4 16.9 18.8 17.1 19.6 16.8 14.1 16.6 19.6 15.8 16.5 23.0 RPL 17.8 18.0 17.0 18.1 21.4 16.4 17.9 16.4 18.7 15.7 13.6 15.6 19.2 15.0 15.6 22.0 C.4 DETAILED RESULTS FOR THE CIFAR10-C AND UDA ADAPTATION Table 19: Detailed results for each corruption along with mean error on CIFAR10-C as reported in Table 2 in the main paper. WRN-28-10 vanilla Baseline 53.0 41.2 44.7 18.5 49.0 22.3 24.4 18.1 25.0 11.2 6.7 17.4 16.2 28.0 22.4 26.5 BN adapt 20.8 17.6 22.7 8.1 28.4 10.9 9.2 14.2 13.0 8.7 6.8 8.5 13.5 12.1 21.0 14.4 ENT 18.5 15.9 20.6 7.8 25.5 10.6 8.5 13.1 12.3 8.3 6.9 8.0 12.6 11.1 18.9 13.3 RPL 19.6 16.7 21.9 8.1 27.1 10.9 8.9 13.9 13.0 8.7 6.9 8.4 13.2 11.7 20.1 13.9 WRN-40-2 AM Baseline 19.1 14.0 13.3 6.3 17.1 7.9 7.0 10.4 10.6 8.5 5.9 9.7 9.2 16.8 11.9 11.2 BN adapt 14.1 11.9 13.9 7.2 17.6 8.7 7.9 10.8 10.6 9.0 6.8 9.0 10.9 10.1 14.0 10.8 TENT 10.8 9.1 10.9 6.0 13.4 7.2 6.3 8.4 7.8 7.1 5.7 7.1 9.2 7.4 11.2 8.5 RPL 12.4 10.5 12.4 6.5 15.6 7.8 6.9 9.5 9.1 8.2 6.2 8.3 9.9 8.8 12.8 9.7 WRN-26-16 UDA-SS Baseline 26.0 24.7 19.3 22.4 56.2 32.4 32.1 31.7 31.2 26.6 15.8 20.4 26.3 21.5 28.9 27.7 BN adapt 20.5 19.0 15.6 13.5 43.1 19.4 18.3 23.1 21.2 16.2 12.8 14.1 20.9 16.7 23.4 19.9 ENT 16.9 16.7 12.3 11.3 37.6 15.6 14.8 18.3 18.2 13.4 10.8 11.9 17.9 14.4 20.9 16.7 RPL 18.1 17.1 13.2 11.9 41.5 17.3 16.1 20.4 19.1 14.5 11.8 12.7 18.8 18.1 22.6 18.2 ----- Table 20: Detailed results for the UDA methods reported in Table 2 of the main paper. Baseline BN adapt RPL ENT UDA CIFAR10→STL10, top1 error on target [%](↘) WRN-26-16 UDA-SS 28.7 24.6 22.9 21.8 WRN-26-16 DANN 25.0 25.0 24.0 23.9 UDA MNIST→MNIST-M, top1 error on target [%](↘) WRN-26-16 UDA-SS 4.8 3.9 2.4 2.0 WRN-26-2 DANN 11.4 6.2 5.2 5.1 C.5 ABLATION OVER THE HYPERPARAMETER q FOR RPL For RPL, we must choose the hyperparameter q. We performed an ablation study over q and show results in Table 21, demonstrating that RPL is robust to the choice of q, with slight preference to higher values. Note: In the initial parameter sweep for this paper, we only compared q = 0.7 and _q = 0.8. Given the result in Table 21, it could be interesting to re-run the models in Table 1 of the_ main paper with q = 0.9, which could yield another (small) improvement in mCE. Table 21: ImageNet-C dev set mCE in %, vanilla ResNet50, batch size 96. We report the best score across a maximum of six adaptation epochs. q 0.5 0.6 0.7 0.8 0.9 mCE (dev) 49.5 49.3 49.2 49.2 49.1 C.6 SELF-TRAINING OUTPERFORMS CONTRASTIVE TEST-TIME TRAINING (SUN ET AL., 2019B) Sun et al. (2019b) use a ResNet18 for their experiments on ImageNet and only evaluate their method on severity 5 of IN-C. To enable a fair comparison, we trained a ResNet18 with both hard labeling and RPL and compare the efficacy of both methods to Test-Time Training in Table 22. For both hard labeling and RPL, we use the hyperparameters we found for the vanilla ResNet50 model and thus, we expect even better results for hyperparameters tuned on the vanilla ResNet18 model and following our general hyperparameter search protocol. While all methods (self-learning and TTT) improve the performance over a simple vanilla ResNet18, we note that even the very simple baseline using hard labeling already outperfoms Test-Time Training; further gains are possible with RPL. The result highlights the importance of simple baselines (like self-learning) when proposing new domain adaptation schemes. It is likely that many established DA techniques more complex than the basic self-learning techniques considered in this work will even further improve over TTT and other adaptation approaches developed exclusively in robustness settings. Table 22: Comparison of hard-pseudo labeling and robust pseudo-labeling to Test-Time Training Sun et al. (2019b): Top-1 error for a ResNet18 and severity 5 for all corruptions. Simple hard pseudo-labeling already outperforms TTT, robust pseudo labeling over multiple epochs yields additional gains. gauss shot impulsedefocusglass motionzoom snow frost fog bright contrastelastic pixelatejpeg **Avg** vanilla ResNet18 98.8 98.2 99.0 88.6 91.3 88.8 82.4 89.1 83.5 85.7 48.7 96.6 83.2 76.9 70.4 85.4 Test-Time Training 73.7 71.4 73.1 76.3 93.4 71.3 66.6 64.4 81.3 52.4 41.7 64.7 55.7 52.2 55.7 66.3 hard PL, (1 epoch) 73.2 70.8 73.6 76.5 75.6 63.9 56.1 59.0 65.9 48.4 39.7 85.2 50.4 47.0 51.5 62.5 RPL (4 epochs) **71.3 68.3 71.7 76.2 75.6 61.5 54.4 56.9 67.1 47.3 39.3 93.2 48.9 45.7 50.4 61.9** ----- C.7 EFFECT OF BATCH SIZE AND LINEAR LEARNING RATE SCALING How is self-learning performance affected by batch size constraints? We compare the effect of different batch sizes and linear learning rate scaling. In general, we found that affine adaptation experiments on ResNet50 scale can be run with batch size 128 on a Nvidia V100 GPU (16GB), while only batch size 96 experiments are possible on RTX 2080 GPUs. The results in Table 23 show that for a ResNet50 model, higher batch size yields a generally better performance. Table 23: ImageNet-C dev set mCE for various batch sizes with linear learning rate scaling. All results are computed for a vanilla ResNet50 model using RPL with q = 0.8, reporting the best score across a maximium of six adaptation epochs. batch size 16 32 64 80 96 128 learning rate (×10[−][3]) 0.125 0.250 0.500 0.625 0.750 1 dev mCE 53.8 51.0 49.7 49.3 49.2 48.9 C.8 PERFORMANCE OVER DIFFERENT SEEDS IN A RESNET50 ON IMAGENET-C To limit the amount of compute, we ran RPL and ENT for our vanilla ResNet50 model three times with the optimal hyperparameters. The averaged results, displayed as “mean (unbiased std)” are: Table 24: ImageNet-C performance for three seeds on a ResNet50 for ENT and RPL. ResNet50 + self-learning mCE on IN-C dev [%] mCE on IN-C test [%] ENT 50.0 (0.04) 51.6 (0.04) RPL 48.9 (0.02) 50.5 (0.03) C.9 SELF-LEARNING AS CONTINUOUS TEST-TEST ADAPTATION We test our method on continuous test-time adaptation where the model adapts to a continuous stream of data from the same domain. In Fig. 7, we display the error of the Noisy Student L2 model while it is being adapted to ImageNet-C and ImageNet-R. The model performance improves as the model sees more data from the new domain. We differentiate continuous test-time adaptation from the online test-time adaptation setting (Zhang et al., 2021) where the model is adapted to each test sample individually, and reset after each test sample. (i) ImageNet-C Baseline ENT RPL 23.0 22.0 2 3 Samples [×10[4]] 28.9 26 24 22 20 18 28 26 24 22 |(ii) ImageNet-R|Col2|Col3| |---|---|---| |23. 19. 17.||| |||| 1 2 3 Samples [×10[4]] Figure 7: Evolution of error during online adaptation for EfficientNet-L2. ----- D DETAILED AND ADDITIONAL RESULTS ON IN-D D.1 EVALUATION PROTOCOL ON IN-D The domains in IN-D differ in terms of their difficulty for the studied models. Therefore, to calculate an aggregate score, we propose normalizing the error rates by the error achieved by AlexNet on the respective domains to calculate the mean error, following the approach in Hendrycks & Dietterich (2019) for IN-C. This way, we obtain the aggregate score mean Domain Error (mDE) by calculating the mean over different domains, DE[f]d [=] _Ed[f]_ _,_ mDE = [1] _Ed[AlexNet]_ _D_ where Ed[f] [is the top-1 error of a classifier][ f][ on domain][ d][.] _Ed[f]_ _[,]_ (21) _d=1_ X **Leave-one-out-cross-validation** For all IN-D results we report in this paper, we chose the hyperparameters on the IN-C dev set. We tried a different model selection scheme on IN-D as a control experiment with “Leave one out cross-validation” (L1outCV): with a round-robin procedure, we choose the hyperparameters for the test domain on all other domains. We select the same hyperparameters as when tuning on the “dev” set: For the ResNet50 model, we select over the number of training epochs (with a maximum of 7 training epochs) and search for the optimal learning rate in the set [0.01, 0.001, 0.0001]. For the EfficientNet-L2 model, we train only for one epoch as before and select the optimal learning rate in the set [4.6 × 10[−][3], 4.6 × 10[−][4], 4.6 × 10[−][5], 4.6 × 10[−][6]]. This model selection leads to worse results both for the ResNet50 and the EfficientNetL2 models, highlighting the robustness of our model selection process, see Table 25. Table 25: mDE in % on IN-D for different model selection strategies. model model selection L1outCV IN-C dev ResNet50 RPLq=0.8 81.3 76.1 ResNet50 ENT 82.4 77.3 EfficientNet-L2 ENT 69.2 66.8 EfficientNet-L2 RPLq=0.8 69.1 67.2 D.2 DETAILED RESULTS FOR ROBUST RESNET50 MODELS ON IN-D We show detailed results for all models on IN-D for vanilla evaluation (Table 26) BN adaptation (Table 27), RPLq=0.8 (Table 28) and ENT(Table 29). For RPLq=0.8 and ENT, we use the same hyperparameters that we chose on our IN-C ‘dev’ set. This means we train the models for 5 epochs with RPLq=0.8 and for one epoch with ENT. We evaluate the pre-trained and public checkpoints of SIN (Geirhos et al., 2019), ANT (Rusak et al., 2020), ANT+SIN (Rusak et al., 2020), AugMix (Hendrycks et al., 2020b), DeepAugment (Hendrycks et al., 2020a) and DeepAug+Augmix (Hendrycks et al., 2020a) in the following tables. Table 26: Top-1 error on IN-D in % as obtained by robust ResNet50 models. For reference, we also show the mCE on IN-C and the top-1 error on IN-R. See main test for model references. Model Clipart Infograph Painting Quickdraw Real Sketch mDE IN-C IN-R vanilla 76.0 89.6 65.1 99.2 40.1 82.0 88.2 76.7 63.9 SIN 71.3 88.6 62.6 97.5 40.6 77.0 85.6 69.3 58.5 ANT 73.4 88.9 63.3 99.2 39.9 80.8 86.9 62.4 61.0 ANT+SIN 68.4 88.6 60.6 95.5 40.8 70.3 83.1 60.7 53.7 AugMix 70.8 88.6 62.1 99.1 39.0 78.5 85.4 65.3 58.9 DeepAugment 72.0 88.8 61.4 98.9 39.4 78.5 85.6 60.4 57.8 DeepAug+Augmix 68.4 88.1 58.7 98.2 39.2 75.2 83.4 53.6 53.2 ----- Table 29: Top-1 error on IN-D in % as obtained by state-of-the-art robust ResNet50 models and ENT. See main text for references to the used models. Model Clipart Infograph Painting Quickdraw Real Sketch mDE vanilla 65.1 85.8 59.2 98.5 38.4 75.8 77.3 SIN 62.1 87.0 57.3 99.1 39.0 68.6 75.5 ANT 64.2 86.9 58.7 97.1 38.8 72.8 76.5 ANT+SIN 62.2 86.8 57.7 95.8 40.1 68.7 75.2 AugMix 60.2 84.6 55.8 97.6 36.8 72.0 74.4 DeepAugment 59.5 85.7 54.4 98.0 37.1 66.4 73.3 DeepAug+Augmix 58.4 84.3 54.7 98.5 38.1 63.6 72.7 Table 30: mDE on IN-D in % as obtained by robust ResNet50 models with a baseline evaluation, batch norm adaptation, RPLq=0.8 and ENT. See main text for model references. mDE on IN-D (↘) Model Baseline BN adapt RPLq=0.8 ENT vanilla 88.2 80.2 76.1 77.3 SIN 85.6 79.6 76.8 75.5 ANT 86.9 80.7 78.1 76.5 ANT+SIN **83.1** 77.8 76.1 75.2 AugMix 85.4 78.4 74.6 74.4 DeepAugment 85.6 78.8 74.8 73.3 DeepAugment+Augmix 83.4 **74.9** **72.6** **72.7** Table 27: Top1 error on IN-D in % as obtained by state-of-the-art robust ResNet50 models and batch norm adaptation, with a batch size of 128. See main text for model references. Model Clipart Infograph Painting Quickdraw Real Sketch mDE vanilla 70.2 88.2 63.5 97.8 41.1 78.3 80.2 SIN 67.3 89.7 62.2 97.2 44.0 75.2 79.6 ANT 69.2 89.4 63.0 97.5 42.9 79.5 80.7 ANT+SIN 64.9 88.2 60.0 96.8 42.6 73.0 77.8 AugMix 66.9 88.1 61.2 97.1 40.4 75.0 78.4 DeepAugment 66.6 89.7 60.0 97.2 42.5 75.1 78.8 DeepAug+Augmix 61.9 85.7 57.5 95.3 40.2 69.2 74.9 Table 28: Top-1 error on IN-D in % as obtained by state-of-the-art robust ResNet50 models and RPLq=0.8. See main text for model references. Model Clipart Infograph Painting Quickdraw Real Sketch mDE vanilla 63.6 85.1 57.8 99.8 37.3 73.0 76.1 SIN 60.8 86.4 56.0 99.0 37.8 67.0 76.8 ANT 63.4 86.3 57.7 99.2 37.7 71.0 78.1 ANT+SIN 61.5 86.4 56.8 97.0 39.0 67.1 76.1 AugMix 59.7 83.4 54.1 98.2 35.6 70.1 74.6 DeepAugment 58.1 84.6 53.3 99.0 36.2 64.2 74.8 DeepAug+Augmix 57.0 83.2 53.4 99.1 36.5 61.3 72.6 The summary results for all models are shown in Table 30. We show the top-1 error for the different IN-D domains versus training epochs for a vanilla ResNet50 in Fig. 8. We indicate the epochs 1 and 5 at which we extract the errors with dashed black lines. ----- clipart infograph painting 100 100 90 85 90 95 80 GCE 80 75 ENT 90 70 70 65 60 85 quickdraw real sketch 100.0 100 80 99.5 95 99.0 70 90 98.5 60 85 98.0 50 80 97.5 40 75 97.0 70 0 5 10 15 0 5 10 15 0 5 10 15 Epochs Epochs Epochs Figure 8: Top-1 error for the different IN-D domains for a ResNet50 and training with RPLq=0.8 and ENT. We indicate the epochs at which we extract the test errors by the dashed black lines (epoch 1 for ENTand epoch 5 for RPLq=0.8). D.3 DETAILED RESULTS FOR THE EFFICIENTNET-L2 NOISY STUDENT MODEL ON IN-D We show the detailed results for the EfficientNet-L2 Noisy Student model on IN-D in Table 31. Table 31: Top-1 error (↘) on IN-D in % for EfficientNet-L2 Domain Baseline ENT RPL Clipart 45.0 39.8 **37.9** Infograph **77.9** 91.3 94.3 Painting 42.7 41.7 **40.9** Quickdraw **98.4** 99.4 99.4 Real 29.2 28.7 **27.9** Sketch 56.4 **48.0** 51.5 mDE 67.2 **66.8** 67.2 D.4 DETAILED RESULTS ON THE ERROR ANALYSIS ON IN-D **Frequently predicted classes** We analyze the most frequently predicted classes on IN-D by a vanilla ResNet50 and show the results in Fig. 9. We make several interesting observations: First, we find most errors interpretable: it makes sense that a ResNet50 assigns the label “comic book” to images from the “clipart” or “painting” domains, or “website” to images from the “infograph” domain, or “envelope” to images from the “sketch” domain. Second, on the hard domain “quickdraw”, the ResNet50 mostly predicts non-sensical classes that are not in IN-D, mirroring its almost chance performance on this domain. Third, we find no systematic errors on the “real” domain which is expected since this domain should be similar to IN. **Filtering predictions on IN-D that cannot be mapped to ImageNet-D** We perform a second analysis: We filter the predicted labels according to whether they can be mapped to IN-D and report the filtered top-1 errors as well as the percentage of filtered out inputs in Table 32. We note that for the domains “infograph” and “quickdraw”, the ResNet50 predicts labels that cannot be mapped to IN-D in over 70% of all cases, highlighting the hardness of these two domains. ----- Table 32: top-1 error on IN and different IN-D domains for different settings: left column: default evaluation, middle column: predicted labels that cannot be mapped to IN-D are filtered out, right column: percentage of filtered out labels. Dataset top-1 error in % top-1 error on filtered labels in % percentage of rejected inputs IN val 12.1 13.4 52.7 IN-D real 40.2 17.2 27.6 IN-D clipart 76.1 59.0 59.0 IN-D infograph 89.7 59.3 74.6 IN-D painting 65.2 39.5 42.4 IN-D quickdraw 99.3 96.7 76.1 IN-D sketch 82.1 65.6 47.9 **Filtering labels and predictions on IN that cannot be mapped to ImageNet-D** To test for possible class-bias effects, we test the performance of a ResNet50 model on IN classes that can be mapped to IN-D and report the results in Table 32. First, we map IN labels to IN-D to make the setting as similar as possible to our experiments on IN-D and report the top-1 error (12.1%). This error is significantly lower compared to the top-1 error a ResNet50 obtains following the standard evaluation protocol (23.9%). This can be explained by the simplification of the task: While in IN there are 39 bird classes, these are all mapped to the same hierarchical class in IN-D. Therefore, the classes in IN-D are more dissimilar from each other than in IN. Additionally, there are only 164 IN-D classes compared to the 1000 IN classes, raising the chance level prediction. If we further only accept predictions that can be mapped to IN-D, the top-1 error is slightly increased to 13.4%. In total, about 52.7% of all images in the IN validation set cannot be mapped to IN-D. clipart infograph 80 150 60 125 100 40 75 50 Numbers of predictions 20 25 0 0 comic book envelope jigsaw puzzle website menu envelope painting quickdraw 100 25 20 80 15 60 10 40 Numbers of predictions 5 20 0 0 comic book book jacket jigsaw puzzle hook, claw chain labyrinth real sketch 100 3 80 2 60 40 1 Numbers of predictions 20 0 0 envelope comic book studio couch envelope labyrinth nematode Figure 9: Systematic predictions of a vanilla ResNet50 on IN-D for different domains. ----- D.5 TOP-1 ERROR ON IN-D FOR ALEXNET We report the top-1 error numbers on different IN-D as achieved by AlexNet in Table 33. We used these numbers for normalization when calculating mDE. Table 33: top-1 error on IN-D by AlexNet which was used for normalization. Dataset top-1 error in % IN-D real 54.887 IN-D clipart 84.010 IN-D infograph 95.072 IN-D painting 79.080 IN-D quickdraw 99.745 IN-D sketch 91.189 ----- E ADDITIONAL EXPERIMENTS E.1 BEYOND IMAGENET CLASSES: SELF-LEARNING ON WILDS The WILDS benchmark (Koh et al., 2021) is comprised of ten tasks to test domain generalization, subpopulation shift, and combinations thereof. In contrast to the setting considered here, many of the datasets in WILDS mix several 10s or 100s domains during test time. The Camelyon17 dataset in WILDS contains histopathological images, with the labels being binary indicators of whether the central 32×32 region contains any tumor tissue; the domain identifies the hospital that the patch was taken from. Camelyon17 contains three different test splits with different domains and varying difficulty levels. For evaluation, we took the pretrained checkpoint from worksheets.codalab.org/worksheets/0x00d14c55993548a1823a710642f6d608 (camelyon17 erm densenet121 seed0) for a DenseNet121 model (Huang et al., 2017) and verified the reported baseline performance numbers. We adapt the models using ENT or RPL for a maximum of 10 epochs using learning rates {3 _×_ 10[−][5], 3 _×_ 10[−][4], . . . 3 _×_ 10[−][1]}. The best hyperparameter is selected according to OOD Validation accuracy. The RxRx1 dataset in WILDS contains RGB images of cells obtained by fluorescent microscopy, with the labels indicating which of the 1,139 genetic treatments (including no treatment) the cells received; the domain identifies the batch in which the imaging experiment was run. The RxRx1 dataset contains three test splits, however, unlike Camelyon17, in all of the splits the domains are mixed. For evaluation, we took the pretrained checkpoint from worksheets.codalab.org/bundles/ 0x7d33860545b64acca5047396d42c0ea0 for a ResNet50 model and verified the reported baseline performance numbers. We adapt the models using ENT or RPL for a maximum of 10 epochs using base learning rates {6.25 × 10[−][6], 6.25 × 10[−][5], . . . 6.25 × 10[−][2]}, which are scaled to the admissible batch size for single GPU adaptation using linear scaling. The best hyperparameter is selected according to OOD Validation accuracy. Table 34: Self-learning can improve performance on WILDS if a systematic shift is present — on Camelyon17, the ood validation and test sets are different hospitals, for example. On datasets like RxRx1 and FMoW, we do not see an improvement, most likely because the ood domains are shuffled, and a limited amount of images exist for each test domain. Top-1 accuracy [%] Validation (ID) Validation (OOD) Test (OOD) Camelyon17 Baseline 81.4 88.7 63.1 BN adapt 97.8 (+16.4) 90.9 (+2.2) 88.0 (+24.9) ENT 97.6 (+16.2) 92.7 (+4.0) 91.6 (+28.5) RPL 97.6 (+16.2) 93.0 (+4.3) 91.0 (+27.9) RxRx1 Baseline 35.9 19.1 29.7 BN adapt 35.0 (-0.9) 19.1 (0.0) 29.4 (-0.3) ENT 34.8 (-1.1) 19.2 (+0.1) 29.4 (-0.3) RPL 34.8 (-1.1) 19.2 (+0.1) 29.4 (-0.3) FMoW Baseline 60.5 59.2 52.9 BN adapt 59.9 (-0.6) 57.6 (-1.6) 51.8 (-1.1) ENT 59.9 (-0.6) 58.5 (-0.7) 52.2 (-0.7) RPL 59.8 (-0.7) 58.6 (-0.6) 52.1 (-0.8) The FMoW dataset in WILDS contains RGB satellite images, with the labels being one of 62 building or land use categories; the domain specifies the year in which the image was taken and its geographical region (Africa, the Americas, Oceania, Asia, or Europe). The FMoW dataset contains four test splits for different time periods, for which all regions are mixed together. For evaluation, we took the pretrained checkpoint from //worksheets.codalab.org/ bundles/0x20182ee424504e4a916fe88c91afd5a2 for a DenseNet121 model and verified the reported baseline performance numbers. We adapt the models using ENT or RPL for a maximum of 10 epochs ----- using learning rates {5.0 × 10[−][6], 5.0 × 10[−][5], . . . 5.0 × 10[−][2]}. The best hyperparameter is selected according to OOD Validation accuracy. While we see improvements on Camelyon17, neither BN adaptation nor self-learning can improve performance on RxRx1 or FMoW. Initial experiments on PovertyMap and iWildsCam also do not show improvements with self-learning. We hypothesize that the reason lies in the mixing of the domains: Both BN adaptation and our self-learning methods work best on systematic domain shifts. These results support our claim that self-learning is effective, while showing the important limitation when applied to more diverse shifts. E.2 SMALL IMPROVEMENTS ON BIGTRANSFER MODELS WITH GROUP NORMALIZATION LAYERS We evaluated BigTransfer models (Kolesnikov et al., 2020) provided by the timm library (Wightman, 2019). A difference to the ResNet50, ResNeXt101 and EfficientNet models is the use of group normalization layers, which might influence the optimal method for adaptation—for this evaluation, we followed our typical protocol as performed on ResNet50 models, and used affine adaptation. For affine adaptation, a distilled BigTransfer ResNet50 model improves from 49.6 % to 48.4 % mCE on the ImageNet-C development set, and from 55.0 % to 54.4 % mCE on the ImageNet-C test set when using RPL (q = 0.8) for adaptation, at learning rate 7.5 × 10[−][4] at batch size 96 after a single adaptation epoch. Entropy minimization did not further improve results on the ImageNet-C test set. An ablation over learning rates and epochs on the dev set is shown in Table 35, the final results are summarized in Table 36. Table 35: mCE in % on the IN-C dev set for ENT and RPL for different numbers of training epochs when adapting the affine batch norm parameters of a ResNet50 model. criterion ENT RPL lr, 7.5 × 10[−][5] 10[−][4] 10[−][3] 10[−][5] 10[−][4] 10[−][3] epoch 0 49.63 49.63 49.63 49.63 49.63 49.63 1 49.44 50.42 52.59 49.54 48.89 48.95 2 49.26 50.27 56.47 49.47 **48.35** 50.77 3 49.08 52.18 60.06 49.39 48.93 51.45 4 48.91 52.03 60.50 49.31 50.01 51.53 5 **48.80** 51.97 62.91 49.24 49.96 51.34 6 48.83 52.10 62.96 49.16 49.71 51.19 7 48.83 52.10 62.96 49.16 49.71 51.19 Table 36: mCE in % on the INC dev set for ENT and RPL for different numbers of training epochs when adapting the affine batch norm parameters of a ResNet50 model. dev mCE test mCE Baseline 49.63 55.03 ENT 48.80 56.36 RPL **48.35** **54.41** E.3 CAN SELF-LEARNING IMPROVE OVER SELF-LEARNING BASED UDA? An interesting question is whether test-time adaptation with self-learning can improve upon selflearning based UDA methods. To investigate this question, we build upon French et al. (2018) and their released code base at github.com/Britefury/self-ensemble-visual-domain-adapt. We trained the Baseline models from scratch using the provided shell scripts with the default hyperparameters and verified the reported performance. For adaptation, we tested BN adaptation, ENT, RPL, as well as continuing to train in exactly the setup of French et al. (2018), but without the supervised loss. For the different losses, we adapt the models for a maximum of 10 epochs using learning rates _{1 × 10[−][5], 1 × 10[−][4], . . ., 1 × 10[−][1]}._ Note that for this experiment, in contrast to any other result in this paper, we purposefully do not **perform proper hyperparameter selection based on a validation dataset—instead we report the** best accuracy across all tested epochs and learning rates to give an upper bound on the achievable performance for test-time adaptation. As highlighted in Table 37, none of the four tested variants is able to meaningfully improve over the baseline, corroborating our initial hypothesis that self-learning within a full UDA setting is the optimal strategy, if dataset size and compute permits. On the other hand, results like the teacher ----- refinement step in DIRT-T (Shu et al., 2018) show that with additional modifications in the loss function, it might be possible to improve over standard UDA with additional adaptation at test time. Table 37: Test-time adaptation marginally improves over self-ensembling. Baseline BN adapt ENT RPL Self-ensembling loss MNIST→SVHN MT+TF 33.88 34.44 34.87 35.09 33.27 MT+CT* 32.62 34.11 34.25 34.21 33.36 MT+CT+TF 41.59 41.93 41.95 41.95 42.70 MT+CT+TFA 30.55 32.53 32.54 32.55 30.84 SVHN-specific aug. 97.05 96.82 96.91 96.87 97.12 MNIST→USPS MT+TF 98.01 97.91 97.96 97.91 98.16 MT+CT* 88.34 88.39 88.54 88.39 88.44 MT+CT+TF 98.36 98.41 98.41 98.41 98.50 MT+CT+TFA 98.45 98.45 98.45 98.45 98.61 SVHN→MNIST MT+TF 98.49 98.47 98.49 98.47 99.40 MT+CT* 88.34 88.36 88.36 88.36 89.36 MT+CT+TF 99.51 99.49 99.5 99.49 99.57 MT+CT+TFA 99.56 99.57 99.57 99.57 99.58 SVHN-specific aug. 99.52 99.49 99.5 99.49 99.65 USPS→MNIST MT+TF 92.79 92.62 92.62 92.66 93.08 MT+CT* 99.11 99.13 99.14 99.13 99.21 MT+CT+TF 99.41 99.42 99.45 99.42 99.52 MT+CT+TFA 99.48 99.54 99.57 99.54 99.54 ----- F DETAILED DISCUSSION OF RELATED WORK **Self-learning for domain adaptation** Xie et al. (2020b) introduce “In-N-Out” which uses auxiliary information to boost both in- and out-of-distribution performance. AdaMatch (Berthelot et al., 2021) builds upon FixMatch (Sohn et al., 2020) and can be used for the tasks of unsupervised domain adaptation, semi-supervised learning and semi-supervised domain adaptation as a generalpurpose algorithm. Prabhu et al. (2021) propose SENTRY, an algorithm based on judging the predictive consistency of samples from the target domain under different image transformations. Zou et al. (2019) show that different types of confidence regularization can improve the performance of self-learning. A theoretically motivated framework for self-learning in domain adaptation based on consistency regularization has been proposed by Wei et al. (2020) and then extended by Cai et al. (2021). Self-learning has also been used for semantic segmentation (Zou et al., 2018). The main difference from these works to ours is that they 1) utilize both source and target data during training (i.e., the classical UDA setup) whereas we only require access to unlabeled target data (source-free setup), and 2) train their models from scratch whereas we adapt pretrained checkpoints to the unlabeled target data, 3) are oftentimes more complicated (also in terms of the number of hyperparameters) than our approach due to using more than one term in the objective function. We would like to highlight that utilizing source data should always result in better performance compared to not using source data. Our contribution is to show that self-learning can still be very beneficial with a small compute budget and no access to source data. Our setup targets “deployed systems”, e.g., a self-driving car or a detection algorithm in a production line which adapts to the distribution shift “on-the-fly” and cannot (or should not) be retrained from scratch for every new domain shift. Kumar et al. (2020) study the setting of self-learning for gradual domain adaptation. They find that self-learning works better if the data distribution changes slowly. The gradual domain adaptation setting differs from ours; instead of a gradual shift over time, we focus on a fixed, systematic shift at test time dataset. Kumar et al. (2020) tested their method on a synthetic Gaussian dataset, MNIST and the Portraits datasets; building and evaluating ImageNet-scale datasets for a gradual domain adaptation perspective is a very interesting extension of our work, but left for future work, and would not only require changes/adaptations to the self-learning method, but also to the evaluation datasets. Chen et al. (2020b) prove that under certain conditions, self-learning can improve performance in biased datasets where spurious features correlate with the label in the source domain but are independent of the label in the target domain. While Chen et al. (2020b) also consider the setting of source-free domain adaptation (like we do), they limit their experiments to small scale models on MNIST and Celeb-A, while we conduct a large scale empirical study on all common robustness datasets with large scale models – some of the observed and studied effects in our paper (effectiveness of different loss functions at different problem scales, adaptation mechanisms, etc.) can be attributed to this large scale evaluation setting, and extending our insights over small scale experiments. Similar to us, Chen et al. (2020b) find that a strong source classifier is necessary for self-learning to work; however, in their case, a teacher accuracy of 72% (on CMNIST10) is already too low and leads to worse student accuracy. In contrast, in our experiments, self-learning still works for an mCE as high as 80% (cf. appendix Figure 3, severity 5) and teacher accuracies as low as 10.4% (on ImageNet-D “Infograph”), and breaks down at accuracies around 1-2% (on ImageNet-D “Quickdraw”). This discrepancy might be due to the spurious correlations that Chen et al. (2020b) introduced in their dataset leading to systematic biases, which are not present in the datasets we studied. **Self-learning in semi-supervised learning (SSL)** In a different line of work which is not related to domain adaptation directly, self-learning has been used in a semi-supervised setting. Zoph et al. (2020) show that self-learning outperforms pretraining when stronger data augmentation is used and more labeled data is present. They use human labels on the target task (e.g., object detection on COCO) and pseudo-labels on an unlabeled dataset (e.g. ImageNet), and optimize the loss on both datasets, with the aim to improve performance on the task where ground truth labels are known. The work of Zoph et al. (2020) is orthogonal to ours, in the sense that we could adapt their final ----- checkpoint to a new domain with our method, similar to how we adapted the Noisy Student model which was also trained using self-learning. Rizve et al. (2021) propose an uncertainty-aware pseudo-label selection (UPS) framework which outperforms other SSL methods in a few-label regime. UPS is helpful to reduce the impact of noisy pseudo-labels; in our case, we use the generalized cross-entropy loss for this purpose. Testing the UPS framework (and other means for improving the quality of pseudo-labels, or robustness against label noise) on robustness datasets would be an interesting direction for future work. De Sousa Ribeiro et al. (2020) propose Deep Bayesian Self-Training (DBST) for automatic data annotation. Mukherjee & Awadallah (2020) suggest using self-learning in a semi-supervised setting for text classification with few labels. ----- G SOFTWARE STACK We use different open source software packages for our experiments, most notably Docker (Merkel, 2014), scipy and numpy (Virtanen et al., 2020), GNU parallel (Tange, 2011), Tensorflow (Abadi et al., 2016), PyTorch (Paszke et al., 2017), timm (Wightman, 2019), Self-ensembling for visual domain adaptation (French et al., 2018), the WILDS benchmark (Koh et al., 2021), and torchvision (Marcel & Rodriguez, 2010). -----