# IF YOUR DATA DISTRIBUTION SHIFTS,
## USE SELF-LEARNING

**Anonymous authors**
Paper under double-blind review

ABSTRACT

We demonstrate that self-learning techniques like entropy minimization and
pseudo-labeling are simple and effective at improving performance of a deployed
computer vision model under systematic domain shifts. We show consistent
improvements irrespective of the model architecture, the pre-training technique
or the type of distribution shift. At the same time, self-learning is simple to
use in practice because it does not require knowledge or access to the original
training data or scheme, is robust to hyperparameter choices, is straight-forward
to implement and requires only a few adaptation epochs. This makes selflearning techniques highly attractive for any practitioner who applies machine
learning algorithms in the real world. We present state-of-the art adaptation
results on CIFAR10-C (8.5% error), ImageNet-C (22.0% mCE), ImageNet-R
(17.4% error) and ImageNet-A (14.8% error), theoretically study the dynamics
of self-supervised adaptation methods and propose a new classification dataset
(ImageNet-D) which is challenging even with adaptation.

1 INTRODUCTION

Deep Neural Networks (DNNs) can reach human-level performance in complex cognitive tasks
(Brown et al., 2020; He et al., 2016a; Berner et al., 2019) if the distribution of the test data is
sufficiently similar to the training data. However, DNNs are known to struggle if the distribution of
the test data is shifted relatively to the training data (Geirhos et al., 2018; Dodge & Karam, 2017).

Two largely distinct communities aim to increase the performance of models under test-time
distribution shifts: The robustness community generally considers ImageNet-scale datasets and
evaluates models in an ad-hoc scenario. Models are trained on a clean source dataset like ImageNet,
using heavy data augmentation (Hendrycks et al., 2020a; Rusak et al., 2020; Geirhos et al., 2019)
and/or large-scale pre-training (Xie et al., 2020a; Mahajan et al., 2018). The trained models are
not adapted in any way to test-time distribution shifts. This evaluation scenario is relevant for
applications in which very different distribution shifts are encountered in an unpredictable order,
and hence misses out on the gains of adaptation to unlabeled samples of the target distribution.

Figure 1: Robustness and adaptation to new datasets has traditionally been achieved by robust pre-training (with
hand-selected/data-driven augmentation strategies, or additional data), unsupervised domain adaptation (with
access to unlabeled samples from the test set), or, more recently, self-supervised learning methods. We show
that on top of these different pre-training tasks, it is always possible (irrespective of architecture, model size or
pre-training algorithm) to further adapt models to the target domain with simple self-learning techniques.


-----

The unsupervised domain adaptation (UDA) community often considers smaller-scale datasets and
assumes that both the source and the (unlabeled) target dataset are known. Models are trained on
both datasets (e.g., with an adversarial domain objective, Ganin et al., 2016) before evaluation on
the target domain data. This evaluation scenario provides optimal conditions for adaptation, but the
reliance on the source dataset makes UDA more computationally expensive, more impractical and
prevents the use of pre-trained models for which the source dataset is unknown or simply too large.

In this work, we consider the source-free domain adaptation setting, a middle ground between the
classical ad-hoc robustness setting and UDA in which models can adapt to the target distribution
but without using the source dataset (Kundu et al., 2020; Kim et al., 2021; Li et al., 2020; Liang
et al., 2020). This evaluation scenario is interesting for many practitioners and applications as an
extension of the ad-hoc robustness scenario. It evaluates the possible performance of a deployed
model on a systematic, unseen distribution shift at inference time: an embedded computer vision
system in an autonomous car should adapt to changes without being trained on all available training
data; an image-based quality control software may not necessarily open-source the images it has
been trained on, but still has to be adapted to the lighting conditions at the operation location; a
computer vision system in a hospital should perform robustly when tested on a scanner different
from the training images—importantly, it might not be known at development time which scanner it
will be tested on, and it might be prohibited to share images from many hospitals to run UDA.

Can self-learning methods like pseudo-labeling and entropy-minimization also be used in this
_source-free domain adaptation setting? To answer this question, we perform an extensive study_
of several self-learning variants, and find consistent and substantial gains in test-time performance
across several robustness and out-of-domain benchmarks and a wide range of models and pretraining methods, including models trained with UDA methods that do not use self-learning. We
also find that self-learning outperforms state-of-the-art source-free domain adaptation methods,
namely Test-Time Training which is based on a self-supervised auxiliary objective and continual
training (Sun et al., 2019b), test-time entropy minimization (Wang et al., 2020) and (gradient-free)
BatchNorm adaptation (Schneider et al., 2020; Nado et al., 2020). We perform a large number
of ablations to study important design choices for self-learning methods in source-free domain
adaptation. Furthermore, we show that a variant of pseudo-labeling with a robust loss function
consistently outperforms entropy minimization on ImageNet-scale datasets. We theoretically
analyze and empirically verify the influence of the temperature parameter in self-learning and
provide guidelines how this single parameter should be chosen. Our approach is visualized in
Figure 1. We do not consider test-time adaptation in an online setting like is studied e.g., by Zhang
et al. (2021), where the model is adapted to one example at a time, and reset after each example.

**Related Work. Variants of self-learning have been used for UDA (Berthelot et al., 2021), for**
example using auxiliary information (Xie et al., 2020b), consistency (Wei et al., 2020; Cai et al.,
2021; Prabhu et al., 2021) or confidence (Zou et al., 2019) regularization. The main difference from
these works to ours is that they 1) utilize both source and target data for self-learning whereas we
only require access to unlabeled target data, 2) train their models from scratch whereas we merely
fine-tune pretrained checkpoints on the unlabeled target data, and 3) are generally more complicated
than our approach due to using more than one term in the objective function.

Our work is conceptually most similar to virtual adversarial domain adaptation in the fine-tuning
phase of DIRT-T (Shu et al., 2018)) and Test-time entropy minimization (TENT; Wang et al., 2020).
In contrast to DIRT-T, our objective is simpler and we scale the approach to considerably larger
datasets on ImageNet scale. TENT, on the other hand, only evaluated a single method (entropy
minimization) on a single vanilla model (ResNet-50) on IN-C. We substantially expand this analysis
to show that self-learning almost universally increases test-time performance under distribution
shifts, regardless of the type of distribution shift, the model architecture or the pre-training method.

Self-learning has also been applied to UDA for semantic segmentation (Zou et al., 2018), for gradual
domain adaptation (Kumar et al., 2020), for semi-supervised learning (Rizve et al., 2021; Mukherjee
& Awadallah, 2020), for learning in biased datasets (Chen et al., 2020b) and for automated data
annotation (De Sousa Ribeiro et al., 2020). Zoph et al. (2020) show that self-learning outperforms
pretraining when stronger data augmentation is used and more labeled data is present. A more
detailed discussion of related work alongside with the main differences to our work can be found in
Appendix F. Our main contribution beyond these works is to show the effectiveness of self-learning
on top of both robust, large scale, and domain adapted models, at scale.


-----

2 SELF-LEARNING FOR TEST-TIME ADAPTATION

Different variants of self-learning have been used in both unsupervised domain adaptation (French
et al., 2018; Shu et al., 2018), self-supervised representation learning (Caron et al., 2021), and in
semi-supervised learning (Xie et al., 2020a). In a typical self-learning setting a teacher network
**f** _[t]_ trained on the source domain predicts labels on the target domain. Then, a student model f _[s]_ is
fine-tuned on the predicted labels.

In the following, let f _[t](x) denote the logits for sample x and let p[t](j_ **x)** _σj(f_ _[t](x)) denote_
_|_ _≡_
the probability for class j obtained from a softmax function σj( ). Similarly, f _[s](x) and p[s](j_ **x)**

_·_ _|_
denote the logits and probabilities for the student model f _[s]. For all techniques, one can optionally_
only admit samples where the probability maxj p[t](j|x) exceeds some threshold. We consider
three popular variants of self-learning: Pseudo-labeling with hard or soft labels, as well as entropy
minimization.

**Hard Pseudo-Labeling (Lee, 2013; Galstyan & Cohen, 2007). We generate labels using the**
teacher and train the student on pseudo-labels i using the standard cross-entropy loss,

_ℓH_ (x) := − log p[s](i|x), _i = argmaxj p[t](j|x)_ (1)

Usually, only samples with a confidence above a certain threshold are considered for training the
student. We test several thresholds but note that thresholding means discarding a potentially large
portion of the data which leads to a performance decrease in itself. The teacher is updated after each
epoch.

**Soft Pseudo-Labeling (Lee, 2013; Galstyan & Cohen, 2007). In contrast to the hard pseudo-**
labeling variant, we here train the student on class probabilities predicted by the teacher,


_p[t](j|x) log p[s](j|x)._ (2)


_ℓS(x) :=_
_−_


Soft pseudo-labeling is typically not used in conjunction with thresholding, since it already
incorporates the certainty of the model. The teacher is updated after each epoch.

**Entropy Minimization (ENT; Grandvalet & Bengio, 2004). This variant is similar to soft pseudo-**
labeling, but we no longer differentiate between a teacher and student network. It corresponds to an
“instantaneous” update of the teacher. The training objective becomes


_p[s](j|x) log p[s](j|x)._ (3)


_ℓE(x) :=_
_−_


Intuitively, self-training with entropy minimization leads to a sharpening of the output distribution
for each sample, making the model more confident in its predictions.

**Robust Pseudo-Labeling (RPL). Virtually all introduced self-training variants use the standard**
cross-entropy classification objective. However, the standard cross-entropy loss has been shown
to be sensitive to label noise (Zhang & Sabuncu, 2018; Zhang et al., 2017). In the setting of
domain adaptation, inaccuracies in the teacher predictions and, thus, the labels for the student, are
inescapable, with severe repercussions for training stability and hyperparameter sensitivity as we
show in the results.

As a straight-forward solution to this problem, we propose to replace the cross-entropy loss by a
robust classification loss designed to withstand certain amounts of label noise (Ghosh et al., 2017;
Song et al., 2020; Shu et al., 2020; Zhang & Sabuncu, 2018). A popular candidate is the Generalized
_Cross Entropy (GCE) loss which combines the noise-tolerant Mean Absolute Error (MAE) loss_
(Ghosh et al., 2017) with the CE loss. We only consider the hard labels and use the robust GCE loss
as the training loss for the student,

_i = argmaxj p[t](j|x),_ _ℓGCE(x, i) := q[−][1](1 −_ _p[s](i|x)[q]),_ (4)

with q ∈ (0, 1]. For the limit case q → 0, the GCE loss approaches the CE loss and for q = 1, the
GCE loss is the MAE loss (Zhang & Sabuncu, 2018). We test updating the teacher both after every
update step of the student (RPL) and once per epoch (RPL[ep]).


-----

3 EXPERIMENT DESIGN

**Datasets. IN-C (Hendrycks & Dietterich, 2019) contains corrupted versions of the 50 000 images in**
the IN validation set. There are fifteen test and four hold-out corruptions, and there are five severity
levels for each corruption. The established metric to report model performance on IN-C is the mean
Corruption Error (mCE) where the error is normalized by the AlexNet error, and averaged over all
corruptions and severity levels, see Eq. 20, Appendix C.1. IN-R (Hendrycks et al., 2020a) contains
30 000 images with artistic renditions of 200 classes of the IN dataset. IN-A (an, 2019) is composed
of 7500 unmodified real-world images on which standard IN-trained ResNet50 (He et al., 2016b)
models yield chance level performance. CIFAR10 (Krizhevsky et al., 2009) and STL10 (Coates
et al., 2011) are small-scale image recognition datasets with 10 classes each, and training sets of
50 000/5000 images and test sets of 10 000/8000 images, respectively. The digit datasets MNIST
(Deng, 2012) and MNIST-M (Ganin et al., 2016) both have 60 000 training and 10 000 test images.

**Hyperparameters. The different self-learning variants have a range of hyperparameters such as the**
learning rate or the stopping criterion. Our goal is to give a realistic estimation on the performance
to be expected in practice.. To this end, we optimize hyperparameters for each variant of pseudolabeling on a hold-out set of IN-C that contains four types of image corruptions (“speckle noise”,
“Gaussian blur”, “saturate” and “spatter”) with five different strengths each, following the procedure
suggested in Hendrycks & Dietterich (2019). We refer to the hold-out set of IN-C as our dev set.

**Models for ImageNet-scale datasets.** We consider four popular model architectures: ResNet50
(He et al., 2016b), DenseNet161 (Huang et al., 2017), ResNeXt101 (Xie et al., 2017) and
EfficientNet-L2 (Tan & Le, 2019) (see Appendix B.1 for details on the used models). For
ResNet50, DenseNet and ResNeXt101, we include a simple vanilla version trained on IN only. For
ResNet50 and ResNeXt101, we additionally include a state-of-the-art robust version trained with
DeepAugment and Augmix (DAug+AM, Hendrycks et al., 2020a)[1]. For the ResNeXt model, we
also include a version that was trained on 3.5 billion weakly labeled images (IG-3.5B, Mahajan et al.,
2018). Finally, for EfficientNet-L2 we select the current state of the art on IN-C which was trained
on 300 million images from JFT-300M (Chollet, 2017; Hinton et al., 2014) using a noisy studentteacher protocol (Xie et al., 2020a). We validate the IN and IN-C performance of all considered
models and match the originally reported scores (Schneider et al., 2020). For EfficientNet-L2, we
match IN top-1 accuracy up to 0.1% points, and IN-C up to 0.6% mCE.

**Models for CIFAR10/MNIST-scale datasets.** For CIFAR10-C experiments, we use two
WideResNets (WRN, Zagoruyko & Komodakis, 2016): the first one is trained on CIFAR10 and
has a depth of 28 and a width of 10 and the second one is trained with AugMix (Hendrycks et al.,
2020b) and has a depth of 40 and a width of 2. The remaining small-scale models are trained with
unsupervised domain adaptation (UDA) methods. We propose to regard any UDA method which
requires joint training with source and target data as a pre-training step, similar to regular pretraining on IN, and use self-learning on top of the final checkpoint. We consider two popular UDA
methods: self-supervised domain adaptation (UDA-SS; Sun et al., 2019a) and Domain-Adversarial
Training of Neural Networks (DANN; Ganin et al., 2016). In UDA-SS, the authors seek to align the
representations of both domains by performing an auxiliary self-supervised task on both domains
simultaneously. In all UDA-SS experiments, we use a WideResNet with a depth of 26 and a width of
16. In DANN, the authors learn a domain-invariant embedding by optimizing a minimax objective.
For all DANN experiments except for MNIST→MNIST-M, we use the same WRN architecture as
above. For the MNIST→MNIST-M experiment, the training with the larger model diverged and
we used a smaller WideResNet version with a width of 2. We note that DANN training involves
optimizing a minimax objective and is generally harder to tune.

4 RESULTS: SELF-LEARNING UNIVERSALLY IMPROVES MODELS

Self-learning is a powerful learning scheme, and in the following section we show that it allows to
perform test-time adaptation on robustified models, models obtained with large-scale pre-training,
as well as domain adapted models across a wide range of datasets and distribution shifts. Our main
results on large-scale and small-scale datasets are shown in Tables 1 and 2, respectively. These

1see leaderboard at github.com/hendrycks/robustness


-----

summary tables show final results, and all experiments use the hyperparameters we determined
separately on the dev set.

**Table 1: Self-learning successfully adapts ImageNet-scale models across different model**
**architectures on IN-C, IN-A and IN-R. We adapt the vanilla ResNet50, ResNeXt101 and**
DenseNet161 models to IN-C and decrease the mCE by over 19 percent points in all models. Further,
self-learning works for models irrespective of their size: Self-learning substantially improves the
performance of the ResNet50 and the ResNext101 trained with DAug+AM, on IN-C by 11.9 and
9.7 percent points, respectively. Finally, we further improve the current state-of-the-art model on
IN-C—the EfficientNet-L2 Noisy Student model—and report a new state-of-the-art result of 22%
mCE (which corresponds to a top1 error of 17.1%) on this benchmark with test-time adaptation
(compared to 28% mCE without adaptation).

number of w/o adapt w/ adapt ∆
mCE [%] on IN-C test (↘) parameters RPL

ResNet50 vanilla (He et al., 2016b) 2.6 × 10[7] 76.7 50.5 (-26.2)
ResNet50 DAug+AM (Hendrycks et al., 2020a) 2.6 × 10[7] 53.6 41.7 (-11.9)
DenseNet161 vanilla (Huang et al., 2017) 2.8 × 10[7] 66.4 47.0 (-19.4)
ResNeXt10132 8d vanilla (Xie et al., 2017) 8.8 10[7] 66.6 43.2 (-23.4)
_×_ _×_
ResNeXt10132 8d DAug+AM (Hendrycks et al., 2020a) 8.8 10[7] 44.5 34.8 (-9.7)
_×_ _×_
ResNeXt10132 8d IG-3.5B (Mahajan et al., 2018) 8.8 10[7] 51.7 40.9 (-10.8)
_×_ _×_
EfficientNet-L2 Noisy Student (Xie et al., 2020a) 4.8 × 10[8] 28.3 **22.0** (-6.3)

top1 error [%] on IN-R (↘)

ResNet50 vanilla (He et al., 2016b) 2.6 × 10[7] 63.8 54.1 (-9.7)
EfficientNet-L2 Noisy Student (Xie et al., 2020a) 4.8 × 10[8] 23.5 **17.4** (-6.1)

top1 error [%] on ImageNet-A (↘)

EfficientNet-L2 Noisy Student (Xie et al., 2020a) 4.8 × 10[8] 16.5 **14.8** (-1.7)


Self-learning is not limited to the distribution shifts in IN-C like compression artefacts or blur.
On IN-R, a dataset with renditions, self-learning improves both the vanilla ResNet50 and the
EfficientNet-L2 model, the latter of which improves from 23.5% to a new state-of-the art of 17.4%
top-1 error. For a vanilla ResNet50, we improve the top-1 error from 63.8% (Hendrycks et al.,
2020a) to 54.1%. On IN-A, adapting the EfficientNet-L2 model using self-learning decreases the
top-1 error from 16.5% (Xie et al., 2020a) to 14.8% top-1 error, again constituting a new state of the
art with test-time adaptation on this dataset.

**Table 2:** **Self-learning improves robustified and domain adapted models on small-scale**
**datasets. We test common domain adaptation techniques like DANN (Ganin et al., 2016) and**
UDA-SS (Sun et al., 2019a), and show that self-learning is effective at further tuning such models
to the target domain. We suggest to view unsupervised source/target domain adaptation as a step
comparable to pre-training under corruptions, rather than an adaptation technique specifically tuned
to the target set—indeed, we can achieve error rates using, e.g., DANN + target adaptation previously
only possible with source/target based pseudo-labeling, across different common domain adaptation
benchmarks. Self-learning also decreases the error on CIFAR10-C of the Wide ResNet model
trained with AugMix (AM, Hendrycks et al., 2020b) and reaches a new state of the art on CIFAR10C of 8.5% top1 error with test-time adaptation. _[†]denotes preliminary results on CIFAR-C dev only,_
due to instabilities in training the adversarial network in DANN.

number of w/o adapt w/ adapt ∆
top1 error [%] on CIFAR10-C (↘) parameters ENT

WRN-28-10 vanilla (Zagoruyko & Komodakis, 2016) 3.6 × 10[7] 26.5 13.3 (-13.2)
WRN-40-2 AM (Hendrycks et al., 2020b) 2.2 × 10[6] 11.2 8.5 (-2.7)
WRN-26-16 UDA-SS (Sun et al., 2019a) 9.3 × 10[7] 27.7 16.7 (-11.0)
WRN-26-16 DANN (Ganin et al., 2016) 9.3 × 10[7] _†29.7_ _†28.5_ (-1.2)

UDA CIFAR10→STL10, top1 error on target [%](↘)

WRN-26-16 UDA-SS (Sun et al., 2019a) 9.3 × 10[7] 28.7 21.8 (-6.9)
WRN-26-16 DANN (Ganin et al., 2016) 9.3 × 10[7] 25.0 23.9 (-1.1)

UDA MNIST→MNIST-M, top1 error on target [%](↘)
WRN-26-16 UDA-SS (Sun et al., 2019a) 9.3 × 10[7] 4.8 2.0 (-2.8)
WRN-26-2 DANN (Ganin et al., 2016) 1.5 × 10[6] 11.4 5.1 (-6.3)


-----

**Table 3: Self-learning also improves large pre-trained models. Unlike BatchNorm adaptation**
(Schneider et al., 2020), we show that self-learning transfers well to models pre-trained on a large
amount of unlabeled data: self-learning decreases the mCE on IN-C of the ResNeXt101 trained on
3.5 billion weakly labeled samples (IG-3.5B, Mahajan et al., 2018) from 51.7% to 40.9%.

mCE on IN-C test [%] (↘) no adaptation BN adaptation self-learning

ResNeXt10132 8d vanilla 66.6 56.8 43.2
_×_
ResNeXt10132 8d IG-3.5B 51.7 51.8 **40.9**
_×_

**Table 4: Self-learning outperforms previously published test-time adaptation approaches on**
**IN-C. The robustness benchmark IN-C has so far mostly been regarded in the ad-hoc evaluation**
setting as discussed in our introduction. Thus, there are only few published methods that report
numbers for test-time adaptation: BatchNorm adaptation (Schneider et al., 2020), Test-Time
Training (TTT, Sun et al., 2019b), and TENT (Wang et al., 2020). In particular, note that TTT
requires a special loss function at training time, while our approach is agnostic to the pre-training
phase. Our self-training results outperforms all three baselines (also after tuning TENT with our full
experimental protocol):

mCE on IN-C test [%] (↘) w/o adapt BN adapt TENT (ours) self-learning
ResNet50 vanilla 76.7 62.2 53.5 (51.6) **50.5**

top1 error [%] on IN-C, sev. 5 (↘) w/o adapt BN adapt TTT self-learning
ResNet18 vanilla 85.4 72.2 66.3 **61.9**

**Table 5:** **Self-supervised methods based on self-learning allow out-of-the-box test-time**
**adaptation. The recently published DINO method (Caron et al., 2021) is another variant of self-**
supervised learning that has proven to be effective for unsupervised representation learning. At the
core, the method uses soft pseudo-labeling. Here, we test whether a model trained with DINO on the
source dataset can be test-time adapted on IN-C using DINO to further improve out-of-distribution
performance. Since the used model is a vision transformer model, we test different choices of
adaptation parameters and find considerable performance improvements in all cases, yielding an
mCE of 43.5%mCE at a parameter count comparable to a ResNet50 model. For adapting the affine
layers, we follow Houlsby et al. (2019):

w/o adapt w/ adapt w/ adapt w/ adapt w/ adapt
mCE on IN-C test [%] (↘) affine layers bottleneck layers lin. layers all weights

ViT-S/16 62.3 51.8 46.8 45.2 **43.5**

5 UNDERSTANDING TEST-TIME ADAPTATION WITH SELF-LEARNING

In the following section, we show ablations and interesting insights of using self-learning for testtime adaptation. If not specified otherwise, all ablations are run on the holdout corruptions of IN-C
(our dev set) with a vanilla ResNet50.

**Table 6: Robust pseudo-labeling outperforms entropy minimization on large-scale datasets**
**while the reverse is true on small-scale datasets. We find that robust pseudo-labeling consistently**
improves over entropy minimization on IN-C, while entropy minimization performs better on
smaller scale data (CIFAR10, STL10, MNIST). The finding highlights the importance of testing
both algorithms on new datasets. The improvement is typically on the order of one percent point:


mCE, IN-C dev ResNet50 ResNeXt-101 EfficientNet-L2

ENT 50.0 ± 0.04 43.0 22.2
RPL **48.9 ± 0.02** **42.0** **21.3**


top-1 err, CIFAR-C WRN-40

ENT **8.5**
RPL 9.0


**Table 7: Robust pseudo-labeling allows usage of the full dataset without a threshold. Classical**
hard labeling needs a confidence threshold (T) for best performance, thereby reducing the dataset
size, while best performance for RPL is reached for full dataset training with a threshold T of 0.0:

diff. self-learning methods no adapt soft PL hard PL (T): 0.0 **0.5** 0.9 RPL (T): **0.0** 0.5 0.9

mCE on IN-C dev [%] 69.5 60.1 53.8 51.9 52.4 **49.7 49.9 51.8**


-----

**Table 8: Short update intervals are crucial for fast adaptation.** Having established that
RPL generally performs better than soft- and hard-labeling, we vary the update interval for the
teacher. We find that instant updates are most effective. In entropy minimization, the update interval
is instant per default.

Update interval for RPL w/o adapt no update epoch instant

mCE on IN-C dev [%] 69.5 54.0 49.7 **49.2**

**Table 9: Adaptation of only affine layers is important in CNNs. On IN-C, adapting only the**
affine parameters after the normalization layers (i.e., the rescaling and shift parameters β and γ)
works better on a ResNet50 architecture than adapting all parameters or only the last layer. We
indicate the number of adapted parameters in brackets.

Adaptation mechanism w/o adapt last layer full model affine

mCE on IN-C dev [%] 69.5 [0] 60.2 [2M] 51.5 [22.6M] **48.9 [5.3k]**

Note that for Vision Transformers, full model adaptation works better than affine adaptation (see
Table 5). We also noticed that on convolutional models with a smaller parameter count like
ResNet18, full model adaptation is possible.

**Hyperparameters obtained on corruption datasets transfer well to real world datasets. When**
evaluating models, we select the hyperparameters discussed above (the learning rate and the epoch
used for early stopping are the most critical ones) on the holdout set of IN-C. We note that this
technique transfers well to IN-R, -A and -D, highlighting the practical value of corruption robustness
datasets for adapting models on real distribution shifts.

On IN-D, we performed a control experiment where we selected hyperparameters with leave-oneout cross validation—this selection scheme actually performed worse than IN-C parameter selection
(see Appendix D.1).

6 ADAPTING MODELS ON A WIDER RANGE OF DISTRIBUTION SHIFTS
REVEALS LIMITATIONS OF ROBUSTIFICATION AND ADAPTATION METHODS

Robustness datasets on ImageNet-scale have so far been limited to a few selected domains (image
corruptions in IN-C, image renditions in IN-R, difficult images for ResNet50 classifiers in IN-A).
In order to test our approach on a wider range of complex distribution shifts, we re-purpose the
dataset from the Visual Domain Adaptation Challenge 2019 (DomainNet, Saenko et al., 2019) as an
additional robustness benchmark. This dataset comes with six image styles: Clipart, Real, Infograph,
Painting, Quickdraw and Sketch. It has 345 classes in total, of which 164 overlap with IN. To
benchmark robustness of IN trained models out of the box, we filter out the classes that cannot be
mapped to IN and refer to the smaller version of DomainNet as ImageNet-D (IN-D). We map 463
classes in IN to these 164 IN-D classes, e.g., for an image from the “bird” class in IN-D, we accept
all 39 bird classes in IN as valid predictions. We show example images from IN-D in Table 10. The
detailed evaluation protocol along with justifications for our design choices and additional analysis
are outlined in Appendix D.

The benefit of IN-D over DomainNet is the re-mapping to ImageNet classes which allows robustness
researchers to easily benchmark on this dataset, without the need of re-training a model (as common
in UDA). To test whether self-learning is helpful for more complex distribution shifts, we adapt a
vanilla ResNet50, several robust IN-C models and the EfficientNet-L2 Noisy Student model on IND. We use the same hyperparameters we obtained on IN-C dev for all our IN-D experiments. We
show our main results in Table 10.

**More robust models perform better on IN-D. Comparing the performance of the vanilla ResNet50**
model to its robust DAug+AM variant, we find that the DAug+AM model performs better on all
domains, with the most significant gains on the “Clipart”, “Painting” and “Sketch” domains. We
show detailed results for all domains and all tested models in Appendix D.2, along with results
on IN-C and IN-R for comparison. We find that the best performing models on IN-D are also the


-----

Table 10: Self-learning decreases the top1 error on some IN-D domains but increases it on others.

domain Real Painting Clipart Sketch Infograph Quickdraw
adapt w/o w/ w/o w/ w/o w/ w/o w/ w/o w/ w/o w/
model
EffNet-L2 Noisy Student 29.2 **27.9** 42.7 **40.9** 45.0 **37.9** 56.4 **51.5** **77.9** 94.3 **98.4** 99.4
ResNet50 DAug+AM 39.2 36.5 58.7 53.4 68.4 57.0 75.2 61.3 88.1 83.2 98.2 99.1
ResNet50 vanilla 40.1 37.3 65.1 57.8 76.0 63.6 82.0 73.0 89.6 85.1 99.2 99.8

strongest ones on IN-C and IN-R which indicates good generalization capabilities of the techniques
combined for these models, given the large differences between the three considered datasets.
However, even the best models perform 20 to 30 percentage points worse on IN-D compared to
their performance on IN-C or IN-R, indicating that IN-D might be a more challenging benchmark.

**All models struggle with some domains of IN-D. The EfficientNet-L2 Noisy Student model**
obtains the best results on most domains. However, we note that the overall error rates are
surprisingly high compared to the model’s strong performance on the other considered datasets
(IN-A: 14.8% top-1 error, IN-R: 17.4% top-1 error, IN-C: 22.0% mCE). Even on the “Real” domain
closest to clean IN where the EfficientNet-L2 model has a top-1 error of 11.6%, the model only
reaches a top-1 error of 29.2%. Self-learning decreases the top1 error on all domains except for
“Infograph” and “Quickdraw”. We note that both domains have very high error rates from the
beginning and thus hypothesize that the produced pseudo-labels are of low quality.

**Error analysis on IN-D. We investigate the errors a ResNet50 model makes on IN-D by analyzing**
the most frequently predicted classes for different domains to reveal systematic errors indicative
of the encountered distribution shifts. We find most errors interpretable: the classifier assigns the
label “comic book” to images from the “Clipart” or “Painting” domains, “website” to images from
the “Infograph” domain, and “envelope” to images from the “Sketch” domain. Thus, the classifier
predicts the domain rather than the class. We find no systematic errors on the “Real” domain which
is expected since this domain should be similar to IN. Detailed results on the top-3 most frequently
predicted classes for different domains can be found in Fig. 9, Appendix D.4.

**IN-D should be used as an additional robustness benchmark. While the error rates on IN-C,**
-R and -A are at a well-acceptable level for our largest EfficientNet-L2 model after adaptation,
IN-D performance is consistently worse for all models. We propose to move from isolated
benchmark settings like IN-R (single domain) to benchmarks more common in domain adaptation
(like DomainNet) and make IN-D publicly available as an easy to use dataset for this purpose.

**Additional experiments and limitations. We discuss additional proof-of-concept implementations**
on the WILDS benchmark (Koh et al., 2021), BigTransfer (BiT; Chen et al., 2020a) models and
on self-learning based UDA models in Appendix E. On WILDS, self-learning is effective for the
Camelyon17 task with a systematic shift between train, validation and test sets (each set is comprised
of different hospitals), while self-learning fails to improve on tasks with mixed domains.

7 A SIMPLE MODEL OF STABILITY IN SELF-LEARNING

We observed that different self-learning schemes are optimal for small-scale vs. large-scale datasets
and varying amount of classes. We reconsider the used loss functions, and unify them into

**f t(x)** **f s(x)**

_ℓ(x) =_ _σj_ log _σj_ _,_
_−_ _τt_ _τs_

_j_     

X (5)

**f** (x), entropy minimization
**f** _[t](x) =_
sg(f (x)), pseudo-labeling.


We introduced student and teacher temperature τs and τt as parameters in the softmax function
and the stop gradient operation sg. Caron et al. (2021) fixed τs and varied τt during training,


-----

and empirically found an upper bound for τt above which the training was no longer stable.
To better understand such behavior, we study the learning dynamics of the loss function in
equation 5 theoretically in a simple two-datapoints, two-classes model with linear student and
teacher networks f _[s](x) = x[⊤]w[s]_ and f _[t](x) = x[⊤]w[t]_ defined in Appendix A.1. Gradient
descent with stop gradient corresponds to hard pseudo-labeling in the limit τt 0 and to
soft pseudo-labeling when τs = τt = 1. Gradient descent without stop gradient, i.e., setting →
**w[s]** = **w[t]** = **w corresponds to entropy minimization.** We obtain the following result:


Two points


CIFAR-C


**Proposition 1 (Collapse in the two-point model).**
_The student and teacher networks ws and wt_
_trained with stop gradient does not collapse to the_
_trivial representation ∀x : x[⊤]w[s]_ = 0, x[⊤]w[t] = 0
_if τs > τt. The network w trained without stop_
_gradient does not collapse if τs > τt/2._ _Proof._
_see § A.2._

We validate the proposition on a simulated two
datapoint toy dataset, as well as on the CIFARC dataset and outline the results in Figure 2. In
general, the size and location of the region where
collapse is observed in the simulated model also
depends on the initial conditions, the learning rate
and the optimization procedure. An in depth
discussion, as well as additional simulations are
given in the Appendix. In practice, the result
suggests that student temperatures should exceed
_the teacher temperatures for pseudo-labeling, and_
_student temperatures should exceed half the teacher_
_temperature for entropy minimization._


PL

_τs_
10
log

ENT


1 1

Error Error

0 0%50% 0 _≤>BASBAS_

100%

_−1_ _−1_

_−2−2_ _−1_ 0 1 _−2_ _−2_ _−1_ 0 1

1 1

_τt = 2τs_
_τt = τs_

0

0

_−1_

_−2−2_ _−1_ 0 1 _−1_ _−1_ 0 1

log10 τt log10 τt


discussion, as well as additional simulations are

Figure 2: For the two point model, we show

given in the Appendix. In practice, the result

error and for the CIFAR10-C simulation, we show

suggests that student temperatures should exceed improvement (yellow) vs. degradation (purple)
_the teacher temperatures for pseudo-labeling, and_ over the non-adapted baseline (BAS). An important
_student temperatures should exceed half the teacher_ convergence criterion for pseudo-labeling (top
_temperature for entropy minimization._ row) and entropy minimization (bottom row) is the

ratio of student and teacher temperatures; it lies at

Entropy minimization with standard temperatures _τs = τt for PL, and 2τs = τt for ENT. Despite_
(are hence stable. The two-point learning dynamicsτs = τt = 1) and hard pseudo-labeling (τt → 0) the simplicity of the two-point model, the generalconvergence regions transfer to CIFAR10-C.
vanish for soft pseudo-labeling with τs = _τt,_
suggesting that one would have to analyze a more complex model with more data points. While
this does not directly imply that the learning is unstable at this point, we empirically observe that
both entropy minimization and hard labeling outperform soft-labeling in practice.


8 CONCLUSION

We evaluated and analysed how self-learning, an essential component in many unsupervised domain
adaptation and self-supervised pre-training techniques, can be applied for adaptation to both small
and large-scale image recognition problems common in robustness research. We demonstrated new
state-of-the-art adaptation results with the EfficientNet-L2 model on the benchmarks ImageNet-C,
-R, and -A, and introduced a new benchmark dataset (ImageNet-D) which remains challenging even
after adaptation. Our theoretical analysis shows the influence of the temperature parameter in the
self-learning loss function on the training stability and provides guidelines how to choose a suitable
value. Self-learning universally improves test-time performance under diverse, but systematic

distribution shifts irrespective of the architecture or pre-training method. We hope that our work
encourages both researchers and practitioners to use self-learning if their data distribution shifts.


**Reproducibility Statement** We attempted to make our work as reproducible as possible: We
mostly used pre-trained models which are publicly available and we denoted the URL addresses
of all used checkpoints; for the checkpoints that were necessary to retrain, we report the Github
directories with the source code and used an official or verified reference implementation when
available. We report all used hyperparameters in the Appendix and will release our code upon
acceptance of the paper.


-----

REFERENCES

Mart´ın Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu
Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. Tensorflow: A system for
large-scale machine learning. In 12th {USENIX} Symposium on Operating Systems Design and
_Implementation ({OSDI} 16), pp. 265–283, 2016. 37_

Dan Hendrycks an. Natural adversarial examples. ArXiv preprint, abs/1907.07174, 2019. URL

[https://arxiv.org/abs/1907.07174. 4](https://arxiv.org/abs/1907.07174)

Christopher Berner, Greg Brockman, Brooke Chan, Vicki Cheung, Przemyslaw Debiak, Christy
Dennison, David Farhi, Quirin Fischer, Shariq Hashme, Chris Hesse, et al. Dota 2 with large
[scale deep reinforcement learning. ArXiv preprint, abs/1912.06680, 2019. URL https://](https://arxiv.org/abs/1912.06680)
[arxiv.org/abs/1912.06680. 1](https://arxiv.org/abs/1912.06680)

David Berthelot, Rebecca Roelofs, Kihyuk Sohn, Nicholas Carlini, and Alex Kurakin. Adamatch:
A unified approach to semi-supervised learning and domain adaptation, 2021. 2, 35

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla
Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini
Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh,
Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric
Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam
McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are fewshot learners. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan,
and Hsuan-Tien Lin (eds.), Advances in Neural Information Processing Systems 33: Annual
_Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12,_
_[2020, virtual, 2020. URL https://proceedings.neurips.cc/paper/2020/hash/](https://proceedings.neurips.cc/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html)_
[1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html. 1](https://proceedings.neurips.cc/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html)

Tianle Cai, Ruiqi Gao, Jason D Lee, and Qi Lei. A theory of label propagation for subpopulation
shift. arXiv preprint arXiv:2102.11203, 2021. 2, 35

Mathilde Caron, Hugo Touvron, Ishan Misra, Herv´e J´egou, Julien Mairal, Piotr Bojanowski, and
Armand Joulin. Emerging properties in self-supervised vision transformers. _ArXiv preprint,_
[abs/2104.14294, 2021. URL https://arxiv.org/abs/2104.14294. 3, 6, 8, 21](https://arxiv.org/abs/2104.14294)

Ting Chen, Simon Kornblith, Kevin Swersky, Mohammad Norouzi, and Geoffrey E.
Hinton. Big self-supervised models are strong semi-supervised learners. In Hugo
Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien
Lin (eds.), Advances in Neural Information Processing Systems 33: _Annual Conference_
_on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020,_
_virtual, 2020a._ [URL https://proceedings.neurips.cc/paper/2020/hash/](https://proceedings.neurips.cc/paper/2020/hash/fcbc95ccdd551da181207c0c1400c655-Abstract.html)
[fcbc95ccdd551da181207c0c1400c655-Abstract.html. 8](https://proceedings.neurips.cc/paper/2020/hash/fcbc95ccdd551da181207c0c1400c655-Abstract.html)

Yining Chen, Colin Wei, Ananya Kumar, and Tengyu Ma. Self-training avoids using spurious
features under domain shift. In NeurIPS, 2020b. 2, 35

Franc¸ois Chollet. Xception: Deep learning with depthwise separable convolutions. In 2017 IEEE
_Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July_
_21-26, 2017, pp. 1800–1807. IEEE Computer Society, 2017. doi: 10.1109/CVPR.2017.195. URL_
[https://doi.org/10.1109/CVPR.2017.195. 4, 20](https://doi.org/10.1109/CVPR.2017.195)

Adam Coates, Andrew Ng, and Honglak Lee. An analysis of single-layer networks in unsupervised
feature learning. In Proceedings of the Fourteenth International Conference on Artificial
_Intelligence and Statistics, 2011. 4_

Francesco Croce, Maksym Andriushchenko, Vikash Sehwag, Edoardo Debenedetti, Nicolas
Flammarion, Mung Chiang, Prateek Mittal, and Matthias Hein. Robustbench: a standardized
adversarial robustness benchmark. _ArXiv preprint, abs/2010.09670, 2020._ [URL https:](https://arxiv.org/abs/2010.09670)
[//arxiv.org/abs/2010.09670. 21](https://arxiv.org/abs/2010.09670)


-----

Fabio De Sousa Ribeiro, Francesco Caliv´a, Mark Swainson, Kjartan Gudmundsson, Georgios
Leontidis, and Stefanos Kollias. Deep bayesian self-training. _Neural Computing and_
_Applications, 32(9):4275–4291, 2020. 2, 36_

Li Deng. The mnist database of handwritten digit images for machine learning research. IEEE
_Signal Processing Magazine, 29(6):141–142, 2012. 4_

Samuel F. Dodge and Lina J. Karam. A study and comparison of human and deep learning
recognition performance under visual distortions. In International Conference on Computer
_Communications and Networks, ICCCN 2017, 2017. 1_

Geoffrey French, Michal Mackiewicz, and Mark H. Fisher. Self-ensembling for visual domain
adaptation. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver,
_BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. OpenReview.net, 2018._
[URL https://openreview.net/forum?id=rkpoTaxA-. 3, 33, 37](https://openreview.net/forum?id=rkpoTaxA-)

Aram Galstyan and Paul R. Cohen. Empirical comparison of hard and soft label propagation for
relational classification. In 17th international conference on Inductive logic programming, 2007.
3

Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, Franc¸ois
Laviolette, Mario Marchand, and Victor Lempitsky. Domain-adversarial training of neural
networks. The journal of machine learning research, 17(1):2096–2030, 2016. 2, 4, 5

Robert Geirhos, Carlos R. Medina Temme, Jonas Rauber, Heiko H. Sch¨utt, Matthias Bethge, and
Felix A. Wichmann. Generalisation in humans and deep neural networks. In Samy Bengio,
Hanna M. Wallach, Hugo Larochelle, Kristen Grauman, Nicol`o Cesa-Bianchi, and Roman
Garnett (eds.), Advances in Neural Information Processing Systems 31: Annual Conference on
_Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2018, Montr´eal,_
_Canada, pp. 7549–7561, 2018._ [URL https://proceedings.neurips.cc/paper/](https://proceedings.neurips.cc/paper/2018/hash/0937fb5864ed06ffb59ae5f9b5ed67a9-Abstract.html)
[2018/hash/0937fb5864ed06ffb59ae5f9b5ed67a9-Abstract.html. 1](https://proceedings.neurips.cc/paper/2018/hash/0937fb5864ed06ffb59ae5f9b5ed67a9-Abstract.html)

Robert Geirhos, Patricia Rubisch, Claudio Michaelis, Matthias Bethge, Felix A. Wichmann, and
Wieland Brendel. Imagenet-trained cnns are biased towards texture; increasing shape bias
improves accuracy and robustness. In 7th International Conference on Learning Representations,
_ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net, 2019._ [URL https:](https://openreview.net/forum?id=Bygh9j09KX)
[//openreview.net/forum?id=Bygh9j09KX. 1, 27](https://openreview.net/forum?id=Bygh9j09KX)

Aritra Ghosh, Himanshu Kumar, and P. S. Sastry. Robust loss functions under label noise for deep
neural networks. In Satinder P. Singh and Shaul Markovitch (eds.), Proceedings of the Thirty-First
_AAAI Conference on Artificial Intelligence, February 4-9, 2017, San Francisco, California, USA,_
[pp. 1919–1925. AAAI Press, 2017. URL http://aaai.org/ocs/index.php/AAAI/](http://aaai.org/ocs/index.php/AAAI/AAAI17/paper/view/14759)
[AAAI17/paper/view/14759. 3](http://aaai.org/ocs/index.php/AAAI/AAAI17/paper/view/14759)

Yves Grandvalet and Yoshua Bengio. Semi-supervised learning by entropy minimization.
In Advances in Neural Information Processing Systems 17 [Neural Information Processing
_Systems, NIPS 2004, December 13-18, 2004, Vancouver, British Columbia, Canada], pp._
529–536, 2004. [URL https://proceedings.neurips.cc/paper/2004/hash/](https://proceedings.neurips.cc/paper/2004/hash/96f2b50b5d3613adf9c27049b2a888c7-Abstract.html)
[96f2b50b5d3613adf9c27049b2a888c7-Abstract.html. 3](https://proceedings.neurips.cc/paper/2004/hash/96f2b50b5d3613adf9c27049b2a888c7-Abstract.html)

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image
recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR
_2016, Las Vegas, NV, USA, June 27-30, 2016, pp. 770–778. IEEE Computer Society, 2016a._
[doi: 10.1109/CVPR.2016.90. URL https://doi.org/10.1109/CVPR.2016.90. 1](https://doi.org/10.1109/CVPR.2016.90)

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image
recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR
_2016, Las Vegas, NV, USA, June 27-30, 2016, pp. 770–778. IEEE Computer Society, 2016b._
[doi: 10.1109/CVPR.2016.90. URL https://doi.org/10.1109/CVPR.2016.90. 4, 5,](https://doi.org/10.1109/CVPR.2016.90)
21


-----

Dan Hendrycks and Thomas G. Dietterich. Benchmarking neural network robustness to common
corruptions and perturbations. In 7th International Conference on Learning Representations,
_ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net, 2019._ [URL https:](https://openreview.net/forum?id=HJz6tiCqYm)
[//openreview.net/forum?id=HJz6tiCqYm. 4, 27](https://openreview.net/forum?id=HJz6tiCqYm)

Dan Hendrycks, Steven Basart, Norman Mu, Saurav Kadavath, Frank Wang, Evan Dorundo, Rahul
Desai, Tyler Zhu, Samyak Parajuli, Mike Guo, et al. The many faces of robustness: A critical
analysis of out-of-distribution generalization. _ArXiv preprint, abs/2006.16241, 2020a._ URL

[https://arxiv.org/abs/2006.16241. 1, 4, 5, 20, 21, 27](https://arxiv.org/abs/2006.16241)

Dan Hendrycks, Norman Mu, Ekin Dogus Cubuk, Barret Zoph, Justin Gilmer, and Balaji
Lakshminarayanan. Augmix: A simple data processing method to improve robustness and
uncertainty. In 8th International Conference on Learning Representations, ICLR 2020, Addis
_[Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020b. URL https://openreview.](https://openreview.net/forum?id=S1gmrxHFvB)_
[net/forum?id=S1gmrxHFvB. 4, 5, 21, 27](https://openreview.net/forum?id=S1gmrxHFvB)

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. In
_NIPS Deep Learning Workshop, 2014. 4, 20_

Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe,
Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning
for NLP. In Proceedings of the 36th International Conference on Machine Learning, 2019. 6

Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q. Weinberger. Densely connected
convolutional networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition,
_CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pp. 2261–2269. IEEE Computer Society,_
[2017. doi: 10.1109/CVPR.2017.243. URL https://doi.org/10.1109/CVPR.2017.](https://doi.org/10.1109/CVPR.2017.243)
[243. 4, 5, 21, 32](https://doi.org/10.1109/CVPR.2017.243)

Youngeun Kim, Donghyeon Cho, Kyeongtak Han, Priyadarshini Panda, and Sungeun Hong.
Domain adaptation without source data. IEEE Transactions on Artificial Intelligence, 2021. 2

Pang Wei Koh, Shiori Sagawa, Henrik Marklund, Sang Michael Xie, Marvin Zhang, Akshay
Balsubramani, Weihua Hu, Michihiro Yasunaga, Richard Lanas Phillips, Irena Gao, Tony Lee,
Etienne David, Ian Stavness, Wei Guo, Berton A. Earnshaw, Imran S. Haque, Sara Beery, Jure
Leskovec, Anshul Kundaje, Emma Pierson, Sergey Levine, Chelsea Finn, and Percy Liang.
WILDS: A benchmark of in-the-wild distribution shifts. In International Conference on Machine
_Learning (ICML), 2021. 8, 32, 37_

Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Joan Puigcerver, Jessica Yung, Sylvain Gelly,
and Neil Houlsby. Big transfer (bit): General visual representation learning. In Computer Vision–
_ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part_
_V 16, pp. 491–507. Springer, 2020. 33_

Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images.
2009. 4

Ananya Kumar, Tengyu Ma, and Percy Liang. Understanding self-training for gradual domain
adaptation. In International Conference on Machine Learning, pp. 5468–5479. PMLR, 2020. 2,
35

Jogendra Nath Kundu, Naveen Venkat, R Venkatesh Babu, et al. Universal source-free domain
adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
_Recognition, pp. 4544–4553, 2020. 2_

Dong-Hyun Lee. Pseudo-label: The simple and efficient semi-supervised learning method for deep
neural networks. In ICML Workshop : Challenges in Representation Learning (WREPL), 2013. 3

Rui Li, Qianfen Jiao, Wenming Cao, Hau-San Wong, and Si Wu. Model adaptation: Unsupervised
domain adaptation without source data. In 2020 IEEE/CVF Conference on Computer Vision and
_Pattern Recognition (CVPR), 2020. 2_


-----

Jian Liang, Dapeng Hu, and Jiashi Feng. Do we really need to access the source data? source
hypothesis transfer for unsupervised domain adaptation. In International Conference on Machine
_Learning, 2020. 2_

Dhruv Mahajan, Ross Girshick, Vignesh Ramanathan, Kaiming He, Manohar Paluri, Yixuan Li,
Ashwin Bharambe, and Laurens van der Maaten. Exploring the limits of weakly supervised
pretraining. In Proceedings of the European Conference on Computer Vision (ECCV), 2018. 1,
4, 5, 6, 20, 21

S´ebastien Marcel and Yann Rodriguez. Torchvision the machine-vision package of torch. In ACM
_International Conference on Multimedia, 2010. 21, 37_

Dirk Merkel. Docker: Lightweight linux containers for consistent development and deployment.
_Linux J., 2014(239), 2014. ISSN 1075-3583. 37_

Subhabrata Mukherjee and Ahmed Hassan Awadallah. Uncertainty-aware self-training for text
classification with few labels. In NeurIPS, 2020. 2, 36

Zachary Nado, Shreyas Padhy, D Sculley, Alexander D’Amour, Balaji Lakshminarayanan, and
Jasper Snoek. Evaluating prediction-time batch normalization for robustness under covariate shift.
_[ArXiv preprint, abs/2006.10963, 2020. URL https://arxiv.org/abs/2006.10963. 2](https://arxiv.org/abs/2006.10963)_

Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito,
Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in
PyTorch. In NIPS Autodiff Workshop, 2017. 37

Viraj Prabhu, Shivam Khare, Deeksha Kartik, and Judy Hoffman. Sentry: Selective entropy
optimization via committee consistency for unsupervised domain adaptation. In Proceedings
_of the IEEE/CVF International Conference on Computer Vision, pp. 8558–8567, 2021. 2, 35_

Mamshad Nayeem Rizve, Kevin Duarte, Yogesh S Rawat, and Mubarak Shah. In defense of pseudolabeling: An uncertainty-aware pseudo-label selection framework for semi-supervised learning.
In ICLR, 2021. 2, 36

Evgenia Rusak, Lukas Schott, Roland Zimmermann, Julian Bitterwolf, Oliver Bringmann, Matthias
Bethge, and Wieland Brendel. Increasing the robustness of dnns against image corruptions by
[playing the game of noise. ArXiv preprint, abs/2001.06057, 2020. URL https://arxiv.](https://arxiv.org/abs/2001.06057)
[org/abs/2001.06057. 1, 27](https://arxiv.org/abs/2001.06057)

Kate Saenko, Xingchao Peng, Ben Usman, Kuniaki Saito, and Ping Hu. Visual Domain Adaptation
_[Challenge (VisDA-2019), 2019. URL http://ai.bu.edu/visda-2019/. 7](http://ai.bu.edu/visda-2019/)_

Steffen Schneider, Evgenia Rusak, Luisa Eck, Oliver Bringmann, Wieland Brendel, and Matthias
Bethge. Improving robustness against common corruptions by covariate shift adaptation. In
_Advances in neural information processing systems, 2020. 2, 4, 6, 20, 24_

Jun Shu, Qian Zhao, Keyu Chen, Zongben Xu, and Deyu Meng. Learning adaptive loss for robust
[learning with noisy labels. ArXiv preprint, abs/2002.06482, 2020. URL https://arxiv.](https://arxiv.org/abs/2002.06482)
[org/abs/2002.06482. 3](https://arxiv.org/abs/2002.06482)

Rui Shu, Hung H. Bui, Hirokazu Narui, and Stefano Ermon. A DIRT-T approach to unsupervised
domain adaptation. In 6th International Conference on Learning Representations, ICLR 2018,
_Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. OpenReview.net,_
[2018. URL https://openreview.net/forum?id=H1q-TM-AW. 2, 3, 34](https://openreview.net/forum?id=H1q-TM-AW)

Kihyuk Sohn, David Berthelot, Chun-Liang Li, Zizhao Zhang, Nicholas Carlini, Ekin D Cubuk,
Alex Kurakin, Han Zhang, and Colin Raffel. Fixmatch: Simplifying semi-supervised learning
with consistency and confidence. In NeurIPS, 2020. 35

Hwanjun Song, Minseok Kim, Dongmin Park, and Jae-Gil Lee. Learning from noisy labels
[with deep neural networks: A survey. ArXiv preprint, abs/2007.08199, 2020. URL https:](https://arxiv.org/abs/2007.08199)
[//arxiv.org/abs/2007.08199. 3](https://arxiv.org/abs/2007.08199)


-----

Yu Sun, Eric Tzeng, Trevor Darrell, and Alexei A Efros. Unsupervised domain adaptation through
[self-supervision. ArXiv preprint, abs/1909.11825, 2019a. URL https://arxiv.org/abs/](https://arxiv.org/abs/1909.11825)
[1909.11825. 4, 5](https://arxiv.org/abs/1909.11825)

Yu Sun, Xiaolong Wang, Zhuang Liu, John Miller, Alexei A Efros, and Moritz Hardt. Test-time
training for out-of-distribution generalization. _ArXiv preprint, abs/1909.13231, 2019b._ URL

[https://arxiv.org/abs/1909.13231. 2, 6, 25](https://arxiv.org/abs/1909.13231)

Mingxing Tan and Quoc V. Le. Efficientnet: Rethinking model scaling for convolutional neural
networks. In Kamalika Chaudhuri and Ruslan Salakhutdinov (eds.), Proceedings of the 36th
_International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach,_
_California, USA, volume 97 of Proceedings of Machine Learning Research, pp. 6105–6114._
[PMLR, 2019. URL http://proceedings.mlr.press/v97/tan19a.html. 4, 21](http://proceedings.mlr.press/v97/tan19a.html)

O. Tange. Gnu parallel - the command-line power tool. ;login: The USENIX Magazine, 36(1):
[42–47, 2011. URL http://www.gnu.org/s/parallel. 37](http://www.gnu.org/s/parallel)

Pauli Virtanen, Ralf Gommers, Travis E. Oliphant, Matt Haberland, Tyler Reddy, David
Cournapeau, Evgeni Burovski, Pearu Peterson, Warren Weckesser, Jonathan Bright, St´efan J.
van der Walt, Matthew Brett, Joshua Wilson, K. Jarrod Millman, Nikolay Mayorov, Andrew R. J.
Nelson, Eric Jones, Robert Kern, Eric Larson, CJ Carey, [˙]Ilhan Polat, Yu Feng, Eric W. Moore,
Jake Vand erPlas, Denis Laxalde, Josef Perktold, Robert Cimrman, Ian Henriksen, E. A. Quintero,
Charles R Harris, Anne M. Archibald, Antˆonio H. Ribeiro, Fabian Pedregosa, Paul van Mulbregt,
and SciPy 1. 0 Contributors. SciPy 1.0: Fundamental Algorithms for Scientific Computing in
Python. Nature Methods, 17:261–272, 2020. doi: https://doi.org/10.1038/s41592-019-0686-2.
37

Dequan Wang, Evan Shelhamer, Shaoteng Liu, Bruno Olshausen, and Trevor Darrell. Fully test[time adaptation by entropy minimization. ArXiv preprint, abs/2006.10726, 2020. URL https:](https://arxiv.org/abs/2006.10726)
[//arxiv.org/abs/2006.10726. 2, 6](https://arxiv.org/abs/2006.10726)

Colin Wei, Kendrick Shen, Yining Chen, and Tengyu Ma. Theoretical analysis of self-training with
deep networks on unlabeled data. In ICLR, 2020. 2, 35

Ross Wightman. Pytorch image models. [https://github.com/rwightman/](https://github.com/rwightman/pytorch-image-models)
[pytorch-image-models, 2019. 33, 37](https://github.com/rwightman/pytorch-image-models)

Qizhe Xie, Minh-Thang Luong, Eduard H. Hovy, and Quoc V. Le. Self-training with noisy student
improves imagenet classification. In 2020 IEEE/CVF Conference on Computer Vision and Pattern
_Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020, pp. 10684–10695. IEEE, 2020a._
[doi: 10.1109/CVPR42600.2020.01070. URL https://doi.org/10.1109/CVPR42600.](https://doi.org/10.1109/CVPR42600.2020.01070)
[2020.01070. 1, 3, 4, 5, 20, 21, 24](https://doi.org/10.1109/CVPR42600.2020.01070)

Saining Xie, Ross B. Girshick, Piotr Doll´ar, Zhuowen Tu, and Kaiming He. Aggregated residual
transformations for deep neural networks. In 2017 IEEE Conference on Computer Vision and
_Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pp. 5987–5995. IEEE_
Computer Society, 2017. doi: 10.1109/CVPR.2017.634. [URL https://doi.org/10.](https://doi.org/10.1109/CVPR.2017.634)
[1109/CVPR.2017.634. 4, 5, 21](https://doi.org/10.1109/CVPR.2017.634)

Sang Michael Xie, Ananya Kumar, Robbie Jones, Fereshte Khani, Tengyu Ma, and Percy Liang. Inn-out: Pre-training and self-training using auxiliary information for out-of-distribution robustness.
_arXiv preprint arXiv:2012.04550, 2020b. 2, 35_

Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. In Richard C. Wilson, Edwin R.
Hancock, and William A. P. Smith (eds.), Proceedings of the British Machine Vision Conference
_[2016, BMVC 2016, York, UK, September 19-22, 2016. BMVA Press, 2016. URL http://www.](http://www.bmva.org/bmvc/2016/papers/paper087/index.html)_
[bmva.org/bmvc/2016/papers/paper087/index.html. 4, 5](http://www.bmva.org/bmvc/2016/papers/paper087/index.html)

Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding
deep learning requires rethinking generalization. In 5th International Conference on Learning
_Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings._
[OpenReview.net, 2017. URL https://openreview.net/forum?id=Sy8gdB9xx. 3](https://openreview.net/forum?id=Sy8gdB9xx)


-----

Marvin Zhang, Sergey Levine, and Chelsea Finn. Memo: Test time robustness via adaptation and
augmentation. arXiv preprint arXiv:2110.09506, 2021. 2, 26

Zhilu Zhang and Mert R. Sabuncu. Generalized cross entropy loss for training deep
neural networks with noisy labels. In Samy Bengio, Hanna M. Wallach, Hugo
Larochelle, Kristen Grauman, Nicol`o Cesa-Bianchi, and Roman Garnett (eds.), Advances
_in Neural Information Processing Systems 31:_ _Annual Conference on Neural Information_
_Processing Systems 2018, NeurIPS 2018, December 3-8, 2018, Montr´eal, Canada, pp._
[8792–8802, 2018. URL https://proceedings.neurips.cc/paper/2018/hash/](https://proceedings.neurips.cc/paper/2018/hash/f2925f97bc13ad2852a7a551802feea0-Abstract.html)
[f2925f97bc13ad2852a7a551802feea0-Abstract.html. 3](https://proceedings.neurips.cc/paper/2018/hash/f2925f97bc13ad2852a7a551802feea0-Abstract.html)

Barret Zoph, Golnaz Ghiasi, Tsung-Yi Lin, Yin Cui, Hanxiao Liu, Ekin D Cubuk, and Quoc V Le.
Rethinking pre-training and self-training. In NeurIPS, 2020. 2, 35

Yang Zou, Zhiding Yu, BVK Kumar, and Jinsong Wang. Unsupervised domain adaptation
for semantic segmentation via class-balanced self-training. In Proceedings of the European
_conference on computer vision (ECCV), pp. 289–305, 2018. 2, 35_

Yang Zou, Zhiding Yu, Xiaofeng Liu, BVK Kumar, and Jinsong Wang. Confidence regularized
self-training. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.
5982–5991, 2019. 2, 35


-----

A A TWO-POINT MODEL OF SELF-LEARNING

A.1 DEFINITION OF THE TWO-POINT MODEL

To understand the learning dynamics and properties of different loss functions and their
hyperparameters, we propose a simple model of self-learning, both for entropy minimization and
pseudo-labeling.

A student network w[s] R[d] and a teacher network w[t] R[d] are trained on N data points **xi** _i=1_
_∈_ _∈_ _{_ _}[N]_
with the cross-entropy loss function L defined as


_σt(x[⊤]i_ **[w][t][) log][ σ][s][(][x]i[⊤][w][s][) +][ σ][t][(][−][x]i[⊤][w][t][) log][ σ][s][(][−][x]i[⊤][w][s][)]**


_L = −_


_ℓ(xi) =_
_−_
_i=1_

X


(6)


_i=1_


where σt(z) =


1

1 + e[−][z/τ][t][ and][ σ][s][(][z][) =]


1 + e[−][z/τ][s][ .]


Here τs and τt denote the student and teacher temperature parameters. With stop gradient, student
and teacher evolve in time according to

**w˙** _[s]_ = **ws** **w[s], w[t][]** _,_ **w˙** _[t]_ = α(w[s] **w[t]),** (7)
_−∇_ _L_ _−_

where α is the learning rate of the teacher. Without stop gradient, student and teacher are set equal 
to each other, and they evolve as

**w˙** = **w** (w), where w[s] = w[t] = w. (8)
_−∇_ _L_

We restrict the theoretical analysis to the time evolution of the components of w[s,t] in direction of
two data points xk and xl, yk[s,t] _≡_ **x[⊤]k** **[w][s,t][ and][ y]l[s,t]** _≡_ **x[⊤]l** **[w][s,t][. All other components][ y]i[s,t]** with
_i ̸= k, l are neglected to reduce the dimensionality of the equation system. It turns out that the_
resulting model captures the neural network dynamics quite well despite the drastic simplification
of taking only two data points into account (see Figure 2).


with stop gradient: ˙yk[s] [=][ −][x]k[⊤][∇][w][s][ (][ℓ][(][x][k][) +][ ℓ][(][x][l][))][,] _y˙l[s]_ [=][ −][x]l[⊤][∇][w][s][ (][ℓ][(][x][k][) +][ ℓ][(][x][l][))][,]

_y˙k[t]_ [=][ α][(][y]k[t] _[−]_ _[y]k[s][)][,]_ _y˙l[t]_ [=][ α][(][y]l[t] _[−]_ _[y]l[s][)][,]_

without stop gradient: ˙yk = **x[⊤]k** [(][ℓ][(][x][k][) +][ ℓ][(][x][l][))][,] _y˙l =_ **x[⊤]l** [(][ℓ][(][x][k][) +][ ℓ][(][x][l][))][ .]
_−_ _[∇][w]_ _−_ _[∇][w]_


(9)


A.2 PROOF OF PROPOSITION 1

**Learning dynamics with stop gradient.** Computing the stop gradient evolution defined in
equation 7 explicitly yields

_N_

**w˙** _[s]_ = **ws** = [1] _σt(x[⊤]i_ **[w][t][)][σ][s][(][−][x]i[⊤][w][s][)][ −]** _[σ][t][(][−][x]i[⊤][w][t][)][σ][s][(][x]i[⊤][w][s][)]_ **xi**
_−∇_ _L_ _τs_ (10)

_i=1_

X   

**w˙** _[t]_ = α(w[s] _−_ **w[t])**

The second equality uses the well-known derivative of the sigmoid function, ∂zσ(z) = σ(z)σ( _z)._
_−_

The equation system of 2d nonlinear, coupled ODEs for w[s] _∈_ R[d] and w[t] _∈_ R[d] in equation 10 is
analytically difficult to analyze. Instead of studying the ODEs directly, we act on them with the data
points x[⊤]k [,][ k][ = 1][, . . ., N] [, and investigate the dynamics of the components][ x]k[⊤][w][s,t][ ≡] _[y]k[s,t][:]_

_N_

_y˙k[s]_ [= 1] **x[⊤]i** **[x][k]** _σt(yi[t][)][σ][s][(][−][y]i[s][)][ −]_ _[σ][t][(][−][y]i[t][)][σ][s][(][y]i[s][)]_

_τs_ (11)

_i=1_

X      

_y˙k[t]_ [=][ α][(][y]k[s] _[−]_ _[y]k[t]_ [)][.]

The learning rate of each mode yk[s] [is scaled by][ (][x]k[⊤][x][i][)][ which is much larger for][ i][ =][ k][ than for][ i][ ̸][=][ k]
in high-dimensional spaces. In the two-point approximation, we consider only the two (in absolute


-----

value) largest terms i = k, l for a given k in the sum in equation 11. Any changes that yk[s,t][(][t][)][ and]
_yl[s,t][(][t][)][ might induce in other modes][ y]i[s,t][(][t][)][ are neglected, and so we are left with only four ODEs:]_


_y˙k[s]_ [= 1] **xk** _σt(yk[t]_ [)][σ][s][(][−][y]k[s][)][ −] _[σ][t][(][−][y]k[t]_ [)][σ][s][(][y]k[s][)]

_τs_ _∥_ _∥[2][  ]_


+ [1] (x[⊤]k **[x][l][)]** _σt(yl[t][)][σ][s][(][−][y]l[s][)][ −]_ _[σ][t][(][−][y]l[t][)][σ][s][(][y]l[s][)]_

_τs_
 

_y˙l[s]_ [= 1] **xl** _σt(yl[t][)][σ][s][(][−][y]l[s][)][ −]_ _[σ][t][(][−][y]l[t][)][σ][s][(][y]l[s][)]_

_τs_ _∥_ _∥[2][  ]_


+ [1] (x[⊤]k **[x][l][)]** _σt(yk[t]_ [)][σ][s][(][−][y]k[s][)][ −] _[σ][t][(][−][y]k[t]_ [)][σ][s][(][y]k[s][)]

_τs_
 


(12)


_y˙k[t]_ [=][α][(][y]k[s] _[−]_ _[y]k[t]_ [)][,][ ˙]yl[t] [=][ α][(][y]l[s] _[−]_ _[y]l[t][)][.]_

The fixed points of equation 12 satisfy

_y˙k[s]_ [= ˙]yl[s] [= ˙]yk[t] [= ˙]yl[t] [= 0][.] (13)

For α > 0, requiring ˙yk[t] [= ˙]yl[t] [= 0][ implies that][ y]k[s] [=][ y]k[t] [and][ y]l[s] [=][ y]l[t][. For][ τ][s][ =][ τ][t][, the two]
remaining equations ˙yk[s] [= ˙]yl[s] [= 0][ vanish automatically so that there are no non-trivial two-point]
learning dynamics. For τs = τt, there is a fixed point at yk[s,t] = yl[s,t] = 0 since at this point, each
bracket in equation 12 vanishes individually: ̸

_σt(yk,l)σs(_ _yk,l)_ _σs(_ _yk,l)σt(yk,l)_ = [1] (14)
_−_ _−_ _−_ _yk,l=0_ 4 4 [= 0][.]

_[−]_ [1]

At the fixed point yk[s,t] = yl[s,t] = 0, w[s] and w[t] are orthogonal to both xk and xl and hence
classification fails. If this fixed point is stable, w[s] and w[t] will stay at the fixed point once they
have reached it, i.e. the model collapses. The fixed point is stable when all eigenvalues of the
Jacobian J of the ODE system equation 12 evaluated at yk[s,t] = yl[s,t] = 0 are negative. This is the
case whenever τs < τt:


(x[⊤]k **[x][l][)]**

4

_∥xl∥[2]_


_∥xk∥[2]_

4

(x[⊤]k **[x][l][)]**


1

_τt_ _τs_

1 _[−]_ [1]

_τt_ _τs_

α _[−]_ [1]


1

_τt_ _τs_

1 _[−]_ [1]

_τt_ _τs_

0 _[−]_ [1]


0 0
_α_ 0



_−0_ _α_
_−_ 

 _[,]_


_yk[s,t][=][y]l[s,t]=0_


eigenvalues: λ1 = λ2 = −α < 0,


1

_τt_ _−_ _τ[1]s_




_λ3,4 = [1]_


+ **xk** + **xl**
_∥_ _∥[2]_ _∥_ _∥[2]_


_λ3,4 = [1]_ **xk** + **xl** 2 **xk** **xl** + 4(x[⊤]k **[x][l][)][2]** + **xk** + **xl**

8 _τt_ _−_ _τ[1]s_ _∥_ _∥[4]_ _∥_ _∥[4]_ _−_ _∥_ _∥[2]∥_ _∥[2]_ _∥_ _∥[2]_ _∥_ _∥[2]_

  q

 _≤∥xk∥[2]+∥xl∥[2]_ 
[±] 

| 0{z with equality if xk= **xl** }

_≥_ _±_
(15)
| {z }
To sum up, training with stop gradient and τs > τt avoids a collapse of the two-point model to the
trivial representation yk[s,t] = yl[s,t] = 0 since the fixed point is not stable in this parameter regime.

**Learning dynamics without stop gradient** Without stop gradient, we set w[t] = w[s] _≡_ **w which**
leads to an additional term in the gradient:


**w˙** = **w** = [1]
_−∇_ _L_ _τs_


_σt(x[⊤]i_ **[w][)][σ][s][(][−][x]i[⊤][w][)][ −]** _[σ][t][(][−][x]i[⊤][w][)][σ][s][(][x]i[⊤][w][)]_ **xi**


_i=1_


(16)
**xi.**


+ [1]

_τt_


_σt(x[⊤]i_ **[w][)][σ][t][(][−][x]i[⊤][w][)]** log σs(x[⊤]i **[w][)][ −]** [log][ σ][s][(][−][x]i[⊤][w][)]
_i=1_

X  =log ((1+e[yi/τs] )/(1+e[−][yi/τs] ))=yi/τs

| {z }


-----

As before, we focus on the evolution of the two components yk = w[⊤]xk and yl = w[⊤]xl.

1
_y˙k =_ **xk** (σt(yk)σs( _yk)_ _σt(_ _yk)σs(yk)) + [1]_ _σt(yk)σt(_ _yk)yk_
_∥_ _∥[2]_ _τs_ _−_ _−_ _−_ _τt_ _−_
 

1 1
+ (x[⊤]k **[x][l][)]** (σt(yl)σs( _yl)_ _σt(_ _yl)σs(yl)) +_ _σt(yl)σt(_ _yl)yl_

_τs_ _−_ _−_ _−_ _τsτt_ _−_

 

1
_y˙l =_ **xl** (σt(yl)σs( _yl)_ _σt(_ _yl)σs(yl)) + [1]_ _σt(yl)σt(_ _yl)yl_
_∥_ _∥[2]_ _τs_ _−_ _−_ _−_ _τt_ _−_
 

1 1
+ (x[⊤]k **[x][l][)]** (σt(yk)σs( _yk)_ _σt(_ _yk)σs(yk)) +_ _σt(yk)σt(_ _yk)yk_

_τs_ _−_ _−_ _−_ _τsτt_ _−_

 

There is a fixed point at yk = yl = 0 where each bracket in equation 17 vanishes individually,


(17)


= 0. (18)
_yk,l_


(σt(yk,l)σs( _yk,l)_ _σt(_ _yk,l)σs(yk,l)) +_
_τs_ _−_ _−_ _−_


_σt(yk,l)σt(_ _yk,l)yk,l_
_τsτt_ _−_


The Jacobian of the ODE system in equation 17 and its eigenvalues evaluated at the fixed point are
given by


_∥xk∥[2]_

4τs

(x[⊤]k **[x][l][)]**

4τs


(x[⊤]k **[x][l][)]**

4τs

_∥xl∥[2]_

4τs


2

_τt_ _τs_

2 _[−]_ [1]

_τt_ _τs_

_[−]_ [1]


2

_τt_ _τs_

2 _[−]_ [1]

_τt_ _τs_

_[−]_ [1]


_yk=yl=0_


2

_τt_ _−_ _τ[1]s_




+ **xk** + **xl**
_∥_ _∥[2]_ _∥_ _∥[2]_


1
_λ1,2 =_ **xk** + **xl** 2 **xk** **xl** + 4(x[⊤]k **[x][l][)][2]** + **xk** + **xl** _._

8τs _τt_ _−_ _τ[1]s_ _∥_ _∥[4]_ _∥_ _∥[4]_ _−_ _∥_ _∥[2]∥_ _∥[2]_ _∥_ _∥[2]_ _∥_ _∥[2]_

  q

 _≤∥xk∥[2]+∥xl∥[2]_ 
[±] 

| 0{z with equality if xk= **xl** }

_≥_ _±_

(19)

| {z }

Hence the fixed point is unstable when τs > τt/2 and thus the model without stop gradient does not
collapse onto yk = yl = 0 in this regime.

A.3 SIMULATION OF THE TWO-POINT MODEL

For visualization purposes in the main paper, we set w[s] = w[t] = [0.5, 0.5][⊤] and train the model
using instant gradient updates on the dataset with points x1 = [1, 0] and x2 = [0, −1] using SGD
with learning rate 0.1 and momentum 0.9. We varied student and teacher temperatures on a log-scale
with 250 points from 10[−][3] to 10. Qualitatively similar results can be obtained without momentum
training, at higher learning rates (most likely due to the implicit learning rate scaling introduced by
the momentum term).

Note that the temperature scales for observing the collapse effect depend on the learning rate, and
the exact training strategy—lower learning rates can empirically prevent the model from collapsing
and shift the convergence region. The result in Figure 2 will hence depend on the exact choice of
learning rate (which is currently not considered in our continuous time evolution theory), while the
predicted region without collapse is robust to details of the optimization.

To visualize the impact of different hyperparameters, we show variants of the two point model with
different learning rates using gradient descent with (Figure 3) and without momentum (Figure 4),
and with different start conditions (Figure 5), which all influence the regions where the model
degrades, but not the stable regions predicted by our theory.


-----

lr: 10


0.1 0.01 0.001


PL


log _−1_


1 _τt = 2τs_

_τt = τs_

0

_−1_

_−21_ Error

0%

0 50%100%

_−1_

_−2_

_−2 −1_ 0 1 _−2 −1_ 0 1 _−2 −1_ 0 1 _−2 −1_ 0 1 _−2 −1_ 0 1

log _τ_ log _τ_ log _τ_ log _τ_ log _τ_


Figure 3: Entropy minimization (top) Training two point model with momentum 0.9 and different learning rates
with initialization w[s] = w[t] = [0.5, 0.5][⊤].


lr: 10


0.1 0.01 0.001


PL


log _−1_


1 _τt = 2τs_

_τt = τs_

0

_−1_

_−21_ Error

0%

0 50%100%

_−1_

_−2_

_−2 −1_ 0 1 _−2 −1_ 0 1 _−2 −1_ 0 1 _−2 −1_ 0 1 _−2 −1_ 0 1

log _τ_ log _τ_ log _τ_ log _τ_ log _τ_


Figure 4: Training a two point model without momentum and different learning rates with initialization w[s] =
**w[t]** = [0.5, 0.5][⊤]. Note that especially for lower learning rates, longer training would increase the size of the
collapsed region.


lr: 10


0.1 0.01 0.001


PL


log _−1_


1 _τt = 2τs_

_τt = τs_

0

_−1_

_−21_ Error

0%

0 50%100%

_−1_

_−2_

_−2 −1_ 0 1 _−2 −1_ 0 1 _−2 −1_ 0 1 _−2 −1_ 0 1 _−2 −1_ 0 1

log _τ_ log _τ_ log _τ_ log _τ_ log _τ_


Figure 5: Training a two point model with momentum 0.9 and different learning rates with initialization w[s] =
**w[t]** = [0.6, 0.3][⊤].


-----

B ADDITIONAL INFORMATION ON USED MODELS

B.1 DETAILS ON ALL HYPERPARAMETERS WE TESTED FOR DIFFERENT MODELS

For all models except EfficientNet-L2, we adapt the batch norm statistics to the test domains
following (Schneider et al., 2020). We do not expect significant gains for combining EfficientNet-L2
with batch norm adaptation: as demonstrated in (Schneider et al., 2020), models trained with large
amounts of weakly labeled data do not seem to benefit from batch norm adaptation.

**ResNet50 models** We use a vanilla ResNet50 model and compare soft- and hard-labeling against
entropy minimization and robust pseudo-labeling. To find optimal hyperparameters for all methods,
we perform an extensive evaluation and test (i) three different adaptation mechanisms (ii) several
learning rates 1.0 × 10[−][4], 1.0 × 10[−][3], 1.0 × 10[−][2] and 5.0 × 10[−][2], (iii) the number of training
epochs and (iv) updating the teacher after each epoch or each iteration. For all experiments, we use
a batch size of 128. The hyperparameter search is performed on IN-C dev. We then use the optimal
hyperparameters to evaluate the methods on the IN-C test set.

**ResNeXt101 models** The ResNeXt101 model is considerably larger than the ResNet50 model
and we therefore limit the number of ablation studies we perform for this architecture. Besides
a baseline, we include a state-of-the-art robust version trained with DeepAugment+Augmix
(DAug+AM, Hendrycks et al., 2020a) and a version that was trained on 3.5 billion weakly labeled
images (IG-3.5B, Mahajan et al., 2018). We only test the two leading methods on the ResNeXt101
models (ENT and RPL). We vary the learning rate in same interval as for the ResNet50 model
but scale it down linearly to account for the smaller batch size of 32. We only train the affine
batch normalization parameters because adapting only these parameters leads to the best results
on ResNet50 and is much more resource efficient than adapting all model parameters. Again, the
hyperparameter search is performed only on the development corruptions of IN-C. We then use the
optimal hyperparameters to evaluate the methods on the IN-C test set.

**EfficientNet-L2 models** The current state of the art on IN, IN-C, IN-R and IN-A is an
EfficientNet-L2 trained on 300 million images from JFT-300M (Chollet, 2017; Hinton et al., 2014)
using a noisy student-teacher protocol (Xie et al., 2020a). We adapt this model for only one epoch
due to resource constraints. During the hyperparameter search, we only evaluate three corruptions
on the IN-C development set[2] and test the learning rates 4.6 × 10[−][2], 4.6 × 10[−][3], 4.6 × 10[−][4] and
4.6 × 10[−][5]. We use the optimal hyperparameters to evaluate ENT and RPL on the full IN-C test set
(with all severity levels).

**UDA-SS models** We trained the models using the scripts from the official code base at github.
com/yueatsprograms/uda release. We used the provided scripts for the cases: (a) source: CIFAR10,
target: STL10 and (b) source: MNIST, target: MNIST-M. For the case (c) source: CIFAR10, target:
CIFAR10-C, we used the hyperparameters from case (a) since this case seemed to be the closest
match to the new setting. We think that the baseline performance of the UDA-SS models can be
further improved with hyperparameter tuning.

**DANN models** To train models with the DANN-method, we used the PyTorch implementation
[of this paper at https://github.com/fungtion/DANN py3. The code base only provides scripts and](https://github.com/fungtion/DANN_py3)
hyperparameters for the case (b) source: MNIST, target: MNIST-M. For the cases (a) and (c),
we used the same optimizer and trained the model for 100 epochs. We think that the baseline
performance of the DANN models can be further improved with hyperparameter tuning.

**Preprocessing** For IN, IN-R, IN-A and IN-D, we resize all images to 256 × 256 px and take the
center 224 × 224 px crop. The IN-C images are already rescaled and cropped. We center and
re-scale the color values with µRGB = [0.485, 0.456, 0.406] and σRGB = [0.229, 0.224, 0.225].
For the EfficientNet-L2, we follow the procedure in Xie et al. (2020a) and rescale all inputs to a
resolution of 507 × 507 px and then center-crop them to 475 × 475 px.

2We compare the results of computing the dev set on the 1, 3 and 5 severities versus the 1, 2, 3, 4 and 5
severities on our ResNeXt101 model in the Supplementary material.


-----

B.2 FULL LIST OF USED MODELS

**ImageNet scale models** ImageNet trained models (ResNet50, DenseNet161, ResNeXt) are
taken directly from torchvision (Marcel & Rodriguez, 2010). The model variants trained with
[DeepAugment and AugMix augmentations (Hendrycks et al., 2020b;a) are taken from https:](https://github.com/hendrycks/imagenet-r)
[//github.com/hendrycks/imagenet-r. The weakly-supervised ResNeXt101 model is taken from the](https://github.com/hendrycks/imagenet-r)
PyTorch Hub. For EfficientNet (Tan & Le, 2019), we use the PyTorch re-implementation available
[at https://github.com/rwightman/gen-efficientnet-pytorch. This is a verified re-implementation of](https://github.com/rwightman/gen-efficientnet-pytorch)
the original work by Xie et al. (2020a). We verify the performance on ImageNet, yielding a 88.23%
top-1 accuracy and 98.546% top-5 accuracy which is within 0.2% points of the originally reported
result (Xie et al., 2020a). On ImageNet-C, our reproduced baseline achieves 28.9% mCE vs. 28.3%
mCE originally reported by Xie et al. (2020a). As noted in the re-implementation, this offset is
possible due to minor differences in the pre-processing. It is possible that our adaptation results
would improve further when applied on the original codebase by Xie et al..

**Small scale models** We train the UDA-SS models using the original code base at github.com/
yueatsprograms/uda release, with the hyperparameters given in the provided bash scripts. For our
DANN experiments, we use the PyTorch implementation at github.com/fungtion/DANN py3. We
use the hyperparameters in the provided bash scripts.

The following Table 11 contains all models we evaluated on various datasets with references and
links to the corresponding source code.

Table 11: Model checkpoints used for our experiments.

Model Source

WideResNet(28,10) (Croce et al., 2020) [https://github.com/RobustBench/robustbench/tree/master/robustbench](https://github.com/RobustBench/robustbench/tree/master/robustbench)

WideResNet(40,2)+AugMix (Croce et al., 2020) [https://github.com/RobustBench/robustbench/tree/master/robustbench](https://github.com/RobustBench/robustbench/tree/master/robustbench)

ResNet50 (He et al., 2016b) [https://github.com/pytorch/vision/tree/master/torchvision/models](https://github.com/pytorch/vision/tree/master/torchvision/models)

ResNeXt101, 32×8d (He et al., 2016b) [https://github.com/pytorch/vision/tree/master/torchvision/models](https://github.com/pytorch/vision/tree/master/torchvision/models)

DenseNet (Huang et al., 2017) [https://github.com/pytorch/vision/tree/master/torchvision/models](https://github.com/pytorch/vision/tree/master/torchvision/models)

ResNeXt101, 32×8d (Xie et al., 2017) [https://pytorch.org/hub/facebookresearch WSL-Images resnext/](https://pytorch.org/hub/facebookresearch_WSL-Images_resnext/)

ResNet50+DeepAugment+AugMix (Hendrycks et al., 2020a) [https://github.com/hendrycks/imagenet-r](https://github.com/hendrycks/imagenet-r)

ResNext101 (Hendrycks et al., 2020a) [https://github.com/hendrycks/imagenet-r](https://github.com/hendrycks/imagenet-r)

ResNext101 32×8d IG-3.5B (Mahajan et al., 2018) [https://github.com/facebookresearch/WSL-Images/blob/master/hubconf.py](https://github.com/facebookresearch/WSL-Images/blob/master/hubconf.py)

Noisy Student EfficientNet-L2 (Xie et al., 2020a) [https://github.com/rwightman/gen-efficientnet-pytorch](https://github.com/rwightman/gen-efficientnet-pytorch)

ViT-S/16 (Caron et al., 2021) [https://github.com/facebookresearch/dino](https://github.com/facebookresearch/dino)


-----

C DETAILED AND ADDITIONAL RESULTS ON IN-C

C.1 DEFINITION OF THE MEAN CORRUPTION ERROR (MCE)

The established performance metric on IN-C is the mean Corruption Error (mCE), which is obtained
by normalizing the model’s top-1 errors with the top-1 errors of AlexNet across the C=15 test
corruptions and S=5 severities:

_C_ _S_

_s=1_ [err]c,s[model]

mCE(model) = C[1] _S_ _._ (20)

_c=1_ Ps=1 [err]c,s[AlexNet]

X

The AlexNet errors used for normalization are shown in Table 12.P

Category Corruption top1 error

Gaussian Noise 0.886428
Noise Shot Noise 0.894468
Impulse Noise 0.922640


Defocus Blur 0.819880
Glass Blur 0.826268
Motion Blur 0.785948
Zoom Blur 0.798360

Snow 0.866816
Frost 0.826572
Fog 0.819324
Brightness 0.564592
Contrast 0.853204

Elastic Transform 0.646056
Pixelate 0.717840
JPEG Compression 0.606500


Blur

Weather

Digital


Hold-out Noise Speckle Noise 0.845388
Hold-out Digital Saturate 0.658248
Hold-out Blur Gaussian Blur 0.787108
Hold-out Weather Spatter 0.717512

Table 12: AlexNet top1 errors on ImageNet-C

C.2 DETAILED RESULTS FOR TUNING EPOCHS AND LEARNING RATES

We tune the learning rate for all models and the number of training epochs for all models except
the EfficientNet-L2. In this section, we present detailed results for tuning these hyperparameters for
all considered models. The best hyperparameters that we found in this analysis, are summarized in
Table 17.


Table 13: mCE in % on the IN-C dev set for ENT and RPL for
different numbers of training epochs when adapting the affine
batch norm parameters of a ResNet50 model.

criterion ENT RPL
lr 10[−][4] 10[−][3] 10[−][2] 10[−][4] 10[−][3] 10[−][2]

epoch

0 60.2 60.2 60.2 60.2 60.2 60.2
1 54.3 **50.0** 72.5 57.4 51.1 52.5
2 52.4 50.9 96.5 55.8 49.6 57.4
3 51.5 51.0 112.9 54.6 49.2 64.2
4 51.0 52.4 124.1 53.7 49.0 71.0
5 50.7 53.5 131.2 52.9 **48.9** 76.3
6 50.7 53.5 131.2 52.9 48.9 76.3


Table 14: mCE (↘) in % on the IN-C dev set
for different learning rates for EfficientNetL2. We favor q = 0.8 over q = 0.7 due
to slightly improved robustness to changes
in the learning rate in the worst case error
setting.

lr (4.6 ×) base 10[−][3] 10[−][4] 10[−][5] 10[−][6]

ENT 25.5 87.8 25.3 **22.2** 24.1
RPLq=0.7 25.5 60.3 **21.3** 23.3 n/a
RPLq=0.8 25.5 58.2 **21.4** 23.4 n/a


-----

Table 17: The best hyperparameters for all models that we found on IN-C. For all models, we fine-tune only
the affine batch normalization parameters and use q = 0.8 for RPL. The small batchsize for the EfficientNet
model is due to hardware limitations.

number of
Model Method Learning rate batch size epochs


vanilla ResNet50 ENT 1 × 10[−][3] 128
vanilla ResNet50 RPL 1 × 10[−][3] 128

vanilla ResNeXt101 ENT 2.5 × 10[−][4] 128
vanilla ResNeXt101 RPL 2.5 × 10[−][4] 128
IG-3.5B ResNeXt101 ENT 2.5 × 10[−][4] 128
IG-3.5B ResNeXt101 RPL 2.5 × 10[−][3] 128
DAug+AM ResNeXt101 ENT 2.5 × 10[−][4] 128
DAug+AM ResNeXt101 RPL 2.5 × 10[−][4] 128

EfficientNet-L2 ENT 4.6 × 10[−][5] 8
EfficientNet-L2 RPL 4.6 × 10[−][4] 8


Table 15: mCE in % on IN-C dev for entropy
minimization for different learning rates and training
epochs for ResNeXt101. (div.=diverged)

ENT Baseline IG-3.5B DAug+AM
lr 2.5 × 1e-4 1e-3 5e-3 1e-4 1e-3 5e-3 1e-4 1e-3 5e-3
epoch


Table 16: mCE in % on IN-C dev for robust pseudolabeling for different learning rates and training epochs
for ResNeXt101. (div.=diverged)

RPL Baseline IG-3.5B DAug+AM
lr 2.5× 1e-4 1e-3 5e-3 1e-4 1e-3 5e-3 1e-4 1e-3 5e-3
epoch


BASE 53.6 53.6 53.6 47.4 47.4 47.4 37.4 37.4 37.4
1 **43.0** 92.2 div. 40.9 40.4 58.6 **35.4** 46.4 div.

2 44.8 118.4 div. 39.8 41.5 69.5 35.5 90.8 div.

3 45.4 131.9 div. 39.3 42.6 76.1 35.5 122.5 div.

4 46.7 div. div. **39.1** 44.2 84.3 35.6 133.8 div.


BASE 53.6 53.6 53.6 47.4 47.4 47.4 37.4 37.4 37.4
1 43.4 51.3 div. 45.0 39.9 43.6 35.3 35.1 79.1
2 42.3 63.2 div. 43.4 **39.3** 48.2 34.9 35.6 121.2
3 42.0 72.6 div. 42.4 39.4 52.9 34.7 40.1 133.5
4 **42.0** 72.6 div. 42.4 39.4 52.9 **34.7** 40.1 133.5


C.3 DETAILED RESULTS FOR ALL IN-C CORRUPTIONS

We outline detailed results for all corruptions and models in Table 18. Performance across the
severities in the dataset is depicted in Figure 6. All detailed results presented here are obtained by
following the model selection protocol outlined in the main text.


RN50

2 3

Severity


RNx101

3

Severity


RNx101 IG-3.5B

2 3 4

Severity


RNx101 DeepAug+Augmix

1 2 3 4 5

Severity


Noisy Student L2

2 3 4

ENT
RPL
Base

Severity


80

60

40

20


Figure 6: Severity-wise mean corruption error (normalized using the average AlexNet baseline error for each
corruption) for ResNet50 (RN50), ResNext101 (RNx101) variants and the Noisy Student L2 model. Especially
for more robust models (DeepAugment+Augmix and Noisy Student L2), most gains are obtained across higher
severities 4 and 5. For weaker models, the baseline variant (Base) is additionally substantially improved for
smaller corruptions.


-----

Table 18: Detailed results for each corruption along with mean corruption error (mCE) as reported in Table
2 in the main paper. We show (unnormalized) top-1 error rate averaged across 15 test corruptions along with
the mean corruption error (mCE: which is normalized). Hyperparameter selection for both ENT and RPL was
carried out on the dev corruptions as outlined in the main text. Mismatch in baseline mCE for EfficientNetL2 can be most likely attributed to pre-processing differences between the original tensorflow implementation
Xie et al. (2020a) and the PyTorch reimplementation we employ. We start with slightly weaker baselines for
ResNet50 and ResNext101 than Schneider et al. (2020): ResNet50 and ResNext101 results are slightly worse
than previously reported results (typically 0.1% points) due to the smaller batch size of 128 and 32. Smaller
batch sizes impact the quality of re-estimated batch norm statistics when computation is performed on the fly
Schneider et al. (2020), which is of no concern here due to the large gains obtained by pseudo-labeling.

gaussshot impulsedefocusglassmotionzoomsnow frost fog brightcontrastelasticpixelatejpeg **mCE**

ResNet50
Baseline (Schneider et al., 2020) 62.2
Baseline (ours) 57.2 59.5 60.0 61.4 62.3 51.3 49.5 54.6 54.1 39.3 29.1 46.7 41.4 38.2 41.8 62.8
ENT 45.5 45.5 46.8 48.4 48.7 40.0 40.3 42.0 46.6 33.2 28.1 42.4 35.2 32.2 35.1 51.6
RPL 44.2 44.4 45.5 47.0 47.4 38.8 39.2 40.7 46.2 32.5 27.7 42.7 34.6 31.6 34.4 50.5

ResNeXt101 Baseline
Baseline (Schneider et al., 2020) 56.7
Baseline (ours) 52.8 54.1 54.0 55.4 56.8 46.7 46.6 48.5 49.4 36.6 25.4 42.8 37.8 32.5 36.7 56.8
ENT 40.5 39.5 41.4 41.6 43.0 34.1 34.5 35.0 39.4 28.5 24.0 33.8 30.3 27.2 30.5 44.3
RPL 39.4 38.9 39.8 40.3 41.0 33.4 33.8 34.6 38.7 28.0 23.7 31.4 29.8 26.8 30.0 43.2

ResNeXt101 IG-3.5B
Baseline (Schneider et al., 2020) 51.6
Baseline (ours) 50.7 51.5 53.1 54.2 55.5 45.5 44.7 41.7 42.0 28.1 20.1 33.8 35.4 27.8 33.9 51.8
ENT 38.6 38.3 40.4 41.4 41.5 33.8 33.6 32.2 34.6 24.1 19.7 26.3 27.6 24.2 27.9 40.8
RPL 39.1 39.2 40.8 42.1 42.4 33.7 33.5 31.8 34.7 23.9 19.6 26.1 27.5 23.8 27.5 40.9

ResNeXt101 DeepAug+Augmix
Baseline (Schneider et al., 2020) 38.0
Baseline (ours) 30.0 30.0 30.2 32.9 35.5 28.9 31.9 33.3 32.8 29.5 22.6 28.4 31.2 23.0 26.5 38.1
ENT 28.7 28.5 29.0 29.8 30.9 26.9 28.0 29.3 30.5 26.2 23.2 26.3 28.5 23.7 26.0 35.5
RPL 28.1 27.8 28.3 29.1 30.1 26.3 27.4 28.8 29.8 25.9 22.7 25.6 27.9 23.2 25.4 34.8

Noisy Student L2
Baseline (Xie et al., 2020a) 28.3
Baseline (ours) 21.6 22.0 20.5 23.9 40.5 19.8 23.2 22.8 26.9 21.0 15.2 21.2 24.8 17.9 18.6 28.9
ENT 18.5 18.7 17.4 18.8 23.4 16.9 18.8 17.1 19.6 16.8 14.1 16.6 19.6 15.8 16.5 23.0
RPL 17.8 18.0 17.0 18.1 21.4 16.4 17.9 16.4 18.7 15.7 13.6 15.6 19.2 15.0 15.6 22.0


C.4 DETAILED RESULTS FOR THE CIFAR10-C AND UDA ADAPTATION

Table 19: Detailed results for each corruption along with mean error on CIFAR10-C as reported in Table 2 in
the main paper.


WRN-28-10 vanilla
Baseline 53.0 41.2 44.7 18.5 49.0 22.3 24.4 18.1 25.0 11.2 6.7 17.4 16.2 28.0 22.4 26.5
BN adapt 20.8 17.6 22.7 8.1 28.4 10.9 9.2 14.2 13.0 8.7 6.8 8.5 13.5 12.1 21.0 14.4
ENT 18.5 15.9 20.6 7.8 25.5 10.6 8.5 13.1 12.3 8.3 6.9 8.0 12.6 11.1 18.9 13.3
RPL 19.6 16.7 21.9 8.1 27.1 10.9 8.9 13.9 13.0 8.7 6.9 8.4 13.2 11.7 20.1 13.9

WRN-40-2 AM
Baseline 19.1 14.0 13.3 6.3 17.1 7.9 7.0 10.4 10.6 8.5 5.9 9.7 9.2 16.8 11.9 11.2
BN adapt 14.1 11.9 13.9 7.2 17.6 8.7 7.9 10.8 10.6 9.0 6.8 9.0 10.9 10.1 14.0 10.8
TENT 10.8 9.1 10.9 6.0 13.4 7.2 6.3 8.4 7.8 7.1 5.7 7.1 9.2 7.4 11.2 8.5
RPL 12.4 10.5 12.4 6.5 15.6 7.8 6.9 9.5 9.1 8.2 6.2 8.3 9.9 8.8 12.8 9.7

WRN-26-16 UDA-SS
Baseline 26.0 24.7 19.3 22.4 56.2 32.4 32.1 31.7 31.2 26.6 15.8 20.4 26.3 21.5 28.9 27.7
BN adapt 20.5 19.0 15.6 13.5 43.1 19.4 18.3 23.1 21.2 16.2 12.8 14.1 20.9 16.7 23.4 19.9
ENT 16.9 16.7 12.3 11.3 37.6 15.6 14.8 18.3 18.2 13.4 10.8 11.9 17.9 14.4 20.9 16.7
RPL 18.1 17.1 13.2 11.9 41.5 17.3 16.1 20.4 19.1 14.5 11.8 12.7 18.8 18.1 22.6 18.2


-----

Table 20: Detailed results for the UDA methods reported in Table 2 of the main paper.

Baseline BN adapt RPL ENT


UDA CIFAR10→STL10, top1 error on target [%](↘)
WRN-26-16 UDA-SS 28.7 24.6 22.9 21.8
WRN-26-16 DANN 25.0 25.0 24.0 23.9

UDA MNIST→MNIST-M, top1 error on target [%](↘)
WRN-26-16 UDA-SS 4.8 3.9 2.4 2.0
WRN-26-2 DANN 11.4 6.2 5.2 5.1

C.5 ABLATION OVER THE HYPERPARAMETER q FOR RPL


For RPL, we must choose the hyperparameter q. We performed an ablation study over q and show
results in Table 21, demonstrating that RPL is robust to the choice of q, with slight preference to
higher values. Note: In the initial parameter sweep for this paper, we only compared q = 0.7 and
_q = 0.8. Given the result in Table 21, it could be interesting to re-run the models in Table 1 of the_
main paper with q = 0.9, which could yield another (small) improvement in mCE.

Table 21: ImageNet-C dev set mCE in %, vanilla ResNet50, batch size 96. We report the best score across a
maximum of six adaptation epochs.

q 0.5 0.6 0.7 0.8 0.9


mCE (dev) 49.5 49.3 49.2 49.2 49.1

C.6 SELF-TRAINING OUTPERFORMS CONTRASTIVE TEST-TIME TRAINING (SUN ET AL.,
2019B)


Sun et al. (2019b) use a ResNet18 for their experiments on ImageNet and only evaluate their method
on severity 5 of IN-C. To enable a fair comparison, we trained a ResNet18 with both hard labeling
and RPL and compare the efficacy of both methods to Test-Time Training in Table 22. For both
hard labeling and RPL, we use the hyperparameters we found for the vanilla ResNet50 model and
thus, we expect even better results for hyperparameters tuned on the vanilla ResNet18 model and
following our general hyperparameter search protocol.

While all methods (self-learning and TTT) improve the performance over a simple vanilla ResNet18,
we note that even the very simple baseline using hard labeling already outperfoms Test-Time
Training; further gains are possible with RPL. The result highlights the importance of simple
baselines (like self-learning) when proposing new domain adaptation schemes. It is likely that many
established DA techniques more complex than the basic self-learning techniques considered in this
work will even further improve over TTT and other adaptation approaches developed exclusively in
robustness settings.

Table 22: Comparison of hard-pseudo labeling and robust pseudo-labeling to Test-Time Training Sun et al.
(2019b): Top-1 error for a ResNet18 and severity 5 for all corruptions. Simple hard pseudo-labeling already
outperforms TTT, robust pseudo labeling over multiple epochs yields additional gains.

gauss shot impulsedefocusglass motionzoom snow frost fog bright contrastelastic pixelatejpeg **Avg**

vanilla ResNet18 98.8 98.2 99.0 88.6 91.3 88.8 82.4 89.1 83.5 85.7 48.7 96.6 83.2 76.9 70.4 85.4
Test-Time Training 73.7 71.4 73.1 76.3 93.4 71.3 66.6 64.4 81.3 52.4 41.7 64.7 55.7 52.2 55.7 66.3
hard PL, (1 epoch) 73.2 70.8 73.6 76.5 75.6 63.9 56.1 59.0 65.9 48.4 39.7 85.2 50.4 47.0 51.5 62.5
RPL (4 epochs) **71.3 68.3 71.7 76.2 75.6 61.5 54.4 56.9 67.1 47.3 39.3 93.2 48.9 45.7 50.4 61.9**


-----

C.7 EFFECT OF BATCH SIZE AND LINEAR LEARNING RATE SCALING

How is self-learning performance affected by batch size constraints? We compare the effect of
different batch sizes and linear learning rate scaling. In general, we found that affine adaptation
experiments on ResNet50 scale can be run with batch size 128 on a Nvidia V100 GPU (16GB),
while only batch size 96 experiments are possible on RTX 2080 GPUs.

The results in Table 23 show that for a ResNet50 model, higher batch size yields a generally better
performance.

Table 23: ImageNet-C dev set mCE for various batch sizes with linear learning rate scaling. All results are
computed for a vanilla ResNet50 model using RPL with q = 0.8, reporting the best score across a maximium
of six adaptation epochs.


batch size 16 32 64 80 96 128
learning rate (×10[−][3]) 0.125 0.250 0.500 0.625 0.750 1

dev mCE 53.8 51.0 49.7 49.3 49.2 48.9

C.8 PERFORMANCE OVER DIFFERENT SEEDS IN A RESNET50 ON IMAGENET-C


To limit the amount of compute, we ran RPL and ENT for our vanilla ResNet50 model three times
with the optimal hyperparameters. The averaged results, displayed as “mean (unbiased std)” are:

Table 24: ImageNet-C performance for three seeds on a ResNet50 for ENT and RPL.


ResNet50 + self-learning mCE on IN-C dev [%] mCE on IN-C test [%]


ENT 50.0 (0.04) 51.6 (0.04)
RPL 48.9 (0.02) 50.5 (0.03)

C.9 SELF-LEARNING AS CONTINUOUS TEST-TEST ADAPTATION


We test our method on continuous test-time adaptation where the model adapts to a continuous
stream of data from the same domain. In Fig. 7, we display the error of the Noisy Student L2 model
while it is being adapted to ImageNet-C and ImageNet-R. The model performance improves as the
model sees more data from the new domain. We differentiate continuous test-time adaptation from
the online test-time adaptation setting (Zhang et al., 2021) where the model is adapted to each test
sample individually, and reset after each test sample.


(i) ImageNet-C

Baseline
ENT
RPL

23.0
22.0


2 3

Samples [×10[4]]


28.9


26

24

22

20

18


28

26


24

22


|(ii) ImageNet-R|Col2|Col3|
|---|---|---|
|23. 19. 17.|||
||||


1 2 3

Samples [×10[4]]


Figure 7: Evolution of error during online adaptation for EfficientNet-L2.


-----

D DETAILED AND ADDITIONAL RESULTS ON IN-D

D.1 EVALUATION PROTOCOL ON IN-D

The domains in IN-D differ in terms of their difficulty for the studied models. Therefore, to calculate
an aggregate score, we propose normalizing the error rates by the error achieved by AlexNet on the
respective domains to calculate the mean error, following the approach in Hendrycks & Dietterich
(2019) for IN-C. This way, we obtain the aggregate score mean Domain Error (mDE) by calculating
the mean over different domains,


DE[f]d [=] _Ed[f]_ _,_ mDE = [1]

_Ed[AlexNet]_ _D_

where Ed[f] [is the top-1 error of a classifier][ f][ on domain][ d][.]


_Ed[f]_ _[,]_ (21)
_d=1_

X


**Leave-one-out-cross-validation** For all IN-D results we report in this paper, we chose the
hyperparameters on the IN-C dev set. We tried a different model selection scheme on IN-D as a
control experiment with “Leave one out cross-validation” (L1outCV): with a round-robin procedure,
we choose the hyperparameters for the test domain on all other domains. We select the same
hyperparameters as when tuning on the “dev” set: For the ResNet50 model, we select over the
number of training epochs (with a maximum of 7 training epochs) and search for the optimal
learning rate in the set [0.01, 0.001, 0.0001]. For the EfficientNet-L2 model, we train only for one
epoch as before and select the optimal learning rate in the set [4.6 × 10[−][3], 4.6 × 10[−][4], 4.6 × 10[−][5],
4.6 × 10[−][6]]. This model selection leads to worse results both for the ResNet50 and the EfficientNetL2 models, highlighting the robustness of our model selection process, see Table 25.

Table 25: mDE in % on IN-D for different model selection strategies.

model model selection
L1outCV IN-C dev

ResNet50 RPLq=0.8 81.3 76.1
ResNet50 ENT 82.4 77.3
EfficientNet-L2 ENT 69.2 66.8
EfficientNet-L2 RPLq=0.8 69.1 67.2

D.2 DETAILED RESULTS FOR ROBUST RESNET50 MODELS ON IN-D

We show detailed results for all models on IN-D for vanilla evaluation (Table 26) BN adaptation
(Table 27), RPLq=0.8 (Table 28) and ENT(Table 29). For RPLq=0.8 and ENT, we use the same
hyperparameters that we chose on our IN-C ‘dev’ set. This means we train the models for 5 epochs
with RPLq=0.8 and for one epoch with ENT.

We evaluate the pre-trained and public checkpoints of SIN (Geirhos et al., 2019), ANT (Rusak
et al., 2020), ANT+SIN (Rusak et al., 2020), AugMix (Hendrycks et al., 2020b), DeepAugment
(Hendrycks et al., 2020a) and DeepAug+Augmix (Hendrycks et al., 2020a) in the following tables.

Table 26: Top-1 error on IN-D in % as obtained by robust ResNet50 models. For reference, we also show the
mCE on IN-C and the top-1 error on IN-R. See main test for model references.

Model Clipart Infograph Painting Quickdraw Real Sketch mDE IN-C IN-R

vanilla 76.0 89.6 65.1 99.2 40.1 82.0 88.2 76.7 63.9
SIN 71.3 88.6 62.6 97.5 40.6 77.0 85.6 69.3 58.5
ANT 73.4 88.9 63.3 99.2 39.9 80.8 86.9 62.4 61.0
ANT+SIN 68.4 88.6 60.6 95.5 40.8 70.3 83.1 60.7 53.7
AugMix 70.8 88.6 62.1 99.1 39.0 78.5 85.4 65.3 58.9
DeepAugment 72.0 88.8 61.4 98.9 39.4 78.5 85.6 60.4 57.8
DeepAug+Augmix 68.4 88.1 58.7 98.2 39.2 75.2 83.4 53.6 53.2


-----

Table 29: Top-1 error on IN-D in % as obtained by state-of-the-art robust ResNet50 models and ENT. See main
text for references to the used models.

Model Clipart Infograph Painting Quickdraw Real Sketch mDE

vanilla 65.1 85.8 59.2 98.5 38.4 75.8 77.3
SIN 62.1 87.0 57.3 99.1 39.0 68.6 75.5
ANT 64.2 86.9 58.7 97.1 38.8 72.8 76.5
ANT+SIN 62.2 86.8 57.7 95.8 40.1 68.7 75.2
AugMix 60.2 84.6 55.8 97.6 36.8 72.0 74.4
DeepAugment 59.5 85.7 54.4 98.0 37.1 66.4 73.3
DeepAug+Augmix 58.4 84.3 54.7 98.5 38.1 63.6 72.7

Table 30: mDE on IN-D in % as obtained by robust ResNet50 models with a baseline evaluation, batch norm
adaptation, RPLq=0.8 and ENT. See main text for model references.

mDE on IN-D (↘)
Model Baseline BN adapt RPLq=0.8 ENT

vanilla 88.2 80.2 76.1 77.3
SIN 85.6 79.6 76.8 75.5
ANT 86.9 80.7 78.1 76.5
ANT+SIN **83.1** 77.8 76.1 75.2
AugMix 85.4 78.4 74.6 74.4
DeepAugment 85.6 78.8 74.8 73.3
DeepAugment+Augmix 83.4 **74.9** **72.6** **72.7**

Table 27: Top1 error on IN-D in % as obtained by state-of-the-art robust ResNet50 models and batch norm
adaptation, with a batch size of 128. See main text for model references.

Model Clipart Infograph Painting Quickdraw Real Sketch mDE

vanilla 70.2 88.2 63.5 97.8 41.1 78.3 80.2
SIN 67.3 89.7 62.2 97.2 44.0 75.2 79.6
ANT 69.2 89.4 63.0 97.5 42.9 79.5 80.7
ANT+SIN 64.9 88.2 60.0 96.8 42.6 73.0 77.8
AugMix 66.9 88.1 61.2 97.1 40.4 75.0 78.4
DeepAugment 66.6 89.7 60.0 97.2 42.5 75.1 78.8
DeepAug+Augmix 61.9 85.7 57.5 95.3 40.2 69.2 74.9

Table 28: Top-1 error on IN-D in % as obtained by state-of-the-art robust ResNet50 models and RPLq=0.8. See
main text for model references.

Model Clipart Infograph Painting Quickdraw Real Sketch mDE

vanilla 63.6 85.1 57.8 99.8 37.3 73.0 76.1
SIN 60.8 86.4 56.0 99.0 37.8 67.0 76.8
ANT 63.4 86.3 57.7 99.2 37.7 71.0 78.1
ANT+SIN 61.5 86.4 56.8 97.0 39.0 67.1 76.1
AugMix 59.7 83.4 54.1 98.2 35.6 70.1 74.6
DeepAugment 58.1 84.6 53.3 99.0 36.2 64.2 74.8
DeepAug+Augmix 57.0 83.2 53.4 99.1 36.5 61.3 72.6

The summary results for all models are shown in Table 30.

We show the top-1 error for the different IN-D domains versus training epochs for a vanilla ResNet50
in Fig. 8. We indicate the epochs 1 and 5 at which we extract the errors with dashed black lines.


-----

clipart infograph painting

100 100

90

85 90

95

80 GCE 80

75 ENT

90 70

70

65 60

85

quickdraw real sketch

100.0 100

80

99.5 95

99.0 70 90

98.5 60 85

98.0 50 80

97.5 40 75

97.0 70

0 5 10 15 0 5 10 15 0 5 10 15

Epochs Epochs Epochs


Figure 8: Top-1 error for the different IN-D domains for a ResNet50 and training with RPLq=0.8 and ENT. We
indicate the epochs at which we extract the test errors by the dashed black lines (epoch 1 for ENTand epoch 5
for RPLq=0.8).

D.3 DETAILED RESULTS FOR THE EFFICIENTNET-L2 NOISY STUDENT MODEL ON IN-D

We show the detailed results for the EfficientNet-L2 Noisy Student model on IN-D in Table 31.

Table 31: Top-1 error (↘) on IN-D in % for EfficientNet-L2

Domain Baseline ENT RPL

Clipart 45.0 39.8 **37.9**
Infograph **77.9** 91.3 94.3
Painting 42.7 41.7 **40.9**
Quickdraw **98.4** 99.4 99.4
Real 29.2 28.7 **27.9**
Sketch 56.4 **48.0** 51.5
mDE 67.2 **66.8** 67.2

D.4 DETAILED RESULTS ON THE ERROR ANALYSIS ON IN-D

**Frequently predicted classes** We analyze the most frequently predicted classes on IN-D by
a vanilla ResNet50 and show the results in Fig. 9. We make several interesting observations:
First, we find most errors interpretable: it makes sense that a ResNet50 assigns the label “comic
book” to images from the “clipart” or “painting” domains, or “website” to images from the
“infograph” domain, or “envelope” to images from the “sketch” domain. Second, on the hard domain
“quickdraw”, the ResNet50 mostly predicts non-sensical classes that are not in IN-D, mirroring its
almost chance performance on this domain. Third, we find no systematic errors on the “real” domain
which is expected since this domain should be similar to IN.

**Filtering predictions on IN-D that cannot be mapped to ImageNet-D** We perform a second
analysis: We filter the predicted labels according to whether they can be mapped to IN-D and report
the filtered top-1 errors as well as the percentage of filtered out inputs in Table 32. We note that for
the domains “infograph” and “quickdraw”, the ResNet50 predicts labels that cannot be mapped to
IN-D in over 70% of all cases, highlighting the hardness of these two domains.


-----

Table 32: top-1 error on IN and different IN-D domains for different settings: left column: default evaluation,
middle column: predicted labels that cannot be mapped to IN-D are filtered out, right column: percentage of
filtered out labels.

Dataset top-1 error in % top-1 error on filtered labels in % percentage of rejected inputs

IN val 12.1 13.4 52.7
IN-D real 40.2 17.2 27.6
IN-D clipart 76.1 59.0 59.0
IN-D infograph 89.7 59.3 74.6
IN-D painting 65.2 39.5 42.4
IN-D quickdraw 99.3 96.7 76.1
IN-D sketch 82.1 65.6 47.9

**Filtering labels and predictions on IN that cannot be mapped to ImageNet-D** To test for
possible class-bias effects, we test the performance of a ResNet50 model on IN classes that can
be mapped to IN-D and report the results in Table 32.

First, we map IN labels to IN-D to make the setting as similar as possible to our experiments on
IN-D and report the top-1 error (12.1%). This error is significantly lower compared to the top-1
error a ResNet50 obtains following the standard evaluation protocol (23.9%). This can be explained
by the simplification of the task: While in IN there are 39 bird classes, these are all mapped to the
same hierarchical class in IN-D. Therefore, the classes in IN-D are more dissimilar from each other
than in IN. Additionally, there are only 164 IN-D classes compared to the 1000 IN classes, raising
the chance level prediction.

If we further only accept predictions that can be mapped to IN-D, the top-1 error is slightly increased
to 13.4%. In total, about 52.7% of all images in the IN validation set cannot be mapped to IN-D.

clipart infograph

80

150

60 125

100

40

75

50

Numbers of predictions 20

25

0 0
comic book envelope jigsaw puzzle website menu envelope

painting quickdraw

100

25

20 80

15 60

10 40

Numbers of predictions

5 20

0 0
comic book book jacket jigsaw puzzle hook, claw chain labyrinth

real sketch

100

3

80

2 60

40

1

Numbers of predictions

20

0 0
envelope comic book studio couch envelope labyrinth nematode


Figure 9: Systematic predictions of a vanilla ResNet50 on IN-D for different domains.


-----

D.5 TOP-1 ERROR ON IN-D FOR ALEXNET

We report the top-1 error numbers on different IN-D as achieved by AlexNet in Table 33. We used
these numbers for normalization when calculating mDE.

Table 33: top-1 error on IN-D by AlexNet which was used for normalization.

Dataset top-1 error in %

IN-D real 54.887
IN-D clipart 84.010
IN-D infograph 95.072
IN-D painting 79.080
IN-D quickdraw 99.745
IN-D sketch 91.189


-----

E ADDITIONAL EXPERIMENTS

E.1 BEYOND IMAGENET CLASSES: SELF-LEARNING ON WILDS

The WILDS benchmark (Koh et al., 2021) is comprised of ten tasks to test domain generalization,
subpopulation shift, and combinations thereof. In contrast to the setting considered here, many of
the datasets in WILDS mix several 10s or 100s domains during test time.

The Camelyon17 dataset in WILDS contains histopathological images, with the labels being binary
indicators of whether the central 32×32 region contains any tumor tissue; the domain identifies
the hospital that the patch was taken from. Camelyon17 contains three different test splits with
different domains and varying difficulty levels. For evaluation, we took the pretrained checkpoint
from worksheets.codalab.org/worksheets/0x00d14c55993548a1823a710642f6d608 (camelyon17
erm densenet121 seed0) for a DenseNet121 model (Huang et al., 2017) and verified the reported
baseline performance numbers. We adapt the models using ENT or RPL for a maximum of 10
epochs using learning rates {3 _×_ 10[−][5], 3 _×_ 10[−][4], . . . 3 _×_ 10[−][1]}. The best hyperparameter is selected
according to OOD Validation accuracy.

The RxRx1 dataset in WILDS contains RGB images of cells obtained by fluorescent microscopy,
with the labels indicating which of the 1,139 genetic treatments (including no treatment) the cells
received; the domain identifies the batch in which the imaging experiment was run. The RxRx1
dataset contains three test splits, however, unlike Camelyon17, in all of the splits the domains are
mixed. For evaluation, we took the pretrained checkpoint from worksheets.codalab.org/bundles/
0x7d33860545b64acca5047396d42c0ea0 for a ResNet50 model and verified the reported baseline
performance numbers. We adapt the models using ENT or RPL for a maximum of 10 epochs using
base learning rates {6.25 × 10[−][6], 6.25 × 10[−][5], . . . 6.25 × 10[−][2]}, which are scaled to the admissible
batch size for single GPU adaptation using linear scaling. The best hyperparameter is selected
according to OOD Validation accuracy.

Table 34: Self-learning can improve performance on WILDS if a systematic shift is present — on Camelyon17,
the ood validation and test sets are different hospitals, for example. On datasets like RxRx1 and FMoW, we
do not see an improvement, most likely because the ood domains are shuffled, and a limited amount of images
exist for each test domain.

Top-1 accuracy [%]
Validation (ID) Validation (OOD) Test (OOD)

Camelyon17
Baseline 81.4 88.7 63.1
BN adapt 97.8 (+16.4) 90.9 (+2.2) 88.0 (+24.9)
ENT 97.6 (+16.2) 92.7 (+4.0) 91.6 (+28.5)
RPL 97.6 (+16.2) 93.0 (+4.3) 91.0 (+27.9)

RxRx1
Baseline 35.9 19.1 29.7
BN adapt 35.0 (-0.9) 19.1 (0.0) 29.4 (-0.3)
ENT 34.8 (-1.1) 19.2 (+0.1) 29.4 (-0.3)
RPL 34.8 (-1.1) 19.2 (+0.1) 29.4 (-0.3)

FMoW
Baseline 60.5 59.2 52.9
BN adapt 59.9 (-0.6) 57.6 (-1.6) 51.8 (-1.1)
ENT 59.9 (-0.6) 58.5 (-0.7) 52.2 (-0.7)
RPL 59.8 (-0.7) 58.6 (-0.6) 52.1 (-0.8)

The FMoW dataset in WILDS contains RGB satellite images, with the labels being one of 62
building or land use categories; the domain specifies the year in which the image was taken
and its geographical region (Africa, the Americas, Oceania, Asia, or Europe). The FMoW
dataset contains four test splits for different time periods, for which all regions are mixed
together. For evaluation, we took the pretrained checkpoint from //worksheets.codalab.org/
bundles/0x20182ee424504e4a916fe88c91afd5a2 for a DenseNet121 model and verified the reported
baseline performance numbers. We adapt the models using ENT or RPL for a maximum of 10 epochs


-----

using learning rates {5.0 × 10[−][6], 5.0 × 10[−][5], . . . 5.0 × 10[−][2]}. The best hyperparameter is selected
according to OOD Validation accuracy.

While we see improvements on Camelyon17, neither BN adaptation nor self-learning can improve
performance on RxRx1 or FMoW. Initial experiments on PovertyMap and iWildsCam also do not
show improvements with self-learning. We hypothesize that the reason lies in the mixing of the
domains: Both BN adaptation and our self-learning methods work best on systematic domain shifts.
These results support our claim that self-learning is effective, while showing the important limitation
when applied to more diverse shifts.

E.2 SMALL IMPROVEMENTS ON BIGTRANSFER MODELS WITH GROUP NORMALIZATION
LAYERS

We evaluated BigTransfer models (Kolesnikov et al., 2020) provided by the timm library (Wightman,
2019). A difference to the ResNet50, ResNeXt101 and EfficientNet models is the use of group
normalization layers, which might influence the optimal method for adaptation—for this evaluation,
we followed our typical protocol as performed on ResNet50 models, and used affine adaptation.

For affine adaptation, a distilled BigTransfer ResNet50 model improves from 49.6 % to 48.4 % mCE
on the ImageNet-C development set, and from 55.0 % to 54.4 % mCE on the ImageNet-C test set
when using RPL (q = 0.8) for adaptation, at learning rate 7.5 × 10[−][4] at batch size 96 after a single
adaptation epoch. Entropy minimization did not further improve results on the ImageNet-C test set.
An ablation over learning rates and epochs on the dev set is shown in Table 35, the final results are
summarized in Table 36.


Table 35: mCE in % on the IN-C dev set for ENT and RPL for
different numbers of training epochs when adapting the affine
batch norm parameters of a ResNet50 model.

criterion ENT RPL
lr, 7.5 × 10[−][5] 10[−][4] 10[−][3] 10[−][5] 10[−][4] 10[−][3]

epoch

0 49.63 49.63 49.63 49.63 49.63 49.63
1 49.44 50.42 52.59 49.54 48.89 48.95
2 49.26 50.27 56.47 49.47 **48.35** 50.77
3 49.08 52.18 60.06 49.39 48.93 51.45
4 48.91 52.03 60.50 49.31 50.01 51.53
5 **48.80** 51.97 62.91 49.24 49.96 51.34
6 48.83 52.10 62.96 49.16 49.71 51.19
7 48.83 52.10 62.96 49.16 49.71 51.19


Table 36: mCE in % on the INC dev set for ENT and RPL for
different numbers of training epochs
when adapting the affine batch norm
parameters of a ResNet50 model.

dev mCE test mCE

Baseline 49.63 55.03
ENT 48.80 56.36
RPL **48.35** **54.41**


E.3 CAN SELF-LEARNING IMPROVE OVER SELF-LEARNING BASED UDA?

An interesting question is whether test-time adaptation with self-learning can improve upon selflearning based UDA methods. To investigate this question, we build upon French et al. (2018) and
their released code base at github.com/Britefury/self-ensemble-visual-domain-adapt. We trained
the Baseline models from scratch using the provided shell scripts with the default hyperparameters
and verified the reported performance. For adaptation, we tested BN adaptation, ENT, RPL, as well
as continuing to train in exactly the setup of French et al. (2018), but without the supervised loss.
For the different losses, we adapt the models for a maximum of 10 epochs using learning rates
_{1 × 10[−][5], 1 × 10[−][4], . . ., 1 × 10[−][1]}._

Note that for this experiment, in contrast to any other result in this paper, we purposefully do not
**perform proper hyperparameter selection based on a validation dataset—instead we report the**
best accuracy across all tested epochs and learning rates to give an upper bound on the achievable
performance for test-time adaptation.

As highlighted in Table 37, none of the four tested variants is able to meaningfully improve over
the baseline, corroborating our initial hypothesis that self-learning within a full UDA setting is the
optimal strategy, if dataset size and compute permits. On the other hand, results like the teacher


-----

refinement step in DIRT-T (Shu et al., 2018) show that with additional modifications in the loss
function, it might be possible to improve over standard UDA with additional adaptation at test time.

Table 37: Test-time adaptation marginally improves over self-ensembling.

Baseline BN adapt ENT RPL Self-ensembling loss
MNIST→SVHN
MT+TF 33.88 34.44 34.87 35.09 33.27
MT+CT* 32.62 34.11 34.25 34.21 33.36
MT+CT+TF 41.59 41.93 41.95 41.95 42.70
MT+CT+TFA 30.55 32.53 32.54 32.55 30.84
SVHN-specific aug. 97.05 96.82 96.91 96.87 97.12

MNIST→USPS
MT+TF 98.01 97.91 97.96 97.91 98.16
MT+CT* 88.34 88.39 88.54 88.39 88.44
MT+CT+TF 98.36 98.41 98.41 98.41 98.50
MT+CT+TFA 98.45 98.45 98.45 98.45 98.61

SVHN→MNIST
MT+TF 98.49 98.47 98.49 98.47 99.40
MT+CT* 88.34 88.36 88.36 88.36 89.36
MT+CT+TF 99.51 99.49 99.5 99.49 99.57
MT+CT+TFA 99.56 99.57 99.57 99.57 99.58
SVHN-specific aug. 99.52 99.49 99.5 99.49 99.65

USPS→MNIST
MT+TF 92.79 92.62 92.62 92.66 93.08
MT+CT* 99.11 99.13 99.14 99.13 99.21
MT+CT+TF 99.41 99.42 99.45 99.42 99.52
MT+CT+TFA 99.48 99.54 99.57 99.54 99.54


-----

F DETAILED DISCUSSION OF RELATED WORK

**Self-learning for domain adaptation** Xie et al. (2020b) introduce “In-N-Out” which uses
auxiliary information to boost both in- and out-of-distribution performance. AdaMatch (Berthelot
et al., 2021) builds upon FixMatch (Sohn et al., 2020) and can be used for the tasks of unsupervised
domain adaptation, semi-supervised learning and semi-supervised domain adaptation as a generalpurpose algorithm. Prabhu et al. (2021) propose SENTRY, an algorithm based on judging the
predictive consistency of samples from the target domain under different image transformations.
Zou et al. (2019) show that different types of confidence regularization can improve the performance
of self-learning. A theoretically motivated framework for self-learning in domain adaptation based
on consistency regularization has been proposed by Wei et al. (2020) and then extended by Cai et al.
(2021). Self-learning has also been used for semantic segmentation (Zou et al., 2018).

The main difference from these works to ours is that they 1) utilize both source and target data during
training (i.e., the classical UDA setup) whereas we only require access to unlabeled target data
(source-free setup), and 2) train their models from scratch whereas we adapt pretrained checkpoints
to the unlabeled target data, 3) are oftentimes more complicated (also in terms of the number of
hyperparameters) than our approach due to using more than one term in the objective function.
We would like to highlight that utilizing source data should always result in better performance
compared to not using source data. Our contribution is to show that self-learning can still be very
beneficial with a small compute budget and no access to source data. Our setup targets “deployed
systems”, e.g., a self-driving car or a detection algorithm in a production line which adapts to the
distribution shift “on-the-fly” and cannot (or should not) be retrained from scratch for every new
domain shift.

Kumar et al. (2020) study the setting of self-learning for gradual domain adaptation. They find that
self-learning works better if the data distribution changes slowly. The gradual domain adaptation
setting differs from ours; instead of a gradual shift over time, we focus on a fixed, systematic shift at
test time dataset. Kumar et al. (2020) tested their method on a synthetic Gaussian dataset, MNIST
and the Portraits datasets; building and evaluating ImageNet-scale datasets for a gradual domain
adaptation perspective is a very interesting extension of our work, but left for future work, and
would not only require changes/adaptations to the self-learning method, but also to the evaluation
datasets.

Chen et al. (2020b) prove that under certain conditions, self-learning can improve performance
in biased datasets where spurious features correlate with the label in the source domain but are
independent of the label in the target domain. While Chen et al. (2020b) also consider the
setting of source-free domain adaptation (like we do), they limit their experiments to small scale
models on MNIST and Celeb-A, while we conduct a large scale empirical study on all common
robustness datasets with large scale models – some of the observed and studied effects in our paper
(effectiveness of different loss functions at different problem scales, adaptation mechanisms, etc.)
can be attributed to this large scale evaluation setting, and extending our insights over small scale
experiments.

Similar to us, Chen et al. (2020b) find that a strong source classifier is necessary for self-learning
to work; however, in their case, a teacher accuracy of 72% (on CMNIST10) is already too low
and leads to worse student accuracy. In contrast, in our experiments, self-learning still works
for an mCE as high as 80% (cf. appendix Figure 3, severity 5) and teacher accuracies as low as
10.4% (on ImageNet-D “Infograph”), and breaks down at accuracies around 1-2% (on ImageNet-D
“Quickdraw”). This discrepancy might be due to the spurious correlations that Chen et al. (2020b)
introduced in their dataset leading to systematic biases, which are not present in the datasets we
studied.

**Self-learning in semi-supervised learning (SSL)** In a different line of work which is not related
to domain adaptation directly, self-learning has been used in a semi-supervised setting. Zoph et al.
(2020) show that self-learning outperforms pretraining when stronger data augmentation is used and
more labeled data is present. They use human labels on the target task (e.g., object detection on
COCO) and pseudo-labels on an unlabeled dataset (e.g. ImageNet), and optimize the loss on both
datasets, with the aim to improve performance on the task where ground truth labels are known.
The work of Zoph et al. (2020) is orthogonal to ours, in the sense that we could adapt their final


-----

checkpoint to a new domain with our method, similar to how we adapted the Noisy Student model
which was also trained using self-learning.

Rizve et al. (2021) propose an uncertainty-aware pseudo-label selection (UPS) framework which
outperforms other SSL methods in a few-label regime. UPS is helpful to reduce the impact of noisy
pseudo-labels; in our case, we use the generalized cross-entropy loss for this purpose. Testing the
UPS framework (and other means for improving the quality of pseudo-labels, or robustness against
label noise) on robustness datasets would be an interesting direction for future work.

De Sousa Ribeiro et al. (2020) propose Deep Bayesian Self-Training (DBST) for automatic data
annotation. Mukherjee & Awadallah (2020) suggest using self-learning in a semi-supervised setting
for text classification with few labels.


-----

G SOFTWARE STACK

We use different open source software packages for our experiments, most notably Docker (Merkel,
2014), scipy and numpy (Virtanen et al., 2020), GNU parallel (Tange, 2011), Tensorflow (Abadi
et al., 2016), PyTorch (Paszke et al., 2017), timm (Wightman, 2019), Self-ensembling for visual
domain adaptation (French et al., 2018), the WILDS benchmark (Koh et al., 2021), and torchvision
(Marcel & Rodriguez, 2010).


-----