|
# IF YOUR DATA DISTRIBUTION SHIFTS, |
|
## USE SELF-LEARNING |
|
|
|
**Anonymous authors** |
|
Paper under double-blind review |
|
|
|
ABSTRACT |
|
|
|
We demonstrate that self-learning techniques like entropy minimization and |
|
pseudo-labeling are simple and effective at improving performance of a deployed |
|
computer vision model under systematic domain shifts. We show consistent |
|
improvements irrespective of the model architecture, the pre-training technique |
|
or the type of distribution shift. At the same time, self-learning is simple to |
|
use in practice because it does not require knowledge or access to the original |
|
training data or scheme, is robust to hyperparameter choices, is straight-forward |
|
to implement and requires only a few adaptation epochs. This makes selflearning techniques highly attractive for any practitioner who applies machine |
|
learning algorithms in the real world. We present state-of-the art adaptation |
|
results on CIFAR10-C (8.5% error), ImageNet-C (22.0% mCE), ImageNet-R |
|
(17.4% error) and ImageNet-A (14.8% error), theoretically study the dynamics |
|
of self-supervised adaptation methods and propose a new classification dataset |
|
(ImageNet-D) which is challenging even with adaptation. |
|
|
|
1 INTRODUCTION |
|
|
|
Deep Neural Networks (DNNs) can reach human-level performance in complex cognitive tasks |
|
(Brown et al., 2020; He et al., 2016a; Berner et al., 2019) if the distribution of the test data is |
|
sufficiently similar to the training data. However, DNNs are known to struggle if the distribution of |
|
the test data is shifted relatively to the training data (Geirhos et al., 2018; Dodge & Karam, 2017). |
|
|
|
Two largely distinct communities aim to increase the performance of models under test-time |
|
distribution shifts: The robustness community generally considers ImageNet-scale datasets and |
|
evaluates models in an ad-hoc scenario. Models are trained on a clean source dataset like ImageNet, |
|
using heavy data augmentation (Hendrycks et al., 2020a; Rusak et al., 2020; Geirhos et al., 2019) |
|
and/or large-scale pre-training (Xie et al., 2020a; Mahajan et al., 2018). The trained models are |
|
not adapted in any way to test-time distribution shifts. This evaluation scenario is relevant for |
|
applications in which very different distribution shifts are encountered in an unpredictable order, |
|
and hence misses out on the gains of adaptation to unlabeled samples of the target distribution. |
|
|
|
Figure 1: Robustness and adaptation to new datasets has traditionally been achieved by robust pre-training (with |
|
hand-selected/data-driven augmentation strategies, or additional data), unsupervised domain adaptation (with |
|
access to unlabeled samples from the test set), or, more recently, self-supervised learning methods. We show |
|
that on top of these different pre-training tasks, it is always possible (irrespective of architecture, model size or |
|
pre-training algorithm) to further adapt models to the target domain with simple self-learning techniques. |
|
|
|
|
|
----- |
|
|
|
The unsupervised domain adaptation (UDA) community often considers smaller-scale datasets and |
|
assumes that both the source and the (unlabeled) target dataset are known. Models are trained on |
|
both datasets (e.g., with an adversarial domain objective, Ganin et al., 2016) before evaluation on |
|
the target domain data. This evaluation scenario provides optimal conditions for adaptation, but the |
|
reliance on the source dataset makes UDA more computationally expensive, more impractical and |
|
prevents the use of pre-trained models for which the source dataset is unknown or simply too large. |
|
|
|
In this work, we consider the source-free domain adaptation setting, a middle ground between the |
|
classical ad-hoc robustness setting and UDA in which models can adapt to the target distribution |
|
but without using the source dataset (Kundu et al., 2020; Kim et al., 2021; Li et al., 2020; Liang |
|
et al., 2020). This evaluation scenario is interesting for many practitioners and applications as an |
|
extension of the ad-hoc robustness scenario. It evaluates the possible performance of a deployed |
|
model on a systematic, unseen distribution shift at inference time: an embedded computer vision |
|
system in an autonomous car should adapt to changes without being trained on all available training |
|
data; an image-based quality control software may not necessarily open-source the images it has |
|
been trained on, but still has to be adapted to the lighting conditions at the operation location; a |
|
computer vision system in a hospital should perform robustly when tested on a scanner different |
|
from the training images—importantly, it might not be known at development time which scanner it |
|
will be tested on, and it might be prohibited to share images from many hospitals to run UDA. |
|
|
|
Can self-learning methods like pseudo-labeling and entropy-minimization also be used in this |
|
_source-free domain adaptation setting? To answer this question, we perform an extensive study_ |
|
of several self-learning variants, and find consistent and substantial gains in test-time performance |
|
across several robustness and out-of-domain benchmarks and a wide range of models and pretraining methods, including models trained with UDA methods that do not use self-learning. We |
|
also find that self-learning outperforms state-of-the-art source-free domain adaptation methods, |
|
namely Test-Time Training which is based on a self-supervised auxiliary objective and continual |
|
training (Sun et al., 2019b), test-time entropy minimization (Wang et al., 2020) and (gradient-free) |
|
BatchNorm adaptation (Schneider et al., 2020; Nado et al., 2020). We perform a large number |
|
of ablations to study important design choices for self-learning methods in source-free domain |
|
adaptation. Furthermore, we show that a variant of pseudo-labeling with a robust loss function |
|
consistently outperforms entropy minimization on ImageNet-scale datasets. We theoretically |
|
analyze and empirically verify the influence of the temperature parameter in self-learning and |
|
provide guidelines how this single parameter should be chosen. Our approach is visualized in |
|
Figure 1. We do not consider test-time adaptation in an online setting like is studied e.g., by Zhang |
|
et al. (2021), where the model is adapted to one example at a time, and reset after each example. |
|
|
|
**Related Work. Variants of self-learning have been used for UDA (Berthelot et al., 2021), for** |
|
example using auxiliary information (Xie et al., 2020b), consistency (Wei et al., 2020; Cai et al., |
|
2021; Prabhu et al., 2021) or confidence (Zou et al., 2019) regularization. The main difference from |
|
these works to ours is that they 1) utilize both source and target data for self-learning whereas we |
|
only require access to unlabeled target data, 2) train their models from scratch whereas we merely |
|
fine-tune pretrained checkpoints on the unlabeled target data, and 3) are generally more complicated |
|
than our approach due to using more than one term in the objective function. |
|
|
|
Our work is conceptually most similar to virtual adversarial domain adaptation in the fine-tuning |
|
phase of DIRT-T (Shu et al., 2018)) and Test-time entropy minimization (TENT; Wang et al., 2020). |
|
In contrast to DIRT-T, our objective is simpler and we scale the approach to considerably larger |
|
datasets on ImageNet scale. TENT, on the other hand, only evaluated a single method (entropy |
|
minimization) on a single vanilla model (ResNet-50) on IN-C. We substantially expand this analysis |
|
to show that self-learning almost universally increases test-time performance under distribution |
|
shifts, regardless of the type of distribution shift, the model architecture or the pre-training method. |
|
|
|
Self-learning has also been applied to UDA for semantic segmentation (Zou et al., 2018), for gradual |
|
domain adaptation (Kumar et al., 2020), for semi-supervised learning (Rizve et al., 2021; Mukherjee |
|
& Awadallah, 2020), for learning in biased datasets (Chen et al., 2020b) and for automated data |
|
annotation (De Sousa Ribeiro et al., 2020). Zoph et al. (2020) show that self-learning outperforms |
|
pretraining when stronger data augmentation is used and more labeled data is present. A more |
|
detailed discussion of related work alongside with the main differences to our work can be found in |
|
Appendix F. Our main contribution beyond these works is to show the effectiveness of self-learning |
|
on top of both robust, large scale, and domain adapted models, at scale. |
|
|
|
|
|
----- |
|
|
|
2 SELF-LEARNING FOR TEST-TIME ADAPTATION |
|
|
|
Different variants of self-learning have been used in both unsupervised domain adaptation (French |
|
et al., 2018; Shu et al., 2018), self-supervised representation learning (Caron et al., 2021), and in |
|
semi-supervised learning (Xie et al., 2020a). In a typical self-learning setting a teacher network |
|
**f** _[t]_ trained on the source domain predicts labels on the target domain. Then, a student model f _[s]_ is |
|
fine-tuned on the predicted labels. |
|
|
|
In the following, let f _[t](x) denote the logits for sample x and let p[t](j_ **x)** _σj(f_ _[t](x)) denote_ |
|
_|_ _≡_ |
|
the probability for class j obtained from a softmax function σj( ). Similarly, f _[s](x) and p[s](j_ **x)** |
|
|
|
_·_ _|_ |
|
denote the logits and probabilities for the student model f _[s]. For all techniques, one can optionally_ |
|
only admit samples where the probability maxj p[t](j|x) exceeds some threshold. We consider |
|
three popular variants of self-learning: Pseudo-labeling with hard or soft labels, as well as entropy |
|
minimization. |
|
|
|
**Hard Pseudo-Labeling (Lee, 2013; Galstyan & Cohen, 2007). We generate labels using the** |
|
teacher and train the student on pseudo-labels i using the standard cross-entropy loss, |
|
|
|
_ℓH_ (x) := − log p[s](i|x), _i = argmaxj p[t](j|x)_ (1) |
|
|
|
Usually, only samples with a confidence above a certain threshold are considered for training the |
|
student. We test several thresholds but note that thresholding means discarding a potentially large |
|
portion of the data which leads to a performance decrease in itself. The teacher is updated after each |
|
epoch. |
|
|
|
**Soft Pseudo-Labeling (Lee, 2013; Galstyan & Cohen, 2007). In contrast to the hard pseudo-** |
|
labeling variant, we here train the student on class probabilities predicted by the teacher, |
|
|
|
|
|
_p[t](j|x) log p[s](j|x)._ (2) |
|
|
|
|
|
_ℓS(x) :=_ |
|
_−_ |
|
|
|
|
|
Soft pseudo-labeling is typically not used in conjunction with thresholding, since it already |
|
incorporates the certainty of the model. The teacher is updated after each epoch. |
|
|
|
**Entropy Minimization (ENT; Grandvalet & Bengio, 2004). This variant is similar to soft pseudo-** |
|
labeling, but we no longer differentiate between a teacher and student network. It corresponds to an |
|
“instantaneous” update of the teacher. The training objective becomes |
|
|
|
|
|
_p[s](j|x) log p[s](j|x)._ (3) |
|
|
|
|
|
_ℓE(x) :=_ |
|
_−_ |
|
|
|
|
|
Intuitively, self-training with entropy minimization leads to a sharpening of the output distribution |
|
for each sample, making the model more confident in its predictions. |
|
|
|
**Robust Pseudo-Labeling (RPL). Virtually all introduced self-training variants use the standard** |
|
cross-entropy classification objective. However, the standard cross-entropy loss has been shown |
|
to be sensitive to label noise (Zhang & Sabuncu, 2018; Zhang et al., 2017). In the setting of |
|
domain adaptation, inaccuracies in the teacher predictions and, thus, the labels for the student, are |
|
inescapable, with severe repercussions for training stability and hyperparameter sensitivity as we |
|
show in the results. |
|
|
|
As a straight-forward solution to this problem, we propose to replace the cross-entropy loss by a |
|
robust classification loss designed to withstand certain amounts of label noise (Ghosh et al., 2017; |
|
Song et al., 2020; Shu et al., 2020; Zhang & Sabuncu, 2018). A popular candidate is the Generalized |
|
_Cross Entropy (GCE) loss which combines the noise-tolerant Mean Absolute Error (MAE) loss_ |
|
(Ghosh et al., 2017) with the CE loss. We only consider the hard labels and use the robust GCE loss |
|
as the training loss for the student, |
|
|
|
_i = argmaxj p[t](j|x),_ _ℓGCE(x, i) := q[−][1](1 −_ _p[s](i|x)[q]),_ (4) |
|
|
|
with q ∈ (0, 1]. For the limit case q → 0, the GCE loss approaches the CE loss and for q = 1, the |
|
GCE loss is the MAE loss (Zhang & Sabuncu, 2018). We test updating the teacher both after every |
|
update step of the student (RPL) and once per epoch (RPL[ep]). |
|
|
|
|
|
----- |
|
|
|
3 EXPERIMENT DESIGN |
|
|
|
**Datasets. IN-C (Hendrycks & Dietterich, 2019) contains corrupted versions of the 50 000 images in** |
|
the IN validation set. There are fifteen test and four hold-out corruptions, and there are five severity |
|
levels for each corruption. The established metric to report model performance on IN-C is the mean |
|
Corruption Error (mCE) where the error is normalized by the AlexNet error, and averaged over all |
|
corruptions and severity levels, see Eq. 20, Appendix C.1. IN-R (Hendrycks et al., 2020a) contains |
|
30 000 images with artistic renditions of 200 classes of the IN dataset. IN-A (an, 2019) is composed |
|
of 7500 unmodified real-world images on which standard IN-trained ResNet50 (He et al., 2016b) |
|
models yield chance level performance. CIFAR10 (Krizhevsky et al., 2009) and STL10 (Coates |
|
et al., 2011) are small-scale image recognition datasets with 10 classes each, and training sets of |
|
50 000/5000 images and test sets of 10 000/8000 images, respectively. The digit datasets MNIST |
|
(Deng, 2012) and MNIST-M (Ganin et al., 2016) both have 60 000 training and 10 000 test images. |
|
|
|
**Hyperparameters. The different self-learning variants have a range of hyperparameters such as the** |
|
learning rate or the stopping criterion. Our goal is to give a realistic estimation on the performance |
|
to be expected in practice.. To this end, we optimize hyperparameters for each variant of pseudolabeling on a hold-out set of IN-C that contains four types of image corruptions (“speckle noise”, |
|
“Gaussian blur”, “saturate” and “spatter”) with five different strengths each, following the procedure |
|
suggested in Hendrycks & Dietterich (2019). We refer to the hold-out set of IN-C as our dev set. |
|
|
|
**Models for ImageNet-scale datasets.** We consider four popular model architectures: ResNet50 |
|
(He et al., 2016b), DenseNet161 (Huang et al., 2017), ResNeXt101 (Xie et al., 2017) and |
|
EfficientNet-L2 (Tan & Le, 2019) (see Appendix B.1 for details on the used models). For |
|
ResNet50, DenseNet and ResNeXt101, we include a simple vanilla version trained on IN only. For |
|
ResNet50 and ResNeXt101, we additionally include a state-of-the-art robust version trained with |
|
DeepAugment and Augmix (DAug+AM, Hendrycks et al., 2020a)[1]. For the ResNeXt model, we |
|
also include a version that was trained on 3.5 billion weakly labeled images (IG-3.5B, Mahajan et al., |
|
2018). Finally, for EfficientNet-L2 we select the current state of the art on IN-C which was trained |
|
on 300 million images from JFT-300M (Chollet, 2017; Hinton et al., 2014) using a noisy studentteacher protocol (Xie et al., 2020a). We validate the IN and IN-C performance of all considered |
|
models and match the originally reported scores (Schneider et al., 2020). For EfficientNet-L2, we |
|
match IN top-1 accuracy up to 0.1% points, and IN-C up to 0.6% mCE. |
|
|
|
**Models for CIFAR10/MNIST-scale datasets.** For CIFAR10-C experiments, we use two |
|
WideResNets (WRN, Zagoruyko & Komodakis, 2016): the first one is trained on CIFAR10 and |
|
has a depth of 28 and a width of 10 and the second one is trained with AugMix (Hendrycks et al., |
|
2020b) and has a depth of 40 and a width of 2. The remaining small-scale models are trained with |
|
unsupervised domain adaptation (UDA) methods. We propose to regard any UDA method which |
|
requires joint training with source and target data as a pre-training step, similar to regular pretraining on IN, and use self-learning on top of the final checkpoint. We consider two popular UDA |
|
methods: self-supervised domain adaptation (UDA-SS; Sun et al., 2019a) and Domain-Adversarial |
|
Training of Neural Networks (DANN; Ganin et al., 2016). In UDA-SS, the authors seek to align the |
|
representations of both domains by performing an auxiliary self-supervised task on both domains |
|
simultaneously. In all UDA-SS experiments, we use a WideResNet with a depth of 26 and a width of |
|
16. In DANN, the authors learn a domain-invariant embedding by optimizing a minimax objective. |
|
For all DANN experiments except for MNIST→MNIST-M, we use the same WRN architecture as |
|
above. For the MNIST→MNIST-M experiment, the training with the larger model diverged and |
|
we used a smaller WideResNet version with a width of 2. We note that DANN training involves |
|
optimizing a minimax objective and is generally harder to tune. |
|
|
|
4 RESULTS: SELF-LEARNING UNIVERSALLY IMPROVES MODELS |
|
|
|
Self-learning is a powerful learning scheme, and in the following section we show that it allows to |
|
perform test-time adaptation on robustified models, models obtained with large-scale pre-training, |
|
as well as domain adapted models across a wide range of datasets and distribution shifts. Our main |
|
results on large-scale and small-scale datasets are shown in Tables 1 and 2, respectively. These |
|
|
|
1see leaderboard at github.com/hendrycks/robustness |
|
|
|
|
|
----- |
|
|
|
summary tables show final results, and all experiments use the hyperparameters we determined |
|
separately on the dev set. |
|
|
|
**Table 1: Self-learning successfully adapts ImageNet-scale models across different model** |
|
**architectures on IN-C, IN-A and IN-R. We adapt the vanilla ResNet50, ResNeXt101 and** |
|
DenseNet161 models to IN-C and decrease the mCE by over 19 percent points in all models. Further, |
|
self-learning works for models irrespective of their size: Self-learning substantially improves the |
|
performance of the ResNet50 and the ResNext101 trained with DAug+AM, on IN-C by 11.9 and |
|
9.7 percent points, respectively. Finally, we further improve the current state-of-the-art model on |
|
IN-C—the EfficientNet-L2 Noisy Student model—and report a new state-of-the-art result of 22% |
|
mCE (which corresponds to a top1 error of 17.1%) on this benchmark with test-time adaptation |
|
(compared to 28% mCE without adaptation). |
|
|
|
number of w/o adapt w/ adapt ∆ |
|
mCE [%] on IN-C test (↘) parameters RPL |
|
|
|
ResNet50 vanilla (He et al., 2016b) 2.6 × 10[7] 76.7 50.5 (-26.2) |
|
ResNet50 DAug+AM (Hendrycks et al., 2020a) 2.6 × 10[7] 53.6 41.7 (-11.9) |
|
DenseNet161 vanilla (Huang et al., 2017) 2.8 × 10[7] 66.4 47.0 (-19.4) |
|
ResNeXt10132 8d vanilla (Xie et al., 2017) 8.8 10[7] 66.6 43.2 (-23.4) |
|
_×_ _×_ |
|
ResNeXt10132 8d DAug+AM (Hendrycks et al., 2020a) 8.8 10[7] 44.5 34.8 (-9.7) |
|
_×_ _×_ |
|
ResNeXt10132 8d IG-3.5B (Mahajan et al., 2018) 8.8 10[7] 51.7 40.9 (-10.8) |
|
_×_ _×_ |
|
EfficientNet-L2 Noisy Student (Xie et al., 2020a) 4.8 × 10[8] 28.3 **22.0** (-6.3) |
|
|
|
top1 error [%] on IN-R (↘) |
|
|
|
ResNet50 vanilla (He et al., 2016b) 2.6 × 10[7] 63.8 54.1 (-9.7) |
|
EfficientNet-L2 Noisy Student (Xie et al., 2020a) 4.8 × 10[8] 23.5 **17.4** (-6.1) |
|
|
|
top1 error [%] on ImageNet-A (↘) |
|
|
|
EfficientNet-L2 Noisy Student (Xie et al., 2020a) 4.8 × 10[8] 16.5 **14.8** (-1.7) |
|
|
|
|
|
Self-learning is not limited to the distribution shifts in IN-C like compression artefacts or blur. |
|
On IN-R, a dataset with renditions, self-learning improves both the vanilla ResNet50 and the |
|
EfficientNet-L2 model, the latter of which improves from 23.5% to a new state-of-the art of 17.4% |
|
top-1 error. For a vanilla ResNet50, we improve the top-1 error from 63.8% (Hendrycks et al., |
|
2020a) to 54.1%. On IN-A, adapting the EfficientNet-L2 model using self-learning decreases the |
|
top-1 error from 16.5% (Xie et al., 2020a) to 14.8% top-1 error, again constituting a new state of the |
|
art with test-time adaptation on this dataset. |
|
|
|
**Table 2:** **Self-learning improves robustified and domain adapted models on small-scale** |
|
**datasets. We test common domain adaptation techniques like DANN (Ganin et al., 2016) and** |
|
UDA-SS (Sun et al., 2019a), and show that self-learning is effective at further tuning such models |
|
to the target domain. We suggest to view unsupervised source/target domain adaptation as a step |
|
comparable to pre-training under corruptions, rather than an adaptation technique specifically tuned |
|
to the target set—indeed, we can achieve error rates using, e.g., DANN + target adaptation previously |
|
only possible with source/target based pseudo-labeling, across different common domain adaptation |
|
benchmarks. Self-learning also decreases the error on CIFAR10-C of the Wide ResNet model |
|
trained with AugMix (AM, Hendrycks et al., 2020b) and reaches a new state of the art on CIFAR10C of 8.5% top1 error with test-time adaptation. _[†]denotes preliminary results on CIFAR-C dev only,_ |
|
due to instabilities in training the adversarial network in DANN. |
|
|
|
number of w/o adapt w/ adapt ∆ |
|
top1 error [%] on CIFAR10-C (↘) parameters ENT |
|
|
|
WRN-28-10 vanilla (Zagoruyko & Komodakis, 2016) 3.6 × 10[7] 26.5 13.3 (-13.2) |
|
WRN-40-2 AM (Hendrycks et al., 2020b) 2.2 × 10[6] 11.2 8.5 (-2.7) |
|
WRN-26-16 UDA-SS (Sun et al., 2019a) 9.3 × 10[7] 27.7 16.7 (-11.0) |
|
WRN-26-16 DANN (Ganin et al., 2016) 9.3 × 10[7] _†29.7_ _†28.5_ (-1.2) |
|
|
|
UDA CIFAR10→STL10, top1 error on target [%](↘) |
|
|
|
WRN-26-16 UDA-SS (Sun et al., 2019a) 9.3 × 10[7] 28.7 21.8 (-6.9) |
|
WRN-26-16 DANN (Ganin et al., 2016) 9.3 × 10[7] 25.0 23.9 (-1.1) |
|
|
|
UDA MNIST→MNIST-M, top1 error on target [%](↘) |
|
WRN-26-16 UDA-SS (Sun et al., 2019a) 9.3 × 10[7] 4.8 2.0 (-2.8) |
|
WRN-26-2 DANN (Ganin et al., 2016) 1.5 × 10[6] 11.4 5.1 (-6.3) |
|
|
|
|
|
----- |
|
|
|
**Table 3: Self-learning also improves large pre-trained models. Unlike BatchNorm adaptation** |
|
(Schneider et al., 2020), we show that self-learning transfers well to models pre-trained on a large |
|
amount of unlabeled data: self-learning decreases the mCE on IN-C of the ResNeXt101 trained on |
|
3.5 billion weakly labeled samples (IG-3.5B, Mahajan et al., 2018) from 51.7% to 40.9%. |
|
|
|
mCE on IN-C test [%] (↘) no adaptation BN adaptation self-learning |
|
|
|
ResNeXt10132 8d vanilla 66.6 56.8 43.2 |
|
_×_ |
|
ResNeXt10132 8d IG-3.5B 51.7 51.8 **40.9** |
|
_×_ |
|
|
|
**Table 4: Self-learning outperforms previously published test-time adaptation approaches on** |
|
**IN-C. The robustness benchmark IN-C has so far mostly been regarded in the ad-hoc evaluation** |
|
setting as discussed in our introduction. Thus, there are only few published methods that report |
|
numbers for test-time adaptation: BatchNorm adaptation (Schneider et al., 2020), Test-Time |
|
Training (TTT, Sun et al., 2019b), and TENT (Wang et al., 2020). In particular, note that TTT |
|
requires a special loss function at training time, while our approach is agnostic to the pre-training |
|
phase. Our self-training results outperforms all three baselines (also after tuning TENT with our full |
|
experimental protocol): |
|
|
|
mCE on IN-C test [%] (↘) w/o adapt BN adapt TENT (ours) self-learning |
|
ResNet50 vanilla 76.7 62.2 53.5 (51.6) **50.5** |
|
|
|
top1 error [%] on IN-C, sev. 5 (↘) w/o adapt BN adapt TTT self-learning |
|
ResNet18 vanilla 85.4 72.2 66.3 **61.9** |
|
|
|
**Table 5:** **Self-supervised methods based on self-learning allow out-of-the-box test-time** |
|
**adaptation. The recently published DINO method (Caron et al., 2021) is another variant of self-** |
|
supervised learning that has proven to be effective for unsupervised representation learning. At the |
|
core, the method uses soft pseudo-labeling. Here, we test whether a model trained with DINO on the |
|
source dataset can be test-time adapted on IN-C using DINO to further improve out-of-distribution |
|
performance. Since the used model is a vision transformer model, we test different choices of |
|
adaptation parameters and find considerable performance improvements in all cases, yielding an |
|
mCE of 43.5%mCE at a parameter count comparable to a ResNet50 model. For adapting the affine |
|
layers, we follow Houlsby et al. (2019): |
|
|
|
w/o adapt w/ adapt w/ adapt w/ adapt w/ adapt |
|
mCE on IN-C test [%] (↘) affine layers bottleneck layers lin. layers all weights |
|
|
|
ViT-S/16 62.3 51.8 46.8 45.2 **43.5** |
|
|
|
5 UNDERSTANDING TEST-TIME ADAPTATION WITH SELF-LEARNING |
|
|
|
In the following section, we show ablations and interesting insights of using self-learning for testtime adaptation. If not specified otherwise, all ablations are run on the holdout corruptions of IN-C |
|
(our dev set) with a vanilla ResNet50. |
|
|
|
**Table 6: Robust pseudo-labeling outperforms entropy minimization on large-scale datasets** |
|
**while the reverse is true on small-scale datasets. We find that robust pseudo-labeling consistently** |
|
improves over entropy minimization on IN-C, while entropy minimization performs better on |
|
smaller scale data (CIFAR10, STL10, MNIST). The finding highlights the importance of testing |
|
both algorithms on new datasets. The improvement is typically on the order of one percent point: |
|
|
|
|
|
mCE, IN-C dev ResNet50 ResNeXt-101 EfficientNet-L2 |
|
|
|
ENT 50.0 ± 0.04 43.0 22.2 |
|
RPL **48.9 ± 0.02** **42.0** **21.3** |
|
|
|
|
|
top-1 err, CIFAR-C WRN-40 |
|
|
|
ENT **8.5** |
|
RPL 9.0 |
|
|
|
|
|
**Table 7: Robust pseudo-labeling allows usage of the full dataset without a threshold. Classical** |
|
hard labeling needs a confidence threshold (T) for best performance, thereby reducing the dataset |
|
size, while best performance for RPL is reached for full dataset training with a threshold T of 0.0: |
|
|
|
diff. self-learning methods no adapt soft PL hard PL (T): 0.0 **0.5** 0.9 RPL (T): **0.0** 0.5 0.9 |
|
|
|
mCE on IN-C dev [%] 69.5 60.1 53.8 51.9 52.4 **49.7 49.9 51.8** |
|
|
|
|
|
----- |
|
|
|
**Table 8: Short update intervals are crucial for fast adaptation.** Having established that |
|
RPL generally performs better than soft- and hard-labeling, we vary the update interval for the |
|
teacher. We find that instant updates are most effective. In entropy minimization, the update interval |
|
is instant per default. |
|
|
|
Update interval for RPL w/o adapt no update epoch instant |
|
|
|
mCE on IN-C dev [%] 69.5 54.0 49.7 **49.2** |
|
|
|
**Table 9: Adaptation of only affine layers is important in CNNs. On IN-C, adapting only the** |
|
affine parameters after the normalization layers (i.e., the rescaling and shift parameters β and γ) |
|
works better on a ResNet50 architecture than adapting all parameters or only the last layer. We |
|
indicate the number of adapted parameters in brackets. |
|
|
|
Adaptation mechanism w/o adapt last layer full model affine |
|
|
|
mCE on IN-C dev [%] 69.5 [0] 60.2 [2M] 51.5 [22.6M] **48.9 [5.3k]** |
|
|
|
Note that for Vision Transformers, full model adaptation works better than affine adaptation (see |
|
Table 5). We also noticed that on convolutional models with a smaller parameter count like |
|
ResNet18, full model adaptation is possible. |
|
|
|
**Hyperparameters obtained on corruption datasets transfer well to real world datasets. When** |
|
evaluating models, we select the hyperparameters discussed above (the learning rate and the epoch |
|
used for early stopping are the most critical ones) on the holdout set of IN-C. We note that this |
|
technique transfers well to IN-R, -A and -D, highlighting the practical value of corruption robustness |
|
datasets for adapting models on real distribution shifts. |
|
|
|
On IN-D, we performed a control experiment where we selected hyperparameters with leave-oneout cross validation—this selection scheme actually performed worse than IN-C parameter selection |
|
(see Appendix D.1). |
|
|
|
6 ADAPTING MODELS ON A WIDER RANGE OF DISTRIBUTION SHIFTS |
|
REVEALS LIMITATIONS OF ROBUSTIFICATION AND ADAPTATION METHODS |
|
|
|
Robustness datasets on ImageNet-scale have so far been limited to a few selected domains (image |
|
corruptions in IN-C, image renditions in IN-R, difficult images for ResNet50 classifiers in IN-A). |
|
In order to test our approach on a wider range of complex distribution shifts, we re-purpose the |
|
dataset from the Visual Domain Adaptation Challenge 2019 (DomainNet, Saenko et al., 2019) as an |
|
additional robustness benchmark. This dataset comes with six image styles: Clipart, Real, Infograph, |
|
Painting, Quickdraw and Sketch. It has 345 classes in total, of which 164 overlap with IN. To |
|
benchmark robustness of IN trained models out of the box, we filter out the classes that cannot be |
|
mapped to IN and refer to the smaller version of DomainNet as ImageNet-D (IN-D). We map 463 |
|
classes in IN to these 164 IN-D classes, e.g., for an image from the “bird” class in IN-D, we accept |
|
all 39 bird classes in IN as valid predictions. We show example images from IN-D in Table 10. The |
|
detailed evaluation protocol along with justifications for our design choices and additional analysis |
|
are outlined in Appendix D. |
|
|
|
The benefit of IN-D over DomainNet is the re-mapping to ImageNet classes which allows robustness |
|
researchers to easily benchmark on this dataset, without the need of re-training a model (as common |
|
in UDA). To test whether self-learning is helpful for more complex distribution shifts, we adapt a |
|
vanilla ResNet50, several robust IN-C models and the EfficientNet-L2 Noisy Student model on IND. We use the same hyperparameters we obtained on IN-C dev for all our IN-D experiments. We |
|
show our main results in Table 10. |
|
|
|
**More robust models perform better on IN-D. Comparing the performance of the vanilla ResNet50** |
|
model to its robust DAug+AM variant, we find that the DAug+AM model performs better on all |
|
domains, with the most significant gains on the “Clipart”, “Painting” and “Sketch” domains. We |
|
show detailed results for all domains and all tested models in Appendix D.2, along with results |
|
on IN-C and IN-R for comparison. We find that the best performing models on IN-D are also the |
|
|
|
|
|
----- |
|
|
|
Table 10: Self-learning decreases the top1 error on some IN-D domains but increases it on others. |
|
|
|
domain Real Painting Clipart Sketch Infograph Quickdraw |
|
adapt w/o w/ w/o w/ w/o w/ w/o w/ w/o w/ w/o w/ |
|
model |
|
EffNet-L2 Noisy Student 29.2 **27.9** 42.7 **40.9** 45.0 **37.9** 56.4 **51.5** **77.9** 94.3 **98.4** 99.4 |
|
ResNet50 DAug+AM 39.2 36.5 58.7 53.4 68.4 57.0 75.2 61.3 88.1 83.2 98.2 99.1 |
|
ResNet50 vanilla 40.1 37.3 65.1 57.8 76.0 63.6 82.0 73.0 89.6 85.1 99.2 99.8 |
|
|
|
strongest ones on IN-C and IN-R which indicates good generalization capabilities of the techniques |
|
combined for these models, given the large differences between the three considered datasets. |
|
However, even the best models perform 20 to 30 percentage points worse on IN-D compared to |
|
their performance on IN-C or IN-R, indicating that IN-D might be a more challenging benchmark. |
|
|
|
**All models struggle with some domains of IN-D. The EfficientNet-L2 Noisy Student model** |
|
obtains the best results on most domains. However, we note that the overall error rates are |
|
surprisingly high compared to the model’s strong performance on the other considered datasets |
|
(IN-A: 14.8% top-1 error, IN-R: 17.4% top-1 error, IN-C: 22.0% mCE). Even on the “Real” domain |
|
closest to clean IN where the EfficientNet-L2 model has a top-1 error of 11.6%, the model only |
|
reaches a top-1 error of 29.2%. Self-learning decreases the top1 error on all domains except for |
|
“Infograph” and “Quickdraw”. We note that both domains have very high error rates from the |
|
beginning and thus hypothesize that the produced pseudo-labels are of low quality. |
|
|
|
**Error analysis on IN-D. We investigate the errors a ResNet50 model makes on IN-D by analyzing** |
|
the most frequently predicted classes for different domains to reveal systematic errors indicative |
|
of the encountered distribution shifts. We find most errors interpretable: the classifier assigns the |
|
label “comic book” to images from the “Clipart” or “Painting” domains, “website” to images from |
|
the “Infograph” domain, and “envelope” to images from the “Sketch” domain. Thus, the classifier |
|
predicts the domain rather than the class. We find no systematic errors on the “Real” domain which |
|
is expected since this domain should be similar to IN. Detailed results on the top-3 most frequently |
|
predicted classes for different domains can be found in Fig. 9, Appendix D.4. |
|
|
|
**IN-D should be used as an additional robustness benchmark. While the error rates on IN-C,** |
|
-R and -A are at a well-acceptable level for our largest EfficientNet-L2 model after adaptation, |
|
IN-D performance is consistently worse for all models. We propose to move from isolated |
|
benchmark settings like IN-R (single domain) to benchmarks more common in domain adaptation |
|
(like DomainNet) and make IN-D publicly available as an easy to use dataset for this purpose. |
|
|
|
**Additional experiments and limitations. We discuss additional proof-of-concept implementations** |
|
on the WILDS benchmark (Koh et al., 2021), BigTransfer (BiT; Chen et al., 2020a) models and |
|
on self-learning based UDA models in Appendix E. On WILDS, self-learning is effective for the |
|
Camelyon17 task with a systematic shift between train, validation and test sets (each set is comprised |
|
of different hospitals), while self-learning fails to improve on tasks with mixed domains. |
|
|
|
7 A SIMPLE MODEL OF STABILITY IN SELF-LEARNING |
|
|
|
We observed that different self-learning schemes are optimal for small-scale vs. large-scale datasets |
|
and varying amount of classes. We reconsider the used loss functions, and unify them into |
|
|
|
**f t(x)** **f s(x)** |
|
|
|
_ℓ(x) =_ _σj_ log _σj_ _,_ |
|
_−_ _τt_ _τs_ |
|
|
|
_j_ |
|
|
|
X (5) |
|
|
|
**f** (x), entropy minimization |
|
**f** _[t](x) =_ |
|
sg(f (x)), pseudo-labeling. |
|
|
|
|
|
We introduced student and teacher temperature τs and τt as parameters in the softmax function |
|
and the stop gradient operation sg. Caron et al. (2021) fixed τs and varied τt during training, |
|
|
|
|
|
----- |
|
|
|
and empirically found an upper bound for τt above which the training was no longer stable. |
|
To better understand such behavior, we study the learning dynamics of the loss function in |
|
equation 5 theoretically in a simple two-datapoints, two-classes model with linear student and |
|
teacher networks f _[s](x) = x[⊤]w[s]_ and f _[t](x) = x[⊤]w[t]_ defined in Appendix A.1. Gradient |
|
descent with stop gradient corresponds to hard pseudo-labeling in the limit τt 0 and to |
|
soft pseudo-labeling when τs = τt = 1. Gradient descent without stop gradient, i.e., setting → |
|
**w[s]** = **w[t]** = **w corresponds to entropy minimization.** We obtain the following result: |
|
|
|
|
|
Two points |
|
|
|
|
|
CIFAR-C |
|
|
|
|
|
**Proposition 1 (Collapse in the two-point model).** |
|
_The student and teacher networks ws and wt_ |
|
_trained with stop gradient does not collapse to the_ |
|
_trivial representation ∀x : x[⊤]w[s]_ = 0, x[⊤]w[t] = 0 |
|
_if τs > τt. The network w trained without stop_ |
|
_gradient does not collapse if τs > τt/2._ _Proof._ |
|
_see § A.2._ |
|
|
|
We validate the proposition on a simulated two |
|
datapoint toy dataset, as well as on the CIFARC dataset and outline the results in Figure 2. In |
|
general, the size and location of the region where |
|
collapse is observed in the simulated model also |
|
depends on the initial conditions, the learning rate |
|
and the optimization procedure. An in depth |
|
discussion, as well as additional simulations are |
|
given in the Appendix. In practice, the result |
|
suggests that student temperatures should exceed |
|
_the teacher temperatures for pseudo-labeling, and_ |
|
_student temperatures should exceed half the teacher_ |
|
_temperature for entropy minimization._ |
|
|
|
|
|
PL |
|
|
|
_τs_ |
|
10 |
|
log |
|
|
|
ENT |
|
|
|
|
|
1 1 |
|
|
|
Error Error |
|
|
|
0 0%50% 0 _≤>BASBAS_ |
|
|
|
100% |
|
|
|
_−1_ _−1_ |
|
|
|
_−2−2_ _−1_ 0 1 _−2_ _−2_ _−1_ 0 1 |
|
|
|
1 1 |
|
|
|
_τt = 2τs_ |
|
_τt = τs_ |
|
|
|
0 |
|
|
|
0 |
|
|
|
_−1_ |
|
|
|
_−2−2_ _−1_ 0 1 _−1_ _−1_ 0 1 |
|
|
|
log10 τt log10 τt |
|
|
|
|
|
discussion, as well as additional simulations are |
|
|
|
Figure 2: For the two point model, we show |
|
|
|
given in the Appendix. In practice, the result |
|
|
|
error and for the CIFAR10-C simulation, we show |
|
|
|
suggests that student temperatures should exceed improvement (yellow) vs. degradation (purple) |
|
_the teacher temperatures for pseudo-labeling, and_ over the non-adapted baseline (BAS). An important |
|
_student temperatures should exceed half the teacher_ convergence criterion for pseudo-labeling (top |
|
_temperature for entropy minimization._ row) and entropy minimization (bottom row) is the |
|
|
|
ratio of student and teacher temperatures; it lies at |
|
|
|
Entropy minimization with standard temperatures _τs = τt for PL, and 2τs = τt for ENT. Despite_ |
|
(are hence stable. The two-point learning dynamicsτs = τt = 1) and hard pseudo-labeling (τt → 0) the simplicity of the two-point model, the generalconvergence regions transfer to CIFAR10-C. |
|
vanish for soft pseudo-labeling with τs = _τt,_ |
|
suggesting that one would have to analyze a more complex model with more data points. While |
|
this does not directly imply that the learning is unstable at this point, we empirically observe that |
|
both entropy minimization and hard labeling outperform soft-labeling in practice. |
|
|
|
|
|
8 CONCLUSION |
|
|
|
We evaluated and analysed how self-learning, an essential component in many unsupervised domain |
|
adaptation and self-supervised pre-training techniques, can be applied for adaptation to both small |
|
and large-scale image recognition problems common in robustness research. We demonstrated new |
|
state-of-the-art adaptation results with the EfficientNet-L2 model on the benchmarks ImageNet-C, |
|
-R, and -A, and introduced a new benchmark dataset (ImageNet-D) which remains challenging even |
|
after adaptation. Our theoretical analysis shows the influence of the temperature parameter in the |
|
self-learning loss function on the training stability and provides guidelines how to choose a suitable |
|
value. Self-learning universally improves test-time performance under diverse, but systematic |
|
|
|
distribution shifts irrespective of the architecture or pre-training method. We hope that our work |
|
encourages both researchers and practitioners to use self-learning if their data distribution shifts. |
|
|
|
|
|
**Reproducibility Statement** We attempted to make our work as reproducible as possible: We |
|
mostly used pre-trained models which are publicly available and we denoted the URL addresses |
|
of all used checkpoints; for the checkpoints that were necessary to retrain, we report the Github |
|
directories with the source code and used an official or verified reference implementation when |
|
available. We report all used hyperparameters in the Appendix and will release our code upon |
|
acceptance of the paper. |
|
|
|
|
|
----- |
|
|
|
REFERENCES |
|
|
|
Mart´ın Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu |
|
Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. Tensorflow: A system for |
|
large-scale machine learning. In 12th {USENIX} Symposium on Operating Systems Design and |
|
_Implementation ({OSDI} 16), pp. 265–283, 2016. 37_ |
|
|
|
Dan Hendrycks an. Natural adversarial examples. ArXiv preprint, abs/1907.07174, 2019. URL |
|
|
|
[https://arxiv.org/abs/1907.07174. 4](https://arxiv.org/abs/1907.07174) |
|
|
|
Christopher Berner, Greg Brockman, Brooke Chan, Vicki Cheung, Przemyslaw Debiak, Christy |
|
Dennison, David Farhi, Quirin Fischer, Shariq Hashme, Chris Hesse, et al. Dota 2 with large |
|
[scale deep reinforcement learning. ArXiv preprint, abs/1912.06680, 2019. URL https://](https://arxiv.org/abs/1912.06680) |
|
[arxiv.org/abs/1912.06680. 1](https://arxiv.org/abs/1912.06680) |
|
|
|
David Berthelot, Rebecca Roelofs, Kihyuk Sohn, Nicholas Carlini, and Alex Kurakin. Adamatch: |
|
A unified approach to semi-supervised learning and domain adaptation, 2021. 2, 35 |
|
|
|
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla |
|
Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini |
|
Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, |
|
Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric |
|
Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam |
|
McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are fewshot learners. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, |
|
and Hsuan-Tien Lin (eds.), Advances in Neural Information Processing Systems 33: Annual |
|
_Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12,_ |
|
_[2020, virtual, 2020. URL https://proceedings.neurips.cc/paper/2020/hash/](https://proceedings.neurips.cc/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html)_ |
|
[1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html. 1](https://proceedings.neurips.cc/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html) |
|
|
|
Tianle Cai, Ruiqi Gao, Jason D Lee, and Qi Lei. A theory of label propagation for subpopulation |
|
shift. arXiv preprint arXiv:2102.11203, 2021. 2, 35 |
|
|
|
Mathilde Caron, Hugo Touvron, Ishan Misra, Herv´e J´egou, Julien Mairal, Piotr Bojanowski, and |
|
Armand Joulin. Emerging properties in self-supervised vision transformers. _ArXiv preprint,_ |
|
[abs/2104.14294, 2021. URL https://arxiv.org/abs/2104.14294. 3, 6, 8, 21](https://arxiv.org/abs/2104.14294) |
|
|
|
Ting Chen, Simon Kornblith, Kevin Swersky, Mohammad Norouzi, and Geoffrey E. |
|
Hinton. Big self-supervised models are strong semi-supervised learners. In Hugo |
|
Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien |
|
Lin (eds.), Advances in Neural Information Processing Systems 33: _Annual Conference_ |
|
_on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020,_ |
|
_virtual, 2020a._ [URL https://proceedings.neurips.cc/paper/2020/hash/](https://proceedings.neurips.cc/paper/2020/hash/fcbc95ccdd551da181207c0c1400c655-Abstract.html) |
|
[fcbc95ccdd551da181207c0c1400c655-Abstract.html. 8](https://proceedings.neurips.cc/paper/2020/hash/fcbc95ccdd551da181207c0c1400c655-Abstract.html) |
|
|
|
Yining Chen, Colin Wei, Ananya Kumar, and Tengyu Ma. Self-training avoids using spurious |
|
features under domain shift. In NeurIPS, 2020b. 2, 35 |
|
|
|
Franc¸ois Chollet. Xception: Deep learning with depthwise separable convolutions. In 2017 IEEE |
|
_Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July_ |
|
_21-26, 2017, pp. 1800–1807. IEEE Computer Society, 2017. doi: 10.1109/CVPR.2017.195. URL_ |
|
[https://doi.org/10.1109/CVPR.2017.195. 4, 20](https://doi.org/10.1109/CVPR.2017.195) |
|
|
|
Adam Coates, Andrew Ng, and Honglak Lee. An analysis of single-layer networks in unsupervised |
|
feature learning. In Proceedings of the Fourteenth International Conference on Artificial |
|
_Intelligence and Statistics, 2011. 4_ |
|
|
|
Francesco Croce, Maksym Andriushchenko, Vikash Sehwag, Edoardo Debenedetti, Nicolas |
|
Flammarion, Mung Chiang, Prateek Mittal, and Matthias Hein. Robustbench: a standardized |
|
adversarial robustness benchmark. _ArXiv preprint, abs/2010.09670, 2020._ [URL https:](https://arxiv.org/abs/2010.09670) |
|
[//arxiv.org/abs/2010.09670. 21](https://arxiv.org/abs/2010.09670) |
|
|
|
|
|
----- |
|
|
|
Fabio De Sousa Ribeiro, Francesco Caliv´a, Mark Swainson, Kjartan Gudmundsson, Georgios |
|
Leontidis, and Stefanos Kollias. Deep bayesian self-training. _Neural Computing and_ |
|
_Applications, 32(9):4275–4291, 2020. 2, 36_ |
|
|
|
Li Deng. The mnist database of handwritten digit images for machine learning research. IEEE |
|
_Signal Processing Magazine, 29(6):141–142, 2012. 4_ |
|
|
|
Samuel F. Dodge and Lina J. Karam. A study and comparison of human and deep learning |
|
recognition performance under visual distortions. In International Conference on Computer |
|
_Communications and Networks, ICCCN 2017, 2017. 1_ |
|
|
|
Geoffrey French, Michal Mackiewicz, and Mark H. Fisher. Self-ensembling for visual domain |
|
adaptation. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, |
|
_BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. OpenReview.net, 2018._ |
|
[URL https://openreview.net/forum?id=rkpoTaxA-. 3, 33, 37](https://openreview.net/forum?id=rkpoTaxA-) |
|
|
|
Aram Galstyan and Paul R. Cohen. Empirical comparison of hard and soft label propagation for |
|
relational classification. In 17th international conference on Inductive logic programming, 2007. |
|
3 |
|
|
|
Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, Franc¸ois |
|
Laviolette, Mario Marchand, and Victor Lempitsky. Domain-adversarial training of neural |
|
networks. The journal of machine learning research, 17(1):2096–2030, 2016. 2, 4, 5 |
|
|
|
Robert Geirhos, Carlos R. Medina Temme, Jonas Rauber, Heiko H. Sch¨utt, Matthias Bethge, and |
|
Felix A. Wichmann. Generalisation in humans and deep neural networks. In Samy Bengio, |
|
Hanna M. Wallach, Hugo Larochelle, Kristen Grauman, Nicol`o Cesa-Bianchi, and Roman |
|
Garnett (eds.), Advances in Neural Information Processing Systems 31: Annual Conference on |
|
_Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2018, Montr´eal,_ |
|
_Canada, pp. 7549–7561, 2018._ [URL https://proceedings.neurips.cc/paper/](https://proceedings.neurips.cc/paper/2018/hash/0937fb5864ed06ffb59ae5f9b5ed67a9-Abstract.html) |
|
[2018/hash/0937fb5864ed06ffb59ae5f9b5ed67a9-Abstract.html. 1](https://proceedings.neurips.cc/paper/2018/hash/0937fb5864ed06ffb59ae5f9b5ed67a9-Abstract.html) |
|
|
|
Robert Geirhos, Patricia Rubisch, Claudio Michaelis, Matthias Bethge, Felix A. Wichmann, and |
|
Wieland Brendel. Imagenet-trained cnns are biased towards texture; increasing shape bias |
|
improves accuracy and robustness. In 7th International Conference on Learning Representations, |
|
_ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net, 2019._ [URL https:](https://openreview.net/forum?id=Bygh9j09KX) |
|
[//openreview.net/forum?id=Bygh9j09KX. 1, 27](https://openreview.net/forum?id=Bygh9j09KX) |
|
|
|
Aritra Ghosh, Himanshu Kumar, and P. S. Sastry. Robust loss functions under label noise for deep |
|
neural networks. In Satinder P. Singh and Shaul Markovitch (eds.), Proceedings of the Thirty-First |
|
_AAAI Conference on Artificial Intelligence, February 4-9, 2017, San Francisco, California, USA,_ |
|
[pp. 1919–1925. AAAI Press, 2017. URL http://aaai.org/ocs/index.php/AAAI/](http://aaai.org/ocs/index.php/AAAI/AAAI17/paper/view/14759) |
|
[AAAI17/paper/view/14759. 3](http://aaai.org/ocs/index.php/AAAI/AAAI17/paper/view/14759) |
|
|
|
Yves Grandvalet and Yoshua Bengio. Semi-supervised learning by entropy minimization. |
|
In Advances in Neural Information Processing Systems 17 [Neural Information Processing |
|
_Systems, NIPS 2004, December 13-18, 2004, Vancouver, British Columbia, Canada], pp._ |
|
529–536, 2004. [URL https://proceedings.neurips.cc/paper/2004/hash/](https://proceedings.neurips.cc/paper/2004/hash/96f2b50b5d3613adf9c27049b2a888c7-Abstract.html) |
|
[96f2b50b5d3613adf9c27049b2a888c7-Abstract.html. 3](https://proceedings.neurips.cc/paper/2004/hash/96f2b50b5d3613adf9c27049b2a888c7-Abstract.html) |
|
|
|
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image |
|
recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR |
|
_2016, Las Vegas, NV, USA, June 27-30, 2016, pp. 770–778. IEEE Computer Society, 2016a._ |
|
[doi: 10.1109/CVPR.2016.90. URL https://doi.org/10.1109/CVPR.2016.90. 1](https://doi.org/10.1109/CVPR.2016.90) |
|
|
|
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image |
|
recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR |
|
_2016, Las Vegas, NV, USA, June 27-30, 2016, pp. 770–778. IEEE Computer Society, 2016b._ |
|
[doi: 10.1109/CVPR.2016.90. URL https://doi.org/10.1109/CVPR.2016.90. 4, 5,](https://doi.org/10.1109/CVPR.2016.90) |
|
21 |
|
|
|
|
|
----- |
|
|
|
Dan Hendrycks and Thomas G. Dietterich. Benchmarking neural network robustness to common |
|
corruptions and perturbations. In 7th International Conference on Learning Representations, |
|
_ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net, 2019._ [URL https:](https://openreview.net/forum?id=HJz6tiCqYm) |
|
[//openreview.net/forum?id=HJz6tiCqYm. 4, 27](https://openreview.net/forum?id=HJz6tiCqYm) |
|
|
|
Dan Hendrycks, Steven Basart, Norman Mu, Saurav Kadavath, Frank Wang, Evan Dorundo, Rahul |
|
Desai, Tyler Zhu, Samyak Parajuli, Mike Guo, et al. The many faces of robustness: A critical |
|
analysis of out-of-distribution generalization. _ArXiv preprint, abs/2006.16241, 2020a._ URL |
|
|
|
[https://arxiv.org/abs/2006.16241. 1, 4, 5, 20, 21, 27](https://arxiv.org/abs/2006.16241) |
|
|
|
Dan Hendrycks, Norman Mu, Ekin Dogus Cubuk, Barret Zoph, Justin Gilmer, and Balaji |
|
Lakshminarayanan. Augmix: A simple data processing method to improve robustness and |
|
uncertainty. In 8th International Conference on Learning Representations, ICLR 2020, Addis |
|
_[Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020b. URL https://openreview.](https://openreview.net/forum?id=S1gmrxHFvB)_ |
|
[net/forum?id=S1gmrxHFvB. 4, 5, 21, 27](https://openreview.net/forum?id=S1gmrxHFvB) |
|
|
|
Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. In |
|
_NIPS Deep Learning Workshop, 2014. 4, 20_ |
|
|
|
Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, |
|
Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning |
|
for NLP. In Proceedings of the 36th International Conference on Machine Learning, 2019. 6 |
|
|
|
Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q. Weinberger. Densely connected |
|
convolutional networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, |
|
_CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pp. 2261–2269. IEEE Computer Society,_ |
|
[2017. doi: 10.1109/CVPR.2017.243. URL https://doi.org/10.1109/CVPR.2017.](https://doi.org/10.1109/CVPR.2017.243) |
|
[243. 4, 5, 21, 32](https://doi.org/10.1109/CVPR.2017.243) |
|
|
|
Youngeun Kim, Donghyeon Cho, Kyeongtak Han, Priyadarshini Panda, and Sungeun Hong. |
|
Domain adaptation without source data. IEEE Transactions on Artificial Intelligence, 2021. 2 |
|
|
|
Pang Wei Koh, Shiori Sagawa, Henrik Marklund, Sang Michael Xie, Marvin Zhang, Akshay |
|
Balsubramani, Weihua Hu, Michihiro Yasunaga, Richard Lanas Phillips, Irena Gao, Tony Lee, |
|
Etienne David, Ian Stavness, Wei Guo, Berton A. Earnshaw, Imran S. Haque, Sara Beery, Jure |
|
Leskovec, Anshul Kundaje, Emma Pierson, Sergey Levine, Chelsea Finn, and Percy Liang. |
|
WILDS: A benchmark of in-the-wild distribution shifts. In International Conference on Machine |
|
_Learning (ICML), 2021. 8, 32, 37_ |
|
|
|
Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Joan Puigcerver, Jessica Yung, Sylvain Gelly, |
|
and Neil Houlsby. Big transfer (bit): General visual representation learning. In Computer Vision– |
|
_ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part_ |
|
_V 16, pp. 491–507. Springer, 2020. 33_ |
|
|
|
Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. |
|
2009. 4 |
|
|
|
Ananya Kumar, Tengyu Ma, and Percy Liang. Understanding self-training for gradual domain |
|
adaptation. In International Conference on Machine Learning, pp. 5468–5479. PMLR, 2020. 2, |
|
35 |
|
|
|
Jogendra Nath Kundu, Naveen Venkat, R Venkatesh Babu, et al. Universal source-free domain |
|
adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern |
|
_Recognition, pp. 4544–4553, 2020. 2_ |
|
|
|
Dong-Hyun Lee. Pseudo-label: The simple and efficient semi-supervised learning method for deep |
|
neural networks. In ICML Workshop : Challenges in Representation Learning (WREPL), 2013. 3 |
|
|
|
Rui Li, Qianfen Jiao, Wenming Cao, Hau-San Wong, and Si Wu. Model adaptation: Unsupervised |
|
domain adaptation without source data. In 2020 IEEE/CVF Conference on Computer Vision and |
|
_Pattern Recognition (CVPR), 2020. 2_ |
|
|
|
|
|
----- |
|
|
|
Jian Liang, Dapeng Hu, and Jiashi Feng. Do we really need to access the source data? source |
|
hypothesis transfer for unsupervised domain adaptation. In International Conference on Machine |
|
_Learning, 2020. 2_ |
|
|
|
Dhruv Mahajan, Ross Girshick, Vignesh Ramanathan, Kaiming He, Manohar Paluri, Yixuan Li, |
|
Ashwin Bharambe, and Laurens van der Maaten. Exploring the limits of weakly supervised |
|
pretraining. In Proceedings of the European Conference on Computer Vision (ECCV), 2018. 1, |
|
4, 5, 6, 20, 21 |
|
|
|
S´ebastien Marcel and Yann Rodriguez. Torchvision the machine-vision package of torch. In ACM |
|
_International Conference on Multimedia, 2010. 21, 37_ |
|
|
|
Dirk Merkel. Docker: Lightweight linux containers for consistent development and deployment. |
|
_Linux J., 2014(239), 2014. ISSN 1075-3583. 37_ |
|
|
|
Subhabrata Mukherjee and Ahmed Hassan Awadallah. Uncertainty-aware self-training for text |
|
classification with few labels. In NeurIPS, 2020. 2, 36 |
|
|
|
Zachary Nado, Shreyas Padhy, D Sculley, Alexander D’Amour, Balaji Lakshminarayanan, and |
|
Jasper Snoek. Evaluating prediction-time batch normalization for robustness under covariate shift. |
|
_[ArXiv preprint, abs/2006.10963, 2020. URL https://arxiv.org/abs/2006.10963. 2](https://arxiv.org/abs/2006.10963)_ |
|
|
|
Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, |
|
Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in |
|
PyTorch. In NIPS Autodiff Workshop, 2017. 37 |
|
|
|
Viraj Prabhu, Shivam Khare, Deeksha Kartik, and Judy Hoffman. Sentry: Selective entropy |
|
optimization via committee consistency for unsupervised domain adaptation. In Proceedings |
|
_of the IEEE/CVF International Conference on Computer Vision, pp. 8558–8567, 2021. 2, 35_ |
|
|
|
Mamshad Nayeem Rizve, Kevin Duarte, Yogesh S Rawat, and Mubarak Shah. In defense of pseudolabeling: An uncertainty-aware pseudo-label selection framework for semi-supervised learning. |
|
In ICLR, 2021. 2, 36 |
|
|
|
Evgenia Rusak, Lukas Schott, Roland Zimmermann, Julian Bitterwolf, Oliver Bringmann, Matthias |
|
Bethge, and Wieland Brendel. Increasing the robustness of dnns against image corruptions by |
|
[playing the game of noise. ArXiv preprint, abs/2001.06057, 2020. URL https://arxiv.](https://arxiv.org/abs/2001.06057) |
|
[org/abs/2001.06057. 1, 27](https://arxiv.org/abs/2001.06057) |
|
|
|
Kate Saenko, Xingchao Peng, Ben Usman, Kuniaki Saito, and Ping Hu. Visual Domain Adaptation |
|
_[Challenge (VisDA-2019), 2019. URL http://ai.bu.edu/visda-2019/. 7](http://ai.bu.edu/visda-2019/)_ |
|
|
|
Steffen Schneider, Evgenia Rusak, Luisa Eck, Oliver Bringmann, Wieland Brendel, and Matthias |
|
Bethge. Improving robustness against common corruptions by covariate shift adaptation. In |
|
_Advances in neural information processing systems, 2020. 2, 4, 6, 20, 24_ |
|
|
|
Jun Shu, Qian Zhao, Keyu Chen, Zongben Xu, and Deyu Meng. Learning adaptive loss for robust |
|
[learning with noisy labels. ArXiv preprint, abs/2002.06482, 2020. URL https://arxiv.](https://arxiv.org/abs/2002.06482) |
|
[org/abs/2002.06482. 3](https://arxiv.org/abs/2002.06482) |
|
|
|
Rui Shu, Hung H. Bui, Hirokazu Narui, and Stefano Ermon. A DIRT-T approach to unsupervised |
|
domain adaptation. In 6th International Conference on Learning Representations, ICLR 2018, |
|
_Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. OpenReview.net,_ |
|
[2018. URL https://openreview.net/forum?id=H1q-TM-AW. 2, 3, 34](https://openreview.net/forum?id=H1q-TM-AW) |
|
|
|
Kihyuk Sohn, David Berthelot, Chun-Liang Li, Zizhao Zhang, Nicholas Carlini, Ekin D Cubuk, |
|
Alex Kurakin, Han Zhang, and Colin Raffel. Fixmatch: Simplifying semi-supervised learning |
|
with consistency and confidence. In NeurIPS, 2020. 35 |
|
|
|
Hwanjun Song, Minseok Kim, Dongmin Park, and Jae-Gil Lee. Learning from noisy labels |
|
[with deep neural networks: A survey. ArXiv preprint, abs/2007.08199, 2020. URL https:](https://arxiv.org/abs/2007.08199) |
|
[//arxiv.org/abs/2007.08199. 3](https://arxiv.org/abs/2007.08199) |
|
|
|
|
|
----- |
|
|
|
Yu Sun, Eric Tzeng, Trevor Darrell, and Alexei A Efros. Unsupervised domain adaptation through |
|
[self-supervision. ArXiv preprint, abs/1909.11825, 2019a. URL https://arxiv.org/abs/](https://arxiv.org/abs/1909.11825) |
|
[1909.11825. 4, 5](https://arxiv.org/abs/1909.11825) |
|
|
|
Yu Sun, Xiaolong Wang, Zhuang Liu, John Miller, Alexei A Efros, and Moritz Hardt. Test-time |
|
training for out-of-distribution generalization. _ArXiv preprint, abs/1909.13231, 2019b._ URL |
|
|
|
[https://arxiv.org/abs/1909.13231. 2, 6, 25](https://arxiv.org/abs/1909.13231) |
|
|
|
Mingxing Tan and Quoc V. Le. Efficientnet: Rethinking model scaling for convolutional neural |
|
networks. In Kamalika Chaudhuri and Ruslan Salakhutdinov (eds.), Proceedings of the 36th |
|
_International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach,_ |
|
_California, USA, volume 97 of Proceedings of Machine Learning Research, pp. 6105–6114._ |
|
[PMLR, 2019. URL http://proceedings.mlr.press/v97/tan19a.html. 4, 21](http://proceedings.mlr.press/v97/tan19a.html) |
|
|
|
O. Tange. Gnu parallel - the command-line power tool. ;login: The USENIX Magazine, 36(1): |
|
[42–47, 2011. URL http://www.gnu.org/s/parallel. 37](http://www.gnu.org/s/parallel) |
|
|
|
Pauli Virtanen, Ralf Gommers, Travis E. Oliphant, Matt Haberland, Tyler Reddy, David |
|
Cournapeau, Evgeni Burovski, Pearu Peterson, Warren Weckesser, Jonathan Bright, St´efan J. |
|
van der Walt, Matthew Brett, Joshua Wilson, K. Jarrod Millman, Nikolay Mayorov, Andrew R. J. |
|
Nelson, Eric Jones, Robert Kern, Eric Larson, CJ Carey, [˙]Ilhan Polat, Yu Feng, Eric W. Moore, |
|
Jake Vand erPlas, Denis Laxalde, Josef Perktold, Robert Cimrman, Ian Henriksen, E. A. Quintero, |
|
Charles R Harris, Anne M. Archibald, Antˆonio H. Ribeiro, Fabian Pedregosa, Paul van Mulbregt, |
|
and SciPy 1. 0 Contributors. SciPy 1.0: Fundamental Algorithms for Scientific Computing in |
|
Python. Nature Methods, 17:261–272, 2020. doi: https://doi.org/10.1038/s41592-019-0686-2. |
|
37 |
|
|
|
Dequan Wang, Evan Shelhamer, Shaoteng Liu, Bruno Olshausen, and Trevor Darrell. Fully test[time adaptation by entropy minimization. ArXiv preprint, abs/2006.10726, 2020. URL https:](https://arxiv.org/abs/2006.10726) |
|
[//arxiv.org/abs/2006.10726. 2, 6](https://arxiv.org/abs/2006.10726) |
|
|
|
Colin Wei, Kendrick Shen, Yining Chen, and Tengyu Ma. Theoretical analysis of self-training with |
|
deep networks on unlabeled data. In ICLR, 2020. 2, 35 |
|
|
|
Ross Wightman. Pytorch image models. [https://github.com/rwightman/](https://github.com/rwightman/pytorch-image-models) |
|
[pytorch-image-models, 2019. 33, 37](https://github.com/rwightman/pytorch-image-models) |
|
|
|
Qizhe Xie, Minh-Thang Luong, Eduard H. Hovy, and Quoc V. Le. Self-training with noisy student |
|
improves imagenet classification. In 2020 IEEE/CVF Conference on Computer Vision and Pattern |
|
_Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020, pp. 10684–10695. IEEE, 2020a._ |
|
[doi: 10.1109/CVPR42600.2020.01070. URL https://doi.org/10.1109/CVPR42600.](https://doi.org/10.1109/CVPR42600.2020.01070) |
|
[2020.01070. 1, 3, 4, 5, 20, 21, 24](https://doi.org/10.1109/CVPR42600.2020.01070) |
|
|
|
Saining Xie, Ross B. Girshick, Piotr Doll´ar, Zhuowen Tu, and Kaiming He. Aggregated residual |
|
transformations for deep neural networks. In 2017 IEEE Conference on Computer Vision and |
|
_Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pp. 5987–5995. IEEE_ |
|
Computer Society, 2017. doi: 10.1109/CVPR.2017.634. [URL https://doi.org/10.](https://doi.org/10.1109/CVPR.2017.634) |
|
[1109/CVPR.2017.634. 4, 5, 21](https://doi.org/10.1109/CVPR.2017.634) |
|
|
|
Sang Michael Xie, Ananya Kumar, Robbie Jones, Fereshte Khani, Tengyu Ma, and Percy Liang. Inn-out: Pre-training and self-training using auxiliary information for out-of-distribution robustness. |
|
_arXiv preprint arXiv:2012.04550, 2020b. 2, 35_ |
|
|
|
Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. In Richard C. Wilson, Edwin R. |
|
Hancock, and William A. P. Smith (eds.), Proceedings of the British Machine Vision Conference |
|
_[2016, BMVC 2016, York, UK, September 19-22, 2016. BMVA Press, 2016. URL http://www.](http://www.bmva.org/bmvc/2016/papers/paper087/index.html)_ |
|
[bmva.org/bmvc/2016/papers/paper087/index.html. 4, 5](http://www.bmva.org/bmvc/2016/papers/paper087/index.html) |
|
|
|
Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding |
|
deep learning requires rethinking generalization. In 5th International Conference on Learning |
|
_Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings._ |
|
[OpenReview.net, 2017. URL https://openreview.net/forum?id=Sy8gdB9xx. 3](https://openreview.net/forum?id=Sy8gdB9xx) |
|
|
|
|
|
----- |
|
|
|
Marvin Zhang, Sergey Levine, and Chelsea Finn. Memo: Test time robustness via adaptation and |
|
augmentation. arXiv preprint arXiv:2110.09506, 2021. 2, 26 |
|
|
|
Zhilu Zhang and Mert R. Sabuncu. Generalized cross entropy loss for training deep |
|
neural networks with noisy labels. In Samy Bengio, Hanna M. Wallach, Hugo |
|
Larochelle, Kristen Grauman, Nicol`o Cesa-Bianchi, and Roman Garnett (eds.), Advances |
|
_in Neural Information Processing Systems 31:_ _Annual Conference on Neural Information_ |
|
_Processing Systems 2018, NeurIPS 2018, December 3-8, 2018, Montr´eal, Canada, pp._ |
|
[8792–8802, 2018. URL https://proceedings.neurips.cc/paper/2018/hash/](https://proceedings.neurips.cc/paper/2018/hash/f2925f97bc13ad2852a7a551802feea0-Abstract.html) |
|
[f2925f97bc13ad2852a7a551802feea0-Abstract.html. 3](https://proceedings.neurips.cc/paper/2018/hash/f2925f97bc13ad2852a7a551802feea0-Abstract.html) |
|
|
|
Barret Zoph, Golnaz Ghiasi, Tsung-Yi Lin, Yin Cui, Hanxiao Liu, Ekin D Cubuk, and Quoc V Le. |
|
Rethinking pre-training and self-training. In NeurIPS, 2020. 2, 35 |
|
|
|
Yang Zou, Zhiding Yu, BVK Kumar, and Jinsong Wang. Unsupervised domain adaptation |
|
for semantic segmentation via class-balanced self-training. In Proceedings of the European |
|
_conference on computer vision (ECCV), pp. 289–305, 2018. 2, 35_ |
|
|
|
Yang Zou, Zhiding Yu, Xiaofeng Liu, BVK Kumar, and Jinsong Wang. Confidence regularized |
|
self-training. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. |
|
5982–5991, 2019. 2, 35 |
|
|
|
|
|
----- |
|
|
|
A A TWO-POINT MODEL OF SELF-LEARNING |
|
|
|
A.1 DEFINITION OF THE TWO-POINT MODEL |
|
|
|
To understand the learning dynamics and properties of different loss functions and their |
|
hyperparameters, we propose a simple model of self-learning, both for entropy minimization and |
|
pseudo-labeling. |
|
|
|
A student network w[s] R[d] and a teacher network w[t] R[d] are trained on N data points **xi** _i=1_ |
|
_∈_ _∈_ _{_ _}[N]_ |
|
with the cross-entropy loss function L defined as |
|
|
|
|
|
_σt(x[⊤]i_ **[w][t][) log][ σ][s][(][x]i[⊤][w][s][) +][ σ][t][(][−][x]i[⊤][w][t][) log][ σ][s][(][−][x]i[⊤][w][s][)]** |
|
|
|
|
|
_L = −_ |
|
|
|
|
|
_ℓ(xi) =_ |
|
_−_ |
|
_i=1_ |
|
|
|
X |
|
|
|
|
|
(6) |
|
|
|
|
|
_i=1_ |
|
|
|
|
|
where σt(z) = |
|
|
|
|
|
1 |
|
|
|
1 + e[−][z/τ][t][ and][ σ][s][(][z][) =] |
|
|
|
|
|
1 + e[−][z/τ][s][ .] |
|
|
|
|
|
Here τs and τt denote the student and teacher temperature parameters. With stop gradient, student |
|
and teacher evolve in time according to |
|
|
|
**w˙** _[s]_ = **ws** **w[s], w[t][]** _,_ **w˙** _[t]_ = α(w[s] **w[t]),** (7) |
|
_−∇_ _L_ _−_ |
|
|
|
where α is the learning rate of the teacher. Without stop gradient, student and teacher are set equal |
|
to each other, and they evolve as |
|
|
|
**w˙** = **w** (w), where w[s] = w[t] = w. (8) |
|
_−∇_ _L_ |
|
|
|
We restrict the theoretical analysis to the time evolution of the components of w[s,t] in direction of |
|
two data points xk and xl, yk[s,t] _≡_ **x[⊤]k** **[w][s,t][ and][ y]l[s,t]** _≡_ **x[⊤]l** **[w][s,t][. All other components][ y]i[s,t]** with |
|
_i ̸= k, l are neglected to reduce the dimensionality of the equation system. It turns out that the_ |
|
resulting model captures the neural network dynamics quite well despite the drastic simplification |
|
of taking only two data points into account (see Figure 2). |
|
|
|
|
|
with stop gradient: ˙yk[s] [=][ −][x]k[⊤][∇][w][s][ (][ℓ][(][x][k][) +][ ℓ][(][x][l][))][,] _y˙l[s]_ [=][ −][x]l[⊤][∇][w][s][ (][ℓ][(][x][k][) +][ ℓ][(][x][l][))][,] |
|
|
|
_y˙k[t]_ [=][ α][(][y]k[t] _[−]_ _[y]k[s][)][,]_ _y˙l[t]_ [=][ α][(][y]l[t] _[−]_ _[y]l[s][)][,]_ |
|
|
|
without stop gradient: ˙yk = **x[⊤]k** [(][ℓ][(][x][k][) +][ ℓ][(][x][l][))][,] _y˙l =_ **x[⊤]l** [(][ℓ][(][x][k][) +][ ℓ][(][x][l][))][ .] |
|
_−_ _[∇][w]_ _−_ _[∇][w]_ |
|
|
|
|
|
(9) |
|
|
|
|
|
A.2 PROOF OF PROPOSITION 1 |
|
|
|
**Learning dynamics with stop gradient.** Computing the stop gradient evolution defined in |
|
equation 7 explicitly yields |
|
|
|
_N_ |
|
|
|
**w˙** _[s]_ = **ws** = [1] _σt(x[⊤]i_ **[w][t][)][σ][s][(][−][x]i[⊤][w][s][)][ −]** _[σ][t][(][−][x]i[⊤][w][t][)][σ][s][(][x]i[⊤][w][s][)]_ **xi** |
|
_−∇_ _L_ _τs_ (10) |
|
|
|
_i=1_ |
|
|
|
X |
|
|
|
**w˙** _[t]_ = α(w[s] _−_ **w[t])** |
|
|
|
The second equality uses the well-known derivative of the sigmoid function, ∂zσ(z) = σ(z)σ( _z)._ |
|
_−_ |
|
|
|
The equation system of 2d nonlinear, coupled ODEs for w[s] _∈_ R[d] and w[t] _∈_ R[d] in equation 10 is |
|
analytically difficult to analyze. Instead of studying the ODEs directly, we act on them with the data |
|
points x[⊤]k [,][ k][ = 1][, . . ., N] [, and investigate the dynamics of the components][ x]k[⊤][w][s,t][ ≡] _[y]k[s,t][:]_ |
|
|
|
_N_ |
|
|
|
_y˙k[s]_ [= 1] **x[⊤]i** **[x][k]** _σt(yi[t][)][σ][s][(][−][y]i[s][)][ −]_ _[σ][t][(][−][y]i[t][)][σ][s][(][y]i[s][)]_ |
|
|
|
_τs_ (11) |
|
|
|
_i=1_ |
|
|
|
X |
|
|
|
_y˙k[t]_ [=][ α][(][y]k[s] _[−]_ _[y]k[t]_ [)][.] |
|
|
|
The learning rate of each mode yk[s] [is scaled by][ (][x]k[⊤][x][i][)][ which is much larger for][ i][ =][ k][ than for][ i][ ̸][=][ k] |
|
in high-dimensional spaces. In the two-point approximation, we consider only the two (in absolute |
|
|
|
|
|
----- |
|
|
|
value) largest terms i = k, l for a given k in the sum in equation 11. Any changes that yk[s,t][(][t][)][ and] |
|
_yl[s,t][(][t][)][ might induce in other modes][ y]i[s,t][(][t][)][ are neglected, and so we are left with only four ODEs:]_ |
|
|
|
|
|
_y˙k[s]_ [= 1] **xk** _σt(yk[t]_ [)][σ][s][(][−][y]k[s][)][ −] _[σ][t][(][−][y]k[t]_ [)][σ][s][(][y]k[s][)] |
|
|
|
_τs_ _∥_ _∥[2][ ]_ |
|
|
|
|
|
+ [1] (x[⊤]k **[x][l][)]** _σt(yl[t][)][σ][s][(][−][y]l[s][)][ −]_ _[σ][t][(][−][y]l[t][)][σ][s][(][y]l[s][)]_ |
|
|
|
_τs_ |
|
|