|
# CALIBRATION REGULARIZED TRAINING OF DEEP NEURAL NETWORKS USING DIRICHLET KERNEL DENSITY ESTIMATION |
|
|
|
**Anonymous authors** |
|
Paper under double-blind review |
|
|
|
ABSTRACT |
|
|
|
Calibrated probabilistic classifiers are models whose predicted probabilities can |
|
directly be interpreted as uncertainty estimates. This property is particularly important in safety-critical applications such as medical diagnosis or autonomous |
|
driving. However, it has been shown recently that deep neural networks are poorly |
|
calibrated and tend to output overconfident predictions. As a remedy, we propose |
|
a trainable calibration error estimator based on Dirichlet kernel density estimates, |
|
which asymptotically converges to the true Lp calibration error. This novel estimator enables us to achieve the strongest notion of multiclass calibration, called |
|
canonical calibration, while other common calibration methods only allow for toplabel and marginal calibration. The empirical results show that our estimator is |
|
competitive with the state-of-the-art, consistently yielding tradeoffs between calibration error and accuracy that are (near) Pareto optimal across a range of network |
|
architectures. The computational complexity of our estimator is O(n[2]), matching |
|
that of the kernel maximum mean discrepancy, used in a previously considered |
|
trainable calibration estimator (Kumar et al., 2018). By contrast, the proposed |
|
method has a natural choice of kernel, and can be used to generate consistent estimates of other quantities based on conditional expectation, such as the sharpness |
|
of an estimator. |
|
|
|
1 INTRODUCTION |
|
|
|
Deep neural networks have shown tremendous success in classification tasks, being regularly the |
|
best performing models in terms of accuracy. However, they are also known to make overconfident |
|
predictions (Guo et al., 2017), which is particularly problematic in safety-critical applications such |
|
as medical diagnosis or autonomous driving. Therefore, in many real world applications we do |
|
not just care about the predictive performance, but also about the trustworthiness of that prediction, |
|
that is, we are interested in accurate predictions with robust uncertainty estimates. To this end, we |
|
want our models to be uncertainty calibrated which means that, for instance, among all cells that |
|
have been predicted with a probability of 0.8 to be cancerous, in fact a fraction of 80 % belong to a |
|
malignant tumor. |
|
|
|
Being calibrated, however, does not imply that the classifier achieves good accuracy. For instance, |
|
a classifier that always predicts the marginal distribution of the target class is calibrated, but will |
|
not be very useful in practice. Likewise, a good predictive performance does not ensure calibration. |
|
In particular, for a broad class of loss functions, risk minimization leads to asymptotically Bayes |
|
optimal classifiers (Bartlett et al., 2006). However, there is no guarantee that they are calibrated, |
|
even in the aysmptotic limit. Therefore, we consider minimizing the risk plus a term that penalizes |
|
miscalibration, i.e., Risk +λ · CalibrationError. For parameter values λ > 0, this will push the |
|
classifier towards a calibrated model, while maintaining similar accuracy. The existence of such a |
|
_λ > 0 is suggested by the fact that there always exists at least one Bayes optimal classifier that is_ |
|
calibrated, namely P(y|x). |
|
|
|
To optimize the risk and the calibration error jointly, we propose a differentiable and consistent estimator of the expected Lp calibration error based on kernel density estimates (KDEs). In particular, |
|
we use a Beta kernel in binary classification tasks and a Dirichlet kernel in the multiclass setting, |
|
|
|
|
|
----- |
|
|
|
as these kernels are the natural choices to model density estimation over a probability simplex. Our |
|
Dirichlet kernel based estimator allows for the estimation of canonical calibration, which is the |
|
strongest notion of multiclass calibration as it implies the calibration of the whole probability vector |
|
(Br¨ocker, 2009; Appice et al., 2015; Vaicenavicius et al., 2019). By contrast, most other state-ofthe-art methods only achieve weaker versions of multiclass calibration, namely top-label (Guo et al., |
|
2017) and marginal or class-wise calibration (Kull et al., 2019). The top-label calibration only considers the scores for the predictied class, while for marginal calibration the multiclass problem is |
|
split up into K one-vs-all binary ones, each of which is required to be calibrated according to the |
|
definition of binary calibration. In many applications marginal and canonical calibration are preferable to top-label calibration, since we often care about having reliable uncertainty estimates for more |
|
than just one class per prediction. For instance, in medical diagnosis we do not just care about the |
|
most likely disease a certain patient might have but also about the probabilities of other diseases. |
|
|
|
Our contributions can be summarized as follows: |
|
|
|
1. We develop a trainable calibration error objective using Dirichlet kernel density estimates, |
|
which can be minimized alongside any loss function in the existing batch stochastic gradient descent framework. |
|
|
|
2. We propose to use our estimator to evaluate canonical calibration. Due to the scaling |
|
properties of Dirichlet kernel density estimation, and the tendency for probabilities to be |
|
concentrated in a relatively small number of classes, this becomes feasible in cases that |
|
cannot be estimated using a binned estimator. |
|
|
|
3. We show on a variety of network architectures and two datasets that DNNs trained alongside an estimator of the calibration error achieve competitive results both on existing metrics and on the proposed measure of canonical calibration. |
|
|
|
|
|
2 RELATED WORK |
|
|
|
Calibration of probabilistic predictors has long been studied in many fields. This topic gained attention in the deep learning community following the observation in Guo et al. (2017) that modern |
|
neural networks are poorly calibrated and tend to give overconfident predictions due to overfitting |
|
on the NLL loss. The surge of interest resulted in many calibration strategies that can be split in two |
|
general categories, which we discuss subsequently. Post-hoc calibration strategies learn a calibration map of the predictions from a trained predictor in a post-hoc manner. For instance, Platt scaling |
|
(Platt, 1999) fits a logistic regression model on top of the logit outputs of the model. A special |
|
case of Platt scaling that fits a single scalar, called temperature, has been popularized by Guo et al. |
|
(2017) as an accuracy-preserving, easy to implement and effective method to improve calibration. |
|
However, it has the undesired consequence that it clamps the high confidence scores of accurate predictions (Kumar et al., 2018). Other approaches for post-hoc calibration include: histogram binning |
|
(Zadrozny & Elkan, 2001), isotonic regression (Zadrozny & Elkan, 2002), and Bayesian binning |
|
into quantiles (Naeini & Cooper, 2015). Trainable calibration strategies integrate a differentiable |
|
calibration measure into the training objective. One of the earliest approaches is regularization by |
|
penalizing low entropy predictions (Pereyra et al., 2017). Similarly to temperature scaling, it has |
|
been shown that entropy regularization needlessly suppresses high confidence scores of correct predictions (Kumar et al., 2018). Another popular strategy is MMCE (Maxmimum Mean Calibration |
|
Error) (Kumar et al., 2018), where the entropy regularizer is replaced by a kernel-based surrogate for |
|
the calibration error that can be optimized alongside NLL. It has been shown that label smoothing |
|
(Szegedy et al., 2015; M¨uller et al., 2020), i.e. training models with a weighted mixture of the labels |
|
instead of one-hot vectors, also improves model calibration. Liang et al. (2020) propose to add the |
|
difference between predicted confidence and accuracy as auxiliary term to the cross-entropy loss. |
|
Focal loss (Mukhoti et al., 2020; Lin et al., 2018) has recently been empirically shown to produce |
|
better calibrated models than many of the alternatives, but does not estimate a clear quantity related |
|
to calibration error. |
|
|
|
**Kernel density estimation (Parzen, 1962; Rosenblatt, 1956) is a non-parametric method to estimate** |
|
a probability density function from a finite sample. Zhang et al. (2020) propose a KDE-based estimator of the calibration error for measuring calibration performance. However, they use the triweight |
|
kernel, which has a limited support interval and is therefore applicable to binary classification, but |
|
does not have a natural extension to higher dimensional simplexes, in contrast to the Dirichlet kernel |
|
|
|
|
|
----- |
|
|
|
that we consider here. As a result, they consider an unnatural proxy to marginal calibration error, |
|
which does not result in a consistent estimator. |
|
|
|
3 METHODS |
|
|
|
The most commonly used loss functions are designed to achieve consistency in the sense of Bayes |
|
optimality under risk minimization, however, they do not guarantee calibration - neither for finite |
|
samples nor in the asymptotic limit. Since we are interested in models f that are both accurate and |
|
calibrated, we consider the following optimization problem bounding the calibration error CE(f ): |
|
_f = arg min_ (1) |
|
_f_ [Risk(][f] [)][,][ s.t.][ CE(][f] [)][ ≤] _[B]_ |
|
_∈F_ |
|
|
|
for some B > 0, and its associated Lagrangian |
|
|
|
_f = arg min_ Risk(f ) + λ CE(f ) _._ (2) |
|
_f_ _·_ |
|
_∈F_ |
|
|
|
|
|
|
|
We measure the (mis-)calibration in terms of the Lp calibration error. To this end, let (Ω, A, P) |
|
be a probability space, let X = R[d], Y = {0, 1, ..., K}. Let x : Ω _→X and y : Ω_ _→Y be_ |
|
random variables while realizations are denoted with subscripts. Furthermore, let f : X →△[K] |
|
be a decision function, where △[K] denotes the K dimensional simplex as is achieved e.g. from the |
|
output of a final softmax layer in a neural network. |
|
**Definition 3.1 (Calibration error, (Naeini et al., 2015; Kumar et al., 2019; Wenger et al., 2020)).** |
|
_The Lp calibration error of f is:_ |
|
|
|
[1] |
|
|
|
_p_ _p_ |
|
CEp(f ) = E E[y _f_ (x)] _f_ (x) _._ (3) |
|
_|_ _−_ _p_ |
|
|
|
|
|
We note that we consider multiclass calibration, and that f (x) and the conditional expectation in |
|
Equation 3 therefore map to points on a probability simplex. We say that a classifier f is perfectly |
|
calibrated if CEp(f ) = 0. Kumar et al. (2018) have also considered a minimization problem similar |
|
to Equation 2. Instead of using the CEp they use a metric called maximum mean calibration error |
|
(MMCE) that is 0 if and only if CEp = 0. However, it is unclear how MMCE relates to the canonical |
|
multiclass setting or to the norm parameter p for non-zero CEp. |
|
|
|
In order to optimize Definition 3.1 directly, we need to perform density estimation over the probability simplex in order to empirically compute the conditional expectation. In a binary setting, |
|
this has traditionally been done with binned estimates (Naeini et al., 2015; Guo et al., 2017; Kumar |
|
et al., 2019). However, this is not differentiable w.r.t. the function f, and cannot be incorporated |
|
into a gradient based training procedure. Furthermore, binned estimates suffer from the curse of |
|
dimensionality and do not have a practical extension to multiclass settings. A natural choice for a |
|
differentiable kernel density estimator in the binary case is a kernel based on the Beta distribution |
|
and the extension to the multiclass case is given by the Dirichlet distribution. Hence, we consider |
|
an estimator for the CEp based on Beta and Dirichlet kernel density estimates in the binary and |
|
multiclass setting, respectively. We require that this estimator is consistent and differentiable such |
|
that we can train it according to Equation 2. This estimator is given by: |
|
|
|
|
|
CE\p(f )[p] = [1] |
|
|
|
|
|
E[y\ f (x)] |
|
_|_ _f_ (xh) _[−]_ _[f]_ [(][x][h][)] |
|
|
|
|
|
(4) |
|
|
|
|
|
_h=1_ |
|
|
|
|
|
where E[y\ f (x)] E[y\ f (x)] evaluated at f (x) = f (xh). If Px,y has a probability |
|
_|_ _f_ (xh) [denotes] _|_ |
|
|
|
density px,y with respect to the product of the Lebesgue and counting measure, we can define: |
|
_px,y(xi, yi) = py|x=xi_ (yi) px(xi). Then we define the estimator of the conditional expectation as |
|
follows: |
|
|
|
_yk_ _[y][k][ p][x,y][(][f]_ [(][x][)][, y][k][)] |
|
|
|
E[y _f_ (x)] = _yk py_ _x=f_ (x)(yk) = _∈Y_ (5) |
|
_|_ _|_ _px(f_ (x)) |
|
|
|
_yXk∈Yn_ P |
|
|
|
_≈_ Pi=1ni=1[k][k][(][f][(][f][(][x][(][x][);][);][ f][ f][(][x][(][x][i][))][i][))][y][i] =: E[y\ | f (x)] (6) |
|
|
|
where k is the kernel of a kernel density estimate evaluated at point xi. |
|
|
|
P |
|
|
|
|
|
----- |
|
|
|
**Proposition 3.2.** E[y\ | f (x)] is a pointwise consistent estimator of E[y | f (x)], that is: |
|
|
|
_n_ |
|
|
|
lim _i=1n_ _[k][(][f]_ [(][x][);][ f] [(][x][i][))][y][i] = _yk∈Y_ _[y][k][ p][x,y][(][f]_ [(][x][)][, y][k][)] _._ (7) |
|
_n→∞_ P _i=1_ _[k][(][f]_ [(][x][);][ f] [(][x][i][))] P _px(f_ (x)) |
|
|
|
P |
|
|
|
_Proof. By the consistency of kernel density estimators (Silverman, 1986; Chen, 1999; Ouimet_ |
|
& Tolosana-Delgado, 2021), for all f (x) (0, 1), _n1_ _ni=1_ _[k][(][f]_ [(][x][);][ f] [(][x][i][))][y][i] _n→∞_ |
|
|
|
_yk∈Y_ _[y][k][ p][x,y][(][f]_ [(][x][)][, y][k][)][ and][ 1]n _ni=1_ _[k][(][f]_ [(][x][);][ f] [(][x]∈[i][))] _−n−→∞−−→_ _px(Pf_ (x)). The fact that the ratio of−−−−→ |
|
|
|
two convergent sequences converges against the ratio of their limits shows the result. |
|
|
|
P P |
|
|
|
**Mean squared error in binary classification** As a first instantiation of our framework we consider a binary classification setting, with the mean squared error MSE(f ) = E[(f (x) − _y)[2]] as the_ |
|
risk function, jointly optimized with the L2 calibration error CE2. Following Murphy (1973); Degroot & Fienberg (1983); Kuleshov & Liang (2015); Nguyen & O’Connor (2015) we decompose |
|
(full derivation in Appendix A) the MSE as: |
|
|
|
MSE(f ) − CE2(f )[2] = E 1 − E[y | f (x)] E[y | f (x)] _≥_ 0. (8) |
|
|
|
|
|
Similar to Equation 2, we consider the optimization problem for some λ > 0: |
|
|
|
_f = arg min_ MSE(f ) + λ CE2(f )[2][]. (9) |
|
_f_ _∈F_ |
|
|
|
|
|
|
|
Using Equation 8 we rewrite: |
|
|
|
MSE(f ) + λ CE2(f )[2] =(1 + λ) MSE(f ) _λ_ MSE(f ) CE2(f )[2][] (10) |
|
_−_ _−_ |
|
|
|
|
|
=(1 + λ) MSE(f ) − _λE_ 1 − E[y | f (x)] E[y | f (x)] _._ (11) |
|
|
|
|
|
Rescaling Equation 11 by a factor of (1 + λ)[−][1] and a variable substitution γ = 1+λλ |
|
|
|
_[∈]_ [[0][,][ 1)] |
|
|
|
_f = arg min_ MSE(f ) + λ CE2(f )[2][] = arg min MSE(f ) _γE_ 1 E[y _f_ (x)] E[y _f_ (x)] |
|
_f_ _f_ _−_ _−_ _|_ _|_ |
|
_∈F_ _∈F_ (12) |
|
|
|
= arg min MSE(f ) + γE E[y _f_ (x)][2][i]. (13) |
|
_f_ _|_ |
|
_∈F_ h |
|
|
|
For optimization we wish to find an estimator for E[E[y | f (x)][2]]. Building upon Equation 6, a |
|
partially debiased estimator can be written as:[1] |
|
|
|
2 |
|
|
|
_n_ |
|
|
|
\ _i≠_ _h_ _[k][(][f]_ [(][x][h][);][ f] [(][x][i][))][y][i] _−_ [P]i≠ _h_ [(][k][(][f] [(][x][h][);][ f] [(][x][i][))][y][i][)][2] |
|
E E[y | f (x)][2] _≈_ _n[1]_ _h=1_ P 2 _._ (14) |
|
h i X _i≠_ _h_ _[k][(][f]_ [(][x][h][);][ f] [(][x][i][))] _−_ [P]i≠ _h_ [(][k][(][f] [(][x][h][);][ f] [(][x][i][)))][2] |
|
|
|
In a binary setting, the kernels k(P·, ·) are Beta distributions, i.e. denoting _zi := f_ (xi) for short, then: |
|
|
|
_kBeta(z, zi) := z[α][i][−][1](1_ _z)[β][i][−][1][ Γ(][α][i][ +][ β][i][)]_ (15) |
|
_−_ Γ(αi) Γ(βi) _[,]_ |
|
|
|
|
|
with αi = _[z]h[i]_ [+1][ and][ β][i][ =][ 1][−]h[z][i] [+1][ (Chen, 1999; Bouezmarni & Rolin, 2003; Zhang & Karunamuni,] |
|
|
|
2010), where h is a bandwidth parameter in the kernel density estimate that goes to 0 as n →∞. |
|
We note that the computational complexity of this estimator is O(n[2]). Within the gradient descent |
|
training procedure, the density is estimated using a mini-batch and therefore the O(n[2]) complexity |
|
is w.r.t. a mini-batch, not the entire dataset. |
|
|
|
The estimator in Equation 14 is a ratio of two second order U-statistics that converge as n[−][1][/][2] |
|
|
|
(Ferguson, 2005). Therefore, the overall convergence will be n[−][1][/][2]. Empirical covergence rates are |
|
calculated in Appendix D.3 and shown to be close to the theoretically expected value. |
|
|
|
1We have debiased the numerator and denominator individually (Ferguson, 2005, Section 2), but for simplicity have not corrected for the fact that we are estimating a ratio (Scott & Wu, 1981). |
|
|
|
|
|
----- |
|
|
|
**Multiclass calibration with Dirichlet kernel density estimates** There are multiple definitions |
|
regarding multiclass calibration that differ in the strictness regarding the calibration of the probability vector f (x). The weakest notion is top label calibration, which, as the name suggests, only |
|
cares about calibrating the entry with the highest predicted probability, which reduces to a binary |
|
calibration problem again (Guo et al., 2017). Marginal or class-wise calibration (Kull et al., 2019) |
|
is the most commonly used definition of multiclass calibration and a stronger version of top label |
|
calibration. Here, the problem is split into K one-vs-all binary calibration setting, such that each |
|
class has to be calibrated against the other K − 1 classes: |
|
|
|
_K_ |
|
|
|
_p[]_ |
|
|
|
MCEp(f )[p] = E E[y = k | f (x)k] − _f_ (x)k _._ (16) |
|
|
|
_k=1_ |
|
|
|
X |
|
|
|
An estimator for this calibration error is: |
|
|
|
|
|
_i≠_ _j_ _[k][Beta][(][f]_ [(][x][j][)][k][;][ f] [(][x][i][)][k][)[][y][i][]][k] _f_ (xj)k |
|
|
|
_i=j_ _[k][Beta][(][f]_ [(][x][j][)][k][;][ f] [(][x][i][)][k][)] _−_ |
|
_̸_ |
|
|
|
P |
|
|
|
|
|
MCE\p(f )[p] = |
|
|
|
|
|
(17) |
|
|
|
|
|
_j=1_ |
|
|
|
|
|
_k=1_ |
|
|
|
|
|
The strongest notion of multiclass calibration, and the one that we want to consider in this paper, is |
|
called canonical calibration (Br¨ocker, 2009; Appice et al., 2015; Vaicenavicius et al., 2019). Here |
|
it is required that the whole probability vector f (x) is calibrated. The definition is exactly the one |
|
from Definition 3.1. Its estimator is: |
|
|
|
|
|
_i≠_ _j_ _[k][Dir][(][f]_ [(][x][j][);][ f] [(][x][i][))][y][i] |
|
|
|
_i=j_ _[k][Dir][(][f]_ [(][x][j][);][ f] [(][x][i][))][ −] _[f]_ [(][x][j][)] |
|
_̸_ |
|
|
|
|
|
_n_ |
|
|
|
CE\p(f )[p] = [1] |
|
|
|
_n_ |
|
|
|
_j=1_ P |
|
|
|
X |
|
|
|
where kDir is a Dirichlet kernel defined as: |
|
|
|
|
|
(18) |
|
|
|
|
|
_K_ |
|
|
|
_i=1_ _[α][i][)]_ |
|
_kDir(z, zi) := [Γ(]K[P][K]_ _zj[α][ij]_ _[−][1]_ (19) |
|
|
|
_i=1_ [Γ(][α][i][)] _j=1_ |
|
|
|
Y |
|
|
|
with αi = zi/h + 1 (Ouimet & Tolosana-Delgado, 2021). As before, the computational complexityQ |
|
is O(n[2]) irrespective of p. |
|
|
|
This estimator is differentiable and furthermore, the following proposition holds: |
|
**Proposition 3.3. The Dirichlet kernel based CE estimator is consistent, that is** |
|
|
|
lim 1 _n_ _ni≠_ _nj_ _[k][Dir][(][f]_ [(][x][j][);][ f] [(][x][i][))][y][i] _p_ = E E[y _f_ (x)] _f_ (x) _p_ _p._ (20) |
|
_n→∞_ _n_ Xj=1 P _i≠_ _j_ _[k][Dir][(][f]_ [(][x][j][);][ f] [(][x][i][))][ −] _[f]_ [(][x][j][)] _p_ _|_ _−_ _p_ |
|
|
|
P |
|
|
|
_Proof. Dirichlet kernel estimators are consistent (Ouimet & Tolosana-Delgado, 2021), conse-_ |
|
quently, by Proposition 3.2 the term inside the norm is consistent for any fixed f (xj) (note, that |
|
summing over i ̸= j ensures that the ratio of the KDE’s does not depend on the outer summation). |
|
Moreover, for any convergent sequence also the norm of that sequence converges against the norm |
|
of its limit. Ultimately, the outer sum is merely the sample mean of consistent summands, which |
|
again is consistent. |
|
|
|
4 EMPIRICAL SETUP |
|
|
|
We trained ResNet (He et al., 2015), ResNet with stochastic depth (SD) (Huang et al., 2016), |
|
DenseNet (Huang et al., 2018) and WideResNet (Zagoruyko & Komodakis, 2016) networks on |
|
CIFAR-10 and CIFAR-100 (Krizhevsky, 2009). We use 45000 images for training. The code will |
|
be released upon acceptance. |
|
|
|
**Baselines** _Cross-entropy: The first baseline model is trained using cross-entropy with the data_ |
|
preprocessing, training procedure and hyperparameters described in the corresponding paper for |
|
the architecture. Trainable calibration strategies MMCE (Kumar et al., 2018) is a differentiable |
|
measure of calibration with a property that it is minimized at perfect calibration. It is used as |
|
a regulariser alongside NLL, with the strength of regularization parameterized by λ. Focal loss |
|
(Mukhoti et al., 2020) is an alternative to the popular cross-entropy loss, defined as Lf = −(1 − |
|
|
|
|
|
----- |
|
|
|
_f_ (y|x))[γ] log(f (y|x)), where γ is a hyperparameter and f (y|x) is the probability score that a neural |
|
network f outputs for a class y on an input x. Their best-performing approach is the sampledependent FL-53 where γ = 5 for f (y|x) ∈ [0, 0.2) and γ = 3 otherwise, followed by the method |
|
with fixed γ = 3. Post-hoc calibration strategies Guo et al. (2017) investigated the performance |
|
of several post-hoc calibration methods and found temperature scaling to be a strong baseline, |
|
which we use as a representative of this group. It works by scaling the logits with a scalar T > 0, |
|
typically learned on a validation set by minimizing NLL. Following Kumar et al. (2018); Mukhoti |
|
et al. (2020), we also use temperature scaling as a post-processing step for our method. |
|
|
|
**Metrics** The most widely-used metric for expected calibration error (ECE) is a binned estimator |
|
(Naeini et al., 2015), which divides the interval [0, 1] into bins of equal width and then calculates |
|
a weighted average of the absolute difference between accuracy and confidence for each bin. A |
|
better binning scheme involves determining the bin sizes so that an equal number of samples fall |
|
into each bin (Nguyen & O’Connor, 2015; Mukhoti et al., 2020). We report the ECE (%) with 15 |
|
bins calculated according to the latter, so-called adaptive binning procedure. We compute the 95% |
|
confidence intervals using 100 bootstrap samples as in Kumar et al. (2019). We consider multiple |
|
versions of the ECE metric based on the Lp norm and the type of calibration (top-label, marginal, |
|
canonical). Top-label calibration error only considers the probability of the predicted class, marginal |
|
requires per-class calibration and the canonical is the highest form of calibration which requires the |
|
entire probability vector to be calibrated. We report L1 and L2 ECE in the marginal and canonical |
|
case. Additional experiments with top-label and marginal calibration on both CIFAR-10 and CIFAR100 can be found in Appendix B. |
|
|
|
**Hyperparameters** A crucial parameter for KDE is the bandwidth, a positive number that defines |
|
the smoothness of the density plot. Poorly chosen bandwidth may lead to undersmoothing (small |
|
bandwidth) or oversmoothing (large bandwidth). A commonly used non-parametric bandwidth selector is maximum likelihood cross validation (Duin, 1976). For our experiments we choose the |
|
bandwidth from a list of possible values by maximizing the leave-one-out likelihood. The λ parameter for weighting the calibration error w.r.t the loss is typically chosen via cross-validation or using |
|
a holdout validation set. The p parameter is chosen depending on the desired Lp calibration error |
|
and the corresponding theoretical guarantees. |
|
|
|
5 RESULTS AND DISCUSSION |
|
|
|
|
|
5.1 BINARY CLASSIFICATION |
|
|
|
We construct a binary experiment by splitting the CIFAR-10 classes into 2 classes: vehicles (plane, |
|
automobile, ship, truck) and animals (bird, cat, deer, dog, frog, horse). Figure 1a shows how the |
|
choice of the bandwidth parameter influences the shape of the estimate. |
|
|
|
|
|
0.0040 |
|
|
|
0.0035 |
|
|
|
0.0030 |
|
|
|
0.0025 |
|
|
|
0.0020 |
|
|
|
0.0015 |
|
|
|
0.0010 |
|
|
|
0.0005 |
|
|
|
0.0000 |
|
|
|
|
|
10 |
|
|
|
8 |
|
|
|
6 |
|
|
|
4 |
|
|
|
2 |
|
|
|
0 |
|
|
|
|Col1|Col2|KDE b = KDE b =|0.001 0.01| |
|
|---|---|---|---| |
|
|||KDE b = Histogram|0.1 from samples| |
|
||||| |
|
||||| |
|
||||| |
|
||||| |
|
||||| |
|
|
|
|
|
KDE b = 0.001 |
|
KDE b = 0.01 |
|
KDE b = 0.1 |
|
Histogram from samples |
|
|
|
0.0 0.2 0.4 0.6 0.8 1.0 |
|
|
|
|Col1|Col2|Col3|Col4|Col5|Col6|Col7|Col8|KDE-MSE MSE|Col10| |
|
|---|---|---|---|---|---|---|---|---|---| |
|
||||||||||| |
|
||||||||||| |
|
||||||||||| |
|
||||||0.|2|||| |
|
||||||||||| |
|
|||||0.1||||0.3|| |
|
||||||||||| |
|
||||||||||0.4| |
|
||||||||||| |
|
||||||||||| |
|
|
|
|
|
0.03 0.04 0.05 0.06 0.07 0.08 |
|
|
|
MSE |
|
|
|
(b) Effect of γ |
|
|
|
|
|
(a) Effect of the bandwidth b |
|
|
|
|
|
Figure 1: Calibration regularized training using MSE loss and CE2 |
|
|
|
Figure 1b shows the effect of the regularization parameter γ on the performance of a ResNet-110 |
|
model. The orange point represents a model trained with MSE loss, and the blue points (KDE-MSE) |
|
correspond to models trained with regularized MSE loss by an L2 calibration error for different |
|
values of γ. As expected, the calibration regularized training decreases the L2 calibration error at |
|
the cost of slightly increased error. |
|
|
|
|
|
----- |
|
|
|
5.2 EVALUATING CANONICAL CALIBRATION |
|
|
|
Accurately evaluating the calibration error is another crucial step towards designing trustworthy |
|
models that can be used in high-cost settings. In spite of its numerous flaws discussed in Vaicenavicius et al. (2019); Ding et al. (2020); Ashukha et al. (2021), such as its sensitivity to the binning |
|
scheme, the histogram-based estimator remains the most widely used metric for evaluating miscalibration. Another downside of the binned estimator is its inability to capture canonical calibration |
|
due to the curse of dimensionality, as the number of bins grows exponentially with the number of |
|
classes. Therefore, because of its favourable scaling properties, we propose using our Dirichlet |
|
kernel density estimate as an alternative metric (KDE-ECE) to measure calibration. |
|
To investigate its relationship with the commonly used binned estimator, we first introduce an extension of the top-label binned estimator to the probability simplex in the three class setting. We start |
|
by partitioning the probability simplex into equally-sized, triangle-shaped bins and assign the probability scores to the corresponding bin, as shown in Figure 2a. Then, we define the binned estimate |
|
of canonical calibration error as follows: |
|
|
|
|
|
CEp(f )[p] _≈_ E _∥H(f_ (x)) − _f_ (x)∥p[p] _≈_ _n[1]_ |
|
h i |
|
|
|
|
|
_H(f_ (xj)) _f_ (xi) _p_ (21) |
|
_∥_ _−_ _∥[p]_ |
|
_i=1_ |
|
|
|
X |
|
|
|
|
|
where H(f (xj)) is the histogram estimate, shown in Figure 2b. The surface of the corresponding |
|
Dirichlet KDE is presented in Figure 2c. In Figure 3 we show that the KDE-ECE estimates of the |
|
three types of calibration closely correspond to the their histogram-based approximations. Each |
|
point in the plot represents a ResNet-56 model trained on a different subset of three classes from |
|
CIFAR-10. See Appendix C for another example of the binned estimator and Dirichlet KDE on |
|
CIFAR-10 and an experiment with varying number of points used for the density estimation. |
|
|
|
0.0 0.2 0.4 0.6 0.8 1.0 0.05 0.10 0.15 0.20 0.25 |
|
|
|
|
|
0.0 |
|
|
|
1.0 |
|
|
|
0.2 |
|
|
|
0.8 |
|
|
|
0.4 |
|
|
|
0.6 |
|
|
|
0.6 |
|
|
|
0.4 |
|
|
|
0.8 |
|
|
|
0.2 |
|
|
|
1.0 |
|
|
|
0.0 |
|
|
|
0.0 0.2 0.4 0.6 0.8 1.0 |
|
|
|
|
|
0.05 0.10 0.15 0.20 0.25 |
|
|
|
|
|
(a) Splitting the simplex in 16 bins |
|
|
|
|
|
(b) Histogram (c) Dirichlet KDE |
|
|
|
|
|
Figure 2: Extension of the binned estimator to the probability simplex, compared with the KDEECE. The KDE-ECE achieves a better approximation to the finite sample, and accurately models |
|
the fact that samples tend to be concentrated near low dimensional faces of the simplex. |
|
|
|
|
|
0.200 |
|
|
|
0.175 |
|
|
|
0.150 |
|
|
|
0.125 |
|
|
|
0.100 |
|
|
|
Binned ECE 0.075 |
|
|
|
0.050 |
|
|
|
0.025 |
|
|
|
0.000 |
|
|
|
0.000 0.025 0.050 0.075 0.100 0.125 0.150 0.175 0.200 |
|
|
|
KDE ECE |
|
|
|
|
|
0.04 |
|
|
|
0.03 |
|
|
|
0.02 |
|
|
|
Binned ECE |
|
|
|
0.01 |
|
|
|
0.00 |
|
|
|
0.00 0.01 0.02 0.03 0.04 |
|
|
|
KDE ECE |
|
|
|
|
|
0.06 |
|
|
|
0.05 |
|
|
|
0.04 |
|
|
|
0.03 |
|
|
|
Binned ECE |
|
|
|
0.02 |
|
|
|
0.01 |
|
|
|
0.00 |
|
|
|
0.00 0.01 0.02 0.03 0.04 0.05 0.06 |
|
|
|
KDE ECE |
|
|
|
|
|
(a) Canonical |
|
|
|
|
|
(b) Marginal |
|
|
|
|
|
(c) Top-label |
|
|
|
|
|
Figure 3: Relationship between the KDE-ECE estimates and their corresponding binned approximations on the three types of calibration. Each point represents a ResNet-56 model trained on a subset |
|
of three classes from CIFAR-10. The 3000 probability scores of the test set are assigned in 25 bins |
|
with adaptive width for the binned estimate. A bandwidth of 0.001 is used for KDE-ECE. |
|
|
|
|
|
----- |
|
|
|
5.3 MULTICLASS CLASSIFICATION |
|
|
|
In this section we evaluate our proposed KDE-based ECE estimator that was jointly trained with |
|
cross entropy loss (KDE-CRE) against other baselines in a multiclass setting on CIFAR-10 and |
|
CIFAR-100. We found that for KDE-CRE, values of λ ∈ [0.01, 0.1] provide a good trade-off in |
|
terms of accuracy and calibration error. Table 1 summarizes the accuracy and marginal L1 ECE% |
|
(computed using 15 bins), measured across multiple architectures. For MMCE, we report the results |
|
with λ = 1 and for KDE-CRE we use λ = 0.01. An analogous table measuring marginal L2 ECE |
|
is given in Appendix B. |
|
|
|
Table 1: Accuracy and marginal L1 ECE (%) computed with 15 bins for different loss functions |
|
and architectures, both trained from scratch (Pre T) and after temperature scaling on a validation set |
|
(Post T). Best results are marked in bold. |
|
|
|
**CIFAR-10** **CIFAR-100** |
|
**Loss** **Metric** ResNet ResNet (SD) Wide-ResNet DenseNet ResNet ResNet (SD) Wide-ResNet DenseNet |
|
|
|
Pre T 0.419 0.357 **0.241** 0.236 0.129 0.100 **0.086** **0.090** |
|
ECE |
|
Post T 0.282 0.250 0.278 **0.165** 0.114 **0.089** **0.105** **0.078** |
|
CRE |
|
|
|
Pre T 0.925 **0.926** **0.957** 0.947 **0.700** **0.728** **0.803** 0.756 |
|
Acc |
|
Post T **0.927** 0.925 **0.957** 0.947 **0.700** **0.729** **0.801** 0.758 |
|
|
|
|
|
Pre T **0.250** 0.390 0.265 **0.193** 0.143 0.100 0.120 0.123 |
|
ECE |
|
Post T 0.361 0.308 0.291 0.235 0.121 0.093 0.109 0.124 |
|
MMCE |
|
|
|
Pre T **0.929** 0.925 0.947 0.944 0.693 0.723 0.767 0.748 |
|
Acc |
|
Post T 0.926 **0.926** 0.949 0.945 0.691 0.722 0.770 0.743 |
|
|
|
Pre T 0.403 0.416 0.414 0.259 0.145 0.120 0.125 0.095 |
|
ECE |
|
Post T 0.272 0.267 0.437 0.220 0.124 0.107 0.106 0.081 |
|
FL-53 |
|
|
|
Pre T 0.922 0.920 0.936 **0.948** 0.695 0.711 0.760 0.752 |
|
Acc |
|
Post T 0.923 0.919 0.936 **0.949** 0.693 0.712 0.763 0.753 |
|
|
|
Pre T 0.363 **0.338** 0.289 0.296 **0.128** **0.096** 0.092 0.099 |
|
ECE |
|
Post T **0.182** **0.220** **0.226** 0.248 **0.104** 0.095 0.108 0.085 |
|
_L1 KDE-CRE_ |
|
|
|
Pre T 0.926 0.925 0.953 0.943 0.697 0.725 0.796 **0.757** |
|
Acc |
|
Post T **0.927** 0.925 0.953 0.944 0.698 0.720 0.793 **0.759** |
|
|
|
We notice that for both pre and post temperature scaling, KDE-CRE achieves very competitive ECE |
|
scores. Another encouraging observation is that the improvement of calibration error comes at almost no cost in accuracy. An important advantage of our KDE-based method is the ability to directly |
|
train and evaluate canonical calibration. In Figure 4 we show a scatter plot with confidence intervals |
|
of the L1 and L2 KDE-CRE models for canonical calibration and the other baselines on CIFAR-10. |
|
We measure the canonical calibration using our KDE-ECE metric from section 5.2. In three of the |
|
architectures, both L1 and L2 KDE-CRE either dominate or are statistically tied with cross-entropy |
|
(CRE). Similarly, Figure 5 shows a scatter plot of L1 and L2 KDE-CRE models trained to minimize |
|
marginal calibration error. In this case, we measure L2 marginal ECE with the standard binned estimator. In most cases, our methods Pareto dominate the other baselines. A general observation can be |
|
made, however, that the models trained with cross-entropy have a surprisingly low marginal calibration error, contrary to previous findings that show poor calibration when considering only the most |
|
confident prediction (top-label calibration). An additional experiment comparing the CRE baseline |
|
with KDE-CRE for canonical calibration on a benchmark dataset of histological images of human |
|
colorectal cancer is given in Appendix D.2, which clearly illustrates the superior performance of our |
|
method, both in terms of accuracy and calibration error in this context. |
|
To summarize, the experiments show that our estimator is consistently producing competitive calibration errors with other state-of-the-art approaches, while maintaining accuracy and keeping the |
|
computational complexity at O(n[2]). We evaluate the computational overhead of CRE and KDECRE and summarize the results in a table in Appendix D.1, which shows that the added cost is |
|
less than a couple percent. There are several limitations in the current work: A larger scale benchmarking will be beneficial for exploring the limits of canonical calibration using Dirichlet kernels. |
|
Furthermore, while we showed consistency of our estimator, we did not fully derive and implement |
|
its debiasing. Due to space constraints, this was not the focus of the paper and is left for future work. |
|
|
|
6 CONCLUSION |
|
|
|
In this paper, we proposed a consistent and differentiable estimator of an Lp calibration error using |
|
Dirichlet kernels. The KDE-based estimate can be directly optimized alongside any loss function in |
|
the existing batch stochastic gradient descent framework. Furthermore, we propose using it as a mea |
|
|
|
----- |
|
|
|
sure of the highest form of calibration which requires the entire probability vector to be calibrated. |
|
We showed empirically on a range of neural architectures that the performance of our estimator |
|
in terms of accuracy and calibration error is competitive against the current state-of-the-art, while |
|
having superior properties as a consistent estimator of canonical calibration error. |
|
|
|
|
|
0.11 |
|
|
|
0.10 |
|
|
|
|
|
0.16 |
|
|
|
0.14 |
|
|
|
0.12 |
|
|
|
0.10 |
|
|
|
0.08 |
|
|
|
0.16 |
|
|
|
0.14 |
|
|
|
0.12 |
|
|
|
0.10 |
|
|
|
0.08 |
|
|
|
0.06 |
|
|
|
0.04 |
|
|
|
|
|
0.09 |
|
|
|
0.08 |
|
|
|
|
|
0.07 |
|
|
|
0.11 |
|
|
|
0.10 |
|
|
|
0.09 |
|
|
|
0.08 |
|
|
|
0.07 |
|
|
|
0.06 |
|
|
|
0.05 |
|
|
|
0.04 |
|
|
|
|Col1|0.3|Col3|Col4|Col5|Col6|CRE FL| |
|
|---|---|---|---|---|---|---| |
|
||||6|||| |
|
|||||3||MMCE L1 KDE-CRE L2 KDE-CRE| |
|
|||1||4 0 2||| |
|
||||||0.1|1.0| |
|
|||||||| |
|
|||||||3 0.10.01| |
|
|||||||0.01| |
|
|
|
|0.|3|Col3|Col4|Col5|CRE FL| |
|
|---|---|---|---|---|---| |
|
||||||MMCE L1 KDE-CRE L2 KDE-CRE| |
|
||||0.2||| |
|
|||||3|10 6| |
|
|||||0.1|4 53 12.0| |
|
||||||0.01 0.1| |
|
|
|
|
|
0.3 CRE |
|
FL |
|
MMCE |
|
L1 KDE-CRE |
|
|
|
L2 KDE-CRE |
|
|
|
0.2 |
|
|
|
10 |
|
|
|
3 6 |
|
|
|
0.1 |
|
|
|
4 |
|
|
|
53 1.02 |
|
|
|
0.01 |
|
|
|
0.1 |
|
|
|
|
|
CRE |
|
|
|
0.3 6 FL |
|
|
|
MMCE |
|
L1 KDE-CRE |
|
|
|
3 L2 KDE-CRE |
|
|
|
4 |
|
|
|
10 |
|
|
|
2 |
|
|
|
0.1 1.0 |
|
|
|
0.2 |
|
|
|
53 0.10.01 |
|
|
|
0.01 |
|
|
|
|
|
0.88 0.89 0.90 0.91 0.92 0.93 |
|
|
|
ACC |
|
|
|
(a) ResNet-110 |
|
|
|
|
|
0.84 0.86 0.88 0.90 0.92 |
|
|
|
ACC |
|
|
|
(b) ResNet-110 (SD) |
|
|
|
|Col1|Col2|6|Col4|Col5|Col6|CRE FL|Col8| |
|
|---|---|---|---|---|---|---|---| |
|
|0|.3|4||10||MMCE L1 KDE-|CRE| |
|
|||||||L2 KDE-|CRE| |
|
||||0.2|2 0.1 53|||| |
|
|||||0.|2 31.0||| |
|
||||||||| |
|
|||||||0.001.1|| |
|
|||||||0|.01| |
|
|
|
|Col1|6|Col3|Col4|Col5|CRE| |
|
|---|---|---|---|---|---| |
|
||||||FL MMCE| |
|
|||10 4|||L1 KDE-CRE L2 KDE-CRE| |
|
||||||| |
|
||||0.3||| |
|
|||||0.20.1 0.2|2 0.1301 .. 00 1| |
|
||||||0.01| |
|
||||||53| |
|
|
|
|
|
0.91 0.92 0.93 0.94 0.95 0.96 |
|
|
|
ACC |
|
|
|
(c) Wide-ResNet-28-10 |
|
|
|
|
|
0.86 0.88 0.90 0.92 0.94 |
|
|
|
ACC |
|
|
|
(d) DenseNet-40 |
|
|
|
|
|
Figure 4: Canonical calibration on CIFAR-10 |
|
|
|
|
|
|1e 5|Col2|Col3|Col4|Col5|Col6|Col7|Col8|Col9| |
|
|---|---|---|---|---|---|---|---|---| |
|
|||2|4|2||53||CRE FL MMCE L1 KDE-CRE L2 KDE-CRE| |
|
||||0.|||0.2||| |
|
||||6|3||1 0.3|.0 0.01|0.1| |
|
|||||0.3|||0.1|0.01| |
|
|||||||||| |
|
|||||||||| |
|
|||||||||| |
|
|
|
|
|
4 CRE |
|
|
|
53 FL |
|
|
|
0.2 MMCE |
|
|
|
2 6 0.2 LL12 KDE-CRE KDE-CRE |
|
|
|
1.0 |
|
|
|
3 0.3 0.01 0.1 |
|
|
|
0.3 0.01 |
|
|
|
0.1 |
|
|
|
0.65 0.66 0.67 0.68 0.69 0.70 0.71 |
|
|
|
ACC |
|
|
|
|
|
2.5 |
|
|
|
|
|
3.5 |
|
|
|
3.0 |
|
|
|
2.5 |
|
|
|
2.0 |
|
|
|
1.5 |
|
|
|
1.0 |
|
|
|
4.0 |
|
|
|
3.5 |
|
|
|
3.0 |
|
|
|
2.5 |
|
|
|
2.0 |
|
|
|
1.5 |
|
|
|
1.0 |
|
|
|
0.5 |
|
|
|
|
|
2.0 |
|
|
|
1.5 |
|
|
|
|
|
1.0 |
|
|
|
0.5 |
|
|
|
|1e 5|Col2|Col3|Col4|Col5|Col6|Col7|Col8|Col9|Col10| |
|
|---|---|---|---|---|---|---|---|---|---| |
|
|0.||3||||3 0.2||0.01|| |
|
|||||||||53|| |
|
|||||0.3||||0.1|2 1.0| |
|
||||||0|.2||0.1|0.01| |
|
|CRE FL MM L K||CE DE-CRE|||||||| |
|
|1 L2 K||DE-CRE|||||||| |
|
|
|
|
|
0.64 0.66 0.68 0.70 0.72 0.74 |
|
|
|
ACC |
|
|
|
(b) ResNet-110 (SD) |
|
|
|
|
|
(a) ResNet-110 |
|
|
|
|
|
|1e 5|Col2|Col3|Col4|Col5|Col6|Col7|Col8|Col9| |
|
|---|---|---|---|---|---|---|---|---| |
|
||||||||CRE|| |
|
|||4|||||FL MMCE|| |
|
|||||||||| |
|
||||||||L1 KDE-C L2 KDE-C|RE RE| |
|
||||53 1. 2||0|||| |
|
||||||0.1|||| |
|
||0|.3|0||0.2 .2||03.01|| |
|
|||0.3|||||00.1.01|| |
|
|||||||||| |
|
|
|
|
|
4 CREFL |
|
|
|
MMCE |
|
L1 KDE-CRE |
|
L2 KDE-CRE |
|
|
|
53 1.0 |
|
|
|
2 |
|
|
|
0.1 |
|
|
|
0.3 |
|
|
|
0.2 |
|
|
|
0.2 |
|
|
|
0.3 0.10.010.013 |
|
|
|
0.74 0.75 0.76 0.77 0.78 0.79 0.80 0.81 |
|
|
|
ACC |
|
|
|
|
|
3.0 |
|
|
|
2.5 |
|
|
|
2.0 |
|
|
|
1.5 |
|
|
|
1.0 |
|
|
|
0.5 |
|
|
|
|1e 5|Col2|Col3|Col4|Col5|Col6|Col7|Col8|Col9|Col10|Col11| |
|
|---|---|---|---|---|---|---|---|---|---|---| |
|
|||||||1.|||0|CRE| |
|
|||||0.2||2||||FL MMCE L1 KDE-CRE L2 KDE-CRE| |
|
|||||||||||| |
|
||4|||||||||| |
|
|||0.3|||03.2||||53|0.01| |
|
||||||||0||.1|| |
|
||||||0.3||||0.10|.01| |
|
|||||||||||| |
|
|||||||||||| |
|
|
|
|
|
0.72 0.73 0.74 0.75 0.76 |
|
|
|
ACC |
|
|
|
(d) DenseNet-40 |
|
|
|
|
|
(c) Wide-ResNet-28-10 |
|
|
|
|
|
Figure 5: Marginal calibration on CIFAR-100 |
|
|
|
|
|
----- |
|
|
|
REFERENCES |
|
|
|
A. Appice, P. Rodrigues, V. S. Costa, C. Soares, Jo˜ao Gama, and A. Jorge. Novel decompositions |
|
of proper scoring rules for classification : Score adjustment as precursor to calibration. 2015. |
|
|
|
Arsenii Ashukha, Alexander Lyzhov, Dmitry Molchanov, and Dmitry Vetrov. Pitfalls of in-domain |
|
uncertainty estimation and ensembling in deep learning, 2021. |
|
|
|
Peter L. Bartlett, Michael I. Jordan, and Jon D. Mcauliffe. Convexity, classification, and risk bounds. |
|
_Journal of the American Statistical Association, 101(473):138–156, 2006._ |
|
|
|
Taoufik Bouezmarni and Jean-Marie Rolin. Consistency of the beta kernel density function estimator. The Canadian Journal of Statistics / La Revue Canadienne de Statistique, 31(1):89–98, |
|
2003. |
|
|
|
Jochen Br¨ocker. Reliability, sufficiency, and the decomposition of proper scores. Quarterly Journal |
|
_of the Royal Meteorological Society, 135(643):1512–1519, Jul 2009._ |
|
|
|
Song Xi Chen. Beta kernel estimators for density functions. _Computational Statistics & Data_ |
|
_Analysis, 31:131–145, 1999._ |
|
|
|
M. Degroot and S. Fienberg. The comparison and evaluation of forecasters. The Statistician, 32: |
|
12–22, 1983. |
|
|
|
Yukun Ding, Jinglan Liu, Jinjun Xiong, and Yiyu Shi. Revisiting the evaluation of uncertainty estimation and its application to explore model complexity-uncertainty trade-off. arXiv:1903.02050, |
|
2020. |
|
|
|
Robert Duin. On the choice of smoothing parameters for parzen estimators of probability density |
|
functions. IEEE Transactions on Computers, C-25(11):1175–1179, 1976. |
|
|
|
Thomas S. Ferguson. U-statistics. In Notes for Statistics 200C. UCLA, 2005. |
|
|
|
Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. On calibration of modern neural |
|
networks, 2017. |
|
|
|
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. arXiv:1512.03385, 2015. |
|
|
|
Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Weinberger. Deep networks with stochastic depth. arXiv:1603.09382, 2016. |
|
|
|
Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q. Weinberger. Densely connected |
|
convolutional networks, 2018. |
|
|
|
Jakob Kather, Cleo-Aron Weis, Francesco Bianconi, Susanne Melchers, Lothar Schad, Timo Gaiser, |
|
Alexander Marx, and Frank Z¨ollner. Multi-class texture analysis in colorectal cancer histology. |
|
_Scientific Reports, 6:27988, 06 2016. doi: 10.1038/srep27988._ |
|
|
|
Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, University |
|
of Toronto, 2009. |
|
|
|
Volodymyr Kuleshov and Percy S Liang. Calibrated structured prediction. In C. Cortes, |
|
N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett (eds.), Advances in Neural Information Pro_cessing Systems, volume 28. Curran Associates, Inc., 2015._ |
|
|
|
Meelis Kull, Miquel Perello-Nieto, Markus K¨angsepp, Telmo Silva Filho, Hao Song, and Peter Flach. Beyond temperature scaling: Obtaining well-calibrated multiclass probabilities with |
|
Dirichlet calibration. arXiv:1910.12656, 2019. |
|
|
|
Ananya Kumar, Percy S Liang, and Tengyu Ma. Verified uncertainty calibration. In H. Wallach, |
|
H. Larochelle, A. Beygelzimer, F. d'Alch´e-Buc, E. Fox, and R. Garnett (eds.), Advances in Neural |
|
_Information Processing Systems 32, pp. 3792–3803. 2019._ |
|
|
|
|
|
----- |
|
|
|
Aviral Kumar, Sunita Sarawagi, and Ujjwal Jain. Trainable calibration measures for neural networks |
|
from kernel mean embeddings. In ICML, 2018. |
|
|
|
Gongbo Liang, Yu Zhang, Xiaoqin Wang, and Nathan Jacobs. Improved trainable calibration method |
|
for neural networks on medical imaging classification. In British Machine Vision Conference, |
|
2020. |
|
|
|
Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Doll´ar. Focal loss for dense object |
|
detection. arXiv:1708.02002, 2018. |
|
|
|
Jishnu Mukhoti, Viveka Kulharia, Amartya Sanyal, Stuart Golodetz, Philip H. S. Torr, and Puneet K. |
|
Dokania. Calibrating deep neural networks using focal loss. arXiv:2002.09437, 2020. |
|
|
|
A. Murphy. A new vector partition of the probability score. Journal of Applied Meteorology, 12: |
|
595–600, 1973. |
|
|
|
Rafael M¨uller, Simon Kornblith, and Geoffrey Hinton. When does label smoothing help? |
|
arXiv:1906.02629, 2020. |
|
|
|
Mahdi Pakdaman Naeini and Gregory F. Cooper. Binary classifier calibration using an ensemble of |
|
near isotonic regression models. arXiv:1511.05191, 2015. |
|
|
|
Mahdi Pakdaman Naeini, Gregory F. Cooper, and Milos Hauskrecht. Obtaining well calibrated |
|
probabilities using Bayesian binning. In Proceedings of the Twenty-Ninth AAAI Conference on |
|
_Artificial Intelligence, pp. 2901–2907, 2015._ |
|
|
|
Khanh Nguyen and Brendan O’Connor. Posterior calibration and exploratory analysis for natural |
|
language processing models. arXiv:1508.05154, 2015. |
|
|
|
Fr´ed´eric Ouimet and Raimon Tolosana-Delgado. Asymptotic properties of dirichlet kernel density |
|
estimators. arXiv:2002.06956, 2021. |
|
|
|
Emanuel Parzen. On estimation of a probability density function and mode. The Annals of Mathe_matical Statistics, 33(3):1065–1076, 1962._ |
|
|
|
Gabriel Pereyra, George Tucker, Jan Chorowski, Łukasz Kaiser, and Geoffrey Hinton. Regularizing |
|
neural networks by penalizing confident output distributions. arXiv:1701.06548, 2017. |
|
|
|
John C. Platt. Probabilistic outputs for support vector machines and comparisons to regularized |
|
likelihood methods. In Advances in Large Margin Classifiers, pp. 61–74. MIT Press, 1999. |
|
|
|
Murray Rosenblatt. Remarks on some nonparametric estimates of a density function. The Annals of |
|
_Mathematical Statistics, 27(3):832 – 837, 1956._ |
|
|
|
Alastair Scott and Chien-Fu Wu. On the asymptotic distribution of ratio and regression estimators. |
|
_Journal of the American Statistical Association, 76(373):98–102, 1981._ |
|
|
|
B. W. Silverman. Density Estimation for Statistics and Data Analysis. Chapman & Hall, 1986. |
|
|
|
Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. arXiv:1512.00567, 2015. |
|
|
|
Juozas Vaicenavicius, David Widmann, Carl Andersson, Fredrik Lindsten, Jacob Roll, and |
|
Thomas B. Sch¨on. Evaluating model calibration in classification. arXiv:1902.06977, 2019. |
|
|
|
Jonathan Wenger, Hedvig Kjellstr¨om, and Rudolph Triebel. Non-parametric calibration for classification. In International Conference on Artificial Intelligence and Statistics, pp. 178–190, 2020. |
|
|
|
B. Zadrozny and C. Elkan. Transforming classifier scores into accurate multiclass probability estimates. Proceedings of the eighth ACM SIGKDD international conference on Knowledge discov_ery and data mining, 2002._ |
|
|
|
Bianca Zadrozny and Charles Elkan. Obtaining calibrated probability estimates from decision trees |
|
and naive bayesian classifiers. ICML, 1, 05 2001. |
|
|
|
|
|
----- |
|
|
|
Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. In British Machine Vision |
|
_Conference, 2016._ |
|
|
|
Jize Zhang, Bhavya Kailkhura, and T. Yong-Jin Han. Mix-n-match: Ensemble and compositional |
|
methods for uncertainty calibration in deep learning. In International Conference on Machine |
|
_Learning, 2020._ |
|
|
|
Shunpu Zhang and Rohana Karunamuni. Boundary performance of the beta kernel estimators. |
|
_Journal of Nonparametric Statistics, 22:81–104, 01 2010._ |
|
|
|
A DERIVATION OF THE MSE DECOMPOSITION |
|
|
|
**Definition A.1 (Mean Squared Error (MSE)). The mean squared error of an estimator is** |
|
|
|
MSE(f ) := E[(f (x) − _y)[2]]._ (22) |
|
|
|
**Proposition A.2. MSE(f** ) ≥ CE2(f )[2] |
|
|
|
_Proof._ |
|
|
|
MSE(f ) :=E[(f (x) − _y))[2]] = E[((f_ (x) − E[y | f (x)]) + (E[y | f (x)] − _y))[2]]_ (23) |
|
|
|
= E[(f (x) − E[y | f (x)])[2]] +E[(E[y | f (x)] − _y)[2]]_ (24) |
|
=CE2[2] |
|
|+ 2E[(f (x){z E[y _f_ (x})])(E[y _f_ (x)] _y)]_ |
|
|
|
_−_ _|_ _|_ _−_ |
|
|
|
which implies |
|
|
|
MSE(f ) − CE2(f )[2] =E[(E[y | f (x)] − _y)[2]]_ (25) |
|
+ 2E[(f (x) − E[y | f (x)])(E[y | f (x)] − _y)]_ |
|
|
|
=E[(E[y | f (x)] − _y)[2]] + 2E[(f_ (x)E[y | f (x)]] (26) |
|
|
|
_−_ 2E[f (x)y] − 2E[E[y | f (x)][2]] + 2E[E[y | f (x)]y]] |
|
|
|
=E[E[y | f (x)][2]] + E[y[2]] − 2E[E[y | f (x)]y] (27) |
|
+ 2E[(f (x)E[y | f (x)]] − 2E[f (x)y] |
|
|
|
_−_ 2E[E[y | f (x)][2]] + 2E[E[y | f (x)]y]] |
|
|
|
=E[y[2]] + 2E[(f (x)E[y | f (x)]] − 2E[f (x)y] (28) |
|
|
|
_−_ E[E[y | f (x)][2]] |
|
=E[(2f (x) − _y −_ E[y | f (x)])(E[y | f (x)]) − _y]_ (29) |
|
=E[(f (x) − _y)(E[y | f_ (x)] − _y)]_ (30) |
|
+ E[(f (x) − E[y | f (x)])(E[y | f (x)] − _y)]._ |
|
|
|
By the law of total expectation, we will write the above as |
|
|
|
MSE(f ) − CE2(f )[2] = E[E[(f (x) − _y)(E[y | f_ (x)] − _y)_ (31) |
|
+ (f (x) − E[y | f (x)])(E[y | f (x)] − _y) | f_ (x)]]. |
|
|
|
Focusing on the inner conditional expectation, we have that |
|
|
|
E[(f (x) − _y)(E[y | f_ (x)] − _y) + (f_ (x) − E[y | f (x)])(E[y | f (x)] − _y) | f_ (x)] |
|
=E[y | f (x)](f (x) − 1)(E[y | f (x)] − 1) + (1 − E[y | f (x)])f (x)E[y | f (x)] |
|
+ E[y | f (x)](f (x) − E[y | f (x)])(E[y | f (x)] − 1) |
|
+ (1 − E[y | f (x)])(f (x) − E[y | f (x)])E[y | f (x)] (32) |
|
=(1 − E[y | f (x)])E[y | f (x)] ≥ 0 _∀f_ (x) (33) |
|
|
|
and therefore |
|
|
|
MSE(f ) − CE2(f )[2] = E[(1 − E[y | f (x)])E[y | f (x)]] ≥ 0. (34) |
|
|
|
The expectation in Equation 34 is over variances of Bernoulli random variables with probabilities |
|
E[y | f (x)]. |
|
|
|
|
|
----- |
|
|
|
B RESULTS |
|
|
|
Table 2 summarizes the marginal L2 ECE and accuracy for the two datasets across multiple architectures and training loss functions. The scatter plots in Figures 6 and 7 show the accuracy and both |
|
_L1 and L2 ECE, for top-label and marginal calibration on CIFAR-10 and CIFAR-100, respectively._ |
|
KDE-CRE is trained by directly minimizing the metric that is evaluated, e.g., in the first column we |
|
minimize marginal L1 calibration error and in the last column we optimize the L2 top label calibration error. Other methods do not have the flexibility of choosing the type of calibration and the Lp |
|
norm. |
|
|
|
Table 2: Accuracy and marginal L2 ECE (%) computed with 15 bins for different approaches, |
|
trained from scratch (Pre T) and after temperature scaling (Post T). |
|
|
|
**CIFAR-10** **CIFAR-100** |
|
**Loss** **Metric** ResNet ResNet (SD) Wide-ResNet DenseNet ResNet ResNet (SD) Wide-ResNet DenseNet |
|
|
|
Pre T 0.020 0.009 0.007 0.008 0.002 0.002 0.001 0.001 |
|
ECE |
|
Post T (NLL) 0.007 0.005 0.008 0.004 0.002 0.001 0.001 0.001 |
|
**CRE** |
|
|
|
Pre T 0.925 0.926 0.950 0.947 0.700 0.728 0.797 0.756 |
|
Acc |
|
Post T (NLL) 0.927 0.925 0.950 0.947 0.700 0.729 0.794 0.758 |
|
|
|
|
|
Pre T 0.009 0.015 0.009 0.004 0.003 0.001 0.003 0.003 |
|
ECE |
|
Post T (NLL) 0.013 0.009 0.009 0.005 0.002 0.001 0.002 0.003 |
|
**MMCE** |
|
|
|
Pre T 0.929 0.925 0.947 0.944 0.693 0.723 0.767 0.748 |
|
Acc |
|
Post T (NLL) 0.926 0.926 0.949 0.945 0.691 0.722 0.770 0.743 |
|
|
|
Pre T 0.013 0.020 0.026 0.005 0.003 0.002 0.003 0.002 |
|
ECE |
|
Post T (NLL) 0.008 0.009 0.022 0.004 0.002 0.002 0.002 0.001 |
|
**FL-53** |
|
|
|
Pre T 0.922 0.920 0.936 0.948 0.695 0.711 0.760 0.752 |
|
Acc |
|
Post T (NLL) 0.923 0.919 0.936 0.949 0.693 0.712 0.763 0.753 |
|
|
|
Pre T 0.010 0.015 0.007 0.008 0.002 0.002 0.001 0.001 |
|
ECE |
|
Post T (NLL) 0.004 0.012 0.008 0.009 0.002 0.002 0.001 0.001 |
|
_L2 KDE-CRE_ |
|
|
|
Pre T 0.930 0.922 0.950 0.943 0.707 0.713 0.797 0.757 |
|
Acc |
|
Post T (NLL) 0.930 0.921 0.950 0.944 0.707 0.717 0.794 0.755 |
|
|
|
C RELATIONSHIP BETWEEN THE BINNED ESTIMATOR AND THE KERNEL |
|
DENSITY ESTIMATOR |
|
|
|
Figure 8 shows an example of the binned estimator in a three-class setting on CIFAR-10. The points |
|
are mostly concentrated at the edges of the histogram, as can be seen from Figure 8b. The surface |
|
of the corresponding Dirichlet KDE is given in 8c. |
|
Figure 9 shows the relationship between the binned estimator and our KDE-ECE metric. The points |
|
represent a trained Resnet-56 model on a subset of three classes from CIFAR-10. In every row, a |
|
differnt number of points was used to estimate the KDE-ECE. |
|
|
|
D EXPERIMENTS FOR REBUTTAL |
|
|
|
D.1 TRAINING TIME MEASUREMENTS |
|
|
|
In Table 3 we summarize the running time per epoch for training with (KDE-CRE) and without |
|
(CRE) regularization for the two datasets and four architectures. KDE-CRE does not create an |
|
overhead of more than a couple percent over the CRE baseline. |
|
|
|
D.2 CANONICAL CALIBRATION IN A MEDICAL APPLICATION |
|
|
|
An additional experiment with a medical application, where the canonical calibration is of particular |
|
interest, was performed on the publicly-available Kather dataset (Kather et al., 2016), which consists |
|
of 5000 histological images of human colorectal cancer. The data has eight different classes of tissue. |
|
Figure 10 shows a comparison in performance of the CRE baseline with our KDE-CRE method. The |
|
canonical L1 (left) and L2 (right) calibration is measured using our KDE-ECE metric. The results |
|
clearly illustrate that our method significantly outperforms the cross-entropy baseline, both in terms |
|
of accuracy and calibration error, for several choices of the regularization parameter. |
|
|
|
D.3 BIAS AND CONVERGENCE RATES |
|
|
|
Figure 11 shows a comparison of the groud truth, computed from 3000 test points with KDE-ECE |
|
against KDE-ECE and binned ECE estimated with a varying number of points used for the estima |
|
|
|
----- |
|
|
|
Marginal calibration on CIFAR10 using Densenet |
|
|
|
|
|
Top-label calibration on CIFAR10 using Densenet |
|
|
|
|
|
Marginal calibration on CIFAR10 using Densenet |
|
|
|
|
|
Top-label calibration on CIFAR10 using Densenet |
|
|
|
|
|
0.035 |
|
|
|
0.030 |
|
|
|
0.025 |
|
|
|
0.020 |
|
|
|
0.015 |
|
|
|
0.010 |
|
|
|
0.005 |
|
|
|
0.000 |
|
|
|
0.0175 |
|
|
|
0.0150 |
|
|
|
0.0125 |
|
|
|
0.0100 |
|
|
|
0.0075 |
|
|
|
0.0050 |
|
|
|
0.0025 |
|
|
|
0.0000 |
|
|
|
|
|
0.10 |
|
|
|
0.08 |
|
|
|
0.06 |
|
|
|
0.04 |
|
|
|
0.02 |
|
|
|
0.00 |
|
|
|
0.06 |
|
|
|
0.05 |
|
|
|
0.04 |
|
|
|
0.03 |
|
|
|
0.02 |
|
|
|
0.01 |
|
|
|
|
|
0.0030 |
|
|
|
0.0025 |
|
|
|
0.0020 |
|
|
|
0.0015 |
|
|
|
0.0010 |
|
|
|
0.0005 |
|
|
|
0.0000 |
|
|
|
0.0012 |
|
|
|
0.0010 |
|
|
|
0.0008 |
|
|
|
0.0006 |
|
|
|
0.0004 |
|
|
|
0.0002 |
|
|
|
0.0000 |
|
|
|
|
|
0.020 |
|
|
|
0.015 |
|
|
|
0.010 |
|
|
|
0.005 |
|
|
|
0.000 |
|
|
|
0.012 |
|
|
|
0.010 |
|
|
|
0.008 |
|
|
|
0.006 |
|
|
|
0.004 |
|
|
|
0.002 |
|
|
|
|6|0.3|4 0.20.3|MMCE L L1 2 K KD DE E- -C CR RE E 2| |
|
|---|---|---|---| |
|
|
|
|6|0.3|4 0.20.3 0.2|MMCE L L1 2 K KD DE E- -C CR RE E 53 00.0.0 052.1 1 .053| |
|
|---|---|---|---| |
|
|
|
|Col1|10|4|MMCE L L1 2 K KD DE E- -C CR RE E| |
|
|---|---|---|---| |
|
|
|
|6|10|4|MMCE L L1 2 K KD DE E- -C CR RE E 53| |
|
|---|---|---|---| |
|
|
|
|
|
10 CREFL |
|
|
|
MMCE |
|
|
|
6 LL12 KDE-CRE KDE-CRE |
|
|
|
4 |
|
|
|
0.3 0.2 0.3 2 |
|
|
|
0.2 0.050.10.10.0531.053 |
|
|
|
0.86 0.88 0.90ACC 0.92 0.94 |
|
|
|
Marginal calibration on CIFAR10 using Resnet |
|
|
|
|
|
10 CREFL |
|
|
|
MMCE |
|
LL12 KDE-CRE KDE-CRE |
|
|
|
6 |
|
|
|
4 |
|
|
|
0.3 0.2 0.3 0.2 0.050.10.10.052 3 53 |
|
|
|
1.0 |
|
|
|
0.86 0.88 0.90ACC 0.92 0.94 |
|
|
|
Top-label calibration on CIFAR10 using Resnet |
|
|
|
|
|
CRE |
|
|
|
6 FLMMCE |
|
|
|
10 LL12 KDE-CRE KDE-CRE |
|
|
|
4 |
|
|
|
0.3 0.2 0.3 0.2 0.050.10.10.052 31.053 |
|
|
|
0.86 0.88 0.90ACC 0.92 0.94 |
|
|
|
Marginal calibration on CIFAR10 using Resnet |
|
|
|
|
|
CRE |
|
FL |
|
|
|
10 MMCELL12 KDE-CRE KDE-CRE |
|
|
|
6 53 |
|
|
|
4 |
|
|
|
0.3 0.2 0.3 0.2 0.050.10.10.052 31.0 |
|
|
|
0.86 0.88 0.90ACC 0.92 0.94 |
|
|
|
Top-label calibration on CIFAR10 using Resnet |
|
|
|
|Col1|Col2|10|CRE FL| |
|
|---|---|---|---| |
|
|0.3||6 0.2 304 0.3 .1|MMCE L L1 2 K KD DE E- -C CR RE E| |
|
|
|
|Col1|Col2|10|CRE FL| |
|
|---|---|---|---| |
|
|0.3||0.2 06 .3 0.1 0. 02 34|MMCE L L1 2 K KD DE E- -C CR RE E 53 .050.050.1 0.5| |
|
|
|
|Col1|Col2|Col3|Col4|CRE FL| |
|
|---|---|---|---|---| |
|
||||10 6|MMCE L L1 2 K KD DE E- -C CR RE E| |
|
|
|
|Col1|Col2|Col3|CRE FL| |
|
|---|---|---|---| |
|
|||10 6|MMCE L L1 2 K KD DE E- -C CR RE E 53| |
|
|
|
|
|
CRE |
|
|
|
10 FLMMCE |
|
|
|
6 LL12 KDE-CRE KDE-CRE |
|
|
|
0.3 |
|
|
|
0.2 0.3 30.14 0.20.052 530.050.10.5 |
|
|
|
1.0 |
|
|
|
|
|
CRE |
|
|
|
10 FLMMCE |
|
|
|
LL12 KDE-CRE KDE-CRE |
|
|
|
0.3 0.2 0.36 0.1 0.20.05530.050.1 |
|
|
|
34 0.5 |
|
|
|
1.0 |
|
|
|
2 |
|
|
|
|
|
CRE |
|
FL |
|
|
|
10 MMCE |
|
|
|
LL12 KDE-CRE KDE-CRE |
|
|
|
6 |
|
|
|
0.3 0.2 0.3 30.14 0.20.052 530.050.11.00.5 |
|
|
|
|
|
CRE |
|
FL |
|
|
|
10 MMCELL12 KDE-CRE KDE-CRE |
|
|
|
6 53 |
|
|
|
0.3 0.2 0.3 0.14 0.20.050.050.1 |
|
|
|
3 2 1.00.5 |
|
|
|
|
|
0.87 0.88 0.89 0.90ACC 0.91 0.92 0.93 |
|
|
|
Marginal calibration on CIFAR10 using Resnet (SD) |
|
|
|
|
|
0.87 0.88 0.89 0.90ACC 0.91 0.92 0.93 |
|
|
|
Top-label calibration on CIFAR10 using Resnet (SD) |
|
|
|
|
|
0.87 0.88 0.89 0.90ACC 0.91 0.92 0.93 |
|
|
|
Marginal calibration on CIFAR10 using Resnet (SD) |
|
|
|
|
|
0.87 0.88 0.89 0.90ACC 0.91 0.92 0.93 |
|
|
|
Marginal calibration on CIFAR10 using Resnet (SD) |
|
|
|
|
|
0.05 |
|
|
|
0.04 |
|
|
|
0.03 |
|
|
|
0.02 |
|
|
|
0.01 |
|
|
|
0.00 |
|
|
|
0.014 |
|
|
|
0.012 |
|
|
|
0.010 |
|
|
|
0.008 |
|
|
|
0.006 |
|
|
|
0.004 |
|
|
|
0.002 |
|
|
|
|
|
0.0200 |
|
|
|
0.0175 |
|
|
|
0.0150 |
|
|
|
0.0125 |
|
|
|
0.0100 |
|
|
|
0.0075 |
|
|
|
0.0050 |
|
|
|
0.0025 |
|
|
|
0.0000 |
|
|
|
0.00175 |
|
|
|
0.00150 |
|
|
|
0.00125 |
|
|
|
0.00100 |
|
|
|
0.00075 |
|
|
|
0.00050 |
|
|
|
0.00025 |
|
|
|
0.00000 |
|
|
|
|
|
0.05 |
|
|
|
0.04 |
|
|
|
0.03 |
|
|
|
0.02 |
|
|
|
0.01 |
|
|
|
0.00 |
|
|
|
0.2 CREFL |
|
|
|
MMCE |
|
LL12 KDE-CRE KDE-CRE |
|
|
|
0.2 |
|
|
|
106 |
|
|
|
0.30.3 3 |
|
|
|
0.1 0.050.15341.020.05 |
|
|
|
|
|
0.025 |
|
|
|
0.020 |
|
|
|
0.015 |
|
|
|
0.010 |
|
|
|
0.005 |
|
|
|
0.000 |
|
|
|
|
|
0.08 |
|
|
|
0.06 |
|
|
|
0.04 |
|
|
|
0.02 |
|
|
|
0.00 |
|
|
|
0.07 |
|
|
|
0.06 |
|
|
|
0.05 |
|
|
|
0.04 |
|
|
|
0.03 |
|
|
|
0.02 |
|
|
|
0.01 |
|
|
|
0.00 |
|
|
|
|
|
|
|
|
|
0.050.1531.020.05 |
|
|
|
|.2|Col2|CRE FL MMCE| |
|
|---|---|---| |
|
|||L L1 2 K KD DE E- -C CR RE E| |
|
|0.2||160 0.03.3 3 4| |
|
|
|
|Col1|Col2|Col3|CRE| |
|
|---|---|---|---| |
|
||0.2||FL MMCE L L1 2 K KD DE E- -C CR R10E E 6| |
|
|0.2|||3 0.3 0.05 0054..3015 00.3.1 1.0| |
|
|
|
|0.2|Col2|CRE| |
|
|---|---|---| |
|
|||FL MMCE| |
|
|||L L1 2 K KD DE E- -C CR RE E| |
|
|||| |
|
|
|
|.2|Col2|CRE FL MMCE| |
|
|---|---|---| |
|
|||L L1 2 K KD DE E- -C CR RE E| |
|
|0.2||160 0.03.3 3 4| |
|
|
|
|
|
|
|
0.2 CREFL |
|
|
|
MMCE |
|
LL12 KDE-CRE KDE-CRE |
|
|
|
0.2 |
|
|
|
106 |
|
|
|
0.30.3 3 |
|
|
|
0.1 0.050.15341.020.05 |
|
|
|
|
|
CRE |
|
|
|
0.2 FLMMCE |
|
|
|
LL12 KDE-CRE KDE-CRE10 |
|
|
|
36 |
|
|
|
0.2 0.3 0.05 |
|
|
|
0.050.1534 |
|
|
|
0.30.1 1.02 |
|
|
|
|
|
0.2 CREFL |
|
|
|
MMCE |
|
LL12 KDE-CRE KDE-CRE |
|
|
|
0.2 0.30.10.3 31.00.05100.1534620.05 |
|
|
|
|
|
0.2 0.4 ACC 0.6 0.8 |
|
|
|
|
|
0.2 0.4 ACC 0.6 0.8 |
|
|
|
|
|
0.2 0.4 ACC 0.6 0.8 |
|
|
|
|
|
0.2 0.4 ACC 0.6 0.8 |
|
|
|
|
|
Marginal calibration on CIFAR10 using Wideresnet |
|
|
|
|Col1|Col2|Col3|CRE| |
|
|---|---|---|---| |
|
|||10 6|FL MMCE L L1 2 K KD DE E- -C CR RE E| |
|
|0.3||4 2 0.20.3 53|3| |
|
|
|
|
|
CRE |
|
|
|
10 FL |
|
|
|
MMCE |
|
LL12 KDE-CRE KDE-CRE |
|
|
|
6 |
|
|
|
4 |
|
|
|
0.3 2 |
|
|
|
0.20.3 530.2 0.10.131.00.050.01 |
|
|
|
|
|
0.89 0.90 0.91 0.92 ACC0.93 0.94 0.95 0.96 |
|
|
|
|
|
Top-label calibration on CIFAR10 using Wideresnet |
|
|
|
|Col1|Col2|Col3|Col4|CRE| |
|
|---|---|---|---|---| |
|
||||10|FL MMCE L L1 2 K KD DE E- -C CR RE E| |
|
||0.3||6 0.240.3 0.2 0 0.|1 .10.050.01 3| |
|
|
|
|
|
CRE |
|
|
|
10 FLMMCE |
|
|
|
LL12 KDE-CRE KDE-CRE |
|
|
|
6 |
|
|
|
0.3 0.240.3 0.2 0.10.13 0.050.01 |
|
|
|
2 53 |
|
|
|
1.0 |
|
|
|
|
|
0.89 0.90 0.91 0.92 ACC0.93 0.94 0.95 0.96 |
|
|
|
|
|
Marginal calibration on CIFAR10 using Wideresnet |
|
|
|
|Col1|Col2|Col3|CRE| |
|
|---|---|---|---| |
|
|||10|FL MMCE L L1 2 K KD DE E- -C CR RE E| |
|
||6|4 253|| |
|
|
|
|
|
CRE |
|
|
|
10 FLMMCE |
|
|
|
LL12 KDE-CRE KDE-CRE |
|
|
|
6 4 |
|
|
|
0.3 0.20.3 2 530.2 0.10.131.00.050.01 |
|
|
|
|
|
0.89 0.90 0.91 0.92 ACC0.93 0.94 0.95 0.96 |
|
|
|
|
|
Top-label calibration on CIFAR10 using Wideresnet |
|
|
|
|Col1|Col2|Col3|CRE| |
|
|---|---|---|---| |
|
|||10|FL MMCE L L1 2 K KD DE E- -C CR RE E| |
|
|||6 0.|1| |
|
|
|
|
|
CRE |
|
FL |
|
|
|
10 MMCE |
|
|
|
LL12 KDE-CRE KDE-CRE |
|
|
|
6 |
|
|
|
0.3 0.240.3 2 530.2 0.10.131.00.050.01 |
|
|
|
|
|
0.89 0.90 0.91 0.92 ACC0.93 0.94 0.95 0.96 |
|
|
|
|
|
Figure 6: Top-label and marginal calibration on CIFAR-10. |
|
|
|
Table 3: Training time [sec] per epoch for Cross-Entropy and KDE-CE methods for different models |
|
and datasets. |
|
|
|
|
|
## Dataset Model CRE L1 KDE-CRE |
|
|
|
|
|
## ResNet-110 51.8 53 ResNet-110 (SD) 45 46 Wide-ResNet-28-10 152.9 154.9 DenseNet-40 103.2 106.8 |
|
ResNet-110 90 92.9 ResNet-110 (SD) 78.2 80.7 Wide-ResNet-28-10 150.5 155.3 DenseNet-40 101 105.5 |
|
|
|
|
|
## CIFAR-10 |
|
|
|
CIFAR-100 |
|
|
|
|
|
tion. The used model is a ResNet-56, trained on a subset of three classes from CIFAR-10. The figure |
|
shows that the two estimates are comparable and both are doing a reasonable job. |
|
Figure 12 shows the absolute difference between the ground truth and estimated ECE using our KDE |
|
estimator and a binned estimator with varying number of points used for estimation. The results are |
|
|
|
|
|
----- |
|
|
|
Marginal calibration on CIFAR100 using Densenet |
|
|
|
|Col1|0.2 0.3 3 0.3|0.1530.10.01 0.01| |
|
|---|---|---| |
|
|
|
|
|
CRE |
|
|
|
4 0.2 2 1.0 FLMMCELL12 KDE-CRE KDE-CRE |
|
|
|
0.2 |
|
|
|
0.3 3 0.1 53 0.10.01 |
|
|
|
0.3 0.01 |
|
|
|
|
|
0.72 0.73 0.74ACC 0.75 0.76 |
|
|
|
|
|
Top-label calibration on CIFAR100 using Densenet |
|
|
|
|CRE FL MMC L1 K|0.3 0.2 2 4 E 3 DE-CRE|53 0.1| |
|
|---|---|---| |
|
|
|
|
|
0.3 |
|
|
|
0.01 |
|
|
|
0.30.2 1.00.1 53 0.01 |
|
|
|
0.2 2 0.1 |
|
|
|
4 |
|
|
|
CRE |
|
FL |
|
MMCE 3 |
|
LL12 KDE-CRE KDE-CRE |
|
|
|
|
|
0.72 0.73 0.74ACC 0.75 0.76 |
|
|
|
|
|
Marginal calibration on CIFAR100 using Densenet |
|
|
|
|
|
Top-label calibration on CIFAR100 using Densenet |
|
|
|
|Col1|0 4|0 .3.2 10.0.153 0.2 2 3|0.01 0.1 CRE FL MMCE L1 KDE-CRE| |
|
|---|---|---|---| |
|
|
|
|
|
0.3 0.01 |
|
|
|
0.30.2 1.00.1 53 0.01 |
|
|
|
0.2 2 0.1 |
|
|
|
CRE |
|
|
|
4 FLMMCE |
|
|
|
3 LL12 KDE-CRE KDE-CRE |
|
|
|
|
|
0.72 0.73 0.74ACC 0.75 0.76 |
|
|
|
|
|
|
|
0.0013 |
|
|
|
0.0012 |
|
|
|
0.0011 |
|
|
|
0.0010 |
|
|
|
0.0009 |
|
|
|
0.0008 |
|
|
|
0.0007 |
|
|
|
0.0006 |
|
|
|
0.0015 |
|
|
|
0.0014 |
|
|
|
0.0013 |
|
|
|
0.0012 |
|
|
|
0.0011 |
|
|
|
0.0010 |
|
|
|
0.0009 |
|
|
|
|
|
0.0175 |
|
|
|
0.0150 |
|
|
|
0.0125 |
|
|
|
0.0100 |
|
|
|
0.0075 |
|
|
|
0.0050 |
|
|
|
0.0025 |
|
|
|
0.030 |
|
|
|
0.025 |
|
|
|
0.020 |
|
|
|
0.015 |
|
|
|
0.010 |
|
|
|
0.005 |
|
|
|
0.000 |
|
|
|
|
|
3.0 |
|
|
|
2.5 |
|
|
|
2.0 |
|
|
|
1.5 |
|
|
|
1.0 |
|
|
|
0.5 |
|
|
|
3.5 |
|
|
|
3.0 |
|
|
|
2.5 |
|
|
|
2.0 |
|
|
|
1.5 |
|
|
|
1.0 |
|
|
|
|
|
0.10 |
|
|
|
0.08 |
|
|
|
0.06 |
|
|
|
0.04 |
|
|
|
0.14 |
|
|
|
0.12 |
|
|
|
0.10 |
|
|
|
0.08 |
|
|
|
0.06 |
|
|
|
0.04 |
|
|
|
|Col1|4 0.3 03.2 0.3|53 0.1 0.100 .0.0 11| |
|
|---|---|---| |
|
|
|
|
|
1e 5 |
|
|
|
1.0 CREFL |
|
|
|
0.2 2 MMCELL12 KDE-CRE KDE-CRE |
|
|
|
4 |
|
|
|
0.3 0.30.23 0.1 53 0.10.010.01 |
|
|
|
0.72 0.73 0.74ACC 0.75 0.76 |
|
|
|
Marginal calibration on CIFAR100 using Resnet |
|
|
|
|
|
Marginal calibration on CIFAR100 using Resnet |
|
|
|
|Col1|Col2|0.3 3|0.2 0.01 0.3 0.10.01 0.1| |
|
|---|---|---|---| |
|
||CRE FL MM|CE|| |
|
||L L1 2 K K|DE-CRE DE-CRE|| |
|
|
|
|
|
2 460.2 1.053 |
|
|
|
0.2 0.01 |
|
|
|
0.3 3 0.3 0.1 0.01 |
|
|
|
0.1 |
|
|
|
CRE |
|
FL |
|
MMCE |
|
LL12 KDE-CRE KDE-CRE |
|
|
|
|
|
0.65 0.66 0.67 0.68ACC 0.69 0.70 0.71 |
|
|
|
|
|
Top-label calibration on CIFAR100 using Resnet |
|
|
|
|Col1|Col2|2 4|Col4| |
|
|---|---|---|---| |
|
|||2 60.20.3|0.1 0.3 0.2| |
|
||CRE FL MMC|E|3| |
|
||L L1 2 K K|DE-CRE DE-CRE|| |
|
|
|
|
|
1.0 |
|
|
|
530.010.1 0.01 |
|
|
|
2 460.2 0.3 0.3 0.1 |
|
|
|
0.2 |
|
|
|
CRE |
|
FL |
|
MMCELL12 KDE-CRE KDE-CRE 3 |
|
|
|
|
|
0.65 0.66 0.67 0.68ACC 0.69 0.70 0.71 |
|
|
|
|
|
Top-label calibration on CIFAR100 using Resnet |
|
|
|
|Col1|L KD|DE-CRE|0.|.1 0.01| |
|
|---|---|---|---|---| |
|
||L2 K|DE-CRE 2 4 0 6|.20.3 0.3 0.2|0.1| |
|
|||||| |
|
||||3|| |
|
|
|
|
|
CRE |
|
FLMMCELL12 KDE-CRE KDE-CRE 1.0530.010.1 0.01 |
|
|
|
2 460.2 0.3 0.3 0.1 |
|
|
|
0.2 |
|
|
|
3 |
|
|
|
|
|
0.65 0.66 0.67 0.68ACC 0.69 0.70 0.71 |
|
|
|
|Col1|Col2|2|0.2 L1 KDE-CRE L KDE-CRE| |
|
|---|---|---|---| |
|
|||6 0.3 3|L2 KDE-CRE 1.0 0.3 0.010.1 0.01 0.1| |
|
||||| |
|
||||| |
|
|
|
|
|
1e 5 |
|
|
|
40.2 53 CREFLMMCE |
|
|
|
2 6 0.2 1.0 LL12 KDE-CRE KDE-CRE |
|
|
|
0.3 3 0.3 0.01 0.1 0.01 |
|
|
|
0.1 |
|
|
|
|
|
0.65 0.66 0.67 0.68ACC 0.69 0.70 0.71 |
|
|
|
|
|
Marginal calibration on CIFAR100 using Resnet (SD) |
|
|
|
|0.3|0.23 0.3|530.01 0.1 2| |
|
|---|---|---| |
|
|CRE FL|0.2|0.1 1. 00 .01| |
|
|MM|CE|| |
|
|L L1 2 K K|DE-CRE DE-CRE|| |
|
|
|
|
|
0.3 |
|
|
|
530.01 |
|
|
|
0.2 3 |
|
|
|
0.3 0.1 2 |
|
|
|
0.2 0.1 1.00.01 |
|
|
|
CRE |
|
FL |
|
MMCE |
|
LL12 KDE-CRE KDE-CRE |
|
|
|
|
|
0.64 0.66 0.68ACC 0.70 0.72 0.74 |
|
|
|
|
|
Top-label calibration on CIFAR100 using Resnet (SD) |
|
|
|
|
|
|1e 5 M|Marginal calibration on CIFAR100|using Resnet (SD)| |
|
|---|---|---| |
|
|0.3|3 0.2|530.01 2| |
|
|CRE FL MMC|0.3 0.2 E|0.1 0.1 1. 00 .01| |
|
|L L1 2 K K|DE-CRE DE-CRE|| |
|
|
|
|
|
0.64 0.66 0.68ACC 0.70 0.72 0.74 |
|
|
|
|
|
Marginal calibration on CIFAR100 using Resnet (SD) |
|
|
|
|0.3|0.3|5 0.23 0.|30.01 1 2| |
|
|---|---|---|---| |
|
|CRE FL||0 0.2|.1 1. 00 .01| |
|
|MM|CE||| |
|
|L L1 2 K K|DE-CRE DE-CRE||| |
|
|
|
|
|
|
|
0.64 0.66 0.68ACC 0.70 0.72 0.74 |
|
|
|
|
|
1e 5 |
|
|
|
|
|
2.5 |
|
|
|
2.0 |
|
|
|
1.5 |
|
|
|
1.0 |
|
|
|
0.5 |
|
|
|
4.0 |
|
|
|
3.5 |
|
|
|
3.0 |
|
|
|
2.5 |
|
|
|
2.0 |
|
|
|
1.5 |
|
|
|
1.0 |
|
|
|
0.5 |
|
|
|
|
|
0.12 |
|
|
|
0.10 |
|
|
|
0.08 |
|
|
|
0.06 |
|
|
|
0.04 |
|
|
|
0.02 |
|
|
|
0.00 |
|
|
|
0.10 |
|
|
|
0.08 |
|
|
|
0.06 |
|
|
|
0.04 |
|
|
|
0.02 |
|
|
|
|
|
0.0012 |
|
|
|
0.0011 |
|
|
|
0.0010 |
|
|
|
0.0009 |
|
|
|
0.0008 |
|
|
|
0.0007 |
|
|
|
0.0006 |
|
|
|
0.0012 |
|
|
|
0.0010 |
|
|
|
0.0008 |
|
|
|
0.0006 |
|
|
|
|
|
0.0012 |
|
|
|
0.0011 |
|
|
|
0.0010 |
|
|
|
0.0009 |
|
|
|
0.0008 |
|
|
|
0.0007 |
|
|
|
0.0006 |
|
|
|
0.016 |
|
|
|
0.014 |
|
|
|
0.012 |
|
|
|
0.010 |
|
|
|
0.008 |
|
|
|
0.006 |
|
|
|
0.004 |
|
|
|
0.002 |
|
|
|
0.000 |
|
|
|
|CRE FL MMC L L1 2 K K|E DE-CRE DE-CRE|12.00.01 0.1 0.01.01| |
|
|---|---|---| |
|
|0.3|0.2 0.2 0.3|53 3| |
|
|||| |
|
|
|
|
|
0.64 0.66 0.68ACC 0.70 0.72 0.74 |
|
|
|
|
|
Marginal calibration on CIFAR100 using Wideresnet |
|
|
|
|Col1|Col2|4 532|CRE FL MMCE| |
|
|---|---|---|---| |
|
|||1.0 0.3 0.10.2 0.2|3L L1 2 K KD DE E- -C CR RE E 000.. 1.0011| |
|
|
|
|
|
0.34 53 2 1.0 3CREFLMMCELL12 KDE-CRE KDE-CRE |
|
|
|
0.10.2 |
|
|
|
0.2 0.10.010.01 |
|
|
|
0.3 |
|
|
|
|
|
0.74 0.75 0.76 0.77ACC 0.78 0.79 0.80 0.81 |
|
|
|
|
|
Top-label calibration on CIFAR100 using Wideresnet |
|
|
|
|Col1|Col2|0.2 0.1|0.1| |
|
|---|---|---|---| |
|
||0|0.2 .3 0.3 1.0 4 532|0.01 0.01| |
|
|
|
|
|
0.20.10.2 0.1 |
|
|
|
0.3 |
|
|
|
0.01 |
|
|
|
0.3 1.0 0.01 |
|
|
|
4 53 2 |
|
|
|
CRE |
|
FL |
|
MMCELL12 KDE-CRE KDE-CRE 3 |
|
|
|
|
|
0.74 0.75 0.76 0.77ACC 0.78 0.79 0.80 0.81 |
|
|
|
|
|
|
|
Top-label calibration on CIFAR100 using Wideresnet |
|
|
|
|Col1|CRE FL MMC|E|Col4|Col5| |
|
|---|---|---|---|---| |
|
||L L1 2 K K 0|DE-CRE DE-CRE .3|0.2 0.1 0.2 1.0|0.1 0.01 0.01| |
|
|
|
|
|
CRE |
|
FL |
|
MMCELL12 KDE-CRE KDE-CRE 0.20.10.2 0.1 |
|
|
|
0.3 0.010.01 |
|
|
|
1.0 |
|
|
|
4 0.3 53 2 |
|
|
|
3 |
|
|
|
|
|
0.74 0.75 0.76 0.77ACC 0.78 0.79 0.80 0.81 |
|
|
|
|1e 5|Col2|Marginal calibration on CIFAR100|0 using Wideresnet| |
|
|---|---|---|---| |
|
|||4|CRE FL MMCE| |
|
|||5321.0 0.1|L L1 2 K KD DE E- -C CR RE E| |
|
|
|
|
|
4 CREFL |
|
|
|
MMCE |
|
|
|
53 2 1.0 LL12 KDE-CRE KDE-CRE |
|
|
|
0.3 0.1 |
|
|
|
0.3 0.2 0.2 0.10.010.013 |
|
|
|
|
|
0.74 0.75 0.76 0.77ACC 0.78 0.79 0.80 0.81 |
|
|
|
|
|
Figure 7: Top-label and marginal calibration on CIFAR-100 |
|
|
|
|
|
0.0 |
|
|
|
1.0 |
|
|
|
0.2 |
|
|
|
0.8 |
|
|
|
0.4 |
|
|
|
0.6 |
|
|
|
0.6 |
|
|
|
0.4 |
|
|
|
0.8 |
|
|
|
0.2 |
|
|
|
1.0 |
|
|
|
0.0 |
|
|
|
0.0 0.2 0.4 0.6 0.8 1.0 |
|
|
|
(a) Splitting the simplex in 16 bins |
|
|
|
|
|
0.00 0.05 0.10 0.15 0.20 0.25 0.30 |
|
|
|
(b) Corresponding histogram (c) Corresponding Dirichlet KDE |
|
|
|
|
|
Figure 8: An example of a simplex binned estimator and kernel-density estimator for CIFAR-10 |
|
|
|
averaged over 120 ResNet-56 models trained on a subset of three classes from CIFAR-10. Both |
|
estimators are biased and have some variance, and the plot shows that the combination of the two is |
|
in the same order of magnitude. The empirical convergence rates (slope of the log-log plot) is given |
|
in the legend and is shown to be close to the theoretically expected value of -0.5. |
|
|
|
|
|
D.4 CHOICE OF THE BATCH SIZE |
|
|
|
In Figure 13 we investigate the choice of the batch size on CIFAR-10. To this end, we use two |
|
differently shuffled dataloaders that draw random batches from the same training set. The first |
|
dataloader provides batches to the loss term (CRE) while the second dataloader provides the batches |
|
for the regularization (KDE). The batch size for the loss term is fixed in all experiments, while the |
|
|
|
|
|
----- |
|
|
|
0.40 Canonical, using 100 points, 25 bins, 0.001 bandwidth |
|
|
|
0.35 |
|
|
|
0.30 |
|
|
|
0.25 |
|
|
|
0.20 |
|
|
|
Binned ECE 0.15 |
|
|
|
0.10 |
|
|
|
0.05 |
|
|
|
0.00 |
|
|
|
0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 |
|
|
|
KDE ECE |
|
|
|
Canonical, using 500 points, 25 bins, 0.001 bandwidth |
|
|
|
0.30 |
|
|
|
0.25 |
|
|
|
0.20 |
|
|
|
0.15 |
|
|
|
Binned ECE 0.10 |
|
|
|
0.05 |
|
|
|
0.00 |
|
|
|
0.00 0.05 0.10 0.15 0.20 0.25 0.30 |
|
|
|
KDE ECE |
|
|
|
Canonical, using 1000 points, 25 bins, 0.001 bandwidth |
|
|
|
0.25 |
|
|
|
0.20 |
|
|
|
0.15 |
|
|
|
Binned ECE 0.10 |
|
|
|
0.05 |
|
|
|
0.00 |
|
|
|
0.00 0.05 0.10 0.15 0.20 0.25 |
|
|
|
KDE ECE |
|
|
|
|
|
Marginal, using 100 points, 25 bins, 0.001 bandwidth |
|
|
|
0.10 |
|
|
|
0.08 |
|
|
|
0.06 |
|
|
|
Binned ECE 0.04 |
|
|
|
0.02 |
|
|
|
0.00 |
|
|
|
0.00 0.02 0.04 0.06 0.08 0.10 |
|
|
|
KDE ECE |
|
|
|
Marginal, using 500 points, 25 bins, 0.001 bandwidth |
|
|
|
0.06 |
|
|
|
0.05 |
|
|
|
0.04 |
|
|
|
0.03 |
|
|
|
Binned ECE |
|
|
|
0.02 |
|
|
|
0.01 |
|
|
|
0.00 |
|
|
|
0.00 0.01 0.02 0.03 0.04 0.05 0.06 |
|
|
|
KDE ECE |
|
|
|
Marginal, using 1000 points, 25 bins, 0.001 bandwidth |
|
|
|
0.05 |
|
|
|
0.04 |
|
|
|
0.03 |
|
|
|
Binned ECE 0.02 |
|
|
|
0.01 |
|
|
|
0.00 |
|
|
|
0.00 0.01 0.02 0.03 0.04 0.05 |
|
|
|
KDE ECE |
|
|
|
|
|
Top-label, using 100 points, 25 bins, 0.001 bandwidth |
|
|
|
0.10 |
|
|
|
0.08 |
|
|
|
0.06 |
|
|
|
Binned ECE 0.04 |
|
|
|
0.02 |
|
|
|
0.00 |
|
|
|
0.00 0.02 0.04 0.06 0.08 0.10 |
|
|
|
KDE ECE |
|
|
|
Top-label, using 500 points, 25 bins, 0.001 bandwidth |
|
|
|
0.08 |
|
|
|
0.07 |
|
|
|
0.06 |
|
|
|
0.05 |
|
|
|
0.04 |
|
|
|
Binned ECE 0.03 |
|
|
|
0.02 |
|
|
|
0.01 |
|
|
|
0.00 |
|
|
|
0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 |
|
|
|
KDE ECE |
|
|
|
Top-label, using 1000 points, 25 bins, 0.001 bandwidth |
|
|
|
0.08 |
|
|
|
0.07 |
|
|
|
0.06 |
|
|
|
0.05 |
|
|
|
0.04 |
|
|
|
Binned ECE 0.03 |
|
|
|
0.02 |
|
|
|
0.01 |
|
|
|
0.00 |
|
|
|
0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 |
|
|
|
KDE ECE |
|
|
|
|
|
Figure 9: Relationship between the ECE metric based on binning and kernel density estimation |
|
(KDE-ECE) for the three types of calibration: canonical, marginal and top-label. In every row, a |
|
different number of points are used to approximate the KDE-ECE. |
|
|
|
|
|
0.40 |
|
|
|
0.35 |
|
|
|
|
|
0.18 |
|
|
|
0.16 |
|
|
|
|
|
0.30 |
|
|
|
0.25 |
|
|
|
|
|
0.14 |
|
|
|
0.12 |
|
|
|
|
|
0.20 |
|
|
|
|0|.01|Col3|C L|RE KDE-CRE| |
|
|---|---|---|---|---| |
|
|||1 L2|1 L2|KDE-CRE| |
|
||0.1|||| |
|
||0.2|||| |
|
||0.3 0.2|0.01||| |
|
|||||| |
|
|||0.3 0.1||| |
|
|||||| |
|
|
|
|Col1|0.01|CRE L1 KDE-CRE| |
|
|---|---|---| |
|
||0.2|L2 KDE-CRE| |
|
|||| |
|
||0.1|0.01| |
|
||0.2|| |
|
|||| |
|
||0.3|0.3| |
|
||0.1|| |
|
|
|
|
|
0.01 CRE |
|
|
|
L1 KDE-CRE |
|
|
|
L2 KDE-CRE |
|
|
|
0.1 |
|
|
|
0.2 |
|
|
|
0.3 |
|
|
|
0.2 0.01 |
|
|
|
0.3 |
|
|
|
0.1 |
|
|
|
|
|
0.84 0.86 0.88 0.90 |
|
|
|
ACC |
|
|
|
|
|
0.84 0.86 0.88 0.90 |
|
|
|
ACC |
|
|
|
|
|
Figure 10: Canonical calibration on Kather using a Resnet-50 model |
|
|
|
batch size for the regularization varies. The orange point is our normal experimental set-up with just |
|
one dataloader (i.e. the same points are used for loss and KDE-ECE computation) as a comparison. |
|
The plot shows that our chosen batch size of 128 is appropriate for our purposes. |
|
|
|
|
|
----- |
|
|
|
0.00 |
|
|
|
0.02 |
|
|
|
0.04 |
|
|
|
0.06 |
|
|
|
0.08 |
|
|
|
0.10 |
|
|
|
|
|
|Col1|Col2|Col3|Col4|Col5|Col6|Ground tru KDE-ECE Binned EC|th E| |
|
|---|---|---|---|---|---|---|---| |
|
||||||||| |
|
||||||||| |
|
||||||||| |
|
||||||||| |
|
||||||||| |
|
||||||||| |
|
|
|
|
|
200 400 600 800 1000 |
|
|
|
# points |
|
|
|
|
|
|Col1|Col2|Col3|Col4|Col5|Ground tru|th| |
|
|---|---|---|---|---|---|---| |
|
||||||KDE-ECE Binned EC|E| |
|
|||||||| |
|
|||||||| |
|
|||||||| |
|
|||||||| |
|
|||||||| |
|
|||||||| |
|
|
|
|
|
200 400 600 800 1000 |
|
|
|
# points |
|
|
|
|
|
|Col1|Col2|Col3|Col4|Col5|Col6|Ground tru KDE-ECE|th| |
|
|---|---|---|---|---|---|---|---| |
|
|||||||Binned EC|E| |
|
||||||||| |
|
||||||||| |
|
||||||||| |
|
||||||||| |
|
||||||||| |
|
|
|
|
|
200 400 600 800 1000 |
|
|
|
# points |
|
|
|
|
|
0.035 |
|
|
|
0.030 |
|
|
|
0.025 |
|
|
|
0.020 |
|
|
|
0.015 |
|
|
|
0.010 |
|
|
|
0.005 |
|
|
|
0.000 |
|
|
|
|
|
0.05 |
|
|
|
0.04 |
|
|
|
0.03 |
|
|
|
0.02 |
|
|
|
0.01 |
|
|
|
0.00 |
|
|
|
|
|
(a) Canonical |
|
|
|
|
|
(b) Marginal |
|
|
|
|
|
(c) Top-label |
|
|
|
|
|
Figure 11: KDE-ECE estimates and their corresponding binned approximations on the three types |
|
of calibration for varying number of points used for the estimation. The ground truth is calculated |
|
using 3000 probability scores of the test set. For the binned estimate, the points are assigned in 25 |
|
bins with adaptive width. A bandwidth of 0.001 is used for KDE-ECE. |
|
|
|
|
|
3 × 10 |
|
|
|
2 × 10 |
|
|
|
10 |
|
|
|
|
|
|
|
|
|
4 × 10 |
|
|
|
3 × 10 |
|
|
|
2 × 10 |
|
|
|
|
|
10 |
|
|
|
6 × 10 |
|
|
|
4 × 10 |
|
|
|
3 × 10 |
|
|
|
|
|
6 × 10 |
|
|
|
4 × 10 |
|
|
|
|Col1|KDE-ECE slope = 0.3 Binned ECE slope = 0.3|KDE-ECE slope = 0.3 Binned ECE slope = 0.3|8 2| |
|
|---|---|---|---| |
|
||||| |
|
|
|
|Col1|Col2|KDE-ECE slope = 0.4 Binned ECE slope = 0.5|0 2| |
|
|---|---|---|---| |
|
||||| |
|
|
|
|Col1|KDE-ECE slope = 0.5 Binned ECE slope = 0.4|6| |
|
|---|---|---| |
|
|||6| |
|
|||| |
|
|
|
|
|
102 10 |
|
|
|
Number of points |
|
|
|
(a) Canonical |
|
|
|
|
|
102 10 |
|
|
|
Number of points |
|
|
|
(b) Marginal |
|
|
|
|
|
102 10 |
|
|
|
Number of points |
|
|
|
(c) Top-label |
|
|
|
|
|
Figure 12: Absolute difference between ground truth and estimated ECE for varying number of |
|
points used for the estimation. The ground truth is calculated using 3000 probability scores of the |
|
test set. For the binned estimate, the points are assigned in 25 bins with adaptive width. A bandwidth |
|
of 0.001 is used for KDE-ECE. Note that the axes are on a log scale. |
|
|
|
|
|
----- |
|
|
|
0.0055 |
|
|
|
0.0050 |
|
|
|
0.0045 |
|
|
|
0.0040 |
|
|
|
0.0035 |
|
|
|
0.0030 |
|
|
|
0.0025 |
|
|
|
0.050 |
|
|
|
|
|
0.00030 |
|
|
|
0.00025 |
|
|
|
0.00020 |
|
|
|
0.00015 |
|
|
|
|
|
0.00010 |
|
|
|
0.00005 |
|
|
|
0.008 |
|
|
|
|Col1|Col2|Col3|Col4|Col5|Col6|Col7|L|1 2 KDE-CRE| |
|
|---|---|---|---|---|---|---|---|---| |
|
|||||||L 64|L|1 KDE-CRE| |
|
|||||||||| |
|
||||1|28||||| |
|
|||128 256||||2||| |
|
||||||51|||| |
|
|||||||||32| |
|
|||||||||| |
|
|||||||||| |
|
|||||||||| |
|
|
|
|Col1|Col2|Col3|Col4|Col5|Col6|Col7|L1 L1|2 KDE-CRE KDE-CRE| |
|
|---|---|---|---|---|---|---|---|---| |
|
|||||||||| |
|
|||||||64||| |
|
|||||||||| |
|
|||128|1|28||||| |
|
|||||||||| |
|
|||256|||51|2||32| |
|
|||||||||| |
|
|||||||||| |
|
|||||||||| |
|
|
|
|
|
0.915 0.920 0.925 0.930 0.935 0.940 |
|
|
|
ACC |
|
|
|
|Col1|Col2|Col3|Col4|Col5|Col6|L L|L L|1 2 KDE-CRE 1 KDE-CRE| |
|
|---|---|---|---|---|---|---|---|---| |
|
|||||||||| |
|
||||||51|2||| |
|
|||||||||| |
|
|||256||||64||| |
|
|||128||||||| |
|
||||12|8||||32| |
|
|||||||||| |
|
|||||||||| |
|
|
|
|
|
L1 2 KDE-CRE |
|
|
|
L1 KDE-CRE |
|
|
|
512 |
|
|
|
256 |
|
|
|
128 128 64 |
|
|
|
32 |
|
|
|
|
|
0.915 0.920 0.925 0.930 0.935 0.940 |
|
|
|
ACC |
|
|
|
|
|
0.915 0.920 0.925 0.930 0.935 0.940 |
|
|
|
ACC |
|
|
|
|Col1|Col2|Col3|Col4|Col5|Col6|L1 L1|L1 L1|2 KDE-CRE KDE-CRE| |
|
|---|---|---|---|---|---|---|---|---| |
|
|||||||||| |
|
||||||512|||| |
|
|||||||||| |
|
|||256||||64||| |
|
|||||||||| |
|
|||128|1|28||||32| |
|
|||||||||| |
|
|||||||||| |
|
|
|
|
|
|
|
0.915 0.920 0.925 0.930 0.935 0.940 |
|
|
|
ACC |
|
|
|
|
|
0.007 |
|
|
|
0.006 |
|
|
|
|
|
0.045 |
|
|
|
0.040 |
|
|
|
|
|
0.005 |
|
|
|
0.004 |
|
|
|
|
|
0.035 |
|
|
|
0.030 |
|
|
|
|
|
0.003 |
|
|
|
|
|
Figure 13: Training with different batches for loss and regularization (2 KDE-CRE), where the batch |
|
size for the loss is fixed and the batch size for the regularization varies. The orange point shows our |
|
usual experimental set-up where we train with only one batch (KDE-CRE). Upper row: marginal, |
|
lower row: top-label. |
|
|
|
|
|
----- |
|
|
|
|