# CALIBRATION REGULARIZED TRAINING OF DEEP NEURAL NETWORKS USING DIRICHLET KERNEL DENSITY ESTIMATION

**Anonymous authors**
Paper under double-blind review

ABSTRACT

Calibrated probabilistic classifiers are models whose predicted probabilities can
directly be interpreted as uncertainty estimates. This property is particularly important in safety-critical applications such as medical diagnosis or autonomous
driving. However, it has been shown recently that deep neural networks are poorly
calibrated and tend to output overconfident predictions. As a remedy, we propose
a trainable calibration error estimator based on Dirichlet kernel density estimates,
which asymptotically converges to the true Lp calibration error. This novel estimator enables us to achieve the strongest notion of multiclass calibration, called
canonical calibration, while other common calibration methods only allow for toplabel and marginal calibration. The empirical results show that our estimator is
competitive with the state-of-the-art, consistently yielding tradeoffs between calibration error and accuracy that are (near) Pareto optimal across a range of network
architectures. The computational complexity of our estimator is O(n[2]), matching
that of the kernel maximum mean discrepancy, used in a previously considered
trainable calibration estimator (Kumar et al., 2018). By contrast, the proposed
method has a natural choice of kernel, and can be used to generate consistent estimates of other quantities based on conditional expectation, such as the sharpness
of an estimator.

1 INTRODUCTION

Deep neural networks have shown tremendous success in classification tasks, being regularly the
best performing models in terms of accuracy. However, they are also known to make overconfident
predictions (Guo et al., 2017), which is particularly problematic in safety-critical applications such
as medical diagnosis or autonomous driving. Therefore, in many real world applications we do
not just care about the predictive performance, but also about the trustworthiness of that prediction,
that is, we are interested in accurate predictions with robust uncertainty estimates. To this end, we
want our models to be uncertainty calibrated which means that, for instance, among all cells that
have been predicted with a probability of 0.8 to be cancerous, in fact a fraction of 80 % belong to a
malignant tumor.

Being calibrated, however, does not imply that the classifier achieves good accuracy. For instance,
a classifier that always predicts the marginal distribution of the target class is calibrated, but will
not be very useful in practice. Likewise, a good predictive performance does not ensure calibration.
In particular, for a broad class of loss functions, risk minimization leads to asymptotically Bayes
optimal classifiers (Bartlett et al., 2006). However, there is no guarantee that they are calibrated,
even in the aysmptotic limit. Therefore, we consider minimizing the risk plus a term that penalizes
miscalibration, i.e., Risk +λ · CalibrationError. For parameter values λ > 0, this will push the
classifier towards a calibrated model, while maintaining similar accuracy. The existence of such a
_λ > 0 is suggested by the fact that there always exists at least one Bayes optimal classifier that is_
calibrated, namely P(y|x).

To optimize the risk and the calibration error jointly, we propose a differentiable and consistent estimator of the expected Lp calibration error based on kernel density estimates (KDEs). In particular,
we use a Beta kernel in binary classification tasks and a Dirichlet kernel in the multiclass setting,


-----

as these kernels are the natural choices to model density estimation over a probability simplex. Our
Dirichlet kernel based estimator allows for the estimation of canonical calibration, which is the
strongest notion of multiclass calibration as it implies the calibration of the whole probability vector
(Br¨ocker, 2009; Appice et al., 2015; Vaicenavicius et al., 2019). By contrast, most other state-ofthe-art methods only achieve weaker versions of multiclass calibration, namely top-label (Guo et al.,
2017) and marginal or class-wise calibration (Kull et al., 2019). The top-label calibration only considers the scores for the predictied class, while for marginal calibration the multiclass problem is
split up into K one-vs-all binary ones, each of which is required to be calibrated according to the
definition of binary calibration. In many applications marginal and canonical calibration are preferable to top-label calibration, since we often care about having reliable uncertainty estimates for more
than just one class per prediction. For instance, in medical diagnosis we do not just care about the
most likely disease a certain patient might have but also about the probabilities of other diseases.

Our contributions can be summarized as follows:

1. We develop a trainable calibration error objective using Dirichlet kernel density estimates,
which can be minimized alongside any loss function in the existing batch stochastic gradient descent framework.

2. We propose to use our estimator to evaluate canonical calibration. Due to the scaling
properties of Dirichlet kernel density estimation, and the tendency for probabilities to be
concentrated in a relatively small number of classes, this becomes feasible in cases that
cannot be estimated using a binned estimator.

3. We show on a variety of network architectures and two datasets that DNNs trained alongside an estimator of the calibration error achieve competitive results both on existing metrics and on the proposed measure of canonical calibration.


2 RELATED WORK

Calibration of probabilistic predictors has long been studied in many fields. This topic gained attention in the deep learning community following the observation in Guo et al. (2017) that modern
neural networks are poorly calibrated and tend to give overconfident predictions due to overfitting
on the NLL loss. The surge of interest resulted in many calibration strategies that can be split in two
general categories, which we discuss subsequently. Post-hoc calibration strategies learn a calibration map of the predictions from a trained predictor in a post-hoc manner. For instance, Platt scaling
(Platt, 1999) fits a logistic regression model on top of the logit outputs of the model. A special
case of Platt scaling that fits a single scalar, called temperature, has been popularized by Guo et al.
(2017) as an accuracy-preserving, easy to implement and effective method to improve calibration.
However, it has the undesired consequence that it clamps the high confidence scores of accurate predictions (Kumar et al., 2018). Other approaches for post-hoc calibration include: histogram binning
(Zadrozny & Elkan, 2001), isotonic regression (Zadrozny & Elkan, 2002), and Bayesian binning
into quantiles (Naeini & Cooper, 2015). Trainable calibration strategies integrate a differentiable
calibration measure into the training objective. One of the earliest approaches is regularization by
penalizing low entropy predictions (Pereyra et al., 2017). Similarly to temperature scaling, it has
been shown that entropy regularization needlessly suppresses high confidence scores of correct predictions (Kumar et al., 2018). Another popular strategy is MMCE (Maxmimum Mean Calibration
Error) (Kumar et al., 2018), where the entropy regularizer is replaced by a kernel-based surrogate for
the calibration error that can be optimized alongside NLL. It has been shown that label smoothing
(Szegedy et al., 2015; M¨uller et al., 2020), i.e. training models with a weighted mixture of the labels
instead of one-hot vectors, also improves model calibration. Liang et al. (2020) propose to add the
difference between predicted confidence and accuracy as auxiliary term to the cross-entropy loss.
Focal loss (Mukhoti et al., 2020; Lin et al., 2018) has recently been empirically shown to produce
better calibrated models than many of the alternatives, but does not estimate a clear quantity related
to calibration error.

**Kernel density estimation (Parzen, 1962; Rosenblatt, 1956) is a non-parametric method to estimate**
a probability density function from a finite sample. Zhang et al. (2020) propose a KDE-based estimator of the calibration error for measuring calibration performance. However, they use the triweight
kernel, which has a limited support interval and is therefore applicable to binary classification, but
does not have a natural extension to higher dimensional simplexes, in contrast to the Dirichlet kernel


-----

that we consider here. As a result, they consider an unnatural proxy to marginal calibration error,
which does not result in a consistent estimator.

3 METHODS

The most commonly used loss functions are designed to achieve consistency in the sense of Bayes
optimality under risk minimization, however, they do not guarantee calibration - neither for finite
samples nor in the asymptotic limit. Since we are interested in models f that are both accurate and
calibrated, we consider the following optimization problem bounding the calibration error CE(f ):
_f = arg min_ (1)
_f_ [Risk(][f] [)][,][ s.t.][ CE(][f] [)][ ≤] _[B]_
_∈F_

for some B > 0, and its associated Lagrangian

_f = arg min_ Risk(f ) + λ CE(f ) _._ (2)
_f_ _·_
_∈F_

 

We measure the (mis-)calibration in terms of the Lp calibration error. To this end, let (Ω, A, P)
be a probability space, let X = R[d], Y = {0, 1, ..., K}. Let x : Ω _→X and y : Ω_ _→Y be_
random variables while realizations are denoted with subscripts. Furthermore, let f : X →△[K]
be a decision function, where △[K] denotes the K dimensional simplex as is achieved e.g. from the
output of a final softmax layer in a neural network.
**Definition 3.1 (Calibration error, (Naeini et al., 2015; Kumar et al., 2019; Wenger et al., 2020)).**
_The Lp calibration error of f is:_

[1]

_p_ _p_
CEp(f ) = E E[y _f_ (x)] _f_ (x) _._ (3)
_|_ _−_ _p_
  

We note that we consider multiclass calibration, and that f (x) and the conditional expectation in
Equation 3 therefore map to points on a probability simplex. We say that a classifier f is perfectly
calibrated if CEp(f ) = 0. Kumar et al. (2018) have also considered a minimization problem similar
to Equation 2. Instead of using the CEp they use a metric called maximum mean calibration error
(MMCE) that is 0 if and only if CEp = 0. However, it is unclear how MMCE relates to the canonical
multiclass setting or to the norm parameter p for non-zero CEp.

In order to optimize Definition 3.1 directly, we need to perform density estimation over the probability simplex in order to empirically compute the conditional expectation. In a binary setting,
this has traditionally been done with binned estimates (Naeini et al., 2015; Guo et al., 2017; Kumar
et al., 2019). However, this is not differentiable w.r.t. the function f, and cannot be incorporated
into a gradient based training procedure. Furthermore, binned estimates suffer from the curse of
dimensionality and do not have a practical extension to multiclass settings. A natural choice for a
differentiable kernel density estimator in the binary case is a kernel based on the Beta distribution
and the extension to the multiclass case is given by the Dirichlet distribution. Hence, we consider
an estimator for the CEp based on Beta and Dirichlet kernel density estimates in the binary and
multiclass setting, respectively. We require that this estimator is consistent and differentiable such
that we can train it according to Equation 2. This estimator is given by:


CE\p(f )[p] = [1]


E[y\ f (x)]
_|_ _f_ (xh) _[−]_ _[f]_ [(][x][h][)]


(4)


_h=1_


where E[y\ f (x)] E[y\ f (x)] evaluated at f (x) = f (xh). If Px,y has a probability
_|_ _f_ (xh) [denotes] _|_

density px,y with respect to the product of the Lebesgue and counting measure, we can define:
_px,y(xi, yi) = py|x=xi_ (yi) px(xi). Then we define the estimator of the conditional expectation as
follows:

_yk_ _[y][k][ p][x,y][(][f]_ [(][x][)][, y][k][)]

E[y _f_ (x)] = _yk py_ _x=f_ (x)(yk) = _∈Y_ (5)
_|_ _|_ _px(f_ (x))

_yXk∈Yn_ P

_≈_ Pi=1ni=1[k][k][(][f][(][f][(][x][(][x][);][);][ f][ f][(][x][(][x][i][))][i][))][y][i] =: E[y\ | f (x)] (6)

where k is the kernel of a kernel density estimate evaluated at point xi.

P


-----

**Proposition 3.2.** E[y\ | f (x)] is a pointwise consistent estimator of E[y | f (x)], that is:

_n_

lim _i=1n_ _[k][(][f]_ [(][x][);][ f] [(][x][i][))][y][i] = _yk∈Y_ _[y][k][ p][x,y][(][f]_ [(][x][)][, y][k][)] _._ (7)
_n→∞_ P _i=1_ _[k][(][f]_ [(][x][);][ f] [(][x][i][))] P _px(f_ (x))

P

_Proof. By the consistency of kernel density estimators (Silverman, 1986; Chen, 1999; Ouimet_
& Tolosana-Delgado, 2021), for all f (x) (0, 1), _n1_ _ni=1_ _[k][(][f]_ [(][x][);][ f] [(][x][i][))][y][i] _n→∞_

_yk∈Y_ _[y][k][ p][x,y][(][f]_ [(][x][)][, y][k][)][ and][ 1]n _ni=1_ _[k][(][f]_ [(][x][);][ f] [(][x]∈[i][))] _−n−→∞−−→_ _px(Pf_ (x)). The fact that the ratio of−−−−→

two convergent sequences converges against the ratio of their limits shows the result.

P P

**Mean squared error in binary classification** As a first instantiation of our framework we consider a binary classification setting, with the mean squared error MSE(f ) = E[(f (x) − _y)[2]] as the_
risk function, jointly optimized with the L2 calibration error CE2. Following Murphy (1973); Degroot & Fienberg (1983); Kuleshov & Liang (2015); Nguyen & O’Connor (2015) we decompose
(full derivation in Appendix A) the MSE as:

MSE(f ) − CE2(f )[2] = E 1 − E[y | f (x)] E[y | f (x)] _≥_ 0. (8)
  

Similar to Equation 2, we consider the optimization problem for some λ > 0:

_f = arg min_ MSE(f ) + λ CE2(f )[2][]. (9)
_f_ _∈F_



Using Equation 8 we rewrite:

MSE(f ) + λ CE2(f )[2] =(1 + λ) MSE(f ) _λ_ MSE(f ) CE2(f )[2][] (10)
_−_ _−_


=(1 + λ) MSE(f ) − _λE_ 1 − E[y | f (x)] E[y | f (x)] _._ (11)
  

Rescaling Equation 11 by a factor of (1 + λ)[−][1] and a variable substitution γ = 1+λλ

_[∈]_ [[0][,][ 1)]

_f = arg min_ MSE(f ) + λ CE2(f )[2][] = arg min MSE(f ) _γE_ 1 E[y _f_ (x)] E[y _f_ (x)]
_f_ _f_ _−_ _−_ _|_ _|_
_∈F_ _∈F_   (12)

= arg min MSE(f ) + γE E[y _f_ (x)][2][i]. (13)
_f_ _|_
_∈F_ h

For optimization we wish to find an estimator for E[E[y | f (x)][2]]. Building upon Equation 6, a
partially debiased estimator can be written as:[1]

2

_n_

\ _i≠_ _h_ _[k][(][f]_ [(][x][h][);][ f] [(][x][i][))][y][i] _−_ [P]i≠ _h_ [(][k][(][f] [(][x][h][);][ f] [(][x][i][))][y][i][)][2]
E E[y | f (x)][2] _≈_ _n[1]_ _h=1_ P 2 _._ (14)
h i X _i≠_ _h_ _[k][(][f]_ [(][x][h][);][ f] [(][x][i][))] _−_ [P]i≠ _h_ [(][k][(][f] [(][x][h][);][ f] [(][x][i][)))][2]

In a binary setting, the kernels k(P·, ·) are Beta distributions, i.e. denoting _zi := f_ (xi) for short, then:

_kBeta(z, zi) := z[α][i][−][1](1_ _z)[β][i][−][1][ Γ(][α][i][ +][ β][i][)]_ (15)
_−_ Γ(αi) Γ(βi) _[,]_


with αi = _[z]h[i]_ [+1][ and][ β][i][ =][ 1][−]h[z][i] [+1][ (Chen, 1999; Bouezmarni & Rolin, 2003; Zhang & Karunamuni,]

2010), where h is a bandwidth parameter in the kernel density estimate that goes to 0 as n →∞.
We note that the computational complexity of this estimator is O(n[2]). Within the gradient descent
training procedure, the density is estimated using a mini-batch and therefore the O(n[2]) complexity
is w.r.t. a mini-batch, not the entire dataset.

The estimator in Equation 14 is a ratio of two second order U-statistics that converge as n[−][1][/][2]

(Ferguson, 2005). Therefore, the overall convergence will be n[−][1][/][2]. Empirical covergence rates are
calculated in Appendix D.3 and shown to be close to the theoretically expected value.

1We have debiased the numerator and denominator individually (Ferguson, 2005, Section 2), but for simplicity have not corrected for the fact that we are estimating a ratio (Scott & Wu, 1981).


-----

**Multiclass calibration with Dirichlet kernel density estimates** There are multiple definitions
regarding multiclass calibration that differ in the strictness regarding the calibration of the probability vector f (x). The weakest notion is top label calibration, which, as the name suggests, only
cares about calibrating the entry with the highest predicted probability, which reduces to a binary
calibration problem again (Guo et al., 2017). Marginal or class-wise calibration (Kull et al., 2019)
is the most commonly used definition of multiclass calibration and a stronger version of top label
calibration. Here, the problem is split into K one-vs-all binary calibration setting, such that each
class has to be calibrated against the other K − 1 classes:

_K_

_p[]_

MCEp(f )[p] = E E[y = k | f (x)k] − _f_ (x)k _._ (16)

_k=1_

X 

An estimator for this calibration error is:


_i≠_ _j_ _[k][Beta][(][f]_ [(][x][j][)][k][;][ f] [(][x][i][)][k][)[][y][i][]][k] _f_ (xj)k

_i=j_ _[k][Beta][(][f]_ [(][x][j][)][k][;][ f] [(][x][i][)][k][)] _−_
_̸_

P


MCE\p(f )[p] =


(17)


_j=1_


_k=1_


The strongest notion of multiclass calibration, and the one that we want to consider in this paper, is
called canonical calibration (Br¨ocker, 2009; Appice et al., 2015; Vaicenavicius et al., 2019). Here
it is required that the whole probability vector f (x) is calibrated. The definition is exactly the one
from Definition 3.1. Its estimator is:


_i≠_ _j_ _[k][Dir][(][f]_ [(][x][j][);][ f] [(][x][i][))][y][i]

_i=j_ _[k][Dir][(][f]_ [(][x][j][);][ f] [(][x][i][))][ −] _[f]_ [(][x][j][)]
_̸_


_n_

CE\p(f )[p] = [1]

_n_

_j=1_ P

X

where kDir is a Dirichlet kernel defined as:


(18)


_K_

_i=1_ _[α][i][)]_
_kDir(z, zi) := [Γ(]K[P][K]_ _zj[α][ij]_ _[−][1]_ (19)

_i=1_ [Γ(][α][i][)] _j=1_

Y

with αi = zi/h + 1 (Ouimet & Tolosana-Delgado, 2021). As before, the computational complexityQ
is O(n[2]) irrespective of p.

This estimator is differentiable and furthermore, the following proposition holds:
**Proposition 3.3. The Dirichlet kernel based CE estimator is consistent, that is**

lim 1 _n_ _ni≠_ _nj_ _[k][Dir][(][f]_ [(][x][j][);][ f] [(][x][i][))][y][i] _p_ = E E[y _f_ (x)] _f_ (x) _p_ _p._ (20)
_n→∞_ _n_ Xj=1 P _i≠_ _j_ _[k][Dir][(][f]_ [(][x][j][);][ f] [(][x][i][))][ −] _[f]_ [(][x][j][)] _p_  _|_ _−_ _p_

P

_Proof. Dirichlet kernel estimators are consistent (Ouimet & Tolosana-Delgado, 2021), conse-_
quently, by Proposition 3.2 the term inside the norm is consistent for any fixed f (xj) (note, that
summing over i ̸= j ensures that the ratio of the KDE’s does not depend on the outer summation).
Moreover, for any convergent sequence also the norm of that sequence converges against the norm
of its limit. Ultimately, the outer sum is merely the sample mean of consistent summands, which
again is consistent.

4 EMPIRICAL SETUP

We trained ResNet (He et al., 2015), ResNet with stochastic depth (SD) (Huang et al., 2016),
DenseNet (Huang et al., 2018) and WideResNet (Zagoruyko & Komodakis, 2016) networks on
CIFAR-10 and CIFAR-100 (Krizhevsky, 2009). We use 45000 images for training. The code will
be released upon acceptance.

**Baselines** _Cross-entropy: The first baseline model is trained using cross-entropy with the data_
preprocessing, training procedure and hyperparameters described in the corresponding paper for
the architecture. Trainable calibration strategies MMCE (Kumar et al., 2018) is a differentiable
measure of calibration with a property that it is minimized at perfect calibration. It is used as
a regulariser alongside NLL, with the strength of regularization parameterized by λ. Focal loss
(Mukhoti et al., 2020) is an alternative to the popular cross-entropy loss, defined as Lf = −(1 −


-----

_f_ (y|x))[γ] log(f (y|x)), where γ is a hyperparameter and f (y|x) is the probability score that a neural
network f outputs for a class y on an input x. Their best-performing approach is the sampledependent FL-53 where γ = 5 for f (y|x) ∈ [0, 0.2) and γ = 3 otherwise, followed by the method
with fixed γ = 3. Post-hoc calibration strategies Guo et al. (2017) investigated the performance
of several post-hoc calibration methods and found temperature scaling to be a strong baseline,
which we use as a representative of this group. It works by scaling the logits with a scalar T > 0,
typically learned on a validation set by minimizing NLL. Following Kumar et al. (2018); Mukhoti
et al. (2020), we also use temperature scaling as a post-processing step for our method.

**Metrics** The most widely-used metric for expected calibration error (ECE) is a binned estimator
(Naeini et al., 2015), which divides the interval [0, 1] into bins of equal width and then calculates
a weighted average of the absolute difference between accuracy and confidence for each bin. A
better binning scheme involves determining the bin sizes so that an equal number of samples fall
into each bin (Nguyen & O’Connor, 2015; Mukhoti et al., 2020). We report the ECE (%) with 15
bins calculated according to the latter, so-called adaptive binning procedure. We compute the 95%
confidence intervals using 100 bootstrap samples as in Kumar et al. (2019). We consider multiple
versions of the ECE metric based on the Lp norm and the type of calibration (top-label, marginal,
canonical). Top-label calibration error only considers the probability of the predicted class, marginal
requires per-class calibration and the canonical is the highest form of calibration which requires the
entire probability vector to be calibrated. We report L1 and L2 ECE in the marginal and canonical
case. Additional experiments with top-label and marginal calibration on both CIFAR-10 and CIFAR100 can be found in Appendix B.

**Hyperparameters** A crucial parameter for KDE is the bandwidth, a positive number that defines
the smoothness of the density plot. Poorly chosen bandwidth may lead to undersmoothing (small
bandwidth) or oversmoothing (large bandwidth). A commonly used non-parametric bandwidth selector is maximum likelihood cross validation (Duin, 1976). For our experiments we choose the
bandwidth from a list of possible values by maximizing the leave-one-out likelihood. The λ parameter for weighting the calibration error w.r.t the loss is typically chosen via cross-validation or using
a holdout validation set. The p parameter is chosen depending on the desired Lp calibration error
and the corresponding theoretical guarantees.

5 RESULTS AND DISCUSSION


5.1 BINARY CLASSIFICATION

We construct a binary experiment by splitting the CIFAR-10 classes into 2 classes: vehicles (plane,
automobile, ship, truck) and animals (bird, cat, deer, dog, frog, horse). Figure 1a shows how the
choice of the bandwidth parameter influences the shape of the estimate.


0.0040

0.0035

0.0030

0.0025

0.0020

0.0015

0.0010

0.0005

0.0000


10

8

6

4

2

0

|Col1|Col2|KDE b = KDE b =|0.001 0.01|
|---|---|---|---|
|||KDE b = Histogram|0.1 from samples|
|||||
|||||
|||||
|||||
|||||


KDE b = 0.001
KDE b = 0.01
KDE b = 0.1
Histogram from samples

0.0 0.2 0.4 0.6 0.8 1.0

|Col1|Col2|Col3|Col4|Col5|Col6|Col7|Col8|KDE-MSE MSE|Col10|
|---|---|---|---|---|---|---|---|---|---|
|||||||||||
|||||||||||
|||||||||||
||||||0.|2||||
|||||||||||
|||||0.1||||0.3||
|||||||||||
||||||||||0.4|
|||||||||||
|||||||||||


0.03 0.04 0.05 0.06 0.07 0.08

MSE

(b) Effect of γ


(a) Effect of the bandwidth b


Figure 1: Calibration regularized training using MSE loss and CE2

Figure 1b shows the effect of the regularization parameter γ on the performance of a ResNet-110
model. The orange point represents a model trained with MSE loss, and the blue points (KDE-MSE)
correspond to models trained with regularized MSE loss by an L2 calibration error for different
values of γ. As expected, the calibration regularized training decreases the L2 calibration error at
the cost of slightly increased error.


-----

5.2 EVALUATING CANONICAL CALIBRATION

Accurately evaluating the calibration error is another crucial step towards designing trustworthy
models that can be used in high-cost settings. In spite of its numerous flaws discussed in Vaicenavicius et al. (2019); Ding et al. (2020); Ashukha et al. (2021), such as its sensitivity to the binning
scheme, the histogram-based estimator remains the most widely used metric for evaluating miscalibration. Another downside of the binned estimator is its inability to capture canonical calibration
due to the curse of dimensionality, as the number of bins grows exponentially with the number of
classes. Therefore, because of its favourable scaling properties, we propose using our Dirichlet
kernel density estimate as an alternative metric (KDE-ECE) to measure calibration.
To investigate its relationship with the commonly used binned estimator, we first introduce an extension of the top-label binned estimator to the probability simplex in the three class setting. We start
by partitioning the probability simplex into equally-sized, triangle-shaped bins and assign the probability scores to the corresponding bin, as shown in Figure 2a. Then, we define the binned estimate
of canonical calibration error as follows:


CEp(f )[p] _≈_ E _∥H(f_ (x)) − _f_ (x)∥p[p] _≈_ _n[1]_
h i


_H(f_ (xj)) _f_ (xi) _p_ (21)
_∥_ _−_ _∥[p]_
_i=1_

X


where H(f (xj)) is the histogram estimate, shown in Figure 2b. The surface of the corresponding
Dirichlet KDE is presented in Figure 2c. In Figure 3 we show that the KDE-ECE estimates of the
three types of calibration closely correspond to the their histogram-based approximations. Each
point in the plot represents a ResNet-56 model trained on a different subset of three classes from
CIFAR-10. See Appendix C for another example of the binned estimator and Dirichlet KDE on
CIFAR-10 and an experiment with varying number of points used for the density estimation.

0.0 0.2 0.4 0.6 0.8 1.0 0.05 0.10 0.15 0.20 0.25


0.0

1.0

0.2

0.8

0.4

0.6

0.6

0.4

0.8

0.2

1.0

0.0

0.0 0.2 0.4 0.6 0.8 1.0


0.05 0.10 0.15 0.20 0.25


(a) Splitting the simplex in 16 bins


(b) Histogram (c) Dirichlet KDE


Figure 2: Extension of the binned estimator to the probability simplex, compared with the KDEECE. The KDE-ECE achieves a better approximation to the finite sample, and accurately models
the fact that samples tend to be concentrated near low dimensional faces of the simplex.


0.200

0.175

0.150

0.125

0.100

Binned ECE 0.075

0.050

0.025

0.000

0.000 0.025 0.050 0.075 0.100 0.125 0.150 0.175 0.200

KDE ECE


0.04

0.03

0.02

Binned ECE

0.01

0.00

0.00 0.01 0.02 0.03 0.04

KDE ECE


0.06

0.05

0.04

0.03

Binned ECE

0.02

0.01

0.00

0.00 0.01 0.02 0.03 0.04 0.05 0.06

KDE ECE


(a) Canonical


(b) Marginal


(c) Top-label


Figure 3: Relationship between the KDE-ECE estimates and their corresponding binned approximations on the three types of calibration. Each point represents a ResNet-56 model trained on a subset
of three classes from CIFAR-10. The 3000 probability scores of the test set are assigned in 25 bins
with adaptive width for the binned estimate. A bandwidth of 0.001 is used for KDE-ECE.


-----

5.3 MULTICLASS CLASSIFICATION

In this section we evaluate our proposed KDE-based ECE estimator that was jointly trained with
cross entropy loss (KDE-CRE) against other baselines in a multiclass setting on CIFAR-10 and
CIFAR-100. We found that for KDE-CRE, values of λ ∈ [0.01, 0.1] provide a good trade-off in
terms of accuracy and calibration error. Table 1 summarizes the accuracy and marginal L1 ECE%
(computed using 15 bins), measured across multiple architectures. For MMCE, we report the results
with λ = 1 and for KDE-CRE we use λ = 0.01. An analogous table measuring marginal L2 ECE
is given in Appendix B.

Table 1: Accuracy and marginal L1 ECE (%) computed with 15 bins for different loss functions
and architectures, both trained from scratch (Pre T) and after temperature scaling on a validation set
(Post T). Best results are marked in bold.

**CIFAR-10** **CIFAR-100**
**Loss** **Metric** ResNet ResNet (SD) Wide-ResNet DenseNet ResNet ResNet (SD) Wide-ResNet DenseNet

Pre T 0.419 0.357 **0.241** 0.236 0.129 0.100 **0.086** **0.090**
ECE
Post T 0.282 0.250 0.278 **0.165** 0.114 **0.089** **0.105** **0.078**
CRE

Pre T 0.925 **0.926** **0.957** 0.947 **0.700** **0.728** **0.803** 0.756
Acc
Post T **0.927** 0.925 **0.957** 0.947 **0.700** **0.729** **0.801** 0.758


Pre T **0.250** 0.390 0.265 **0.193** 0.143 0.100 0.120 0.123
ECE
Post T 0.361 0.308 0.291 0.235 0.121 0.093 0.109 0.124
MMCE

Pre T **0.929** 0.925 0.947 0.944 0.693 0.723 0.767 0.748
Acc
Post T 0.926 **0.926** 0.949 0.945 0.691 0.722 0.770 0.743

Pre T 0.403 0.416 0.414 0.259 0.145 0.120 0.125 0.095
ECE
Post T 0.272 0.267 0.437 0.220 0.124 0.107 0.106 0.081
FL-53

Pre T 0.922 0.920 0.936 **0.948** 0.695 0.711 0.760 0.752
Acc
Post T 0.923 0.919 0.936 **0.949** 0.693 0.712 0.763 0.753

Pre T 0.363 **0.338** 0.289 0.296 **0.128** **0.096** 0.092 0.099
ECE
Post T **0.182** **0.220** **0.226** 0.248 **0.104** 0.095 0.108 0.085
_L1 KDE-CRE_

Pre T 0.926 0.925 0.953 0.943 0.697 0.725 0.796 **0.757**
Acc
Post T **0.927** 0.925 0.953 0.944 0.698 0.720 0.793 **0.759**

We notice that for both pre and post temperature scaling, KDE-CRE achieves very competitive ECE
scores. Another encouraging observation is that the improvement of calibration error comes at almost no cost in accuracy. An important advantage of our KDE-based method is the ability to directly
train and evaluate canonical calibration. In Figure 4 we show a scatter plot with confidence intervals
of the L1 and L2 KDE-CRE models for canonical calibration and the other baselines on CIFAR-10.
We measure the canonical calibration using our KDE-ECE metric from section 5.2. In three of the
architectures, both L1 and L2 KDE-CRE either dominate or are statistically tied with cross-entropy
(CRE). Similarly, Figure 5 shows a scatter plot of L1 and L2 KDE-CRE models trained to minimize
marginal calibration error. In this case, we measure L2 marginal ECE with the standard binned estimator. In most cases, our methods Pareto dominate the other baselines. A general observation can be
made, however, that the models trained with cross-entropy have a surprisingly low marginal calibration error, contrary to previous findings that show poor calibration when considering only the most
confident prediction (top-label calibration). An additional experiment comparing the CRE baseline
with KDE-CRE for canonical calibration on a benchmark dataset of histological images of human
colorectal cancer is given in Appendix D.2, which clearly illustrates the superior performance of our
method, both in terms of accuracy and calibration error in this context.
To summarize, the experiments show that our estimator is consistently producing competitive calibration errors with other state-of-the-art approaches, while maintaining accuracy and keeping the
computational complexity at O(n[2]). We evaluate the computational overhead of CRE and KDECRE and summarize the results in a table in Appendix D.1, which shows that the added cost is
less than a couple percent. There are several limitations in the current work: A larger scale benchmarking will be beneficial for exploring the limits of canonical calibration using Dirichlet kernels.
Furthermore, while we showed consistency of our estimator, we did not fully derive and implement
its debiasing. Due to space constraints, this was not the focus of the paper and is left for future work.

6 CONCLUSION

In this paper, we proposed a consistent and differentiable estimator of an Lp calibration error using
Dirichlet kernels. The KDE-based estimate can be directly optimized alongside any loss function in
the existing batch stochastic gradient descent framework. Furthermore, we propose using it as a mea

-----

sure of the highest form of calibration which requires the entire probability vector to be calibrated.
We showed empirically on a range of neural architectures that the performance of our estimator
in terms of accuracy and calibration error is competitive against the current state-of-the-art, while
having superior properties as a consistent estimator of canonical calibration error.


0.11

0.10


0.16

0.14

0.12

0.10

0.08

0.16

0.14

0.12

0.10

0.08

0.06

0.04


0.09

0.08


0.07

0.11

0.10

0.09

0.08

0.07

0.06

0.05

0.04

|Col1|0.3|Col3|Col4|Col5|Col6|CRE FL|
|---|---|---|---|---|---|---|
||||6||||
|||||3||MMCE L1 KDE-CRE L2 KDE-CRE|
|||1||4 0 2|||
||||||0.1|1.0|
||||||||
|||||||3 0.10.01|
|||||||0.01|

|0.|3|Col3|Col4|Col5|CRE FL|
|---|---|---|---|---|---|
||||||MMCE L1 KDE-CRE L2 KDE-CRE|
||||0.2|||
|||||3|10 6|
|||||0.1|4 53 12.0|
||||||0.01 0.1|


0.3 CRE
FL
MMCE
L1 KDE-CRE

L2 KDE-CRE

0.2

10

3 6

0.1

4

53 1.02

0.01

0.1


CRE

0.3 6 FL

MMCE
L1 KDE-CRE

3 L2 KDE-CRE

4

10

2

0.1 1.0

0.2

53 0.10.01

0.01


0.88 0.89 0.90 0.91 0.92 0.93

ACC

(a) ResNet-110


0.84 0.86 0.88 0.90 0.92

ACC

(b) ResNet-110 (SD)

|Col1|Col2|6|Col4|Col5|Col6|CRE FL|Col8|
|---|---|---|---|---|---|---|---|
|0|.3|4||10||MMCE L1 KDE-|CRE|
|||||||L2 KDE-|CRE|
||||0.2|2 0.1 53||||
|||||0.|2 31.0|||
|||||||||
|||||||0.001.1||
|||||||0|.01|

|Col1|6|Col3|Col4|Col5|CRE|
|---|---|---|---|---|---|
||||||FL MMCE|
|||10 4|||L1 KDE-CRE L2 KDE-CRE|
|||||||
||||0.3|||
|||||0.20.1 0.2|2 0.1301 .. 00 1|
||||||0.01|
||||||53|


0.91 0.92 0.93 0.94 0.95 0.96

ACC

(c) Wide-ResNet-28-10


0.86 0.88 0.90 0.92 0.94

ACC

(d) DenseNet-40


Figure 4: Canonical calibration on CIFAR-10


|1e 5|Col2|Col3|Col4|Col5|Col6|Col7|Col8|Col9|
|---|---|---|---|---|---|---|---|---|
|||2|4|2||53||CRE FL MMCE L1 KDE-CRE L2 KDE-CRE|
||||0.|||0.2|||
||||6|3||1 0.3|.0 0.01|0.1|
|||||0.3|||0.1|0.01|
||||||||||
||||||||||
||||||||||


4 CRE

53 FL

0.2 MMCE

2 6 0.2 LL12 KDE-CRE KDE-CRE

1.0

3 0.3 0.01 0.1

0.3 0.01

0.1

0.65 0.66 0.67 0.68 0.69 0.70 0.71

ACC


2.5


3.5

3.0

2.5

2.0

1.5

1.0

4.0

3.5

3.0

2.5

2.0

1.5

1.0

0.5


2.0

1.5


1.0

0.5

|1e 5|Col2|Col3|Col4|Col5|Col6|Col7|Col8|Col9|Col10|
|---|---|---|---|---|---|---|---|---|---|
|0.||3||||3 0.2||0.01||
|||||||||53||
|||||0.3||||0.1|2 1.0|
||||||0|.2||0.1|0.01|
|CRE FL MM L K||CE DE-CRE||||||||
|1 L2 K||DE-CRE||||||||


0.64 0.66 0.68 0.70 0.72 0.74

ACC

(b) ResNet-110 (SD)


(a) ResNet-110


|1e 5|Col2|Col3|Col4|Col5|Col6|Col7|Col8|Col9|
|---|---|---|---|---|---|---|---|---|
||||||||CRE||
|||4|||||FL MMCE||
||||||||||
||||||||L1 KDE-C L2 KDE-C|RE RE|
||||53 1. 2||0||||
||||||0.1||||
||0|.3|0||0.2 .2||03.01||
|||0.3|||||00.1.01||
||||||||||


4 CREFL

MMCE
L1 KDE-CRE
L2 KDE-CRE

53 1.0

2

0.1

0.3

0.2

0.2

0.3 0.10.010.013

0.74 0.75 0.76 0.77 0.78 0.79 0.80 0.81

ACC


3.0

2.5

2.0

1.5

1.0

0.5

|1e 5|Col2|Col3|Col4|Col5|Col6|Col7|Col8|Col9|Col10|Col11|
|---|---|---|---|---|---|---|---|---|---|---|
|||||||1.|||0|CRE|
|||||0.2||2||||FL MMCE L1 KDE-CRE L2 KDE-CRE|
||||||||||||
||4||||||||||
|||0.3|||03.2||||53|0.01|
||||||||0||.1||
||||||0.3||||0.10|.01|
||||||||||||
||||||||||||


0.72 0.73 0.74 0.75 0.76

ACC

(d) DenseNet-40


(c) Wide-ResNet-28-10


Figure 5: Marginal calibration on CIFAR-100


-----

REFERENCES

A. Appice, P. Rodrigues, V. S. Costa, C. Soares, Jo˜ao Gama, and A. Jorge. Novel decompositions
of proper scoring rules for classification : Score adjustment as precursor to calibration. 2015.

Arsenii Ashukha, Alexander Lyzhov, Dmitry Molchanov, and Dmitry Vetrov. Pitfalls of in-domain
uncertainty estimation and ensembling in deep learning, 2021.

Peter L. Bartlett, Michael I. Jordan, and Jon D. Mcauliffe. Convexity, classification, and risk bounds.
_Journal of the American Statistical Association, 101(473):138–156, 2006._

Taoufik Bouezmarni and Jean-Marie Rolin. Consistency of the beta kernel density function estimator. The Canadian Journal of Statistics / La Revue Canadienne de Statistique, 31(1):89–98,
2003.

Jochen Br¨ocker. Reliability, sufficiency, and the decomposition of proper scores. Quarterly Journal
_of the Royal Meteorological Society, 135(643):1512–1519, Jul 2009._

Song Xi Chen. Beta kernel estimators for density functions. _Computational Statistics & Data_
_Analysis, 31:131–145, 1999._

M. Degroot and S. Fienberg. The comparison and evaluation of forecasters. The Statistician, 32:
12–22, 1983.

Yukun Ding, Jinglan Liu, Jinjun Xiong, and Yiyu Shi. Revisiting the evaluation of uncertainty estimation and its application to explore model complexity-uncertainty trade-off. arXiv:1903.02050,
2020.

Robert Duin. On the choice of smoothing parameters for parzen estimators of probability density
functions. IEEE Transactions on Computers, C-25(11):1175–1179, 1976.

Thomas S. Ferguson. U-statistics. In Notes for Statistics 200C. UCLA, 2005.

Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. On calibration of modern neural
networks, 2017.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. arXiv:1512.03385, 2015.

Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Weinberger. Deep networks with stochastic depth. arXiv:1603.09382, 2016.

Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q. Weinberger. Densely connected
convolutional networks, 2018.

Jakob Kather, Cleo-Aron Weis, Francesco Bianconi, Susanne Melchers, Lothar Schad, Timo Gaiser,
Alexander Marx, and Frank Z¨ollner. Multi-class texture analysis in colorectal cancer histology.
_Scientific Reports, 6:27988, 06 2016. doi: 10.1038/srep27988._

Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, University
of Toronto, 2009.

Volodymyr Kuleshov and Percy S Liang. Calibrated structured prediction. In C. Cortes,
N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett (eds.), Advances in Neural Information Pro_cessing Systems, volume 28. Curran Associates, Inc., 2015._

Meelis Kull, Miquel Perello-Nieto, Markus K¨angsepp, Telmo Silva Filho, Hao Song, and Peter Flach. Beyond temperature scaling: Obtaining well-calibrated multiclass probabilities with
Dirichlet calibration. arXiv:1910.12656, 2019.

Ananya Kumar, Percy S Liang, and Tengyu Ma. Verified uncertainty calibration. In H. Wallach,
H. Larochelle, A. Beygelzimer, F. d'Alch´e-Buc, E. Fox, and R. Garnett (eds.), Advances in Neural
_Information Processing Systems 32, pp. 3792–3803. 2019._


-----

Aviral Kumar, Sunita Sarawagi, and Ujjwal Jain. Trainable calibration measures for neural networks
from kernel mean embeddings. In ICML, 2018.

Gongbo Liang, Yu Zhang, Xiaoqin Wang, and Nathan Jacobs. Improved trainable calibration method
for neural networks on medical imaging classification. In British Machine Vision Conference,
2020.

Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Doll´ar. Focal loss for dense object
detection. arXiv:1708.02002, 2018.

Jishnu Mukhoti, Viveka Kulharia, Amartya Sanyal, Stuart Golodetz, Philip H. S. Torr, and Puneet K.
Dokania. Calibrating deep neural networks using focal loss. arXiv:2002.09437, 2020.

A. Murphy. A new vector partition of the probability score. Journal of Applied Meteorology, 12:
595–600, 1973.

Rafael M¨uller, Simon Kornblith, and Geoffrey Hinton. When does label smoothing help?
arXiv:1906.02629, 2020.

Mahdi Pakdaman Naeini and Gregory F. Cooper. Binary classifier calibration using an ensemble of
near isotonic regression models. arXiv:1511.05191, 2015.

Mahdi Pakdaman Naeini, Gregory F. Cooper, and Milos Hauskrecht. Obtaining well calibrated
probabilities using Bayesian binning. In Proceedings of the Twenty-Ninth AAAI Conference on
_Artificial Intelligence, pp. 2901–2907, 2015._

Khanh Nguyen and Brendan O’Connor. Posterior calibration and exploratory analysis for natural
language processing models. arXiv:1508.05154, 2015.

Fr´ed´eric Ouimet and Raimon Tolosana-Delgado. Asymptotic properties of dirichlet kernel density
estimators. arXiv:2002.06956, 2021.

Emanuel Parzen. On estimation of a probability density function and mode. The Annals of Mathe_matical Statistics, 33(3):1065–1076, 1962._

Gabriel Pereyra, George Tucker, Jan Chorowski, Łukasz Kaiser, and Geoffrey Hinton. Regularizing
neural networks by penalizing confident output distributions. arXiv:1701.06548, 2017.

John C. Platt. Probabilistic outputs for support vector machines and comparisons to regularized
likelihood methods. In Advances in Large Margin Classifiers, pp. 61–74. MIT Press, 1999.

Murray Rosenblatt. Remarks on some nonparametric estimates of a density function. The Annals of
_Mathematical Statistics, 27(3):832 – 837, 1956._

Alastair Scott and Chien-Fu Wu. On the asymptotic distribution of ratio and regression estimators.
_Journal of the American Statistical Association, 76(373):98–102, 1981._

B. W. Silverman. Density Estimation for Statistics and Data Analysis. Chapman & Hall, 1986.

Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. arXiv:1512.00567, 2015.

Juozas Vaicenavicius, David Widmann, Carl Andersson, Fredrik Lindsten, Jacob Roll, and
Thomas B. Sch¨on. Evaluating model calibration in classification. arXiv:1902.06977, 2019.

Jonathan Wenger, Hedvig Kjellstr¨om, and Rudolph Triebel. Non-parametric calibration for classification. In International Conference on Artificial Intelligence and Statistics, pp. 178–190, 2020.

B. Zadrozny and C. Elkan. Transforming classifier scores into accurate multiclass probability estimates. Proceedings of the eighth ACM SIGKDD international conference on Knowledge discov_ery and data mining, 2002._

Bianca Zadrozny and Charles Elkan. Obtaining calibrated probability estimates from decision trees
and naive bayesian classifiers. ICML, 1, 05 2001.


-----

Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. In British Machine Vision
_Conference, 2016._

Jize Zhang, Bhavya Kailkhura, and T. Yong-Jin Han. Mix-n-match: Ensemble and compositional
methods for uncertainty calibration in deep learning. In International Conference on Machine
_Learning, 2020._

Shunpu Zhang and Rohana Karunamuni. Boundary performance of the beta kernel estimators.
_Journal of Nonparametric Statistics, 22:81–104, 01 2010._

A DERIVATION OF THE MSE DECOMPOSITION

**Definition A.1 (Mean Squared Error (MSE)). The mean squared error of an estimator is**

MSE(f ) := E[(f (x) − _y)[2]]._ (22)

**Proposition A.2. MSE(f** ) ≥ CE2(f )[2]

_Proof._

MSE(f ) :=E[(f (x) − _y))[2]] = E[((f_ (x) − E[y | f (x)]) + (E[y | f (x)] − _y))[2]]_ (23)

= E[(f (x) − E[y | f (x)])[2]] +E[(E[y | f (x)] − _y)[2]]_ (24)
=CE2[2]
|+ 2E[(f (x){z E[y _f_ (x})])(E[y _f_ (x)] _y)]_

_−_ _|_ _|_ _−_

which implies

MSE(f ) − CE2(f )[2] =E[(E[y | f (x)] − _y)[2]]_ (25)
+ 2E[(f (x) − E[y | f (x)])(E[y | f (x)] − _y)]_

=E[(E[y | f (x)] − _y)[2]] + 2E[(f_ (x)E[y | f (x)]] (26)

_−_ 2E[f (x)y] − 2E[E[y | f (x)][2]] + 2E[E[y | f (x)]y]]

=E[E[y | f (x)][2]] + E[y[2]] − 2E[E[y | f (x)]y] (27)
+ 2E[(f (x)E[y | f (x)]] − 2E[f (x)y]

_−_ 2E[E[y | f (x)][2]] + 2E[E[y | f (x)]y]]

=E[y[2]] + 2E[(f (x)E[y | f (x)]] − 2E[f (x)y] (28)

_−_ E[E[y | f (x)][2]]
=E[(2f (x) − _y −_ E[y | f (x)])(E[y | f (x)]) − _y]_ (29)
=E[(f (x) − _y)(E[y | f_ (x)] − _y)]_ (30)
+ E[(f (x) − E[y | f (x)])(E[y | f (x)] − _y)]._

By the law of total expectation, we will write the above as

MSE(f ) − CE2(f )[2] = E[E[(f (x) − _y)(E[y | f_ (x)] − _y)_ (31)
+ (f (x) − E[y | f (x)])(E[y | f (x)] − _y) | f_ (x)]].

Focusing on the inner conditional expectation, we have that

E[(f (x) − _y)(E[y | f_ (x)] − _y) + (f_ (x) − E[y | f (x)])(E[y | f (x)] − _y) | f_ (x)]
=E[y | f (x)](f (x) − 1)(E[y | f (x)] − 1) + (1 − E[y | f (x)])f (x)E[y | f (x)]
+ E[y | f (x)](f (x) − E[y | f (x)])(E[y | f (x)] − 1)
+ (1 − E[y | f (x)])(f (x) − E[y | f (x)])E[y | f (x)] (32)
=(1 − E[y | f (x)])E[y | f (x)] ≥ 0 _∀f_ (x) (33)

and therefore

MSE(f ) − CE2(f )[2] = E[(1 − E[y | f (x)])E[y | f (x)]] ≥ 0. (34)

The expectation in Equation 34 is over variances of Bernoulli random variables with probabilities
E[y | f (x)].


-----

B RESULTS

Table 2 summarizes the marginal L2 ECE and accuracy for the two datasets across multiple architectures and training loss functions. The scatter plots in Figures 6 and 7 show the accuracy and both
_L1 and L2 ECE, for top-label and marginal calibration on CIFAR-10 and CIFAR-100, respectively._
KDE-CRE is trained by directly minimizing the metric that is evaluated, e.g., in the first column we
minimize marginal L1 calibration error and in the last column we optimize the L2 top label calibration error. Other methods do not have the flexibility of choosing the type of calibration and the Lp
norm.

Table 2: Accuracy and marginal L2 ECE (%) computed with 15 bins for different approaches,
trained from scratch (Pre T) and after temperature scaling (Post T).

**CIFAR-10** **CIFAR-100**
**Loss** **Metric** ResNet ResNet (SD) Wide-ResNet DenseNet ResNet ResNet (SD) Wide-ResNet DenseNet

Pre T 0.020 0.009 0.007 0.008 0.002 0.002 0.001 0.001
ECE
Post T (NLL) 0.007 0.005 0.008 0.004 0.002 0.001 0.001 0.001
**CRE**

Pre T 0.925 0.926 0.950 0.947 0.700 0.728 0.797 0.756
Acc
Post T (NLL) 0.927 0.925 0.950 0.947 0.700 0.729 0.794 0.758


Pre T 0.009 0.015 0.009 0.004 0.003 0.001 0.003 0.003
ECE
Post T (NLL) 0.013 0.009 0.009 0.005 0.002 0.001 0.002 0.003
**MMCE**

Pre T 0.929 0.925 0.947 0.944 0.693 0.723 0.767 0.748
Acc
Post T (NLL) 0.926 0.926 0.949 0.945 0.691 0.722 0.770 0.743

Pre T 0.013 0.020 0.026 0.005 0.003 0.002 0.003 0.002
ECE
Post T (NLL) 0.008 0.009 0.022 0.004 0.002 0.002 0.002 0.001
**FL-53**

Pre T 0.922 0.920 0.936 0.948 0.695 0.711 0.760 0.752
Acc
Post T (NLL) 0.923 0.919 0.936 0.949 0.693 0.712 0.763 0.753

Pre T 0.010 0.015 0.007 0.008 0.002 0.002 0.001 0.001
ECE
Post T (NLL) 0.004 0.012 0.008 0.009 0.002 0.002 0.001 0.001
_L2 KDE-CRE_

Pre T 0.930 0.922 0.950 0.943 0.707 0.713 0.797 0.757
Acc
Post T (NLL) 0.930 0.921 0.950 0.944 0.707 0.717 0.794 0.755

C RELATIONSHIP BETWEEN THE BINNED ESTIMATOR AND THE KERNEL
DENSITY ESTIMATOR

Figure 8 shows an example of the binned estimator in a three-class setting on CIFAR-10. The points
are mostly concentrated at the edges of the histogram, as can be seen from Figure 8b. The surface
of the corresponding Dirichlet KDE is given in 8c.
Figure 9 shows the relationship between the binned estimator and our KDE-ECE metric. The points
represent a trained Resnet-56 model on a subset of three classes from CIFAR-10. In every row, a
differnt number of points was used to estimate the KDE-ECE.

D EXPERIMENTS FOR REBUTTAL

D.1 TRAINING TIME MEASUREMENTS

In Table 3 we summarize the running time per epoch for training with (KDE-CRE) and without
(CRE) regularization for the two datasets and four architectures. KDE-CRE does not create an
overhead of more than a couple percent over the CRE baseline.

D.2 CANONICAL CALIBRATION IN A MEDICAL APPLICATION

An additional experiment with a medical application, where the canonical calibration is of particular
interest, was performed on the publicly-available Kather dataset (Kather et al., 2016), which consists
of 5000 histological images of human colorectal cancer. The data has eight different classes of tissue.
Figure 10 shows a comparison in performance of the CRE baseline with our KDE-CRE method. The
canonical L1 (left) and L2 (right) calibration is measured using our KDE-ECE metric. The results
clearly illustrate that our method significantly outperforms the cross-entropy baseline, both in terms
of accuracy and calibration error, for several choices of the regularization parameter.

D.3 BIAS AND CONVERGENCE RATES

Figure 11 shows a comparison of the groud truth, computed from 3000 test points with KDE-ECE
against KDE-ECE and binned ECE estimated with a varying number of points used for the estima

-----

Marginal calibration on CIFAR10 using Densenet


Top-label calibration on CIFAR10 using Densenet


Marginal calibration on CIFAR10 using Densenet


Top-label calibration on CIFAR10 using Densenet


0.035

0.030

0.025

0.020

0.015

0.010

0.005

0.000

0.0175

0.0150

0.0125

0.0100

0.0075

0.0050

0.0025

0.0000


0.10

0.08

0.06

0.04

0.02

0.00

0.06

0.05

0.04

0.03

0.02

0.01


0.0030

0.0025

0.0020

0.0015

0.0010

0.0005

0.0000

0.0012

0.0010

0.0008

0.0006

0.0004

0.0002

0.0000


0.020

0.015

0.010

0.005

0.000

0.012

0.010

0.008

0.006

0.004

0.002

|6|0.3|4 0.20.3|MMCE L L1 2 K KD DE E- -C CR RE E 2|
|---|---|---|---|

|6|0.3|4 0.20.3 0.2|MMCE L L1 2 K KD DE E- -C CR RE E 53 00.0.0 052.1 1 .053|
|---|---|---|---|

|Col1|10|4|MMCE L L1 2 K KD DE E- -C CR RE E|
|---|---|---|---|

|6|10|4|MMCE L L1 2 K KD DE E- -C CR RE E 53|
|---|---|---|---|


10 CREFL

MMCE

6 LL12 KDE-CRE KDE-CRE

4

0.3 0.2 0.3 2

0.2 0.050.10.10.0531.053

0.86 0.88 0.90ACC 0.92 0.94

Marginal calibration on CIFAR10 using Resnet


10 CREFL

MMCE
LL12 KDE-CRE KDE-CRE

6

4

0.3 0.2 0.3 0.2 0.050.10.10.052 3 53

1.0

0.86 0.88 0.90ACC 0.92 0.94

Top-label calibration on CIFAR10 using Resnet


CRE

6 FLMMCE

10 LL12 KDE-CRE KDE-CRE

4

0.3 0.2 0.3 0.2 0.050.10.10.052 31.053

0.86 0.88 0.90ACC 0.92 0.94

Marginal calibration on CIFAR10 using Resnet


CRE
FL

10 MMCELL12 KDE-CRE KDE-CRE

6 53

4

0.3 0.2 0.3 0.2 0.050.10.10.052 31.0

0.86 0.88 0.90ACC 0.92 0.94

Top-label calibration on CIFAR10 using Resnet

|Col1|Col2|10|CRE FL|
|---|---|---|---|
|0.3||6 0.2 304 0.3 .1|MMCE L L1 2 K KD DE E- -C CR RE E|

|Col1|Col2|10|CRE FL|
|---|---|---|---|
|0.3||0.2 06 .3 0.1 0. 02 34|MMCE L L1 2 K KD DE E- -C CR RE E 53 .050.050.1 0.5|

|Col1|Col2|Col3|Col4|CRE FL|
|---|---|---|---|---|
||||10 6|MMCE L L1 2 K KD DE E- -C CR RE E|

|Col1|Col2|Col3|CRE FL|
|---|---|---|---|
|||10 6|MMCE L L1 2 K KD DE E- -C CR RE E 53|


CRE

10 FLMMCE

6 LL12 KDE-CRE KDE-CRE

0.3

0.2 0.3 30.14 0.20.052 530.050.10.5

1.0


CRE

10 FLMMCE

LL12 KDE-CRE KDE-CRE

0.3 0.2 0.36 0.1 0.20.05530.050.1

34 0.5

1.0

2


CRE
FL

10 MMCE

LL12 KDE-CRE KDE-CRE

6

0.3 0.2 0.3 30.14 0.20.052 530.050.11.00.5


CRE
FL

10 MMCELL12 KDE-CRE KDE-CRE

6 53

0.3 0.2 0.3 0.14 0.20.050.050.1

3 2 1.00.5


0.87 0.88 0.89 0.90ACC 0.91 0.92 0.93

Marginal calibration on CIFAR10 using Resnet (SD)


0.87 0.88 0.89 0.90ACC 0.91 0.92 0.93

Top-label calibration on CIFAR10 using Resnet (SD)


0.87 0.88 0.89 0.90ACC 0.91 0.92 0.93

Marginal calibration on CIFAR10 using Resnet (SD)


0.87 0.88 0.89 0.90ACC 0.91 0.92 0.93

Marginal calibration on CIFAR10 using Resnet (SD)


0.05

0.04

0.03

0.02

0.01

0.00

0.014

0.012

0.010

0.008

0.006

0.004

0.002


0.0200

0.0175

0.0150

0.0125

0.0100

0.0075

0.0050

0.0025

0.0000

0.00175

0.00150

0.00125

0.00100

0.00075

0.00050

0.00025

0.00000


0.05

0.04

0.03

0.02

0.01

0.00

0.2 CREFL

MMCE
LL12 KDE-CRE KDE-CRE

0.2

106

0.30.3 3

0.1 0.050.15341.020.05


0.025

0.020

0.015

0.010

0.005

0.000


0.08

0.06

0.04

0.02

0.00

0.07

0.06

0.05

0.04

0.03

0.02

0.01

0.00


0.050.1531.020.05

|.2|Col2|CRE FL MMCE|
|---|---|---|
|||L L1 2 K KD DE E- -C CR RE E|
|0.2||160 0.03.3 3 4|

|Col1|Col2|Col3|CRE|
|---|---|---|---|
||0.2||FL MMCE L L1 2 K KD DE E- -C CR R10E E 6|
|0.2|||3 0.3 0.05 0054..3015 00.3.1 1.0|

|0.2|Col2|CRE|
|---|---|---|
|||FL MMCE|
|||L L1 2 K KD DE E- -C CR RE E|
||||

|.2|Col2|CRE FL MMCE|
|---|---|---|
|||L L1 2 K KD DE E- -C CR RE E|
|0.2||160 0.03.3 3 4|


0.2 CREFL

MMCE
LL12 KDE-CRE KDE-CRE

0.2

106

0.30.3 3

0.1 0.050.15341.020.05


CRE

0.2 FLMMCE

LL12 KDE-CRE KDE-CRE10

36

0.2 0.3 0.05

0.050.1534

0.30.1 1.02


0.2 CREFL

MMCE
LL12 KDE-CRE KDE-CRE

0.2 0.30.10.3 31.00.05100.1534620.05


0.2 0.4 ACC 0.6 0.8


0.2 0.4 ACC 0.6 0.8


0.2 0.4 ACC 0.6 0.8


0.2 0.4 ACC 0.6 0.8


Marginal calibration on CIFAR10 using Wideresnet

|Col1|Col2|Col3|CRE|
|---|---|---|---|
|||10 6|FL MMCE L L1 2 K KD DE E- -C CR RE E|
|0.3||4 2 0.20.3 53|3|


CRE

10 FL

MMCE
LL12 KDE-CRE KDE-CRE

6

4

0.3 2

0.20.3 530.2 0.10.131.00.050.01


0.89 0.90 0.91 0.92 ACC0.93 0.94 0.95 0.96


Top-label calibration on CIFAR10 using Wideresnet

|Col1|Col2|Col3|Col4|CRE|
|---|---|---|---|---|
||||10|FL MMCE L L1 2 K KD DE E- -C CR RE E|
||0.3||6 0.240.3 0.2 0 0.|1 .10.050.01 3|


CRE

10 FLMMCE

LL12 KDE-CRE KDE-CRE

6

0.3 0.240.3 0.2 0.10.13 0.050.01

2 53

1.0


0.89 0.90 0.91 0.92 ACC0.93 0.94 0.95 0.96


Marginal calibration on CIFAR10 using Wideresnet

|Col1|Col2|Col3|CRE|
|---|---|---|---|
|||10|FL MMCE L L1 2 K KD DE E- -C CR RE E|
||6|4 253||


CRE

10 FLMMCE

LL12 KDE-CRE KDE-CRE

6 4

0.3 0.20.3 2 530.2 0.10.131.00.050.01


0.89 0.90 0.91 0.92 ACC0.93 0.94 0.95 0.96


Top-label calibration on CIFAR10 using Wideresnet

|Col1|Col2|Col3|CRE|
|---|---|---|---|
|||10|FL MMCE L L1 2 K KD DE E- -C CR RE E|
|||6 0.|1|


CRE
FL

10 MMCE

LL12 KDE-CRE KDE-CRE

6

0.3 0.240.3 2 530.2 0.10.131.00.050.01


0.89 0.90 0.91 0.92 ACC0.93 0.94 0.95 0.96


Figure 6: Top-label and marginal calibration on CIFAR-10.

Table 3: Training time [sec] per epoch for Cross-Entropy and KDE-CE methods for different models
and datasets.


## Dataset Model CRE L1 KDE-CRE


## ResNet-110 51.8 53 ResNet-110 (SD) 45 46 Wide-ResNet-28-10 152.9 154.9 DenseNet-40 103.2 106.8
 ResNet-110 90 92.9 ResNet-110 (SD) 78.2 80.7 Wide-ResNet-28-10 150.5 155.3 DenseNet-40 101 105.5


## CIFAR-10

 CIFAR-100


tion. The used model is a ResNet-56, trained on a subset of three classes from CIFAR-10. The figure
shows that the two estimates are comparable and both are doing a reasonable job.
Figure 12 shows the absolute difference between the ground truth and estimated ECE using our KDE
estimator and a binned estimator with varying number of points used for estimation. The results are


-----

Marginal calibration on CIFAR100 using Densenet

|Col1|0.2 0.3 3 0.3|0.1530.10.01 0.01|
|---|---|---|


CRE

4 0.2 2 1.0 FLMMCELL12 KDE-CRE KDE-CRE

0.2

0.3 3 0.1 53 0.10.01

0.3 0.01


0.72 0.73 0.74ACC 0.75 0.76


Top-label calibration on CIFAR100 using Densenet

|CRE FL MMC L1 K|0.3 0.2 2 4 E 3 DE-CRE|53 0.1|
|---|---|---|


0.3

0.01

0.30.2 1.00.1 53 0.01

0.2 2 0.1

4

CRE
FL
MMCE 3
LL12 KDE-CRE KDE-CRE


0.72 0.73 0.74ACC 0.75 0.76


Marginal calibration on CIFAR100 using Densenet


Top-label calibration on CIFAR100 using Densenet

|Col1|0 4|0 .3.2 10.0.153 0.2 2 3|0.01 0.1 CRE FL MMCE L1 KDE-CRE|
|---|---|---|---|


0.3 0.01

0.30.2 1.00.1 53 0.01

0.2 2 0.1

CRE

4 FLMMCE

3 LL12 KDE-CRE KDE-CRE


0.72 0.73 0.74ACC 0.75 0.76


0.0013

0.0012

0.0011

0.0010

0.0009

0.0008

0.0007

0.0006

0.0015

0.0014

0.0013

0.0012

0.0011

0.0010

0.0009


0.0175

0.0150

0.0125

0.0100

0.0075

0.0050

0.0025

0.030

0.025

0.020

0.015

0.010

0.005

0.000


3.0

2.5

2.0

1.5

1.0

0.5

3.5

3.0

2.5

2.0

1.5

1.0


0.10

0.08

0.06

0.04

0.14

0.12

0.10

0.08

0.06

0.04

|Col1|4 0.3 03.2 0.3|53 0.1 0.100 .0.0 11|
|---|---|---|


1e 5

1.0 CREFL

0.2 2 MMCELL12 KDE-CRE KDE-CRE

4

0.3 0.30.23 0.1 53 0.10.010.01

0.72 0.73 0.74ACC 0.75 0.76

Marginal calibration on CIFAR100 using Resnet


Marginal calibration on CIFAR100 using Resnet

|Col1|Col2|0.3 3|0.2 0.01 0.3 0.10.01 0.1|
|---|---|---|---|
||CRE FL MM|CE||
||L L1 2 K K|DE-CRE DE-CRE||


2 460.2 1.053

0.2 0.01

0.3 3 0.3 0.1 0.01

0.1

CRE
FL
MMCE
LL12 KDE-CRE KDE-CRE


0.65 0.66 0.67 0.68ACC 0.69 0.70 0.71


Top-label calibration on CIFAR100 using Resnet

|Col1|Col2|2 4|Col4|
|---|---|---|---|
|||2 60.20.3|0.1 0.3 0.2|
||CRE FL MMC|E|3|
||L L1 2 K K|DE-CRE DE-CRE||


1.0

530.010.1 0.01

2 460.2 0.3 0.3 0.1

0.2

CRE
FL
MMCELL12 KDE-CRE KDE-CRE 3


0.65 0.66 0.67 0.68ACC 0.69 0.70 0.71


Top-label calibration on CIFAR100 using Resnet

|Col1|L KD|DE-CRE|0.|.1 0.01|
|---|---|---|---|---|
||L2 K|DE-CRE 2 4 0 6|.20.3 0.3 0.2|0.1|
||||||
||||3||


CRE
FLMMCELL12 KDE-CRE KDE-CRE 1.0530.010.1 0.01

2 460.2 0.3 0.3 0.1

0.2

3


0.65 0.66 0.67 0.68ACC 0.69 0.70 0.71

|Col1|Col2|2|0.2 L1 KDE-CRE L KDE-CRE|
|---|---|---|---|
|||6 0.3 3|L2 KDE-CRE 1.0 0.3 0.010.1 0.01 0.1|
|||||
|||||


1e 5

40.2 53 CREFLMMCE

2 6 0.2 1.0 LL12 KDE-CRE KDE-CRE

0.3 3 0.3 0.01 0.1 0.01

0.1


0.65 0.66 0.67 0.68ACC 0.69 0.70 0.71


Marginal calibration on CIFAR100 using Resnet (SD)

|0.3|0.23 0.3|530.01 0.1 2|
|---|---|---|
|CRE FL|0.2|0.1 1. 00 .01|
|MM|CE||
|L L1 2 K K|DE-CRE DE-CRE||


0.3

530.01

0.2 3

0.3 0.1 2

0.2 0.1 1.00.01

CRE
FL
MMCE
LL12 KDE-CRE KDE-CRE


0.64 0.66 0.68ACC 0.70 0.72 0.74


Top-label calibration on CIFAR100 using Resnet (SD)


|1e 5 M|Marginal calibration on CIFAR100|using Resnet (SD)|
|---|---|---|
|0.3|3 0.2|530.01 2|
|CRE FL MMC|0.3 0.2 E|0.1 0.1 1. 00 .01|
|L L1 2 K K|DE-CRE DE-CRE||


0.64 0.66 0.68ACC 0.70 0.72 0.74


Marginal calibration on CIFAR100 using Resnet (SD)

|0.3|0.3|5 0.23 0.|30.01 1 2|
|---|---|---|---|
|CRE FL||0 0.2|.1 1. 00 .01|
|MM|CE|||
|L L1 2 K K|DE-CRE DE-CRE|||


0.64 0.66 0.68ACC 0.70 0.72 0.74


1e 5


2.5

2.0

1.5

1.0

0.5

4.0

3.5

3.0

2.5

2.0

1.5

1.0

0.5


0.12

0.10

0.08

0.06

0.04

0.02

0.00

0.10

0.08

0.06

0.04

0.02


0.0012

0.0011

0.0010

0.0009

0.0008

0.0007

0.0006

0.0012

0.0010

0.0008

0.0006


0.0012

0.0011

0.0010

0.0009

0.0008

0.0007

0.0006

0.016

0.014

0.012

0.010

0.008

0.006

0.004

0.002

0.000

|CRE FL MMC L L1 2 K K|E DE-CRE DE-CRE|12.00.01 0.1 0.01.01|
|---|---|---|
|0.3|0.2 0.2 0.3|53 3|
||||


0.64 0.66 0.68ACC 0.70 0.72 0.74


Marginal calibration on CIFAR100 using Wideresnet

|Col1|Col2|4 532|CRE FL MMCE|
|---|---|---|---|
|||1.0 0.3 0.10.2 0.2|3L L1 2 K KD DE E- -C CR RE E 000.. 1.0011|


0.34 53 2 1.0 3CREFLMMCELL12 KDE-CRE KDE-CRE

0.10.2

0.2 0.10.010.01

0.3


0.74 0.75 0.76 0.77ACC 0.78 0.79 0.80 0.81


Top-label calibration on CIFAR100 using Wideresnet

|Col1|Col2|0.2 0.1|0.1|
|---|---|---|---|
||0|0.2 .3 0.3 1.0 4 532|0.01 0.01|


0.20.10.2 0.1

0.3

0.01

0.3 1.0 0.01

4 53 2

CRE
FL
MMCELL12 KDE-CRE KDE-CRE 3


0.74 0.75 0.76 0.77ACC 0.78 0.79 0.80 0.81


Top-label calibration on CIFAR100 using Wideresnet

|Col1|CRE FL MMC|E|Col4|Col5|
|---|---|---|---|---|
||L L1 2 K K 0|DE-CRE DE-CRE .3|0.2 0.1 0.2 1.0|0.1 0.01 0.01|


CRE
FL
MMCELL12 KDE-CRE KDE-CRE 0.20.10.2 0.1

0.3 0.010.01

1.0

4 0.3 53 2

3


0.74 0.75 0.76 0.77ACC 0.78 0.79 0.80 0.81

|1e 5|Col2|Marginal calibration on CIFAR100|0 using Wideresnet|
|---|---|---|---|
|||4|CRE FL MMCE|
|||5321.0 0.1|L L1 2 K KD DE E- -C CR RE E|


4 CREFL

MMCE

53 2 1.0 LL12 KDE-CRE KDE-CRE

0.3 0.1

0.3 0.2 0.2 0.10.010.013


0.74 0.75 0.76 0.77ACC 0.78 0.79 0.80 0.81


Figure 7: Top-label and marginal calibration on CIFAR-100


0.0

1.0

0.2

0.8

0.4

0.6

0.6

0.4

0.8

0.2

1.0

0.0

0.0 0.2 0.4 0.6 0.8 1.0

(a) Splitting the simplex in 16 bins


0.00 0.05 0.10 0.15 0.20 0.25 0.30

(b) Corresponding histogram (c) Corresponding Dirichlet KDE


Figure 8: An example of a simplex binned estimator and kernel-density estimator for CIFAR-10

averaged over 120 ResNet-56 models trained on a subset of three classes from CIFAR-10. Both
estimators are biased and have some variance, and the plot shows that the combination of the two is
in the same order of magnitude. The empirical convergence rates (slope of the log-log plot) is given
in the legend and is shown to be close to the theoretically expected value of -0.5.


D.4 CHOICE OF THE BATCH SIZE

In Figure 13 we investigate the choice of the batch size on CIFAR-10. To this end, we use two
differently shuffled dataloaders that draw random batches from the same training set. The first
dataloader provides batches to the loss term (CRE) while the second dataloader provides the batches
for the regularization (KDE). The batch size for the loss term is fixed in all experiments, while the


-----

0.40 Canonical, using 100 points, 25 bins, 0.001 bandwidth

0.35

0.30

0.25

0.20

Binned ECE 0.15

0.10

0.05

0.00

0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40

KDE ECE

Canonical, using 500 points, 25 bins, 0.001 bandwidth

0.30

0.25

0.20

0.15

Binned ECE 0.10

0.05

0.00

0.00 0.05 0.10 0.15 0.20 0.25 0.30

KDE ECE

Canonical, using 1000 points, 25 bins, 0.001 bandwidth

0.25

0.20

0.15

Binned ECE 0.10

0.05

0.00

0.00 0.05 0.10 0.15 0.20 0.25

KDE ECE


Marginal, using 100 points, 25 bins, 0.001 bandwidth

0.10

0.08

0.06

Binned ECE 0.04

0.02

0.00

0.00 0.02 0.04 0.06 0.08 0.10

KDE ECE

Marginal, using 500 points, 25 bins, 0.001 bandwidth

0.06

0.05

0.04

0.03

Binned ECE

0.02

0.01

0.00

0.00 0.01 0.02 0.03 0.04 0.05 0.06

KDE ECE

Marginal, using 1000 points, 25 bins, 0.001 bandwidth

0.05

0.04

0.03

Binned ECE 0.02

0.01

0.00

0.00 0.01 0.02 0.03 0.04 0.05

KDE ECE


Top-label, using 100 points, 25 bins, 0.001 bandwidth

0.10

0.08

0.06

Binned ECE 0.04

0.02

0.00

0.00 0.02 0.04 0.06 0.08 0.10

KDE ECE

Top-label, using 500 points, 25 bins, 0.001 bandwidth

0.08

0.07

0.06

0.05

0.04

Binned ECE 0.03

0.02

0.01

0.00

0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08

KDE ECE

Top-label, using 1000 points, 25 bins, 0.001 bandwidth

0.08

0.07

0.06

0.05

0.04

Binned ECE 0.03

0.02

0.01

0.00

0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08

KDE ECE


Figure 9: Relationship between the ECE metric based on binning and kernel density estimation
(KDE-ECE) for the three types of calibration: canonical, marginal and top-label. In every row, a
different number of points are used to approximate the KDE-ECE.


0.40

0.35


0.18

0.16


0.30

0.25


0.14

0.12


0.20

|0|.01|Col3|C L|RE KDE-CRE|
|---|---|---|---|---|
|||1 L2|1 L2|KDE-CRE|
||0.1||||
||0.2||||
||0.3 0.2|0.01|||
||||||
|||0.3 0.1|||
||||||

|Col1|0.01|CRE L1 KDE-CRE|
|---|---|---|
||0.2|L2 KDE-CRE|
||||
||0.1|0.01|
||0.2||
||||
||0.3|0.3|
||0.1||


0.01 CRE

L1 KDE-CRE

L2 KDE-CRE

0.1

0.2

0.3

0.2 0.01

0.3

0.1


0.84 0.86 0.88 0.90

ACC


0.84 0.86 0.88 0.90

ACC


Figure 10: Canonical calibration on Kather using a Resnet-50 model

batch size for the regularization varies. The orange point is our normal experimental set-up with just
one dataloader (i.e. the same points are used for loss and KDE-ECE computation) as a comparison.
The plot shows that our chosen batch size of 128 is appropriate for our purposes.


-----

0.00

0.02

0.04

0.06

0.08

0.10


|Col1|Col2|Col3|Col4|Col5|Col6|Ground tru KDE-ECE Binned EC|th E|
|---|---|---|---|---|---|---|---|
|||||||||
|||||||||
|||||||||
|||||||||
|||||||||
|||||||||


200 400 600 800 1000

# points


|Col1|Col2|Col3|Col4|Col5|Ground tru|th|
|---|---|---|---|---|---|---|
||||||KDE-ECE Binned EC|E|
||||||||
||||||||
||||||||
||||||||
||||||||
||||||||


200 400 600 800 1000

# points


|Col1|Col2|Col3|Col4|Col5|Col6|Ground tru KDE-ECE|th|
|---|---|---|---|---|---|---|---|
|||||||Binned EC|E|
|||||||||
|||||||||
|||||||||
|||||||||
|||||||||


200 400 600 800 1000

# points


0.035

0.030

0.025

0.020

0.015

0.010

0.005

0.000


0.05

0.04

0.03

0.02

0.01

0.00


(a) Canonical


(b) Marginal


(c) Top-label


Figure 11: KDE-ECE estimates and their corresponding binned approximations on the three types
of calibration for varying number of points used for the estimation. The ground truth is calculated
using 3000 probability scores of the test set. For the binned estimate, the points are assigned in 25
bins with adaptive width. A bandwidth of 0.001 is used for KDE-ECE.


3 × 10

2 × 10

10


4 × 10

3 × 10

2 × 10


10

6 × 10

4 × 10

3 × 10


6 × 10

4 × 10

|Col1|KDE-ECE slope = 0.3 Binned ECE slope = 0.3|KDE-ECE slope = 0.3 Binned ECE slope = 0.3|8 2|
|---|---|---|---|
|||||

|Col1|Col2|KDE-ECE slope = 0.4 Binned ECE slope = 0.5|0 2|
|---|---|---|---|
|||||

|Col1|KDE-ECE slope = 0.5 Binned ECE slope = 0.4|6|
|---|---|---|
|||6|
||||


102 10

Number of points

(a) Canonical


102 10

Number of points

(b) Marginal


102 10

Number of points

(c) Top-label


Figure 12: Absolute difference between ground truth and estimated ECE for varying number of
points used for the estimation. The ground truth is calculated using 3000 probability scores of the
test set. For the binned estimate, the points are assigned in 25 bins with adaptive width. A bandwidth
of 0.001 is used for KDE-ECE. Note that the axes are on a log scale.


-----

0.0055

0.0050

0.0045

0.0040

0.0035

0.0030

0.0025

0.050


0.00030

0.00025

0.00020

0.00015


0.00010

0.00005

0.008

|Col1|Col2|Col3|Col4|Col5|Col6|Col7|L|1 2 KDE-CRE|
|---|---|---|---|---|---|---|---|---|
|||||||L 64|L|1 KDE-CRE|
||||||||||
||||1|28|||||
|||128 256||||2|||
||||||51||||
|||||||||32|
||||||||||
||||||||||
||||||||||

|Col1|Col2|Col3|Col4|Col5|Col6|Col7|L1 L1|2 KDE-CRE KDE-CRE|
|---|---|---|---|---|---|---|---|---|
||||||||||
|||||||64|||
||||||||||
|||128|1|28|||||
||||||||||
|||256|||51|2||32|
||||||||||
||||||||||
||||||||||


0.915 0.920 0.925 0.930 0.935 0.940

ACC

|Col1|Col2|Col3|Col4|Col5|Col6|L L|L L|1 2 KDE-CRE 1 KDE-CRE|
|---|---|---|---|---|---|---|---|---|
||||||||||
||||||51|2|||
||||||||||
|||256||||64|||
|||128|||||||
||||12|8||||32|
||||||||||
||||||||||


L1 2 KDE-CRE

L1 KDE-CRE

512

256

128 128 64

32


0.915 0.920 0.925 0.930 0.935 0.940

ACC


0.915 0.920 0.925 0.930 0.935 0.940

ACC

|Col1|Col2|Col3|Col4|Col5|Col6|L1 L1|L1 L1|2 KDE-CRE KDE-CRE|
|---|---|---|---|---|---|---|---|---|
||||||||||
||||||512||||
||||||||||
|||256||||64|||
||||||||||
|||128|1|28||||32|
||||||||||
||||||||||


0.915 0.920 0.925 0.930 0.935 0.940

ACC


0.007

0.006


0.045

0.040


0.005

0.004


0.035

0.030


0.003


Figure 13: Training with different batches for loss and regularization (2 KDE-CRE), where the batch
size for the loss is fixed and the batch size for the regularization varies. The orange point shows our
usual experimental set-up where we train with only one batch (KDE-CRE). Upper row: marginal,
lower row: top-label.


-----