pradachan
/

AI-Scientist

Model card Files Files and versions Community

AI-Scientist / review_iclr_bench /iclr_parsed /1-lFH8oYTI.txt

pradachan

Upload folder using huggingface_hub

f71c233 verified 19 days ago

raw

history blame contribute delete

65.9 kB

	# CALIBRATION REGULARIZED TRAINING OF DEEP NEURAL NETWORKS USING DIRICHLET KERNEL DENSITY ESTIMATION

	Anonymous authors
	Paper under double-blind review

	ABSTRACT

	Calibrated probabilistic classifiers are models whose predicted probabilities can
	directly be interpreted as uncertainty estimates. This property is particularly important in safety-critical applications such as medical diagnosis or autonomous
	driving. However, it has been shown recently that deep neural networks are poorly
	calibrated and tend to output overconfident predictions. As a remedy, we propose
	a trainable calibration error estimator based on Dirichlet kernel density estimates,
	which asymptotically converges to the true Lp calibration error. This novel estimator enables us to achieve the strongest notion of multiclass calibration, called
	canonical calibration, while other common calibration methods only allow for toplabel and marginal calibration. The empirical results show that our estimator is
	competitive with the state-of-the-art, consistently yielding tradeoffs between calibration error and accuracy that are (near) Pareto optimal across a range of network
	architectures. The computational complexity of our estimator is O(n[2]), matching
	that of the kernel maximum mean discrepancy, used in a previously considered
	trainable calibration estimator (Kumar et al., 2018). By contrast, the proposed
	method has a natural choice of kernel, and can be used to generate consistent estimates of other quantities based on conditional expectation, such as the sharpness
	of an estimator.

	1 INTRODUCTION

	Deep neural networks have shown tremendous success in classification tasks, being regularly the
	best performing models in terms of accuracy. However, they are also known to make overconfident
	predictions (Guo et al., 2017), which is particularly problematic in safety-critical applications such
	as medical diagnosis or autonomous driving. Therefore, in many real world applications we do
	not just care about the predictive performance, but also about the trustworthiness of that prediction,
	that is, we are interested in accurate predictions with robust uncertainty estimates. To this end, we
	want our models to be uncertainty calibrated which means that, for instance, among all cells that
	have been predicted with a probability of 0.8 to be cancerous, in fact a fraction of 80 % belong to a
	malignant tumor.

	Being calibrated, however, does not imply that the classifier achieves good accuracy. For instance,
	a classifier that always predicts the marginal distribution of the target class is calibrated, but will
	not be very useful in practice. Likewise, a good predictive performance does not ensure calibration.
	In particular, for a broad class of loss functions, risk minimization leads to asymptotically Bayes
	optimal classifiers (Bartlett et al., 2006). However, there is no guarantee that they are calibrated,
	even in the aysmptotic limit. Therefore, we consider minimizing the risk plus a term that penalizes
	miscalibration, i.e., Risk +λ · CalibrationError. For parameter values λ > 0, this will push the
	classifier towards a calibrated model, while maintaining similar accuracy. The existence of such a
	_λ > 0 is suggested by the fact that there always exists at least one Bayes optimal classifier that is_
	calibrated, namely P(y\|x).

	To optimize the risk and the calibration error jointly, we propose a differentiable and consistent estimator of the expected Lp calibration error based on kernel density estimates (KDEs). In particular,
	we use a Beta kernel in binary classification tasks and a Dirichlet kernel in the multiclass setting,


	-----

	as these kernels are the natural choices to model density estimation over a probability simplex. Our
	Dirichlet kernel based estimator allows for the estimation of canonical calibration, which is the
	strongest notion of multiclass calibration as it implies the calibration of the whole probability vector
	(Br¨ocker, 2009; Appice et al., 2015; Vaicenavicius et al., 2019). By contrast, most other state-ofthe-art methods only achieve weaker versions of multiclass calibration, namely top-label (Guo et al.,
	2017) and marginal or class-wise calibration (Kull et al., 2019). The top-label calibration only considers the scores for the predictied class, while for marginal calibration the multiclass problem is
	split up into K one-vs-all binary ones, each of which is required to be calibrated according to the
	definition of binary calibration. In many applications marginal and canonical calibration are preferable to top-label calibration, since we often care about having reliable uncertainty estimates for more
	than just one class per prediction. For instance, in medical diagnosis we do not just care about the
	most likely disease a certain patient might have but also about the probabilities of other diseases.

	Our contributions can be summarized as follows:

	1. We develop a trainable calibration error objective using Dirichlet kernel density estimates,
	which can be minimized alongside any loss function in the existing batch stochastic gradient descent framework.

	2. We propose to use our estimator to evaluate canonical calibration. Due to the scaling
	properties of Dirichlet kernel density estimation, and the tendency for probabilities to be
	concentrated in a relatively small number of classes, this becomes feasible in cases that
	cannot be estimated using a binned estimator.

	3. We show on a variety of network architectures and two datasets that DNNs trained alongside an estimator of the calibration error achieve competitive results both on existing metrics and on the proposed measure of canonical calibration.


	2 RELATED WORK

	Calibration of probabilistic predictors has long been studied in many fields. This topic gained attention in the deep learning community following the observation in Guo et al. (2017) that modern
	neural networks are poorly calibrated and tend to give overconfident predictions due to overfitting
	on the NLL loss. The surge of interest resulted in many calibration strategies that can be split in two
	general categories, which we discuss subsequently. Post-hoc calibration strategies learn a calibration map of the predictions from a trained predictor in a post-hoc manner. For instance, Platt scaling
	(Platt, 1999) fits a logistic regression model on top of the logit outputs of the model. A special
	case of Platt scaling that fits a single scalar, called temperature, has been popularized by Guo et al.
	(2017) as an accuracy-preserving, easy to implement and effective method to improve calibration.
	However, it has the undesired consequence that it clamps the high confidence scores of accurate predictions (Kumar et al., 2018). Other approaches for post-hoc calibration include: histogram binning
	(Zadrozny & Elkan, 2001), isotonic regression (Zadrozny & Elkan, 2002), and Bayesian binning
	into quantiles (Naeini & Cooper, 2015). Trainable calibration strategies integrate a differentiable
	calibration measure into the training objective. One of the earliest approaches is regularization by
	penalizing low entropy predictions (Pereyra et al., 2017). Similarly to temperature scaling, it has
	been shown that entropy regularization needlessly suppresses high confidence scores of correct predictions (Kumar et al., 2018). Another popular strategy is MMCE (Maxmimum Mean Calibration
	Error) (Kumar et al., 2018), where the entropy regularizer is replaced by a kernel-based surrogate for
	the calibration error that can be optimized alongside NLL. It has been shown that label smoothing
	(Szegedy et al., 2015; M¨uller et al., 2020), i.e. training models with a weighted mixture of the labels
	instead of one-hot vectors, also improves model calibration. Liang et al. (2020) propose to add the
	difference between predicted confidence and accuracy as auxiliary term to the cross-entropy loss.
	Focal loss (Mukhoti et al., 2020; Lin et al., 2018) has recently been empirically shown to produce
	better calibrated models than many of the alternatives, but does not estimate a clear quantity related
	to calibration error.

	Kernel density estimation (Parzen, 1962; Rosenblatt, 1956) is a non-parametric method to estimate
	a probability density function from a finite sample. Zhang et al. (2020) propose a KDE-based estimator of the calibration error for measuring calibration performance. However, they use the triweight
	kernel, which has a limited support interval and is therefore applicable to binary classification, but
	does not have a natural extension to higher dimensional simplexes, in contrast to the Dirichlet kernel


	-----

	that we consider here. As a result, they consider an unnatural proxy to marginal calibration error,
	which does not result in a consistent estimator.

	3 METHODS

	The most commonly used loss functions are designed to achieve consistency in the sense of Bayes
	optimality under risk minimization, however, they do not guarantee calibration - neither for finite
	samples nor in the asymptotic limit. Since we are interested in models f that are both accurate and
	calibrated, we consider the following optimization problem bounding the calibration error CE(f ):
	_f = arg min_ (1)
	_f_ [Risk(][f] [)][,][ s.t.][ CE(][f] [)][ ≤] _[B]_
	_∈F_

	for some B > 0, and its associated Lagrangian

	_f = arg min_ Risk(f ) + λ CE(f ) _._ (2)
	_f_ _·_
	_∈F_



	We measure the (mis-)calibration in terms of the Lp calibration error. To this end, let (Ω, A, P)
	be a probability space, let X = R[d], Y = {0, 1, ..., K}. Let x : Ω _→X and y : Ω_ _→Y be_
	random variables while realizations are denoted with subscripts. Furthermore, let f : X →△[K]
	be a decision function, where △[K] denotes the K dimensional simplex as is achieved e.g. from the
	output of a final softmax layer in a neural network.
	Definition 3.1 (Calibration error, (Naeini et al., 2015; Kumar et al., 2019; Wenger et al., 2020)).
	_The Lp calibration error of f is:_

	[1]

	_p_ _p_
	CEp(f ) = E E[y _f_ (x)] _f_ (x) _._ (3)
	_\|_ _−_ _p_


	We note that we consider multiclass calibration, and that f (x) and the conditional expectation in
	Equation 3 therefore map to points on a probability simplex. We say that a classifier f is perfectly
	calibrated if CEp(f ) = 0. Kumar et al. (2018) have also considered a minimization problem similar
	to Equation 2. Instead of using the CEp they use a metric called maximum mean calibration error
	(MMCE) that is 0 if and only if CEp = 0. However, it is unclear how MMCE relates to the canonical
	multiclass setting or to the norm parameter p for non-zero CEp.

	In order to optimize Definition 3.1 directly, we need to perform density estimation over the probability simplex in order to empirically compute the conditional expectation. In a binary setting,
	this has traditionally been done with binned estimates (Naeini et al., 2015; Guo et al., 2017; Kumar
	et al., 2019). However, this is not differentiable w.r.t. the function f, and cannot be incorporated
	into a gradient based training procedure. Furthermore, binned estimates suffer from the curse of
	dimensionality and do not have a practical extension to multiclass settings. A natural choice for a
	differentiable kernel density estimator in the binary case is a kernel based on the Beta distribution
	and the extension to the multiclass case is given by the Dirichlet distribution. Hence, we consider
	an estimator for the CEp based on Beta and Dirichlet kernel density estimates in the binary and
	multiclass setting, respectively. We require that this estimator is consistent and differentiable such
	that we can train it according to Equation 2. This estimator is given by:


	CE\p(f )[p] = [1]


	E[y\ f (x)]
	_\|_ _f_ (xh) _[−]_ _[f]_ [(][x][h][)]


	(4)


	_h=1_


	where E[y\ f (x)] E[y\ f (x)] evaluated at f (x) = f (xh). If Px,y has a probability
	_\|_ _f_ (xh) [denotes] _\|_

	density px,y with respect to the product of the Lebesgue and counting measure, we can define:
	_px,y(xi, yi) = py\|x=xi_ (yi) px(xi). Then we define the estimator of the conditional expectation as
	follows:

	_yk_ _[y][k][ p][x,y][(][f]_ [(][x][)][, y][k][)]

	E[y _f_ (x)] = _yk py_ _x=f_ (x)(yk) = _∈Y_ (5)
	_\|_ _\|_ _px(f_ (x))

	_yXk∈Yn_ P

	_≈_ Pi=1ni=1[k][k][(][f][(][f][(][x][(][x][);][);][ f][ f][(][x][(][x][i][))][i][))][y][i] =: E[y\ \| f (x)] (6)

	where k is the kernel of a kernel density estimate evaluated at point xi.

	P


	-----

	Proposition 3.2. E[y\ \| f (x)] is a pointwise consistent estimator of E[y \| f (x)], that is:

	_n_

	lim _i=1n_ _[k][(][f]_ [(][x][);][ f] [(][x][i][))][y][i] = _yk∈Y_ _[y][k][ p][x,y][(][f]_ [(][x][)][, y][k][)] _._ (7)
	_n→∞_ P _i=1_ _[k][(][f]_ [(][x][);][ f] [(][x][i][))] P _px(f_ (x))

	P

	_Proof. By the consistency of kernel density estimators (Silverman, 1986; Chen, 1999; Ouimet_
	& Tolosana-Delgado, 2021), for all f (x) (0, 1), _n1_ _ni=1_ _[k][(][f]_ [(][x][);][ f] [(][x][i][))][y][i] _n→∞_

	_yk∈Y_ _[y][k][ p][x,y][(][f]_ [(][x][)][, y][k][)][ and][ 1]n _ni=1_ _[k][(][f]_ [(][x][);][ f] [(][x]∈[i][))] _−n−→∞−−→_ _px(Pf_ (x)). The fact that the ratio of−−−−→

	two convergent sequences converges against the ratio of their limits shows the result.

	P P

	Mean squared error in binary classification As a first instantiation of our framework we consider a binary classification setting, with the mean squared error MSE(f ) = E[(f (x) − _y)[2]] as the_
	risk function, jointly optimized with the L2 calibration error CE2. Following Murphy (1973); Degroot & Fienberg (1983); Kuleshov & Liang (2015); Nguyen & O’Connor (2015) we decompose
	(full derivation in Appendix A) the MSE as:

	MSE(f ) − CE2(f )[2] = E 1 − E[y \| f (x)] E[y \| f (x)] _≥_ 0. (8)


	Similar to Equation 2, we consider the optimization problem for some λ > 0:

	_f = arg min_ MSE(f ) + λ CE2(f )[2][]. (9)
	_f_ _∈F_



	Using Equation 8 we rewrite:

	MSE(f ) + λ CE2(f )[2] =(1 + λ) MSE(f ) _λ_ MSE(f ) CE2(f )[2][] (10)
	_−_ _−_


	=(1 + λ) MSE(f ) − _λE_ 1 − E[y \| f (x)] E[y \| f (x)] _._ (11)


	Rescaling Equation 11 by a factor of (1 + λ)[−][1] and a variable substitution γ = 1+λλ

	_[∈]_ [[0][,][ 1)]

	_f = arg min_ MSE(f ) + λ CE2(f )[2][] = arg min MSE(f ) _γE_ 1 E[y _f_ (x)] E[y _f_ (x)]
	_f_ _f_ _−_ _−_ _\|_ _\|_
	_∈F_ _∈F_ (12)

	= arg min MSE(f ) + γE E[y _f_ (x)][2][i]. (13)
	_f_ _\|_
	_∈F_ h

	For optimization we wish to find an estimator for E[E[y \| f (x)][2]]. Building upon Equation 6, a
	partially debiased estimator can be written as:[1]

	2

	_n_

	\ _i≠_ _h_ _[k][(][f]_ [(][x][h][);][ f] [(][x][i][))][y][i] _−_ [P]i≠ _h_ [(][k][(][f] [(][x][h][);][ f] [(][x][i][))][y][i][)][2]
	E E[y \| f (x)][2] _≈_ _n[1]_ _h=1_ P 2 _._ (14)
	h i X _i≠_ _h_ _[k][(][f]_ [(][x][h][);][ f] [(][x][i][))] _−_ [P]i≠ _h_ [(][k][(][f] [(][x][h][);][ f] [(][x][i][)))][2]

	In a binary setting, the kernels k(P·, ·) are Beta distributions, i.e. denoting _zi := f_ (xi) for short, then:

	_kBeta(z, zi) := z[α][i][−][1](1_ _z)[β][i][−][1][ Γ(][α][i][ +][ β][i][)]_ (15)
	_−_ Γ(αi) Γ(βi) _[,]_


	with αi = _[z]h[i]_ [+1][ and][ β][i][ =][ 1][−]h[z][i] [+1][ (Chen, 1999; Bouezmarni & Rolin, 2003; Zhang & Karunamuni,]

	2010), where h is a bandwidth parameter in the kernel density estimate that goes to 0 as n →∞.
	We note that the computational complexity of this estimator is O(n[2]). Within the gradient descent
	training procedure, the density is estimated using a mini-batch and therefore the O(n[2]) complexity
	is w.r.t. a mini-batch, not the entire dataset.

	The estimator in Equation 14 is a ratio of two second order U-statistics that converge as n[−][1][/][2]

	(Ferguson, 2005). Therefore, the overall convergence will be n[−][1][/][2]. Empirical covergence rates are
	calculated in Appendix D.3 and shown to be close to the theoretically expected value.

	1We have debiased the numerator and denominator individually (Ferguson, 2005, Section 2), but for simplicity have not corrected for the fact that we are estimating a ratio (Scott & Wu, 1981).


	-----

	Multiclass calibration with Dirichlet kernel density estimates There are multiple definitions
	regarding multiclass calibration that differ in the strictness regarding the calibration of the probability vector f (x). The weakest notion is top label calibration, which, as the name suggests, only
	cares about calibrating the entry with the highest predicted probability, which reduces to a binary
	calibration problem again (Guo et al., 2017). Marginal or class-wise calibration (Kull et al., 2019)
	is the most commonly used definition of multiclass calibration and a stronger version of top label
	calibration. Here, the problem is split into K one-vs-all binary calibration setting, such that each
	class has to be calibrated against the other K − 1 classes:

	_K_

	_p[]_

	MCEp(f )[p] = E E[y = k \| f (x)k] − _f_ (x)k _._ (16)

	_k=1_

	X

	An estimator for this calibration error is:


	_i≠_ _j_ _[k][Beta][(][f]_ [(][x][j][)][k][;][ f] [(][x][i][)][k][)[][y][i][]][k] _f_ (xj)k

	_i=j_ _[k][Beta][(][f]_ [(][x][j][)][k][;][ f] [(][x][i][)][k][)] _−_
	_̸_

	P


	MCE\p(f )[p] =


	(17)


	_j=1_


	_k=1_


	The strongest notion of multiclass calibration, and the one that we want to consider in this paper, is
	called canonical calibration (Br¨ocker, 2009; Appice et al., 2015; Vaicenavicius et al., 2019). Here
	it is required that the whole probability vector f (x) is calibrated. The definition is exactly the one
	from Definition 3.1. Its estimator is:


	_i≠_ _j_ _[k][Dir][(][f]_ [(][x][j][);][ f] [(][x][i][))][y][i]

	_i=j_ _[k][Dir][(][f]_ [(][x][j][);][ f] [(][x][i][))][ −] _[f]_ [(][x][j][)]
	_̸_


	_n_

	CE\p(f )[p] = [1]

	_n_

	_j=1_ P

	X

	where kDir is a Dirichlet kernel defined as:


	(18)


	_K_

	_i=1_ _[α][i][)]_
	_kDir(z, zi) := [Γ(]K[P][K]_ _zj[α][ij]_ _[−][1]_ (19)

	_i=1_ [Γ(][α][i][)] _j=1_

	Y

	with αi = zi/h + 1 (Ouimet & Tolosana-Delgado, 2021). As before, the computational complexityQ
	is O(n[2]) irrespective of p.

	This estimator is differentiable and furthermore, the following proposition holds:
	Proposition 3.3. The Dirichlet kernel based CE estimator is consistent, that is

	lim 1 _n_ _ni≠_ _nj_ _[k][Dir][(][f]_ [(][x][j][);][ f] [(][x][i][))][y][i] _p_ = E E[y _f_ (x)] _f_ (x) _p_ _p._ (20)
	_n→∞_ _n_ Xj=1 P _i≠_ _j_ _[k][Dir][(][f]_ [(][x][j][);][ f] [(][x][i][))][ −] _[f]_ [(][x][j][)] _p_ _\|_ _−_ _p_

	P

	_Proof. Dirichlet kernel estimators are consistent (Ouimet & Tolosana-Delgado, 2021), conse-_
	quently, by Proposition 3.2 the term inside the norm is consistent for any fixed f (xj) (note, that
	summing over i ̸= j ensures that the ratio of the KDE’s does not depend on the outer summation).
	Moreover, for any convergent sequence also the norm of that sequence converges against the norm
	of its limit. Ultimately, the outer sum is merely the sample mean of consistent summands, which
	again is consistent.

	4 EMPIRICAL SETUP

	We trained ResNet (He et al., 2015), ResNet with stochastic depth (SD) (Huang et al., 2016),
	DenseNet (Huang et al., 2018) and WideResNet (Zagoruyko & Komodakis, 2016) networks on
	CIFAR-10 and CIFAR-100 (Krizhevsky, 2009). We use 45000 images for training. The code will
	be released upon acceptance.

	Baselines _Cross-entropy: The first baseline model is trained using cross-entropy with the data_
	preprocessing, training procedure and hyperparameters described in the corresponding paper for
	the architecture. Trainable calibration strategies MMCE (Kumar et al., 2018) is a differentiable
	measure of calibration with a property that it is minimized at perfect calibration. It is used as
	a regulariser alongside NLL, with the strength of regularization parameterized by λ. Focal loss
	(Mukhoti et al., 2020) is an alternative to the popular cross-entropy loss, defined as Lf = −(1 −


	-----

	_f_ (y\|x))[γ] log(f (y\|x)), where γ is a hyperparameter and f (y\|x) is the probability score that a neural
	network f outputs for a class y on an input x. Their best-performing approach is the sampledependent FL-53 where γ = 5 for f (y\|x) ∈ [0, 0.2) and γ = 3 otherwise, followed by the method
	with fixed γ = 3. Post-hoc calibration strategies Guo et al. (2017) investigated the performance
	of several post-hoc calibration methods and found temperature scaling to be a strong baseline,
	which we use as a representative of this group. It works by scaling the logits with a scalar T > 0,
	typically learned on a validation set by minimizing NLL. Following Kumar et al. (2018); Mukhoti
	et al. (2020), we also use temperature scaling as a post-processing step for our method.

	Metrics The most widely-used metric for expected calibration error (ECE) is a binned estimator
	(Naeini et al., 2015), which divides the interval [0, 1] into bins of equal width and then calculates
	a weighted average of the absolute difference between accuracy and confidence for each bin. A
	better binning scheme involves determining the bin sizes so that an equal number of samples fall
	into each bin (Nguyen & O’Connor, 2015; Mukhoti et al., 2020). We report the ECE (%) with 15
	bins calculated according to the latter, so-called adaptive binning procedure. We compute the 95%
	confidence intervals using 100 bootstrap samples as in Kumar et al. (2019). We consider multiple
	versions of the ECE metric based on the Lp norm and the type of calibration (top-label, marginal,
	canonical). Top-label calibration error only considers the probability of the predicted class, marginal
	requires per-class calibration and the canonical is the highest form of calibration which requires the
	entire probability vector to be calibrated. We report L1 and L2 ECE in the marginal and canonical
	case. Additional experiments with top-label and marginal calibration on both CIFAR-10 and CIFAR100 can be found in Appendix B.

	Hyperparameters A crucial parameter for KDE is the bandwidth, a positive number that defines
	the smoothness of the density plot. Poorly chosen bandwidth may lead to undersmoothing (small
	bandwidth) or oversmoothing (large bandwidth). A commonly used non-parametric bandwidth selector is maximum likelihood cross validation (Duin, 1976). For our experiments we choose the
	bandwidth from a list of possible values by maximizing the leave-one-out likelihood. The λ parameter for weighting the calibration error w.r.t the loss is typically chosen via cross-validation or using
	a holdout validation set. The p parameter is chosen depending on the desired Lp calibration error
	and the corresponding theoretical guarantees.

	5 RESULTS AND DISCUSSION


	5.1 BINARY CLASSIFICATION

	We construct a binary experiment by splitting the CIFAR-10 classes into 2 classes: vehicles (plane,
	automobile, ship, truck) and animals (bird, cat, deer, dog, frog, horse). Figure 1a shows how the
	choice of the bandwidth parameter influences the shape of the estimate.


	0.0040

	0.0035

	0.0030

	0.0025

	0.0020

	0.0015

	0.0010

	0.0005

	0.0000


	10

	8

	6

	4

	2

	0

	\|Col1\|Col2\|KDE b = KDE b =\|0.001 0.01\|
	\|---\|---\|---\|---\|
	\|\|\|KDE b = Histogram\|0.1 from samples\|
	\|\|\|\|\|
	\|\|\|\|\|
	\|\|\|\|\|
	\|\|\|\|\|
	\|\|\|\|\|


	KDE b = 0.001
	KDE b = 0.01
	KDE b = 0.1
	Histogram from samples

	0.0 0.2 0.4 0.6 0.8 1.0

	\|Col1\|Col2\|Col3\|Col4\|Col5\|Col6\|Col7\|Col8\|KDE-MSE MSE\|Col10\|
	\|---\|---\|---\|---\|---\|---\|---\|---\|---\|---\|
	\|\|\|\|\|\|\|\|\|\|\|
	\|\|\|\|\|\|\|\|\|\|\|
	\|\|\|\|\|\|\|\|\|\|\|
	\|\|\|\|\|\|0.\|2\|\|\|\|
	\|\|\|\|\|\|\|\|\|\|\|
	\|\|\|\|\|0.1\|\|\|\|0.3\|\|
	\|\|\|\|\|\|\|\|\|\|\|
	\|\|\|\|\|\|\|\|\|\|0.4\|
	\|\|\|\|\|\|\|\|\|\|\|
	\|\|\|\|\|\|\|\|\|\|\|


	0.03 0.04 0.05 0.06 0.07 0.08

	MSE

	(b) Effect of γ


	(a) Effect of the bandwidth b


	Figure 1: Calibration regularized training using MSE loss and CE2

	Figure 1b shows the effect of the regularization parameter γ on the performance of a ResNet-110
	model. The orange point represents a model trained with MSE loss, and the blue points (KDE-MSE)
	correspond to models trained with regularized MSE loss by an L2 calibration error for different
	values of γ. As expected, the calibration regularized training decreases the L2 calibration error at
	the cost of slightly increased error.


	-----

	5.2 EVALUATING CANONICAL CALIBRATION

	Accurately evaluating the calibration error is another crucial step towards designing trustworthy
	models that can be used in high-cost settings. In spite of its numerous flaws discussed in Vaicenavicius et al. (2019); Ding et al. (2020); Ashukha et al. (2021), such as its sensitivity to the binning
	scheme, the histogram-based estimator remains the most widely used metric for evaluating miscalibration. Another downside of the binned estimator is its inability to capture canonical calibration
	due to the curse of dimensionality, as the number of bins grows exponentially with the number of
	classes. Therefore, because of its favourable scaling properties, we propose using our Dirichlet
	kernel density estimate as an alternative metric (KDE-ECE) to measure calibration.
	To investigate its relationship with the commonly used binned estimator, we first introduce an extension of the top-label binned estimator to the probability simplex in the three class setting. We start
	by partitioning the probability simplex into equally-sized, triangle-shaped bins and assign the probability scores to the corresponding bin, as shown in Figure 2a. Then, we define the binned estimate
	of canonical calibration error as follows:


	CEp(f )[p] _≈_ E _∥H(f_ (x)) − _f_ (x)∥p[p] _≈_ _n[1]_
	h i


	_H(f_ (xj)) _f_ (xi) _p_ (21)
	_∥_ _−_ _∥[p]_
	_i=1_

	X


	where H(f (xj)) is the histogram estimate, shown in Figure 2b. The surface of the corresponding
	Dirichlet KDE is presented in Figure 2c. In Figure 3 we show that the KDE-ECE estimates of the
	three types of calibration closely correspond to the their histogram-based approximations. Each
	point in the plot represents a ResNet-56 model trained on a different subset of three classes from
	CIFAR-10. See Appendix C for another example of the binned estimator and Dirichlet KDE on
	CIFAR-10 and an experiment with varying number of points used for the density estimation.

	0.0 0.2 0.4 0.6 0.8 1.0 0.05 0.10 0.15 0.20 0.25


	0.0

	1.0

	0.2

	0.8

	0.4

	0.6

	0.6

	0.4

	0.8

	0.2

	1.0

	0.0

	0.0 0.2 0.4 0.6 0.8 1.0


	0.05 0.10 0.15 0.20 0.25


	(a) Splitting the simplex in 16 bins


	(b) Histogram (c) Dirichlet KDE


	Figure 2: Extension of the binned estimator to the probability simplex, compared with the KDEECE. The KDE-ECE achieves a better approximation to the finite sample, and accurately models
	the fact that samples tend to be concentrated near low dimensional faces of the simplex.


	0.200

	0.175

	0.150

	0.125

	0.100

	Binned ECE 0.075

	0.050

	0.025

	0.000

	0.000 0.025 0.050 0.075 0.100 0.125 0.150 0.175 0.200

	KDE ECE


	0.04

	0.03

	0.02

	Binned ECE

	0.01

	0.00

	0.00 0.01 0.02 0.03 0.04

	KDE ECE


	0.06

	0.05

	0.04

	0.03

	Binned ECE

	0.02

	0.01

	0.00

	0.00 0.01 0.02 0.03 0.04 0.05 0.06

	KDE ECE


	(a) Canonical


	(b) Marginal


	(c) Top-label


	Figure 3: Relationship between the KDE-ECE estimates and their corresponding binned approximations on the three types of calibration. Each point represents a ResNet-56 model trained on a subset
	of three classes from CIFAR-10. The 3000 probability scores of the test set are assigned in 25 bins
	with adaptive width for the binned estimate. A bandwidth of 0.001 is used for KDE-ECE.


	-----

	5.3 MULTICLASS CLASSIFICATION

	In this section we evaluate our proposed KDE-based ECE estimator that was jointly trained with
	cross entropy loss (KDE-CRE) against other baselines in a multiclass setting on CIFAR-10 and
	CIFAR-100. We found that for KDE-CRE, values of λ ∈ [0.01, 0.1] provide a good trade-off in
	terms of accuracy and calibration error. Table 1 summarizes the accuracy and marginal L1 ECE%
	(computed using 15 bins), measured across multiple architectures. For MMCE, we report the results
	with λ = 1 and for KDE-CRE we use λ = 0.01. An analogous table measuring marginal L2 ECE
	is given in Appendix B.

	Table 1: Accuracy and marginal L1 ECE (%) computed with 15 bins for different loss functions
	and architectures, both trained from scratch (Pre T) and after temperature scaling on a validation set
	(Post T). Best results are marked in bold.

	CIFAR-10 CIFAR-100
	Loss Metric ResNet ResNet (SD) Wide-ResNet DenseNet ResNet ResNet (SD) Wide-ResNet DenseNet

	Pre T 0.419 0.357 0.241 0.236 0.129 0.100 0.086 0.090
	ECE
	Post T 0.282 0.250 0.278 0.165 0.114 0.089 0.105 0.078
	CRE

	Pre T 0.925 0.926 0.957 0.947 0.700 0.728 0.803 0.756
	Acc
	Post T 0.927 0.925 0.957 0.947 0.700 0.729 0.801 0.758


	Pre T 0.250 0.390 0.265 0.193 0.143 0.100 0.120 0.123
	ECE
	Post T 0.361 0.308 0.291 0.235 0.121 0.093 0.109 0.124
	MMCE

	Pre T 0.929 0.925 0.947 0.944 0.693 0.723 0.767 0.748
	Acc
	Post T 0.926 0.926 0.949 0.945 0.691 0.722 0.770 0.743

	Pre T 0.403 0.416 0.414 0.259 0.145 0.120 0.125 0.095
	ECE
	Post T 0.272 0.267 0.437 0.220 0.124 0.107 0.106 0.081
	FL-53

	Pre T 0.922 0.920 0.936 0.948 0.695 0.711 0.760 0.752
	Acc
	Post T 0.923 0.919 0.936 0.949 0.693 0.712 0.763 0.753

	Pre T 0.363 0.338 0.289 0.296 0.128 0.096 0.092 0.099
	ECE
	Post T 0.182 0.220 0.226 0.248 0.104 0.095 0.108 0.085
	_L1 KDE-CRE_

	Pre T 0.926 0.925 0.953 0.943 0.697 0.725 0.796 0.757
	Acc
	Post T 0.927 0.925 0.953 0.944 0.698 0.720 0.793 0.759

	We notice that for both pre and post temperature scaling, KDE-CRE achieves very competitive ECE
	scores. Another encouraging observation is that the improvement of calibration error comes at almost no cost in accuracy. An important advantage of our KDE-based method is the ability to directly
	train and evaluate canonical calibration. In Figure 4 we show a scatter plot with confidence intervals
	of the L1 and L2 KDE-CRE models for canonical calibration and the other baselines on CIFAR-10.
	We measure the canonical calibration using our KDE-ECE metric from section 5.2. In three of the
	architectures, both L1 and L2 KDE-CRE either dominate or are statistically tied with cross-entropy
	(CRE). Similarly, Figure 5 shows a scatter plot of L1 and L2 KDE-CRE models trained to minimize
	marginal calibration error. In this case, we measure L2 marginal ECE with the standard binned estimator. In most cases, our methods Pareto dominate the other baselines. A general observation can be
	made, however, that the models trained with cross-entropy have a surprisingly low marginal calibration error, contrary to previous findings that show poor calibration when considering only the most
	confident prediction (top-label calibration). An additional experiment comparing the CRE baseline
	with KDE-CRE for canonical calibration on a benchmark dataset of histological images of human
	colorectal cancer is given in Appendix D.2, which clearly illustrates the superior performance of our
	method, both in terms of accuracy and calibration error in this context.
	To summarize, the experiments show that our estimator is consistently producing competitive calibration errors with other state-of-the-art approaches, while maintaining accuracy and keeping the
	computational complexity at O(n[2]). We evaluate the computational overhead of CRE and KDECRE and summarize the results in a table in Appendix D.1, which shows that the added cost is
	less than a couple percent. There are several limitations in the current work: A larger scale benchmarking will be beneficial for exploring the limits of canonical calibration using Dirichlet kernels.
	Furthermore, while we showed consistency of our estimator, we did not fully derive and implement
	its debiasing. Due to space constraints, this was not the focus of the paper and is left for future work.

	6 CONCLUSION

	In this paper, we proposed a consistent and differentiable estimator of an Lp calibration error using
	Dirichlet kernels. The KDE-based estimate can be directly optimized alongside any loss function in
	the existing batch stochastic gradient descent framework. Furthermore, we propose using it as a mea

	-----

	sure of the highest form of calibration which requires the entire probability vector to be calibrated.
	We showed empirically on a range of neural architectures that the performance of our estimator
	in terms of accuracy and calibration error is competitive against the current state-of-the-art, while
	having superior properties as a consistent estimator of canonical calibration error.


	0.11

	0.10


	0.16

	0.14

	0.12

	0.10

	0.08

	0.16

	0.14

	0.12

	0.10

	0.08

	0.06

	0.04


	0.09

	0.08


	0.07

	0.11

	0.10

	0.09

	0.08

	0.07

	0.06

	0.05

	0.04

	\|Col1\|0.3\|Col3\|Col4\|Col5\|Col6\|CRE FL\|
	\|---\|---\|---\|---\|---\|---\|---\|
	\|\|\|\|6\|\|\|\|
	\|\|\|\|\|3\|\|MMCE L1 KDE-CRE L2 KDE-CRE\|
	\|\|\|1\|\|4 0 2\|\|\|
	\|\|\|\|\|\|0.1\|1.0\|
	\|\|\|\|\|\|\|\|
	\|\|\|\|\|\|\|3 0.10.01\|
	\|\|\|\|\|\|\|0.01\|

	\|0.\|3\|Col3\|Col4\|Col5\|CRE FL\|
	\|---\|---\|---\|---\|---\|---\|
	\|\|\|\|\|\|MMCE L1 KDE-CRE L2 KDE-CRE\|
	\|\|\|\|0.2\|\|\|
	\|\|\|\|\|3\|10 6\|
	\|\|\|\|\|0.1\|4 53 12.0\|
	\|\|\|\|\|\|0.01 0.1\|


	0.3 CRE
	FL
	MMCE
	L1 KDE-CRE

	L2 KDE-CRE

	0.2

	10

	3 6

	0.1

	4

	53 1.02

	0.01

	0.1


	CRE

	0.3 6 FL

	MMCE
	L1 KDE-CRE

	3 L2 KDE-CRE

	4

	10

	2

	0.1 1.0

	0.2

	53 0.10.01

	0.01


	0.88 0.89 0.90 0.91 0.92 0.93

	ACC

	(a) ResNet-110


	0.84 0.86 0.88 0.90 0.92

	ACC

	(b) ResNet-110 (SD)

	\|Col1\|Col2\|6\|Col4\|Col5\|Col6\|CRE FL\|Col8\|
	\|---\|---\|---\|---\|---\|---\|---\|---\|
	\|0\|.3\|4\|\|10\|\|MMCE L1 KDE-\|CRE\|
	\|\|\|\|\|\|\|L2 KDE-\|CRE\|
	\|\|\|\|0.2\|2 0.1 53\|\|\|\|
	\|\|\|\|\|0.\|2 31.0\|\|\|
	\|\|\|\|\|\|\|\|\|
	\|\|\|\|\|\|\|0.001.1\|\|
	\|\|\|\|\|\|\|0\|.01\|

	\|Col1\|6\|Col3\|Col4\|Col5\|CRE\|
	\|---\|---\|---\|---\|---\|---\|
	\|\|\|\|\|\|FL MMCE\|
	\|\|\|10 4\|\|\|L1 KDE-CRE L2 KDE-CRE\|
	\|\|\|\|\|\|\|
	\|\|\|\|0.3\|\|\|
	\|\|\|\|\|0.20.1 0.2\|2 0.1301 .. 00 1\|
	\|\|\|\|\|\|0.01\|
	\|\|\|\|\|\|53\|


	0.91 0.92 0.93 0.94 0.95 0.96

	ACC

	(c) Wide-ResNet-28-10


	0.86 0.88 0.90 0.92 0.94

	ACC

	(d) DenseNet-40


	Figure 4: Canonical calibration on CIFAR-10


	\|1e 5\|Col2\|Col3\|Col4\|Col5\|Col6\|Col7\|Col8\|Col9\|
	\|---\|---\|---\|---\|---\|---\|---\|---\|---\|
	\|\|\|2\|4\|2\|\|53\|\|CRE FL MMCE L1 KDE-CRE L2 KDE-CRE\|
	\|\|\|\|0.\|\|\|0.2\|\|\|
	\|\|\|\|6\|3\|\|1 0.3\|.0 0.01\|0.1\|
	\|\|\|\|\|0.3\|\|\|0.1\|0.01\|
	\|\|\|\|\|\|\|\|\|\|
	\|\|\|\|\|\|\|\|\|\|
	\|\|\|\|\|\|\|\|\|\|


	4 CRE

	53 FL

	0.2 MMCE

	2 6 0.2 LL12 KDE-CRE KDE-CRE

	1.0

	3 0.3 0.01 0.1

	0.3 0.01

	0.1

	0.65 0.66 0.67 0.68 0.69 0.70 0.71

	ACC


	2.5


	3.5

	3.0

	2.5

	2.0

	1.5

	1.0

	4.0

	3.5

	3.0

	2.5

	2.0

	1.5

	1.0

	0.5


	2.0

	1.5


	1.0

	0.5

	\|1e 5\|Col2\|Col3\|Col4\|Col5\|Col6\|Col7\|Col8\|Col9\|Col10\|
	\|---\|---\|---\|---\|---\|---\|---\|---\|---\|---\|
	\|0.\|\|3\|\|\|\|3 0.2\|\|0.01\|\|
	\|\|\|\|\|\|\|\|\|53\|\|
	\|\|\|\|\|0.3\|\|\|\|0.1\|2 1.0\|
	\|\|\|\|\|\|0\|.2\|\|0.1\|0.01\|
	\|CRE FL MM L K\|\|CE DE-CRE\|\|\|\|\|\|\|\|
	\|1 L2 K\|\|DE-CRE\|\|\|\|\|\|\|\|


	0.64 0.66 0.68 0.70 0.72 0.74

	ACC

	(b) ResNet-110 (SD)


	(a) ResNet-110


	\|1e 5\|Col2\|Col3\|Col4\|Col5\|Col6\|Col7\|Col8\|Col9\|
	\|---\|---\|---\|---\|---\|---\|---\|---\|---\|
	\|\|\|\|\|\|\|\|CRE\|\|
	\|\|\|4\|\|\|\|\|FL MMCE\|\|
	\|\|\|\|\|\|\|\|\|\|
	\|\|\|\|\|\|\|\|L1 KDE-C L2 KDE-C\|RE RE\|
	\|\|\|\|53 1. 2\|\|0\|\|\|\|
	\|\|\|\|\|\|0.1\|\|\|\|
	\|\|0\|.3\|0\|\|0.2 .2\|\|03.01\|\|
	\|\|\|0.3\|\|\|\|\|00.1.01\|\|
	\|\|\|\|\|\|\|\|\|\|


	4 CREFL

	MMCE
	L1 KDE-CRE
	L2 KDE-CRE

	53 1.0

	2

	0.1

	0.3

	0.2

	0.2

	0.3 0.10.010.013

	0.74 0.75 0.76 0.77 0.78 0.79 0.80 0.81

	ACC


	3.0

	2.5

	2.0

	1.5

	1.0

	0.5

	\|1e 5\|Col2\|Col3\|Col4\|Col5\|Col6\|Col7\|Col8\|Col9\|Col10\|Col11\|
	\|---\|---\|---\|---\|---\|---\|---\|---\|---\|---\|---\|
	\|\|\|\|\|\|\|1.\|\|\|0\|CRE\|
	\|\|\|\|\|0.2\|\|2\|\|\|\|FL MMCE L1 KDE-CRE L2 KDE-CRE\|
	\|\|\|\|\|\|\|\|\|\|\|\|
	\|\|4\|\|\|\|\|\|\|\|\|\|
	\|\|\|0.3\|\|\|03.2\|\|\|\|53\|0.01\|
	\|\|\|\|\|\|\|\|0\|\|.1\|\|
	\|\|\|\|\|\|0.3\|\|\|\|0.10\|.01\|
	\|\|\|\|\|\|\|\|\|\|\|\|
	\|\|\|\|\|\|\|\|\|\|\|\|


	0.72 0.73 0.74 0.75 0.76

	ACC

	(d) DenseNet-40


	(c) Wide-ResNet-28-10


	Figure 5: Marginal calibration on CIFAR-100


	-----

	REFERENCES

	A. Appice, P. Rodrigues, V. S. Costa, C. Soares, Jo˜ao Gama, and A. Jorge. Novel decompositions
	of proper scoring rules for classification : Score adjustment as precursor to calibration. 2015.

	Arsenii Ashukha, Alexander Lyzhov, Dmitry Molchanov, and Dmitry Vetrov. Pitfalls of in-domain
	uncertainty estimation and ensembling in deep learning, 2021.

	Peter L. Bartlett, Michael I. Jordan, and Jon D. Mcauliffe. Convexity, classification, and risk bounds.
	_Journal of the American Statistical Association, 101(473):138–156, 2006._

	Taoufik Bouezmarni and Jean-Marie Rolin. Consistency of the beta kernel density function estimator. The Canadian Journal of Statistics / La Revue Canadienne de Statistique, 31(1):89–98,
	2003.

	Jochen Br¨ocker. Reliability, sufficiency, and the decomposition of proper scores. Quarterly Journal
	_of the Royal Meteorological Society, 135(643):1512–1519, Jul 2009._

	Song Xi Chen. Beta kernel estimators for density functions. _Computational Statistics & Data_
	_Analysis, 31:131–145, 1999._

	M. Degroot and S. Fienberg. The comparison and evaluation of forecasters. The Statistician, 32:
	12–22, 1983.

	Yukun Ding, Jinglan Liu, Jinjun Xiong, and Yiyu Shi. Revisiting the evaluation of uncertainty estimation and its application to explore model complexity-uncertainty trade-off. arXiv:1903.02050,
	2020.

	Robert Duin. On the choice of smoothing parameters for parzen estimators of probability density
	functions. IEEE Transactions on Computers, C-25(11):1175–1179, 1976.

	Thomas S. Ferguson. U-statistics. In Notes for Statistics 200C. UCLA, 2005.

	Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. On calibration of modern neural
	networks, 2017.

	Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. arXiv:1512.03385, 2015.

	Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Weinberger. Deep networks with stochastic depth. arXiv:1603.09382, 2016.

	Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q. Weinberger. Densely connected
	convolutional networks, 2018.

	Jakob Kather, Cleo-Aron Weis, Francesco Bianconi, Susanne Melchers, Lothar Schad, Timo Gaiser,
	Alexander Marx, and Frank Z¨ollner. Multi-class texture analysis in colorectal cancer histology.
	_Scientific Reports, 6:27988, 06 2016. doi: 10.1038/srep27988._

	Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, University
	of Toronto, 2009.

	Volodymyr Kuleshov and Percy S Liang. Calibrated structured prediction. In C. Cortes,
	N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett (eds.), Advances in Neural Information Pro_cessing Systems, volume 28. Curran Associates, Inc., 2015._

	Meelis Kull, Miquel Perello-Nieto, Markus K¨angsepp, Telmo Silva Filho, Hao Song, and Peter Flach. Beyond temperature scaling: Obtaining well-calibrated multiclass probabilities with
	Dirichlet calibration. arXiv:1910.12656, 2019.

	Ananya Kumar, Percy S Liang, and Tengyu Ma. Verified uncertainty calibration. In H. Wallach,
	H. Larochelle, A. Beygelzimer, F. d'Alch´e-Buc, E. Fox, and R. Garnett (eds.), Advances in Neural
	_Information Processing Systems 32, pp. 3792–3803. 2019._


	-----

	Aviral Kumar, Sunita Sarawagi, and Ujjwal Jain. Trainable calibration measures for neural networks
	from kernel mean embeddings. In ICML, 2018.

	Gongbo Liang, Yu Zhang, Xiaoqin Wang, and Nathan Jacobs. Improved trainable calibration method
	for neural networks on medical imaging classification. In British Machine Vision Conference,
	2020.

	Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Doll´ar. Focal loss for dense object
	detection. arXiv:1708.02002, 2018.

	Jishnu Mukhoti, Viveka Kulharia, Amartya Sanyal, Stuart Golodetz, Philip H. S. Torr, and Puneet K.
	Dokania. Calibrating deep neural networks using focal loss. arXiv:2002.09437, 2020.

	A. Murphy. A new vector partition of the probability score. Journal of Applied Meteorology, 12:
	595–600, 1973.

	Rafael M¨uller, Simon Kornblith, and Geoffrey Hinton. When does label smoothing help?
	arXiv:1906.02629, 2020.

	Mahdi Pakdaman Naeini and Gregory F. Cooper. Binary classifier calibration using an ensemble of
	near isotonic regression models. arXiv:1511.05191, 2015.

	Mahdi Pakdaman Naeini, Gregory F. Cooper, and Milos Hauskrecht. Obtaining well calibrated
	probabilities using Bayesian binning. In Proceedings of the Twenty-Ninth AAAI Conference on
	_Artificial Intelligence, pp. 2901–2907, 2015._

	Khanh Nguyen and Brendan O’Connor. Posterior calibration and exploratory analysis for natural
	language processing models. arXiv:1508.05154, 2015.

	Fr´ed´eric Ouimet and Raimon Tolosana-Delgado. Asymptotic properties of dirichlet kernel density
	estimators. arXiv:2002.06956, 2021.

	Emanuel Parzen. On estimation of a probability density function and mode. The Annals of Mathe_matical Statistics, 33(3):1065–1076, 1962._

	Gabriel Pereyra, George Tucker, Jan Chorowski, Łukasz Kaiser, and Geoffrey Hinton. Regularizing
	neural networks by penalizing confident output distributions. arXiv:1701.06548, 2017.

	John C. Platt. Probabilistic outputs for support vector machines and comparisons to regularized
	likelihood methods. In Advances in Large Margin Classifiers, pp. 61–74. MIT Press, 1999.

	Murray Rosenblatt. Remarks on some nonparametric estimates of a density function. The Annals of
	_Mathematical Statistics, 27(3):832 – 837, 1956._

	Alastair Scott and Chien-Fu Wu. On the asymptotic distribution of ratio and regression estimators.
	_Journal of the American Statistical Association, 76(373):98–102, 1981._

	B. W. Silverman. Density Estimation for Statistics and Data Analysis. Chapman & Hall, 1986.

	Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. arXiv:1512.00567, 2015.

	Juozas Vaicenavicius, David Widmann, Carl Andersson, Fredrik Lindsten, Jacob Roll, and
	Thomas B. Sch¨on. Evaluating model calibration in classification. arXiv:1902.06977, 2019.

	Jonathan Wenger, Hedvig Kjellstr¨om, and Rudolph Triebel. Non-parametric calibration for classification. In International Conference on Artificial Intelligence and Statistics, pp. 178–190, 2020.

	B. Zadrozny and C. Elkan. Transforming classifier scores into accurate multiclass probability estimates. Proceedings of the eighth ACM SIGKDD international conference on Knowledge discov_ery and data mining, 2002._

	Bianca Zadrozny and Charles Elkan. Obtaining calibrated probability estimates from decision trees
	and naive bayesian classifiers. ICML, 1, 05 2001.


	-----

	Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. In British Machine Vision
	_Conference, 2016._

	Jize Zhang, Bhavya Kailkhura, and T. Yong-Jin Han. Mix-n-match: Ensemble and compositional
	methods for uncertainty calibration in deep learning. In International Conference on Machine
	_Learning, 2020._

	Shunpu Zhang and Rohana Karunamuni. Boundary performance of the beta kernel estimators.
	_Journal of Nonparametric Statistics, 22:81–104, 01 2010._

	A DERIVATION OF THE MSE DECOMPOSITION

	Definition A.1 (Mean Squared Error (MSE)). The mean squared error of an estimator is

	MSE(f ) := E[(f (x) − _y)[2]]._ (22)

	Proposition A.2. MSE(f ) ≥ CE2(f )[2]

	_Proof._

	MSE(f ) :=E[(f (x) − _y))[2]] = E[((f_ (x) − E[y \| f (x)]) + (E[y \| f (x)] − _y))[2]]_ (23)

	= E[(f (x) − E[y \| f (x)])[2]] +E[(E[y \| f (x)] − _y)[2]]_ (24)
	=CE2[2]
	\|+ 2E[(f (x){z E[y _f_ (x})])(E[y _f_ (x)] _y)]_

	_−_ _\|_ _\|_ _−_

	which implies

	MSE(f ) − CE2(f )[2] =E[(E[y \| f (x)] − _y)[2]]_ (25)
	+ 2E[(f (x) − E[y \| f (x)])(E[y \| f (x)] − _y)]_

	=E[(E[y \| f (x)] − _y)[2]] + 2E[(f_ (x)E[y \| f (x)]] (26)

	_−_ 2E[f (x)y] − 2E[E[y \| f (x)][2]] + 2E[E[y \| f (x)]y]]

	=E[E[y \| f (x)][2]] + E[y[2]] − 2E[E[y \| f (x)]y] (27)
	+ 2E[(f (x)E[y \| f (x)]] − 2E[f (x)y]

	_−_ 2E[E[y \| f (x)][2]] + 2E[E[y \| f (x)]y]]

	=E[y[2]] + 2E[(f (x)E[y \| f (x)]] − 2E[f (x)y] (28)

	_−_ E[E[y \| f (x)][2]]
	=E[(2f (x) − _y −_ E[y \| f (x)])(E[y \| f (x)]) − _y]_ (29)
	=E[(f (x) − _y)(E[y \| f_ (x)] − _y)]_ (30)
	+ E[(f (x) − E[y \| f (x)])(E[y \| f (x)] − _y)]._

	By the law of total expectation, we will write the above as

	MSE(f ) − CE2(f )[2] = E[E[(f (x) − _y)(E[y \| f_ (x)] − _y)_ (31)
	+ (f (x) − E[y \| f (x)])(E[y \| f (x)] − _y) \| f_ (x)]].

	Focusing on the inner conditional expectation, we have that

	E[(f (x) − _y)(E[y \| f_ (x)] − _y) + (f_ (x) − E[y \| f (x)])(E[y \| f (x)] − _y) \| f_ (x)]
	=E[y \| f (x)](f (x) − 1)(E[y \| f (x)] − 1) + (1 − E[y \| f (x)])f (x)E[y \| f (x)]
	+ E[y \| f (x)](f (x) − E[y \| f (x)])(E[y \| f (x)] − 1)
	+ (1 − E[y \| f (x)])(f (x) − E[y \| f (x)])E[y \| f (x)] (32)
	=(1 − E[y \| f (x)])E[y \| f (x)] ≥ 0 _∀f_ (x) (33)

	and therefore

	MSE(f ) − CE2(f )[2] = E[(1 − E[y \| f (x)])E[y \| f (x)]] ≥ 0. (34)

	The expectation in Equation 34 is over variances of Bernoulli random variables with probabilities
	E[y \| f (x)].


	-----

	B RESULTS

	Table 2 summarizes the marginal L2 ECE and accuracy for the two datasets across multiple architectures and training loss functions. The scatter plots in Figures 6 and 7 show the accuracy and both
	_L1 and L2 ECE, for top-label and marginal calibration on CIFAR-10 and CIFAR-100, respectively._
	KDE-CRE is trained by directly minimizing the metric that is evaluated, e.g., in the first column we
	minimize marginal L1 calibration error and in the last column we optimize the L2 top label calibration error. Other methods do not have the flexibility of choosing the type of calibration and the Lp
	norm.

	Table 2: Accuracy and marginal L2 ECE (%) computed with 15 bins for different approaches,
	trained from scratch (Pre T) and after temperature scaling (Post T).

	CIFAR-10 CIFAR-100
	Loss Metric ResNet ResNet (SD) Wide-ResNet DenseNet ResNet ResNet (SD) Wide-ResNet DenseNet

	Pre T 0.020 0.009 0.007 0.008 0.002 0.002 0.001 0.001
	ECE
	Post T (NLL) 0.007 0.005 0.008 0.004 0.002 0.001 0.001 0.001
	CRE

	Pre T 0.925 0.926 0.950 0.947 0.700 0.728 0.797 0.756
	Acc
	Post T (NLL) 0.927 0.925 0.950 0.947 0.700 0.729 0.794 0.758


	Pre T 0.009 0.015 0.009 0.004 0.003 0.001 0.003 0.003
	ECE
	Post T (NLL) 0.013 0.009 0.009 0.005 0.002 0.001 0.002 0.003
	MMCE

	Pre T 0.929 0.925 0.947 0.944 0.693 0.723 0.767 0.748
	Acc
	Post T (NLL) 0.926 0.926 0.949 0.945 0.691 0.722 0.770 0.743

	Pre T 0.013 0.020 0.026 0.005 0.003 0.002 0.003 0.002
	ECE
	Post T (NLL) 0.008 0.009 0.022 0.004 0.002 0.002 0.002 0.001
	FL-53

	Pre T 0.922 0.920 0.936 0.948 0.695 0.711 0.760 0.752
	Acc
	Post T (NLL) 0.923 0.919 0.936 0.949 0.693 0.712 0.763 0.753

	Pre T 0.010 0.015 0.007 0.008 0.002 0.002 0.001 0.001
	ECE
	Post T (NLL) 0.004 0.012 0.008 0.009 0.002 0.002 0.001 0.001
	_L2 KDE-CRE_

	Pre T 0.930 0.922 0.950 0.943 0.707 0.713 0.797 0.757
	Acc
	Post T (NLL) 0.930 0.921 0.950 0.944 0.707 0.717 0.794 0.755

	C RELATIONSHIP BETWEEN THE BINNED ESTIMATOR AND THE KERNEL
	DENSITY ESTIMATOR

	Figure 8 shows an example of the binned estimator in a three-class setting on CIFAR-10. The points
	are mostly concentrated at the edges of the histogram, as can be seen from Figure 8b. The surface
	of the corresponding Dirichlet KDE is given in 8c.
	Figure 9 shows the relationship between the binned estimator and our KDE-ECE metric. The points
	represent a trained Resnet-56 model on a subset of three classes from CIFAR-10. In every row, a
	differnt number of points was used to estimate the KDE-ECE.

	D EXPERIMENTS FOR REBUTTAL

	D.1 TRAINING TIME MEASUREMENTS

	In Table 3 we summarize the running time per epoch for training with (KDE-CRE) and without
	(CRE) regularization for the two datasets and four architectures. KDE-CRE does not create an
	overhead of more than a couple percent over the CRE baseline.

	D.2 CANONICAL CALIBRATION IN A MEDICAL APPLICATION

	An additional experiment with a medical application, where the canonical calibration is of particular
	interest, was performed on the publicly-available Kather dataset (Kather et al., 2016), which consists
	of 5000 histological images of human colorectal cancer. The data has eight different classes of tissue.
	Figure 10 shows a comparison in performance of the CRE baseline with our KDE-CRE method. The
	canonical L1 (left) and L2 (right) calibration is measured using our KDE-ECE metric. The results
	clearly illustrate that our method significantly outperforms the cross-entropy baseline, both in terms
	of accuracy and calibration error, for several choices of the regularization parameter.

	D.3 BIAS AND CONVERGENCE RATES

	Figure 11 shows a comparison of the groud truth, computed from 3000 test points with KDE-ECE
	against KDE-ECE and binned ECE estimated with a varying number of points used for the estima

	-----

	Marginal calibration on CIFAR10 using Densenet


	Top-label calibration on CIFAR10 using Densenet


	Marginal calibration on CIFAR10 using Densenet


	Top-label calibration on CIFAR10 using Densenet


	0.035

	0.030

	0.025

	0.020

	0.015

	0.010

	0.005

	0.000

	0.0175

	0.0150

	0.0125

	0.0100

	0.0075

	0.0050

	0.0025

	0.0000


	0.10

	0.08

	0.06

	0.04

	0.02

	0.00

	0.06

	0.05

	0.04

	0.03

	0.02

	0.01


	0.0030

	0.0025

	0.0020

	0.0015

	0.0010

	0.0005

	0.0000

	0.0012

	0.0010

	0.0008

	0.0006

	0.0004

	0.0002

	0.0000


	0.020

	0.015

	0.010

	0.005

	0.000

	0.012

	0.010

	0.008

	0.006

	0.004

	0.002

	\|6\|0.3\|4 0.20.3\|MMCE L L1 2 K KD DE E- -C CR RE E 2\|
	\|---\|---\|---\|---\|

	\|6\|0.3\|4 0.20.3 0.2\|MMCE L L1 2 K KD DE E- -C CR RE E 53 00.0.0 052.1 1 .053\|
	\|---\|---\|---\|---\|

	\|Col1\|10\|4\|MMCE L L1 2 K KD DE E- -C CR RE E\|
	\|---\|---\|---\|---\|

	\|6\|10\|4\|MMCE L L1 2 K KD DE E- -C CR RE E 53\|
	\|---\|---\|---\|---\|


	10 CREFL

	MMCE

	6 LL12 KDE-CRE KDE-CRE

	4

	0.3 0.2 0.3 2

	0.2 0.050.10.10.0531.053

	0.86 0.88 0.90ACC 0.92 0.94

	Marginal calibration on CIFAR10 using Resnet


	10 CREFL

	MMCE
	LL12 KDE-CRE KDE-CRE

	6

	4

	0.3 0.2 0.3 0.2 0.050.10.10.052 3 53

	1.0

	0.86 0.88 0.90ACC 0.92 0.94

	Top-label calibration on CIFAR10 using Resnet


	CRE

	6 FLMMCE

	10 LL12 KDE-CRE KDE-CRE

	4

	0.3 0.2 0.3 0.2 0.050.10.10.052 31.053

	0.86 0.88 0.90ACC 0.92 0.94

	Marginal calibration on CIFAR10 using Resnet


	CRE
	FL

	10 MMCELL12 KDE-CRE KDE-CRE

	6 53

	4

	0.3 0.2 0.3 0.2 0.050.10.10.052 31.0

	0.86 0.88 0.90ACC 0.92 0.94

	Top-label calibration on CIFAR10 using Resnet

	\|Col1\|Col2\|10\|CRE FL\|
	\|---\|---\|---\|---\|
	\|0.3\|\|6 0.2 304 0.3 .1\|MMCE L L1 2 K KD DE E- -C CR RE E\|

	\|Col1\|Col2\|10\|CRE FL\|
	\|---\|---\|---\|---\|
	\|0.3\|\|0.2 06 .3 0.1 0. 02 34\|MMCE L L1 2 K KD DE E- -C CR RE E 53 .050.050.1 0.5\|

	\|Col1\|Col2\|Col3\|Col4\|CRE FL\|
	\|---\|---\|---\|---\|---\|
	\|\|\|\|10 6\|MMCE L L1 2 K KD DE E- -C CR RE E\|

	\|Col1\|Col2\|Col3\|CRE FL\|
	\|---\|---\|---\|---\|
	\|\|\|10 6\|MMCE L L1 2 K KD DE E- -C CR RE E 53\|


	CRE

	10 FLMMCE

	6 LL12 KDE-CRE KDE-CRE

	0.3

	0.2 0.3 30.14 0.20.052 530.050.10.5

	1.0


	CRE

	10 FLMMCE

	LL12 KDE-CRE KDE-CRE

	0.3 0.2 0.36 0.1 0.20.05530.050.1

	34 0.5

	1.0

	2


	CRE
	FL

	10 MMCE

	LL12 KDE-CRE KDE-CRE

	6

	0.3 0.2 0.3 30.14 0.20.052 530.050.11.00.5


	CRE
	FL

	10 MMCELL12 KDE-CRE KDE-CRE

	6 53

	0.3 0.2 0.3 0.14 0.20.050.050.1

	3 2 1.00.5


	0.87 0.88 0.89 0.90ACC 0.91 0.92 0.93

	Marginal calibration on CIFAR10 using Resnet (SD)


	0.87 0.88 0.89 0.90ACC 0.91 0.92 0.93

	Top-label calibration on CIFAR10 using Resnet (SD)


	0.87 0.88 0.89 0.90ACC 0.91 0.92 0.93

	Marginal calibration on CIFAR10 using Resnet (SD)


	0.87 0.88 0.89 0.90ACC 0.91 0.92 0.93

	Marginal calibration on CIFAR10 using Resnet (SD)


	0.05

	0.04

	0.03

	0.02

	0.01

	0.00

	0.014

	0.012

	0.010

	0.008

	0.006

	0.004

	0.002


	0.0200

	0.0175

	0.0150

	0.0125

	0.0100

	0.0075

	0.0050

	0.0025

	0.0000

	0.00175

	0.00150

	0.00125

	0.00100

	0.00075

	0.00050

	0.00025

	0.00000


	0.05

	0.04

	0.03

	0.02

	0.01

	0.00

	0.2 CREFL

	MMCE
	LL12 KDE-CRE KDE-CRE

	0.2

	106

	0.30.3 3

	0.1 0.050.15341.020.05


	0.025

	0.020

	0.015

	0.010

	0.005

	0.000


	0.08

	0.06

	0.04

	0.02

	0.00

	0.07

	0.06

	0.05

	0.04

	0.03

	0.02

	0.01

	0.00




	0.050.1531.020.05

	\|.2\|Col2\|CRE FL MMCE\|
	\|---\|---\|---\|
	\|\|\|L L1 2 K KD DE E- -C CR RE E\|
	\|0.2\|\|160 0.03.3 3 4\|

	\|Col1\|Col2\|Col3\|CRE\|
	\|---\|---\|---\|---\|
	\|\|0.2\|\|FL MMCE L L1 2 K KD DE E- -C CR R10E E 6\|
	\|0.2\|\|\|3 0.3 0.05 0054..3015 00.3.1 1.0\|

	\|0.2\|Col2\|CRE\|
	\|---\|---\|---\|
	\|\|\|FL MMCE\|
	\|\|\|L L1 2 K KD DE E- -C CR RE E\|
	\|\|\|\|

	\|.2\|Col2\|CRE FL MMCE\|
	\|---\|---\|---\|
	\|\|\|L L1 2 K KD DE E- -C CR RE E\|
	\|0.2\|\|160 0.03.3 3 4\|



	0.2 CREFL

	MMCE
	LL12 KDE-CRE KDE-CRE

	0.2

	106

	0.30.3 3

	0.1 0.050.15341.020.05


	CRE

	0.2 FLMMCE

	LL12 KDE-CRE KDE-CRE10

	36

	0.2 0.3 0.05

	0.050.1534

	0.30.1 1.02


	0.2 CREFL

	MMCE
	LL12 KDE-CRE KDE-CRE

	0.2 0.30.10.3 31.00.05100.1534620.05


	0.2 0.4 ACC 0.6 0.8


	0.2 0.4 ACC 0.6 0.8


	0.2 0.4 ACC 0.6 0.8


	0.2 0.4 ACC 0.6 0.8


	Marginal calibration on CIFAR10 using Wideresnet

	\|Col1\|Col2\|Col3\|CRE\|
	\|---\|---\|---\|---\|
	\|\|\|10 6\|FL MMCE L L1 2 K KD DE E- -C CR RE E\|
	\|0.3\|\|4 2 0.20.3 53\|3\|


	CRE

	10 FL

	MMCE
	LL12 KDE-CRE KDE-CRE

	6

	4

	0.3 2

	0.20.3 530.2 0.10.131.00.050.01


	0.89 0.90 0.91 0.92 ACC0.93 0.94 0.95 0.96


	Top-label calibration on CIFAR10 using Wideresnet

	\|Col1\|Col2\|Col3\|Col4\|CRE\|
	\|---\|---\|---\|---\|---\|
	\|\|\|\|10\|FL MMCE L L1 2 K KD DE E- -C CR RE E\|
	\|\|0.3\|\|6 0.240.3 0.2 0 0.\|1 .10.050.01 3\|


	CRE

	10 FLMMCE

	LL12 KDE-CRE KDE-CRE

	6

	0.3 0.240.3 0.2 0.10.13 0.050.01

	2 53

	1.0


	0.89 0.90 0.91 0.92 ACC0.93 0.94 0.95 0.96


	Marginal calibration on CIFAR10 using Wideresnet

	\|Col1\|Col2\|Col3\|CRE\|
	\|---\|---\|---\|---\|
	\|\|\|10\|FL MMCE L L1 2 K KD DE E- -C CR RE E\|
	\|\|6\|4 253\|\|


	CRE

	10 FLMMCE

	LL12 KDE-CRE KDE-CRE

	6 4

	0.3 0.20.3 2 530.2 0.10.131.00.050.01


	0.89 0.90 0.91 0.92 ACC0.93 0.94 0.95 0.96


	Top-label calibration on CIFAR10 using Wideresnet

	\|Col1\|Col2\|Col3\|CRE\|
	\|---\|---\|---\|---\|
	\|\|\|10\|FL MMCE L L1 2 K KD DE E- -C CR RE E\|
	\|\|\|6 0.\|1\|


	CRE
	FL

	10 MMCE

	LL12 KDE-CRE KDE-CRE

	6

	0.3 0.240.3 2 530.2 0.10.131.00.050.01


	0.89 0.90 0.91 0.92 ACC0.93 0.94 0.95 0.96


	Figure 6: Top-label and marginal calibration on CIFAR-10.

	Table 3: Training time [sec] per epoch for Cross-Entropy and KDE-CE methods for different models
	and datasets.


	## Dataset Model CRE L1 KDE-CRE


	## ResNet-110 51.8 53 ResNet-110 (SD) 45 46 Wide-ResNet-28-10 152.9 154.9 DenseNet-40 103.2 106.8
	ResNet-110 90 92.9 ResNet-110 (SD) 78.2 80.7 Wide-ResNet-28-10 150.5 155.3 DenseNet-40 101 105.5


	## CIFAR-10

	CIFAR-100


	tion. The used model is a ResNet-56, trained on a subset of three classes from CIFAR-10. The figure
	shows that the two estimates are comparable and both are doing a reasonable job.
	Figure 12 shows the absolute difference between the ground truth and estimated ECE using our KDE
	estimator and a binned estimator with varying number of points used for estimation. The results are


	-----

	Marginal calibration on CIFAR100 using Densenet

	\|Col1\|0.2 0.3 3 0.3\|0.1530.10.01 0.01\|
	\|---\|---\|---\|


	CRE

	4 0.2 2 1.0 FLMMCELL12 KDE-CRE KDE-CRE

	0.2

	0.3 3 0.1 53 0.10.01

	0.3 0.01


	0.72 0.73 0.74ACC 0.75 0.76


	Top-label calibration on CIFAR100 using Densenet

	\|CRE FL MMC L1 K\|0.3 0.2 2 4 E 3 DE-CRE\|53 0.1\|
	\|---\|---\|---\|


	0.3

	0.01

	0.30.2 1.00.1 53 0.01

	0.2 2 0.1

	4

	CRE
	FL
	MMCE 3
	LL12 KDE-CRE KDE-CRE


	0.72 0.73 0.74ACC 0.75 0.76


	Marginal calibration on CIFAR100 using Densenet


	Top-label calibration on CIFAR100 using Densenet

	\|Col1\|0 4\|0 .3.2 10.0.153 0.2 2 3\|0.01 0.1 CRE FL MMCE L1 KDE-CRE\|
	\|---\|---\|---\|---\|


	0.3 0.01

	0.30.2 1.00.1 53 0.01

	0.2 2 0.1

	CRE

	4 FLMMCE

	3 LL12 KDE-CRE KDE-CRE


	0.72 0.73 0.74ACC 0.75 0.76



	0.0013

	0.0012

	0.0011

	0.0010

	0.0009

	0.0008

	0.0007

	0.0006

	0.0015

	0.0014

	0.0013

	0.0012

	0.0011

	0.0010

	0.0009


	0.0175

	0.0150

	0.0125

	0.0100

	0.0075

	0.0050

	0.0025

	0.030

	0.025

	0.020

	0.015

	0.010

	0.005

	0.000


	3.0

	2.5

	2.0

	1.5

	1.0

	0.5

	3.5

	3.0

	2.5

	2.0

	1.5

	1.0


	0.10

	0.08

	0.06

	0.04

	0.14

	0.12

	0.10

	0.08

	0.06

	0.04

	\|Col1\|4 0.3 03.2 0.3\|53 0.1 0.100 .0.0 11\|
	\|---\|---\|---\|


	1e 5

	1.0 CREFL

	0.2 2 MMCELL12 KDE-CRE KDE-CRE

	4

	0.3 0.30.23 0.1 53 0.10.010.01

	0.72 0.73 0.74ACC 0.75 0.76

	Marginal calibration on CIFAR100 using Resnet


	Marginal calibration on CIFAR100 using Resnet

	\|Col1\|Col2\|0.3 3\|0.2 0.01 0.3 0.10.01 0.1\|
	\|---\|---\|---\|---\|
	\|\|CRE FL MM\|CE\|\|
	\|\|L L1 2 K K\|DE-CRE DE-CRE\|\|


	2 460.2 1.053

	0.2 0.01

	0.3 3 0.3 0.1 0.01

	0.1

	CRE
	FL
	MMCE
	LL12 KDE-CRE KDE-CRE


	0.65 0.66 0.67 0.68ACC 0.69 0.70 0.71


	Top-label calibration on CIFAR100 using Resnet

	\|Col1\|Col2\|2 4\|Col4\|
	\|---\|---\|---\|---\|
	\|\|\|2 60.20.3\|0.1 0.3 0.2\|
	\|\|CRE FL MMC\|E\|3\|
	\|\|L L1 2 K K\|DE-CRE DE-CRE\|\|


	1.0

	530.010.1 0.01

	2 460.2 0.3 0.3 0.1

	0.2

	CRE
	FL
	MMCELL12 KDE-CRE KDE-CRE 3


	0.65 0.66 0.67 0.68ACC 0.69 0.70 0.71


	Top-label calibration on CIFAR100 using Resnet

	\|Col1\|L KD\|DE-CRE\|0.\|.1 0.01\|
	\|---\|---\|---\|---\|---\|
	\|\|L2 K\|DE-CRE 2 4 0 6\|.20.3 0.3 0.2\|0.1\|
	\|\|\|\|\|\|
	\|\|\|\|3\|\|


	CRE
	FLMMCELL12 KDE-CRE KDE-CRE 1.0530.010.1 0.01

	2 460.2 0.3 0.3 0.1

	0.2

	3


	0.65 0.66 0.67 0.68ACC 0.69 0.70 0.71

	\|Col1\|Col2\|2\|0.2 L1 KDE-CRE L KDE-CRE\|
	\|---\|---\|---\|---\|
	\|\|\|6 0.3 3\|L2 KDE-CRE 1.0 0.3 0.010.1 0.01 0.1\|
	\|\|\|\|\|
	\|\|\|\|\|


	1e 5

	40.2 53 CREFLMMCE

	2 6 0.2 1.0 LL12 KDE-CRE KDE-CRE

	0.3 3 0.3 0.01 0.1 0.01

	0.1


	0.65 0.66 0.67 0.68ACC 0.69 0.70 0.71


	Marginal calibration on CIFAR100 using Resnet (SD)

	\|0.3\|0.23 0.3\|530.01 0.1 2\|
	\|---\|---\|---\|
	\|CRE FL\|0.2\|0.1 1. 00 .01\|
	\|MM\|CE\|\|
	\|L L1 2 K K\|DE-CRE DE-CRE\|\|


	0.3

	530.01

	0.2 3

	0.3 0.1 2

	0.2 0.1 1.00.01

	CRE
	FL
	MMCE
	LL12 KDE-CRE KDE-CRE


	0.64 0.66 0.68ACC 0.70 0.72 0.74


	Top-label calibration on CIFAR100 using Resnet (SD)


	\|1e 5 M\|Marginal calibration on CIFAR100\|using Resnet (SD)\|
	\|---\|---\|---\|
	\|0.3\|3 0.2\|530.01 2\|
	\|CRE FL MMC\|0.3 0.2 E\|0.1 0.1 1. 00 .01\|
	\|L L1 2 K K\|DE-CRE DE-CRE\|\|


	0.64 0.66 0.68ACC 0.70 0.72 0.74


	Marginal calibration on CIFAR100 using Resnet (SD)

	\|0.3\|0.3\|5 0.23 0.\|30.01 1 2\|
	\|---\|---\|---\|---\|
	\|CRE FL\|\|0 0.2\|.1 1. 00 .01\|
	\|MM\|CE\|\|\|
	\|L L1 2 K K\|DE-CRE DE-CRE\|\|\|



	0.64 0.66 0.68ACC 0.70 0.72 0.74


	1e 5


	2.5

	2.0

	1.5

	1.0

	0.5

	4.0

	3.5

	3.0

	2.5

	2.0

	1.5

	1.0

	0.5


	0.12

	0.10

	0.08

	0.06

	0.04

	0.02

	0.00

	0.10

	0.08

	0.06

	0.04

	0.02


	0.0012

	0.0011

	0.0010

	0.0009

	0.0008

	0.0007

	0.0006

	0.0012

	0.0010

	0.0008

	0.0006


	0.0012

	0.0011

	0.0010

	0.0009

	0.0008

	0.0007

	0.0006

	0.016

	0.014

	0.012

	0.010

	0.008

	0.006

	0.004

	0.002

	0.000

	\|CRE FL MMC L L1 2 K K\|E DE-CRE DE-CRE\|12.00.01 0.1 0.01.01\|
	\|---\|---\|---\|
	\|0.3\|0.2 0.2 0.3\|53 3\|
	\|\|\|\|


	0.64 0.66 0.68ACC 0.70 0.72 0.74


	Marginal calibration on CIFAR100 using Wideresnet

	\|Col1\|Col2\|4 532\|CRE FL MMCE\|
	\|---\|---\|---\|---\|
	\|\|\|1.0 0.3 0.10.2 0.2\|3L L1 2 K KD DE E- -C CR RE E 000.. 1.0011\|


	0.34 53 2 1.0 3CREFLMMCELL12 KDE-CRE KDE-CRE

	0.10.2

	0.2 0.10.010.01

	0.3


	0.74 0.75 0.76 0.77ACC 0.78 0.79 0.80 0.81


	Top-label calibration on CIFAR100 using Wideresnet

	\|Col1\|Col2\|0.2 0.1\|0.1\|
	\|---\|---\|---\|---\|
	\|\|0\|0.2 .3 0.3 1.0 4 532\|0.01 0.01\|


	0.20.10.2 0.1

	0.3

	0.01

	0.3 1.0 0.01

	4 53 2

	CRE
	FL
	MMCELL12 KDE-CRE KDE-CRE 3


	0.74 0.75 0.76 0.77ACC 0.78 0.79 0.80 0.81



	Top-label calibration on CIFAR100 using Wideresnet

	\|Col1\|CRE FL MMC\|E\|Col4\|Col5\|
	\|---\|---\|---\|---\|---\|
	\|\|L L1 2 K K 0\|DE-CRE DE-CRE .3\|0.2 0.1 0.2 1.0\|0.1 0.01 0.01\|


	CRE
	FL
	MMCELL12 KDE-CRE KDE-CRE 0.20.10.2 0.1

	0.3 0.010.01

	1.0

	4 0.3 53 2

	3


	0.74 0.75 0.76 0.77ACC 0.78 0.79 0.80 0.81

	\|1e 5\|Col2\|Marginal calibration on CIFAR100\|0 using Wideresnet\|
	\|---\|---\|---\|---\|
	\|\|\|4\|CRE FL MMCE\|
	\|\|\|5321.0 0.1\|L L1 2 K KD DE E- -C CR RE E\|


	4 CREFL

	MMCE

	53 2 1.0 LL12 KDE-CRE KDE-CRE

	0.3 0.1

	0.3 0.2 0.2 0.10.010.013


	0.74 0.75 0.76 0.77ACC 0.78 0.79 0.80 0.81


	Figure 7: Top-label and marginal calibration on CIFAR-100


	0.0

	1.0

	0.2

	0.8

	0.4

	0.6

	0.6

	0.4

	0.8

	0.2

	1.0

	0.0

	0.0 0.2 0.4 0.6 0.8 1.0

	(a) Splitting the simplex in 16 bins


	0.00 0.05 0.10 0.15 0.20 0.25 0.30

	(b) Corresponding histogram (c) Corresponding Dirichlet KDE


	Figure 8: An example of a simplex binned estimator and kernel-density estimator for CIFAR-10

	averaged over 120 ResNet-56 models trained on a subset of three classes from CIFAR-10. Both
	estimators are biased and have some variance, and the plot shows that the combination of the two is
	in the same order of magnitude. The empirical convergence rates (slope of the log-log plot) is given
	in the legend and is shown to be close to the theoretically expected value of -0.5.


	D.4 CHOICE OF THE BATCH SIZE

	In Figure 13 we investigate the choice of the batch size on CIFAR-10. To this end, we use two
	differently shuffled dataloaders that draw random batches from the same training set. The first
	dataloader provides batches to the loss term (CRE) while the second dataloader provides the batches
	for the regularization (KDE). The batch size for the loss term is fixed in all experiments, while the


	-----

	0.40 Canonical, using 100 points, 25 bins, 0.001 bandwidth

	0.35

	0.30

	0.25

	0.20

	Binned ECE 0.15

	0.10

	0.05

	0.00

	0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40

	KDE ECE

	Canonical, using 500 points, 25 bins, 0.001 bandwidth

	0.30

	0.25

	0.20

	0.15

	Binned ECE 0.10

	0.05

	0.00

	0.00 0.05 0.10 0.15 0.20 0.25 0.30

	KDE ECE

	Canonical, using 1000 points, 25 bins, 0.001 bandwidth

	0.25

	0.20

	0.15

	Binned ECE 0.10

	0.05

	0.00

	0.00 0.05 0.10 0.15 0.20 0.25

	KDE ECE


	Marginal, using 100 points, 25 bins, 0.001 bandwidth

	0.10

	0.08

	0.06

	Binned ECE 0.04

	0.02

	0.00

	0.00 0.02 0.04 0.06 0.08 0.10

	KDE ECE

	Marginal, using 500 points, 25 bins, 0.001 bandwidth

	0.06

	0.05

	0.04

	0.03

	Binned ECE

	0.02

	0.01

	0.00

	0.00 0.01 0.02 0.03 0.04 0.05 0.06

	KDE ECE

	Marginal, using 1000 points, 25 bins, 0.001 bandwidth

	0.05

	0.04

	0.03

	Binned ECE 0.02

	0.01

	0.00

	0.00 0.01 0.02 0.03 0.04 0.05

	KDE ECE


	Top-label, using 100 points, 25 bins, 0.001 bandwidth

	0.10

	0.08

	0.06

	Binned ECE 0.04

	0.02

	0.00

	0.00 0.02 0.04 0.06 0.08 0.10

	KDE ECE

	Top-label, using 500 points, 25 bins, 0.001 bandwidth

	0.08

	0.07

	0.06

	0.05

	0.04

	Binned ECE 0.03

	0.02

	0.01

	0.00

	0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08

	KDE ECE

	Top-label, using 1000 points, 25 bins, 0.001 bandwidth

	0.08

	0.07

	0.06

	0.05

	0.04

	Binned ECE 0.03

	0.02

	0.01

	0.00

	0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08

	KDE ECE


	Figure 9: Relationship between the ECE metric based on binning and kernel density estimation
	(KDE-ECE) for the three types of calibration: canonical, marginal and top-label. In every row, a
	different number of points are used to approximate the KDE-ECE.


	0.40

	0.35


	0.18

	0.16


	0.30

	0.25


	0.14

	0.12


	0.20

	\|0\|.01\|Col3\|C L\|RE KDE-CRE\|
	\|---\|---\|---\|---\|---\|
	\|\|\|1 L2\|1 L2\|KDE-CRE\|
	\|\|0.1\|\|\|\|
	\|\|0.2\|\|\|\|
	\|\|0.3 0.2\|0.01\|\|\|
	\|\|\|\|\|\|
	\|\|\|0.3 0.1\|\|\|
	\|\|\|\|\|\|

	\|Col1\|0.01\|CRE L1 KDE-CRE\|
	\|---\|---\|---\|
	\|\|0.2\|L2 KDE-CRE\|
	\|\|\|\|
	\|\|0.1\|0.01\|
	\|\|0.2\|\|
	\|\|\|\|
	\|\|0.3\|0.3\|
	\|\|0.1\|\|


	0.01 CRE

	L1 KDE-CRE

	L2 KDE-CRE

	0.1

	0.2

	0.3

	0.2 0.01

	0.3

	0.1


	0.84 0.86 0.88 0.90

	ACC


	0.84 0.86 0.88 0.90

	ACC


	Figure 10: Canonical calibration on Kather using a Resnet-50 model

	batch size for the regularization varies. The orange point is our normal experimental set-up with just
	one dataloader (i.e. the same points are used for loss and KDE-ECE computation) as a comparison.
	The plot shows that our chosen batch size of 128 is appropriate for our purposes.


	-----

	0.00

	0.02

	0.04

	0.06

	0.08

	0.10


	\|Col1\|Col2\|Col3\|Col4\|Col5\|Col6\|Ground tru KDE-ECE Binned EC\|th E\|
	\|---\|---\|---\|---\|---\|---\|---\|---\|
	\|\|\|\|\|\|\|\|\|
	\|\|\|\|\|\|\|\|\|
	\|\|\|\|\|\|\|\|\|
	\|\|\|\|\|\|\|\|\|
	\|\|\|\|\|\|\|\|\|
	\|\|\|\|\|\|\|\|\|


	200 400 600 800 1000

	# points


	\|Col1\|Col2\|Col3\|Col4\|Col5\|Ground tru\|th\|
	\|---\|---\|---\|---\|---\|---\|---\|
	\|\|\|\|\|\|KDE-ECE Binned EC\|E\|
	\|\|\|\|\|\|\|\|
	\|\|\|\|\|\|\|\|
	\|\|\|\|\|\|\|\|
	\|\|\|\|\|\|\|\|
	\|\|\|\|\|\|\|\|
	\|\|\|\|\|\|\|\|


	200 400 600 800 1000

	# points


	\|Col1\|Col2\|Col3\|Col4\|Col5\|Col6\|Ground tru KDE-ECE\|th\|
	\|---\|---\|---\|---\|---\|---\|---\|---\|
	\|\|\|\|\|\|\|Binned EC\|E\|
	\|\|\|\|\|\|\|\|\|
	\|\|\|\|\|\|\|\|\|
	\|\|\|\|\|\|\|\|\|
	\|\|\|\|\|\|\|\|\|
	\|\|\|\|\|\|\|\|\|


	200 400 600 800 1000

	# points


	0.035

	0.030

	0.025

	0.020

	0.015

	0.010

	0.005

	0.000


	0.05

	0.04

	0.03

	0.02

	0.01

	0.00


	(a) Canonical


	(b) Marginal


	(c) Top-label


	Figure 11: KDE-ECE estimates and their corresponding binned approximations on the three types
	of calibration for varying number of points used for the estimation. The ground truth is calculated
	using 3000 probability scores of the test set. For the binned estimate, the points are assigned in 25
	bins with adaptive width. A bandwidth of 0.001 is used for KDE-ECE.


	3 × 10

	2 × 10

	10




	4 × 10

	3 × 10

	2 × 10


	10

	6 × 10

	4 × 10

	3 × 10


	6 × 10

	4 × 10

	\|Col1\|KDE-ECE slope = 0.3 Binned ECE slope = 0.3\|KDE-ECE slope = 0.3 Binned ECE slope = 0.3\|8 2\|
	\|---\|---\|---\|---\|
	\|\|\|\|\|

	\|Col1\|Col2\|KDE-ECE slope = 0.4 Binned ECE slope = 0.5\|0 2\|
	\|---\|---\|---\|---\|
	\|\|\|\|\|

	\|Col1\|KDE-ECE slope = 0.5 Binned ECE slope = 0.4\|6\|
	\|---\|---\|---\|
	\|\|\|6\|
	\|\|\|\|


	102 10

	Number of points

	(a) Canonical


	102 10

	Number of points

	(b) Marginal


	102 10

	Number of points

	(c) Top-label


	Figure 12: Absolute difference between ground truth and estimated ECE for varying number of
	points used for the estimation. The ground truth is calculated using 3000 probability scores of the
	test set. For the binned estimate, the points are assigned in 25 bins with adaptive width. A bandwidth
	of 0.001 is used for KDE-ECE. Note that the axes are on a log scale.


	-----

	0.0055

	0.0050

	0.0045

	0.0040

	0.0035

	0.0030

	0.0025

	0.050


	0.00030

	0.00025

	0.00020

	0.00015


	0.00010

	0.00005

	0.008

	\|Col1\|Col2\|Col3\|Col4\|Col5\|Col6\|Col7\|L\|1 2 KDE-CRE\|
	\|---\|---\|---\|---\|---\|---\|---\|---\|---\|
	\|\|\|\|\|\|\|L 64\|L\|1 KDE-CRE\|
	\|\|\|\|\|\|\|\|\|\|
	\|\|\|\|1\|28\|\|\|\|\|
	\|\|\|128 256\|\|\|\|2\|\|\|
	\|\|\|\|\|\|51\|\|\|\|
	\|\|\|\|\|\|\|\|\|32\|
	\|\|\|\|\|\|\|\|\|\|
	\|\|\|\|\|\|\|\|\|\|
	\|\|\|\|\|\|\|\|\|\|

	\|Col1\|Col2\|Col3\|Col4\|Col5\|Col6\|Col7\|L1 L1\|2 KDE-CRE KDE-CRE\|
	\|---\|---\|---\|---\|---\|---\|---\|---\|---\|
	\|\|\|\|\|\|\|\|\|\|
	\|\|\|\|\|\|\|64\|\|\|
	\|\|\|\|\|\|\|\|\|\|
	\|\|\|128\|1\|28\|\|\|\|\|
	\|\|\|\|\|\|\|\|\|\|
	\|\|\|256\|\|\|51\|2\|\|32\|
	\|\|\|\|\|\|\|\|\|\|
	\|\|\|\|\|\|\|\|\|\|
	\|\|\|\|\|\|\|\|\|\|


	0.915 0.920 0.925 0.930 0.935 0.940

	ACC

	\|Col1\|Col2\|Col3\|Col4\|Col5\|Col6\|L L\|L L\|1 2 KDE-CRE 1 KDE-CRE\|
	\|---\|---\|---\|---\|---\|---\|---\|---\|---\|
	\|\|\|\|\|\|\|\|\|\|
	\|\|\|\|\|\|51\|2\|\|\|
	\|\|\|\|\|\|\|\|\|\|
	\|\|\|256\|\|\|\|64\|\|\|
	\|\|\|128\|\|\|\|\|\|\|
	\|\|\|\|12\|8\|\|\|\|32\|
	\|\|\|\|\|\|\|\|\|\|
	\|\|\|\|\|\|\|\|\|\|


	L1 2 KDE-CRE

	L1 KDE-CRE

	512

	256

	128 128 64

	32


	0.915 0.920 0.925 0.930 0.935 0.940

	ACC


	0.915 0.920 0.925 0.930 0.935 0.940

	ACC

	\|Col1\|Col2\|Col3\|Col4\|Col5\|Col6\|L1 L1\|L1 L1\|2 KDE-CRE KDE-CRE\|
	\|---\|---\|---\|---\|---\|---\|---\|---\|---\|
	\|\|\|\|\|\|\|\|\|\|
	\|\|\|\|\|\|512\|\|\|\|
	\|\|\|\|\|\|\|\|\|\|
	\|\|\|256\|\|\|\|64\|\|\|
	\|\|\|\|\|\|\|\|\|\|
	\|\|\|128\|1\|28\|\|\|\|32\|
	\|\|\|\|\|\|\|\|\|\|
	\|\|\|\|\|\|\|\|\|\|



	0.915 0.920 0.925 0.930 0.935 0.940

	ACC


	0.007

	0.006


	0.045

	0.040


	0.005

	0.004


	0.035

	0.030


	0.003


	Figure 13: Training with different batches for loss and regularization (2 KDE-CRE), where the batch
	size for the loss is fixed and the batch size for the regularization varies. The orange point shows our
	usual experimental set-up where we train with only one batch (KDE-CRE). Upper row: marginal,
	lower row: top-label.


	-----