# CALIBRATION REGULARIZED TRAINING OF DEEP NEURAL NETWORKS USING DIRICHLET KERNEL DENSITY ESTIMATION **Anonymous authors** Paper under double-blind review ABSTRACT Calibrated probabilistic classifiers are models whose predicted probabilities can directly be interpreted as uncertainty estimates. This property is particularly important in safety-critical applications such as medical diagnosis or autonomous driving. However, it has been shown recently that deep neural networks are poorly calibrated and tend to output overconfident predictions. As a remedy, we propose a trainable calibration error estimator based on Dirichlet kernel density estimates, which asymptotically converges to the true Lp calibration error. This novel estimator enables us to achieve the strongest notion of multiclass calibration, called canonical calibration, while other common calibration methods only allow for toplabel and marginal calibration. The empirical results show that our estimator is competitive with the state-of-the-art, consistently yielding tradeoffs between calibration error and accuracy that are (near) Pareto optimal across a range of network architectures. The computational complexity of our estimator is O(n[2]), matching that of the kernel maximum mean discrepancy, used in a previously considered trainable calibration estimator (Kumar et al., 2018). By contrast, the proposed method has a natural choice of kernel, and can be used to generate consistent estimates of other quantities based on conditional expectation, such as the sharpness of an estimator. 1 INTRODUCTION Deep neural networks have shown tremendous success in classification tasks, being regularly the best performing models in terms of accuracy. However, they are also known to make overconfident predictions (Guo et al., 2017), which is particularly problematic in safety-critical applications such as medical diagnosis or autonomous driving. Therefore, in many real world applications we do not just care about the predictive performance, but also about the trustworthiness of that prediction, that is, we are interested in accurate predictions with robust uncertainty estimates. To this end, we want our models to be uncertainty calibrated which means that, for instance, among all cells that have been predicted with a probability of 0.8 to be cancerous, in fact a fraction of 80 % belong to a malignant tumor. Being calibrated, however, does not imply that the classifier achieves good accuracy. For instance, a classifier that always predicts the marginal distribution of the target class is calibrated, but will not be very useful in practice. Likewise, a good predictive performance does not ensure calibration. In particular, for a broad class of loss functions, risk minimization leads to asymptotically Bayes optimal classifiers (Bartlett et al., 2006). However, there is no guarantee that they are calibrated, even in the aysmptotic limit. Therefore, we consider minimizing the risk plus a term that penalizes miscalibration, i.e., Risk +λ · CalibrationError. For parameter values λ > 0, this will push the classifier towards a calibrated model, while maintaining similar accuracy. The existence of such a _λ > 0 is suggested by the fact that there always exists at least one Bayes optimal classifier that is_ calibrated, namely P(y|x). To optimize the risk and the calibration error jointly, we propose a differentiable and consistent estimator of the expected Lp calibration error based on kernel density estimates (KDEs). In particular, we use a Beta kernel in binary classification tasks and a Dirichlet kernel in the multiclass setting, ----- as these kernels are the natural choices to model density estimation over a probability simplex. Our Dirichlet kernel based estimator allows for the estimation of canonical calibration, which is the strongest notion of multiclass calibration as it implies the calibration of the whole probability vector (Br¨ocker, 2009; Appice et al., 2015; Vaicenavicius et al., 2019). By contrast, most other state-ofthe-art methods only achieve weaker versions of multiclass calibration, namely top-label (Guo et al., 2017) and marginal or class-wise calibration (Kull et al., 2019). The top-label calibration only considers the scores for the predictied class, while for marginal calibration the multiclass problem is split up into K one-vs-all binary ones, each of which is required to be calibrated according to the definition of binary calibration. In many applications marginal and canonical calibration are preferable to top-label calibration, since we often care about having reliable uncertainty estimates for more than just one class per prediction. For instance, in medical diagnosis we do not just care about the most likely disease a certain patient might have but also about the probabilities of other diseases. Our contributions can be summarized as follows: 1. We develop a trainable calibration error objective using Dirichlet kernel density estimates, which can be minimized alongside any loss function in the existing batch stochastic gradient descent framework. 2. We propose to use our estimator to evaluate canonical calibration. Due to the scaling properties of Dirichlet kernel density estimation, and the tendency for probabilities to be concentrated in a relatively small number of classes, this becomes feasible in cases that cannot be estimated using a binned estimator. 3. We show on a variety of network architectures and two datasets that DNNs trained alongside an estimator of the calibration error achieve competitive results both on existing metrics and on the proposed measure of canonical calibration. 2 RELATED WORK Calibration of probabilistic predictors has long been studied in many fields. This topic gained attention in the deep learning community following the observation in Guo et al. (2017) that modern neural networks are poorly calibrated and tend to give overconfident predictions due to overfitting on the NLL loss. The surge of interest resulted in many calibration strategies that can be split in two general categories, which we discuss subsequently. Post-hoc calibration strategies learn a calibration map of the predictions from a trained predictor in a post-hoc manner. For instance, Platt scaling (Platt, 1999) fits a logistic regression model on top of the logit outputs of the model. A special case of Platt scaling that fits a single scalar, called temperature, has been popularized by Guo et al. (2017) as an accuracy-preserving, easy to implement and effective method to improve calibration. However, it has the undesired consequence that it clamps the high confidence scores of accurate predictions (Kumar et al., 2018). Other approaches for post-hoc calibration include: histogram binning (Zadrozny & Elkan, 2001), isotonic regression (Zadrozny & Elkan, 2002), and Bayesian binning into quantiles (Naeini & Cooper, 2015). Trainable calibration strategies integrate a differentiable calibration measure into the training objective. One of the earliest approaches is regularization by penalizing low entropy predictions (Pereyra et al., 2017). Similarly to temperature scaling, it has been shown that entropy regularization needlessly suppresses high confidence scores of correct predictions (Kumar et al., 2018). Another popular strategy is MMCE (Maxmimum Mean Calibration Error) (Kumar et al., 2018), where the entropy regularizer is replaced by a kernel-based surrogate for the calibration error that can be optimized alongside NLL. It has been shown that label smoothing (Szegedy et al., 2015; M¨uller et al., 2020), i.e. training models with a weighted mixture of the labels instead of one-hot vectors, also improves model calibration. Liang et al. (2020) propose to add the difference between predicted confidence and accuracy as auxiliary term to the cross-entropy loss. Focal loss (Mukhoti et al., 2020; Lin et al., 2018) has recently been empirically shown to produce better calibrated models than many of the alternatives, but does not estimate a clear quantity related to calibration error. **Kernel density estimation (Parzen, 1962; Rosenblatt, 1956) is a non-parametric method to estimate** a probability density function from a finite sample. Zhang et al. (2020) propose a KDE-based estimator of the calibration error for measuring calibration performance. However, they use the triweight kernel, which has a limited support interval and is therefore applicable to binary classification, but does not have a natural extension to higher dimensional simplexes, in contrast to the Dirichlet kernel ----- that we consider here. As a result, they consider an unnatural proxy to marginal calibration error, which does not result in a consistent estimator. 3 METHODS The most commonly used loss functions are designed to achieve consistency in the sense of Bayes optimality under risk minimization, however, they do not guarantee calibration - neither for finite samples nor in the asymptotic limit. Since we are interested in models f that are both accurate and calibrated, we consider the following optimization problem bounding the calibration error CE(f ): _f = arg min_ (1) _f_ [Risk(][f] [)][,][ s.t.][ CE(][f] [)][ ≤] _[B]_ _∈F_ for some B > 0, and its associated Lagrangian _f = arg min_ Risk(f ) + λ CE(f ) _._ (2) _f_ _·_ _∈F_   We measure the (mis-)calibration in terms of the Lp calibration error. To this end, let (Ω, A, P) be a probability space, let X = R[d], Y = {0, 1, ..., K}. Let x : Ω _→X and y : Ω_ _→Y be_ random variables while realizations are denoted with subscripts. Furthermore, let f : X →△[K] be a decision function, where △[K] denotes the K dimensional simplex as is achieved e.g. from the output of a final softmax layer in a neural network. **Definition 3.1 (Calibration error, (Naeini et al., 2015; Kumar et al., 2019; Wenger et al., 2020)).** _The Lp calibration error of f is:_ [1] _p_ _p_ CEp(f ) = E E[y _f_ (x)] _f_ (x) _._ (3) _|_ _−_ _p_    We note that we consider multiclass calibration, and that f (x) and the conditional expectation in Equation 3 therefore map to points on a probability simplex. We say that a classifier f is perfectly calibrated if CEp(f ) = 0. Kumar et al. (2018) have also considered a minimization problem similar to Equation 2. Instead of using the CEp they use a metric called maximum mean calibration error (MMCE) that is 0 if and only if CEp = 0. However, it is unclear how MMCE relates to the canonical multiclass setting or to the norm parameter p for non-zero CEp. In order to optimize Definition 3.1 directly, we need to perform density estimation over the probability simplex in order to empirically compute the conditional expectation. In a binary setting, this has traditionally been done with binned estimates (Naeini et al., 2015; Guo et al., 2017; Kumar et al., 2019). However, this is not differentiable w.r.t. the function f, and cannot be incorporated into a gradient based training procedure. Furthermore, binned estimates suffer from the curse of dimensionality and do not have a practical extension to multiclass settings. A natural choice for a differentiable kernel density estimator in the binary case is a kernel based on the Beta distribution and the extension to the multiclass case is given by the Dirichlet distribution. Hence, we consider an estimator for the CEp based on Beta and Dirichlet kernel density estimates in the binary and multiclass setting, respectively. We require that this estimator is consistent and differentiable such that we can train it according to Equation 2. This estimator is given by: CE\p(f )[p] = [1] E[y\ f (x)] _|_ _f_ (xh) _[−]_ _[f]_ [(][x][h][)] (4) _h=1_ where E[y\ f (x)] E[y\ f (x)] evaluated at f (x) = f (xh). If Px,y has a probability _|_ _f_ (xh) [denotes] _|_ density px,y with respect to the product of the Lebesgue and counting measure, we can define: _px,y(xi, yi) = py|x=xi_ (yi) px(xi). Then we define the estimator of the conditional expectation as follows: _yk_ _[y][k][ p][x,y][(][f]_ [(][x][)][, y][k][)] E[y _f_ (x)] = _yk py_ _x=f_ (x)(yk) = _∈Y_ (5) _|_ _|_ _px(f_ (x)) _yXk∈Yn_ P _≈_ Pi=1ni=1[k][k][(][f][(][f][(][x][(][x][);][);][ f][ f][(][x][(][x][i][))][i][))][y][i] =: E[y\ | f (x)] (6) where k is the kernel of a kernel density estimate evaluated at point xi. P ----- **Proposition 3.2.** E[y\ | f (x)] is a pointwise consistent estimator of E[y | f (x)], that is: _n_ lim _i=1n_ _[k][(][f]_ [(][x][);][ f] [(][x][i][))][y][i] = _yk∈Y_ _[y][k][ p][x,y][(][f]_ [(][x][)][, y][k][)] _._ (7) _n→∞_ P _i=1_ _[k][(][f]_ [(][x][);][ f] [(][x][i][))] P _px(f_ (x)) P _Proof. By the consistency of kernel density estimators (Silverman, 1986; Chen, 1999; Ouimet_ & Tolosana-Delgado, 2021), for all f (x) (0, 1), _n1_ _ni=1_ _[k][(][f]_ [(][x][);][ f] [(][x][i][))][y][i] _n→∞_ _yk∈Y_ _[y][k][ p][x,y][(][f]_ [(][x][)][, y][k][)][ and][ 1]n _ni=1_ _[k][(][f]_ [(][x][);][ f] [(][x]∈[i][))] _−n−→∞−−→_ _px(Pf_ (x)). The fact that the ratio of−−−−→ two convergent sequences converges against the ratio of their limits shows the result. P P **Mean squared error in binary classification** As a first instantiation of our framework we consider a binary classification setting, with the mean squared error MSE(f ) = E[(f (x) − _y)[2]] as the_ risk function, jointly optimized with the L2 calibration error CE2. Following Murphy (1973); Degroot & Fienberg (1983); Kuleshov & Liang (2015); Nguyen & O’Connor (2015) we decompose (full derivation in Appendix A) the MSE as: MSE(f ) − CE2(f )[2] = E 1 − E[y | f (x)] E[y | f (x)] _≥_ 0. (8)    Similar to Equation 2, we consider the optimization problem for some λ > 0: _f = arg min_ MSE(f ) + λ CE2(f )[2][]. (9) _f_ _∈F_  Using Equation 8 we rewrite: MSE(f ) + λ CE2(f )[2] =(1 + λ) MSE(f ) _λ_ MSE(f ) CE2(f )[2][] (10) _−_ _−_  =(1 + λ) MSE(f ) − _λE_ 1 − E[y | f (x)] E[y | f (x)] _._ (11)    Rescaling Equation 11 by a factor of (1 + λ)[−][1] and a variable substitution γ = 1+λλ _[∈]_ [[0][,][ 1)] _f = arg min_ MSE(f ) + λ CE2(f )[2][] = arg min MSE(f ) _γE_ 1 E[y _f_ (x)] E[y _f_ (x)] _f_ _f_ _−_ _−_ _|_ _|_ _∈F_ _∈F_   (12) = arg min MSE(f ) + γE E[y _f_ (x)][2][i]. (13) _f_ _|_ _∈F_ h For optimization we wish to find an estimator for E[E[y | f (x)][2]]. Building upon Equation 6, a partially debiased estimator can be written as:[1] 2 _n_ \ _i≠_ _h_ _[k][(][f]_ [(][x][h][);][ f] [(][x][i][))][y][i] _−_ [P]i≠ _h_ [(][k][(][f] [(][x][h][);][ f] [(][x][i][))][y][i][)][2] E E[y | f (x)][2] _≈_ _n[1]_ _h=1_ P 2 _._ (14) h i X _i≠_ _h_ _[k][(][f]_ [(][x][h][);][ f] [(][x][i][))] _−_ [P]i≠ _h_ [(][k][(][f] [(][x][h][);][ f] [(][x][i][)))][2] In a binary setting, the kernels k(P·, ·) are Beta distributions, i.e. denoting _zi := f_ (xi) for short, then: _kBeta(z, zi) := z[α][i][−][1](1_ _z)[β][i][−][1][ Γ(][α][i][ +][ β][i][)]_ (15) _−_ Γ(αi) Γ(βi) _[,]_ with αi = _[z]h[i]_ [+1][ and][ β][i][ =][ 1][−]h[z][i] [+1][ (Chen, 1999; Bouezmarni & Rolin, 2003; Zhang & Karunamuni,] 2010), where h is a bandwidth parameter in the kernel density estimate that goes to 0 as n →∞. We note that the computational complexity of this estimator is O(n[2]). Within the gradient descent training procedure, the density is estimated using a mini-batch and therefore the O(n[2]) complexity is w.r.t. a mini-batch, not the entire dataset. The estimator in Equation 14 is a ratio of two second order U-statistics that converge as n[−][1][/][2] (Ferguson, 2005). Therefore, the overall convergence will be n[−][1][/][2]. Empirical covergence rates are calculated in Appendix D.3 and shown to be close to the theoretically expected value. 1We have debiased the numerator and denominator individually (Ferguson, 2005, Section 2), but for simplicity have not corrected for the fact that we are estimating a ratio (Scott & Wu, 1981). ----- **Multiclass calibration with Dirichlet kernel density estimates** There are multiple definitions regarding multiclass calibration that differ in the strictness regarding the calibration of the probability vector f (x). The weakest notion is top label calibration, which, as the name suggests, only cares about calibrating the entry with the highest predicted probability, which reduces to a binary calibration problem again (Guo et al., 2017). Marginal or class-wise calibration (Kull et al., 2019) is the most commonly used definition of multiclass calibration and a stronger version of top label calibration. Here, the problem is split into K one-vs-all binary calibration setting, such that each class has to be calibrated against the other K − 1 classes: _K_ _p[]_ MCEp(f )[p] = E E[y = k | f (x)k] − _f_ (x)k _._ (16) _k=1_ X  An estimator for this calibration error is: _i≠_ _j_ _[k][Beta][(][f]_ [(][x][j][)][k][;][ f] [(][x][i][)][k][)[][y][i][]][k] _f_ (xj)k _i=j_ _[k][Beta][(][f]_ [(][x][j][)][k][;][ f] [(][x][i][)][k][)] _−_ _̸_ P MCE\p(f )[p] = (17) _j=1_ _k=1_ The strongest notion of multiclass calibration, and the one that we want to consider in this paper, is called canonical calibration (Br¨ocker, 2009; Appice et al., 2015; Vaicenavicius et al., 2019). Here it is required that the whole probability vector f (x) is calibrated. The definition is exactly the one from Definition 3.1. Its estimator is: _i≠_ _j_ _[k][Dir][(][f]_ [(][x][j][);][ f] [(][x][i][))][y][i] _i=j_ _[k][Dir][(][f]_ [(][x][j][);][ f] [(][x][i][))][ −] _[f]_ [(][x][j][)] _̸_ _n_ CE\p(f )[p] = [1] _n_ _j=1_ P X where kDir is a Dirichlet kernel defined as: (18) _K_ _i=1_ _[α][i][)]_ _kDir(z, zi) := [Γ(]K[P][K]_ _zj[α][ij]_ _[−][1]_ (19) _i=1_ [Γ(][α][i][)] _j=1_ Y with αi = zi/h + 1 (Ouimet & Tolosana-Delgado, 2021). As before, the computational complexityQ is O(n[2]) irrespective of p. This estimator is differentiable and furthermore, the following proposition holds: **Proposition 3.3. The Dirichlet kernel based CE estimator is consistent, that is** lim 1 _n_ _ni≠_ _nj_ _[k][Dir][(][f]_ [(][x][j][);][ f] [(][x][i][))][y][i] _p_ = E E[y _f_ (x)] _f_ (x) _p_ _p._ (20) _n→∞_ _n_ Xj=1 P _i≠_ _j_ _[k][Dir][(][f]_ [(][x][j][);][ f] [(][x][i][))][ −] _[f]_ [(][x][j][)] _p_  _|_ _−_ _p_ P _Proof. Dirichlet kernel estimators are consistent (Ouimet & Tolosana-Delgado, 2021), conse-_ quently, by Proposition 3.2 the term inside the norm is consistent for any fixed f (xj) (note, that summing over i ̸= j ensures that the ratio of the KDE’s does not depend on the outer summation). Moreover, for any convergent sequence also the norm of that sequence converges against the norm of its limit. Ultimately, the outer sum is merely the sample mean of consistent summands, which again is consistent. 4 EMPIRICAL SETUP We trained ResNet (He et al., 2015), ResNet with stochastic depth (SD) (Huang et al., 2016), DenseNet (Huang et al., 2018) and WideResNet (Zagoruyko & Komodakis, 2016) networks on CIFAR-10 and CIFAR-100 (Krizhevsky, 2009). We use 45000 images for training. The code will be released upon acceptance. **Baselines** _Cross-entropy: The first baseline model is trained using cross-entropy with the data_ preprocessing, training procedure and hyperparameters described in the corresponding paper for the architecture. Trainable calibration strategies MMCE (Kumar et al., 2018) is a differentiable measure of calibration with a property that it is minimized at perfect calibration. It is used as a regulariser alongside NLL, with the strength of regularization parameterized by λ. Focal loss (Mukhoti et al., 2020) is an alternative to the popular cross-entropy loss, defined as Lf = −(1 − ----- _f_ (y|x))[γ] log(f (y|x)), where γ is a hyperparameter and f (y|x) is the probability score that a neural network f outputs for a class y on an input x. Their best-performing approach is the sampledependent FL-53 where γ = 5 for f (y|x) ∈ [0, 0.2) and γ = 3 otherwise, followed by the method with fixed γ = 3. Post-hoc calibration strategies Guo et al. (2017) investigated the performance of several post-hoc calibration methods and found temperature scaling to be a strong baseline, which we use as a representative of this group. It works by scaling the logits with a scalar T > 0, typically learned on a validation set by minimizing NLL. Following Kumar et al. (2018); Mukhoti et al. (2020), we also use temperature scaling as a post-processing step for our method. **Metrics** The most widely-used metric for expected calibration error (ECE) is a binned estimator (Naeini et al., 2015), which divides the interval [0, 1] into bins of equal width and then calculates a weighted average of the absolute difference between accuracy and confidence for each bin. A better binning scheme involves determining the bin sizes so that an equal number of samples fall into each bin (Nguyen & O’Connor, 2015; Mukhoti et al., 2020). We report the ECE (%) with 15 bins calculated according to the latter, so-called adaptive binning procedure. We compute the 95% confidence intervals using 100 bootstrap samples as in Kumar et al. (2019). We consider multiple versions of the ECE metric based on the Lp norm and the type of calibration (top-label, marginal, canonical). Top-label calibration error only considers the probability of the predicted class, marginal requires per-class calibration and the canonical is the highest form of calibration which requires the entire probability vector to be calibrated. We report L1 and L2 ECE in the marginal and canonical case. Additional experiments with top-label and marginal calibration on both CIFAR-10 and CIFAR100 can be found in Appendix B. **Hyperparameters** A crucial parameter for KDE is the bandwidth, a positive number that defines the smoothness of the density plot. Poorly chosen bandwidth may lead to undersmoothing (small bandwidth) or oversmoothing (large bandwidth). A commonly used non-parametric bandwidth selector is maximum likelihood cross validation (Duin, 1976). For our experiments we choose the bandwidth from a list of possible values by maximizing the leave-one-out likelihood. The λ parameter for weighting the calibration error w.r.t the loss is typically chosen via cross-validation or using a holdout validation set. The p parameter is chosen depending on the desired Lp calibration error and the corresponding theoretical guarantees. 5 RESULTS AND DISCUSSION 5.1 BINARY CLASSIFICATION We construct a binary experiment by splitting the CIFAR-10 classes into 2 classes: vehicles (plane, automobile, ship, truck) and animals (bird, cat, deer, dog, frog, horse). Figure 1a shows how the choice of the bandwidth parameter influences the shape of the estimate. 0.0040 0.0035 0.0030 0.0025 0.0020 0.0015 0.0010 0.0005 0.0000 10 8 6 4 2 0 |Col1|Col2|KDE b = KDE b =|0.001 0.01| |---|---|---|---| |||KDE b = Histogram|0.1 from samples| ||||| ||||| ||||| ||||| ||||| KDE b = 0.001 KDE b = 0.01 KDE b = 0.1 Histogram from samples 0.0 0.2 0.4 0.6 0.8 1.0 |Col1|Col2|Col3|Col4|Col5|Col6|Col7|Col8|KDE-MSE MSE|Col10| |---|---|---|---|---|---|---|---|---|---| ||||||||||| ||||||||||| ||||||||||| ||||||0.|2|||| ||||||||||| |||||0.1||||0.3|| ||||||||||| ||||||||||0.4| ||||||||||| ||||||||||| 0.03 0.04 0.05 0.06 0.07 0.08 MSE (b) Effect of γ (a) Effect of the bandwidth b Figure 1: Calibration regularized training using MSE loss and CE2 Figure 1b shows the effect of the regularization parameter γ on the performance of a ResNet-110 model. The orange point represents a model trained with MSE loss, and the blue points (KDE-MSE) correspond to models trained with regularized MSE loss by an L2 calibration error for different values of γ. As expected, the calibration regularized training decreases the L2 calibration error at the cost of slightly increased error. ----- 5.2 EVALUATING CANONICAL CALIBRATION Accurately evaluating the calibration error is another crucial step towards designing trustworthy models that can be used in high-cost settings. In spite of its numerous flaws discussed in Vaicenavicius et al. (2019); Ding et al. (2020); Ashukha et al. (2021), such as its sensitivity to the binning scheme, the histogram-based estimator remains the most widely used metric for evaluating miscalibration. Another downside of the binned estimator is its inability to capture canonical calibration due to the curse of dimensionality, as the number of bins grows exponentially with the number of classes. Therefore, because of its favourable scaling properties, we propose using our Dirichlet kernel density estimate as an alternative metric (KDE-ECE) to measure calibration. To investigate its relationship with the commonly used binned estimator, we first introduce an extension of the top-label binned estimator to the probability simplex in the three class setting. We start by partitioning the probability simplex into equally-sized, triangle-shaped bins and assign the probability scores to the corresponding bin, as shown in Figure 2a. Then, we define the binned estimate of canonical calibration error as follows: CEp(f )[p] _≈_ E _∥H(f_ (x)) − _f_ (x)∥p[p] _≈_ _n[1]_ h i _H(f_ (xj)) _f_ (xi) _p_ (21) _∥_ _−_ _∥[p]_ _i=1_ X where H(f (xj)) is the histogram estimate, shown in Figure 2b. The surface of the corresponding Dirichlet KDE is presented in Figure 2c. In Figure 3 we show that the KDE-ECE estimates of the three types of calibration closely correspond to the their histogram-based approximations. Each point in the plot represents a ResNet-56 model trained on a different subset of three classes from CIFAR-10. See Appendix C for another example of the binned estimator and Dirichlet KDE on CIFAR-10 and an experiment with varying number of points used for the density estimation. 0.0 0.2 0.4 0.6 0.8 1.0 0.05 0.10 0.15 0.20 0.25 0.0 1.0 0.2 0.8 0.4 0.6 0.6 0.4 0.8 0.2 1.0 0.0 0.0 0.2 0.4 0.6 0.8 1.0 0.05 0.10 0.15 0.20 0.25 (a) Splitting the simplex in 16 bins (b) Histogram (c) Dirichlet KDE Figure 2: Extension of the binned estimator to the probability simplex, compared with the KDEECE. The KDE-ECE achieves a better approximation to the finite sample, and accurately models the fact that samples tend to be concentrated near low dimensional faces of the simplex. 0.200 0.175 0.150 0.125 0.100 Binned ECE 0.075 0.050 0.025 0.000 0.000 0.025 0.050 0.075 0.100 0.125 0.150 0.175 0.200 KDE ECE 0.04 0.03 0.02 Binned ECE 0.01 0.00 0.00 0.01 0.02 0.03 0.04 KDE ECE 0.06 0.05 0.04 0.03 Binned ECE 0.02 0.01 0.00 0.00 0.01 0.02 0.03 0.04 0.05 0.06 KDE ECE (a) Canonical (b) Marginal (c) Top-label Figure 3: Relationship between the KDE-ECE estimates and their corresponding binned approximations on the three types of calibration. Each point represents a ResNet-56 model trained on a subset of three classes from CIFAR-10. The 3000 probability scores of the test set are assigned in 25 bins with adaptive width for the binned estimate. A bandwidth of 0.001 is used for KDE-ECE. ----- 5.3 MULTICLASS CLASSIFICATION In this section we evaluate our proposed KDE-based ECE estimator that was jointly trained with cross entropy loss (KDE-CRE) against other baselines in a multiclass setting on CIFAR-10 and CIFAR-100. We found that for KDE-CRE, values of λ ∈ [0.01, 0.1] provide a good trade-off in terms of accuracy and calibration error. Table 1 summarizes the accuracy and marginal L1 ECE% (computed using 15 bins), measured across multiple architectures. For MMCE, we report the results with λ = 1 and for KDE-CRE we use λ = 0.01. An analogous table measuring marginal L2 ECE is given in Appendix B. Table 1: Accuracy and marginal L1 ECE (%) computed with 15 bins for different loss functions and architectures, both trained from scratch (Pre T) and after temperature scaling on a validation set (Post T). Best results are marked in bold. **CIFAR-10** **CIFAR-100** **Loss** **Metric** ResNet ResNet (SD) Wide-ResNet DenseNet ResNet ResNet (SD) Wide-ResNet DenseNet Pre T 0.419 0.357 **0.241** 0.236 0.129 0.100 **0.086** **0.090** ECE Post T 0.282 0.250 0.278 **0.165** 0.114 **0.089** **0.105** **0.078** CRE Pre T 0.925 **0.926** **0.957** 0.947 **0.700** **0.728** **0.803** 0.756 Acc Post T **0.927** 0.925 **0.957** 0.947 **0.700** **0.729** **0.801** 0.758 Pre T **0.250** 0.390 0.265 **0.193** 0.143 0.100 0.120 0.123 ECE Post T 0.361 0.308 0.291 0.235 0.121 0.093 0.109 0.124 MMCE Pre T **0.929** 0.925 0.947 0.944 0.693 0.723 0.767 0.748 Acc Post T 0.926 **0.926** 0.949 0.945 0.691 0.722 0.770 0.743 Pre T 0.403 0.416 0.414 0.259 0.145 0.120 0.125 0.095 ECE Post T 0.272 0.267 0.437 0.220 0.124 0.107 0.106 0.081 FL-53 Pre T 0.922 0.920 0.936 **0.948** 0.695 0.711 0.760 0.752 Acc Post T 0.923 0.919 0.936 **0.949** 0.693 0.712 0.763 0.753 Pre T 0.363 **0.338** 0.289 0.296 **0.128** **0.096** 0.092 0.099 ECE Post T **0.182** **0.220** **0.226** 0.248 **0.104** 0.095 0.108 0.085 _L1 KDE-CRE_ Pre T 0.926 0.925 0.953 0.943 0.697 0.725 0.796 **0.757** Acc Post T **0.927** 0.925 0.953 0.944 0.698 0.720 0.793 **0.759** We notice that for both pre and post temperature scaling, KDE-CRE achieves very competitive ECE scores. Another encouraging observation is that the improvement of calibration error comes at almost no cost in accuracy. An important advantage of our KDE-based method is the ability to directly train and evaluate canonical calibration. In Figure 4 we show a scatter plot with confidence intervals of the L1 and L2 KDE-CRE models for canonical calibration and the other baselines on CIFAR-10. We measure the canonical calibration using our KDE-ECE metric from section 5.2. In three of the architectures, both L1 and L2 KDE-CRE either dominate or are statistically tied with cross-entropy (CRE). Similarly, Figure 5 shows a scatter plot of L1 and L2 KDE-CRE models trained to minimize marginal calibration error. In this case, we measure L2 marginal ECE with the standard binned estimator. In most cases, our methods Pareto dominate the other baselines. A general observation can be made, however, that the models trained with cross-entropy have a surprisingly low marginal calibration error, contrary to previous findings that show poor calibration when considering only the most confident prediction (top-label calibration). An additional experiment comparing the CRE baseline with KDE-CRE for canonical calibration on a benchmark dataset of histological images of human colorectal cancer is given in Appendix D.2, which clearly illustrates the superior performance of our method, both in terms of accuracy and calibration error in this context. To summarize, the experiments show that our estimator is consistently producing competitive calibration errors with other state-of-the-art approaches, while maintaining accuracy and keeping the computational complexity at O(n[2]). We evaluate the computational overhead of CRE and KDECRE and summarize the results in a table in Appendix D.1, which shows that the added cost is less than a couple percent. There are several limitations in the current work: A larger scale benchmarking will be beneficial for exploring the limits of canonical calibration using Dirichlet kernels. Furthermore, while we showed consistency of our estimator, we did not fully derive and implement its debiasing. Due to space constraints, this was not the focus of the paper and is left for future work. 6 CONCLUSION In this paper, we proposed a consistent and differentiable estimator of an Lp calibration error using Dirichlet kernels. The KDE-based estimate can be directly optimized alongside any loss function in the existing batch stochastic gradient descent framework. Furthermore, we propose using it as a mea ----- sure of the highest form of calibration which requires the entire probability vector to be calibrated. We showed empirically on a range of neural architectures that the performance of our estimator in terms of accuracy and calibration error is competitive against the current state-of-the-art, while having superior properties as a consistent estimator of canonical calibration error. 0.11 0.10 0.16 0.14 0.12 0.10 0.08 0.16 0.14 0.12 0.10 0.08 0.06 0.04 0.09 0.08 0.07 0.11 0.10 0.09 0.08 0.07 0.06 0.05 0.04 |Col1|0.3|Col3|Col4|Col5|Col6|CRE FL| |---|---|---|---|---|---|---| ||||6|||| |||||3||MMCE L1 KDE-CRE L2 KDE-CRE| |||1||4 0 2||| ||||||0.1|1.0| |||||||| |||||||3 0.10.01| |||||||0.01| |0.|3|Col3|Col4|Col5|CRE FL| |---|---|---|---|---|---| ||||||MMCE L1 KDE-CRE L2 KDE-CRE| ||||0.2||| |||||3|10 6| |||||0.1|4 53 12.0| ||||||0.01 0.1| 0.3 CRE FL MMCE L1 KDE-CRE L2 KDE-CRE 0.2 10 3 6 0.1 4 53 1.02 0.01 0.1 CRE 0.3 6 FL MMCE L1 KDE-CRE 3 L2 KDE-CRE 4 10 2 0.1 1.0 0.2 53 0.10.01 0.01 0.88 0.89 0.90 0.91 0.92 0.93 ACC (a) ResNet-110 0.84 0.86 0.88 0.90 0.92 ACC (b) ResNet-110 (SD) |Col1|Col2|6|Col4|Col5|Col6|CRE FL|Col8| |---|---|---|---|---|---|---|---| |0|.3|4||10||MMCE L1 KDE-|CRE| |||||||L2 KDE-|CRE| ||||0.2|2 0.1 53|||| |||||0.|2 31.0||| ||||||||| |||||||0.001.1|| |||||||0|.01| |Col1|6|Col3|Col4|Col5|CRE| |---|---|---|---|---|---| ||||||FL MMCE| |||10 4|||L1 KDE-CRE L2 KDE-CRE| ||||||| ||||0.3||| |||||0.20.1 0.2|2 0.1301 .. 00 1| ||||||0.01| ||||||53| 0.91 0.92 0.93 0.94 0.95 0.96 ACC (c) Wide-ResNet-28-10 0.86 0.88 0.90 0.92 0.94 ACC (d) DenseNet-40 Figure 4: Canonical calibration on CIFAR-10 |1e 5|Col2|Col3|Col4|Col5|Col6|Col7|Col8|Col9| |---|---|---|---|---|---|---|---|---| |||2|4|2||53||CRE FL MMCE L1 KDE-CRE L2 KDE-CRE| ||||0.|||0.2||| ||||6|3||1 0.3|.0 0.01|0.1| |||||0.3|||0.1|0.01| |||||||||| |||||||||| |||||||||| 4 CRE 53 FL 0.2 MMCE 2 6 0.2 LL12 KDE-CRE KDE-CRE 1.0 3 0.3 0.01 0.1 0.3 0.01 0.1 0.65 0.66 0.67 0.68 0.69 0.70 0.71 ACC 2.5 3.5 3.0 2.5 2.0 1.5 1.0 4.0 3.5 3.0 2.5 2.0 1.5 1.0 0.5 2.0 1.5 1.0 0.5 |1e 5|Col2|Col3|Col4|Col5|Col6|Col7|Col8|Col9|Col10| |---|---|---|---|---|---|---|---|---|---| |0.||3||||3 0.2||0.01|| |||||||||53|| |||||0.3||||0.1|2 1.0| ||||||0|.2||0.1|0.01| |CRE FL MM L K||CE DE-CRE|||||||| |1 L2 K||DE-CRE|||||||| 0.64 0.66 0.68 0.70 0.72 0.74 ACC (b) ResNet-110 (SD) (a) ResNet-110 |1e 5|Col2|Col3|Col4|Col5|Col6|Col7|Col8|Col9| |---|---|---|---|---|---|---|---|---| ||||||||CRE|| |||4|||||FL MMCE|| |||||||||| ||||||||L1 KDE-C L2 KDE-C|RE RE| ||||53 1. 2||0|||| ||||||0.1|||| ||0|.3|0||0.2 .2||03.01|| |||0.3|||||00.1.01|| |||||||||| 4 CREFL MMCE L1 KDE-CRE L2 KDE-CRE 53 1.0 2 0.1 0.3 0.2 0.2 0.3 0.10.010.013 0.74 0.75 0.76 0.77 0.78 0.79 0.80 0.81 ACC 3.0 2.5 2.0 1.5 1.0 0.5 |1e 5|Col2|Col3|Col4|Col5|Col6|Col7|Col8|Col9|Col10|Col11| |---|---|---|---|---|---|---|---|---|---|---| |||||||1.|||0|CRE| |||||0.2||2||||FL MMCE L1 KDE-CRE L2 KDE-CRE| |||||||||||| ||4|||||||||| |||0.3|||03.2||||53|0.01| ||||||||0||.1|| ||||||0.3||||0.10|.01| |||||||||||| |||||||||||| 0.72 0.73 0.74 0.75 0.76 ACC (d) DenseNet-40 (c) Wide-ResNet-28-10 Figure 5: Marginal calibration on CIFAR-100 ----- REFERENCES A. Appice, P. Rodrigues, V. S. Costa, C. Soares, Jo˜ao Gama, and A. Jorge. Novel decompositions of proper scoring rules for classification : Score adjustment as precursor to calibration. 2015. Arsenii Ashukha, Alexander Lyzhov, Dmitry Molchanov, and Dmitry Vetrov. Pitfalls of in-domain uncertainty estimation and ensembling in deep learning, 2021. Peter L. Bartlett, Michael I. Jordan, and Jon D. Mcauliffe. Convexity, classification, and risk bounds. _Journal of the American Statistical Association, 101(473):138–156, 2006._ Taoufik Bouezmarni and Jean-Marie Rolin. Consistency of the beta kernel density function estimator. The Canadian Journal of Statistics / La Revue Canadienne de Statistique, 31(1):89–98, 2003. Jochen Br¨ocker. Reliability, sufficiency, and the decomposition of proper scores. Quarterly Journal _of the Royal Meteorological Society, 135(643):1512–1519, Jul 2009._ Song Xi Chen. Beta kernel estimators for density functions. _Computational Statistics & Data_ _Analysis, 31:131–145, 1999._ M. Degroot and S. Fienberg. The comparison and evaluation of forecasters. The Statistician, 32: 12–22, 1983. Yukun Ding, Jinglan Liu, Jinjun Xiong, and Yiyu Shi. Revisiting the evaluation of uncertainty estimation and its application to explore model complexity-uncertainty trade-off. arXiv:1903.02050, 2020. Robert Duin. On the choice of smoothing parameters for parzen estimators of probability density functions. IEEE Transactions on Computers, C-25(11):1175–1179, 1976. Thomas S. Ferguson. U-statistics. In Notes for Statistics 200C. UCLA, 2005. Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. On calibration of modern neural networks, 2017. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. arXiv:1512.03385, 2015. Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Weinberger. Deep networks with stochastic depth. arXiv:1603.09382, 2016. Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q. Weinberger. Densely connected convolutional networks, 2018. Jakob Kather, Cleo-Aron Weis, Francesco Bianconi, Susanne Melchers, Lothar Schad, Timo Gaiser, Alexander Marx, and Frank Z¨ollner. Multi-class texture analysis in colorectal cancer histology. _Scientific Reports, 6:27988, 06 2016. doi: 10.1038/srep27988._ Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009. Volodymyr Kuleshov and Percy S Liang. Calibrated structured prediction. In C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett (eds.), Advances in Neural Information Pro_cessing Systems, volume 28. Curran Associates, Inc., 2015._ Meelis Kull, Miquel Perello-Nieto, Markus K¨angsepp, Telmo Silva Filho, Hao Song, and Peter Flach. Beyond temperature scaling: Obtaining well-calibrated multiclass probabilities with Dirichlet calibration. arXiv:1910.12656, 2019. Ananya Kumar, Percy S Liang, and Tengyu Ma. Verified uncertainty calibration. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alch´e-Buc, E. Fox, and R. Garnett (eds.), Advances in Neural _Information Processing Systems 32, pp. 3792–3803. 2019._ ----- Aviral Kumar, Sunita Sarawagi, and Ujjwal Jain. Trainable calibration measures for neural networks from kernel mean embeddings. In ICML, 2018. Gongbo Liang, Yu Zhang, Xiaoqin Wang, and Nathan Jacobs. Improved trainable calibration method for neural networks on medical imaging classification. In British Machine Vision Conference, 2020. Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Doll´ar. Focal loss for dense object detection. arXiv:1708.02002, 2018. Jishnu Mukhoti, Viveka Kulharia, Amartya Sanyal, Stuart Golodetz, Philip H. S. Torr, and Puneet K. Dokania. Calibrating deep neural networks using focal loss. arXiv:2002.09437, 2020. A. Murphy. A new vector partition of the probability score. Journal of Applied Meteorology, 12: 595–600, 1973. Rafael M¨uller, Simon Kornblith, and Geoffrey Hinton. When does label smoothing help? arXiv:1906.02629, 2020. Mahdi Pakdaman Naeini and Gregory F. Cooper. Binary classifier calibration using an ensemble of near isotonic regression models. arXiv:1511.05191, 2015. Mahdi Pakdaman Naeini, Gregory F. Cooper, and Milos Hauskrecht. Obtaining well calibrated probabilities using Bayesian binning. In Proceedings of the Twenty-Ninth AAAI Conference on _Artificial Intelligence, pp. 2901–2907, 2015._ Khanh Nguyen and Brendan O’Connor. Posterior calibration and exploratory analysis for natural language processing models. arXiv:1508.05154, 2015. Fr´ed´eric Ouimet and Raimon Tolosana-Delgado. Asymptotic properties of dirichlet kernel density estimators. arXiv:2002.06956, 2021. Emanuel Parzen. On estimation of a probability density function and mode. The Annals of Mathe_matical Statistics, 33(3):1065–1076, 1962._ Gabriel Pereyra, George Tucker, Jan Chorowski, Łukasz Kaiser, and Geoffrey Hinton. Regularizing neural networks by penalizing confident output distributions. arXiv:1701.06548, 2017. John C. Platt. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. In Advances in Large Margin Classifiers, pp. 61–74. MIT Press, 1999. Murray Rosenblatt. Remarks on some nonparametric estimates of a density function. The Annals of _Mathematical Statistics, 27(3):832 – 837, 1956._ Alastair Scott and Chien-Fu Wu. On the asymptotic distribution of ratio and regression estimators. _Journal of the American Statistical Association, 76(373):98–102, 1981._ B. W. Silverman. Density Estimation for Statistics and Data Analysis. Chapman & Hall, 1986. Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. arXiv:1512.00567, 2015. Juozas Vaicenavicius, David Widmann, Carl Andersson, Fredrik Lindsten, Jacob Roll, and Thomas B. Sch¨on. Evaluating model calibration in classification. arXiv:1902.06977, 2019. Jonathan Wenger, Hedvig Kjellstr¨om, and Rudolph Triebel. Non-parametric calibration for classification. In International Conference on Artificial Intelligence and Statistics, pp. 178–190, 2020. B. Zadrozny and C. Elkan. Transforming classifier scores into accurate multiclass probability estimates. Proceedings of the eighth ACM SIGKDD international conference on Knowledge discov_ery and data mining, 2002._ Bianca Zadrozny and Charles Elkan. Obtaining calibrated probability estimates from decision trees and naive bayesian classifiers. ICML, 1, 05 2001. ----- Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. In British Machine Vision _Conference, 2016._ Jize Zhang, Bhavya Kailkhura, and T. Yong-Jin Han. Mix-n-match: Ensemble and compositional methods for uncertainty calibration in deep learning. In International Conference on Machine _Learning, 2020._ Shunpu Zhang and Rohana Karunamuni. Boundary performance of the beta kernel estimators. _Journal of Nonparametric Statistics, 22:81–104, 01 2010._ A DERIVATION OF THE MSE DECOMPOSITION **Definition A.1 (Mean Squared Error (MSE)). The mean squared error of an estimator is** MSE(f ) := E[(f (x) − _y)[2]]._ (22) **Proposition A.2. MSE(f** ) ≥ CE2(f )[2] _Proof._ MSE(f ) :=E[(f (x) − _y))[2]] = E[((f_ (x) − E[y | f (x)]) + (E[y | f (x)] − _y))[2]]_ (23) = E[(f (x) − E[y | f (x)])[2]] +E[(E[y | f (x)] − _y)[2]]_ (24) =CE2[2] |+ 2E[(f (x){z E[y _f_ (x})])(E[y _f_ (x)] _y)]_ _−_ _|_ _|_ _−_ which implies MSE(f ) − CE2(f )[2] =E[(E[y | f (x)] − _y)[2]]_ (25) + 2E[(f (x) − E[y | f (x)])(E[y | f (x)] − _y)]_ =E[(E[y | f (x)] − _y)[2]] + 2E[(f_ (x)E[y | f (x)]] (26) _−_ 2E[f (x)y] − 2E[E[y | f (x)][2]] + 2E[E[y | f (x)]y]] =E[E[y | f (x)][2]] + E[y[2]] − 2E[E[y | f (x)]y] (27) + 2E[(f (x)E[y | f (x)]] − 2E[f (x)y] _−_ 2E[E[y | f (x)][2]] + 2E[E[y | f (x)]y]] =E[y[2]] + 2E[(f (x)E[y | f (x)]] − 2E[f (x)y] (28) _−_ E[E[y | f (x)][2]] =E[(2f (x) − _y −_ E[y | f (x)])(E[y | f (x)]) − _y]_ (29) =E[(f (x) − _y)(E[y | f_ (x)] − _y)]_ (30) + E[(f (x) − E[y | f (x)])(E[y | f (x)] − _y)]._ By the law of total expectation, we will write the above as MSE(f ) − CE2(f )[2] = E[E[(f (x) − _y)(E[y | f_ (x)] − _y)_ (31) + (f (x) − E[y | f (x)])(E[y | f (x)] − _y) | f_ (x)]]. Focusing on the inner conditional expectation, we have that E[(f (x) − _y)(E[y | f_ (x)] − _y) + (f_ (x) − E[y | f (x)])(E[y | f (x)] − _y) | f_ (x)] =E[y | f (x)](f (x) − 1)(E[y | f (x)] − 1) + (1 − E[y | f (x)])f (x)E[y | f (x)] + E[y | f (x)](f (x) − E[y | f (x)])(E[y | f (x)] − 1) + (1 − E[y | f (x)])(f (x) − E[y | f (x)])E[y | f (x)] (32) =(1 − E[y | f (x)])E[y | f (x)] ≥ 0 _∀f_ (x) (33) and therefore MSE(f ) − CE2(f )[2] = E[(1 − E[y | f (x)])E[y | f (x)]] ≥ 0. (34) The expectation in Equation 34 is over variances of Bernoulli random variables with probabilities E[y | f (x)]. ----- B RESULTS Table 2 summarizes the marginal L2 ECE and accuracy for the two datasets across multiple architectures and training loss functions. The scatter plots in Figures 6 and 7 show the accuracy and both _L1 and L2 ECE, for top-label and marginal calibration on CIFAR-10 and CIFAR-100, respectively._ KDE-CRE is trained by directly minimizing the metric that is evaluated, e.g., in the first column we minimize marginal L1 calibration error and in the last column we optimize the L2 top label calibration error. Other methods do not have the flexibility of choosing the type of calibration and the Lp norm. Table 2: Accuracy and marginal L2 ECE (%) computed with 15 bins for different approaches, trained from scratch (Pre T) and after temperature scaling (Post T). **CIFAR-10** **CIFAR-100** **Loss** **Metric** ResNet ResNet (SD) Wide-ResNet DenseNet ResNet ResNet (SD) Wide-ResNet DenseNet Pre T 0.020 0.009 0.007 0.008 0.002 0.002 0.001 0.001 ECE Post T (NLL) 0.007 0.005 0.008 0.004 0.002 0.001 0.001 0.001 **CRE** Pre T 0.925 0.926 0.950 0.947 0.700 0.728 0.797 0.756 Acc Post T (NLL) 0.927 0.925 0.950 0.947 0.700 0.729 0.794 0.758 Pre T 0.009 0.015 0.009 0.004 0.003 0.001 0.003 0.003 ECE Post T (NLL) 0.013 0.009 0.009 0.005 0.002 0.001 0.002 0.003 **MMCE** Pre T 0.929 0.925 0.947 0.944 0.693 0.723 0.767 0.748 Acc Post T (NLL) 0.926 0.926 0.949 0.945 0.691 0.722 0.770 0.743 Pre T 0.013 0.020 0.026 0.005 0.003 0.002 0.003 0.002 ECE Post T (NLL) 0.008 0.009 0.022 0.004 0.002 0.002 0.002 0.001 **FL-53** Pre T 0.922 0.920 0.936 0.948 0.695 0.711 0.760 0.752 Acc Post T (NLL) 0.923 0.919 0.936 0.949 0.693 0.712 0.763 0.753 Pre T 0.010 0.015 0.007 0.008 0.002 0.002 0.001 0.001 ECE Post T (NLL) 0.004 0.012 0.008 0.009 0.002 0.002 0.001 0.001 _L2 KDE-CRE_ Pre T 0.930 0.922 0.950 0.943 0.707 0.713 0.797 0.757 Acc Post T (NLL) 0.930 0.921 0.950 0.944 0.707 0.717 0.794 0.755 C RELATIONSHIP BETWEEN THE BINNED ESTIMATOR AND THE KERNEL DENSITY ESTIMATOR Figure 8 shows an example of the binned estimator in a three-class setting on CIFAR-10. The points are mostly concentrated at the edges of the histogram, as can be seen from Figure 8b. The surface of the corresponding Dirichlet KDE is given in 8c. Figure 9 shows the relationship between the binned estimator and our KDE-ECE metric. The points represent a trained Resnet-56 model on a subset of three classes from CIFAR-10. In every row, a differnt number of points was used to estimate the KDE-ECE. D EXPERIMENTS FOR REBUTTAL D.1 TRAINING TIME MEASUREMENTS In Table 3 we summarize the running time per epoch for training with (KDE-CRE) and without (CRE) regularization for the two datasets and four architectures. KDE-CRE does not create an overhead of more than a couple percent over the CRE baseline. D.2 CANONICAL CALIBRATION IN A MEDICAL APPLICATION An additional experiment with a medical application, where the canonical calibration is of particular interest, was performed on the publicly-available Kather dataset (Kather et al., 2016), which consists of 5000 histological images of human colorectal cancer. The data has eight different classes of tissue. Figure 10 shows a comparison in performance of the CRE baseline with our KDE-CRE method. The canonical L1 (left) and L2 (right) calibration is measured using our KDE-ECE metric. The results clearly illustrate that our method significantly outperforms the cross-entropy baseline, both in terms of accuracy and calibration error, for several choices of the regularization parameter. D.3 BIAS AND CONVERGENCE RATES Figure 11 shows a comparison of the groud truth, computed from 3000 test points with KDE-ECE against KDE-ECE and binned ECE estimated with a varying number of points used for the estima ----- Marginal calibration on CIFAR10 using Densenet Top-label calibration on CIFAR10 using Densenet Marginal calibration on CIFAR10 using Densenet Top-label calibration on CIFAR10 using Densenet 0.035 0.030 0.025 0.020 0.015 0.010 0.005 0.000 0.0175 0.0150 0.0125 0.0100 0.0075 0.0050 0.0025 0.0000 0.10 0.08 0.06 0.04 0.02 0.00 0.06 0.05 0.04 0.03 0.02 0.01 0.0030 0.0025 0.0020 0.0015 0.0010 0.0005 0.0000 0.0012 0.0010 0.0008 0.0006 0.0004 0.0002 0.0000 0.020 0.015 0.010 0.005 0.000 0.012 0.010 0.008 0.006 0.004 0.002 |6|0.3|4 0.20.3|MMCE L L1 2 K KD DE E- -C CR RE E 2| |---|---|---|---| |6|0.3|4 0.20.3 0.2|MMCE L L1 2 K KD DE E- -C CR RE E 53 00.0.0 052.1 1 .053| |---|---|---|---| |Col1|10|4|MMCE L L1 2 K KD DE E- -C CR RE E| |---|---|---|---| |6|10|4|MMCE L L1 2 K KD DE E- -C CR RE E 53| |---|---|---|---| 10 CREFL MMCE 6 LL12 KDE-CRE KDE-CRE 4 0.3 0.2 0.3 2 0.2 0.050.10.10.0531.053 0.86 0.88 0.90ACC 0.92 0.94 Marginal calibration on CIFAR10 using Resnet 10 CREFL MMCE LL12 KDE-CRE KDE-CRE 6 4 0.3 0.2 0.3 0.2 0.050.10.10.052 3 53 1.0 0.86 0.88 0.90ACC 0.92 0.94 Top-label calibration on CIFAR10 using Resnet CRE 6 FLMMCE 10 LL12 KDE-CRE KDE-CRE 4 0.3 0.2 0.3 0.2 0.050.10.10.052 31.053 0.86 0.88 0.90ACC 0.92 0.94 Marginal calibration on CIFAR10 using Resnet CRE FL 10 MMCELL12 KDE-CRE KDE-CRE 6 53 4 0.3 0.2 0.3 0.2 0.050.10.10.052 31.0 0.86 0.88 0.90ACC 0.92 0.94 Top-label calibration on CIFAR10 using Resnet |Col1|Col2|10|CRE FL| |---|---|---|---| |0.3||6 0.2 304 0.3 .1|MMCE L L1 2 K KD DE E- -C CR RE E| |Col1|Col2|10|CRE FL| |---|---|---|---| |0.3||0.2 06 .3 0.1 0. 02 34|MMCE L L1 2 K KD DE E- -C CR RE E 53 .050.050.1 0.5| |Col1|Col2|Col3|Col4|CRE FL| |---|---|---|---|---| ||||10 6|MMCE L L1 2 K KD DE E- -C CR RE E| |Col1|Col2|Col3|CRE FL| |---|---|---|---| |||10 6|MMCE L L1 2 K KD DE E- -C CR RE E 53| CRE 10 FLMMCE 6 LL12 KDE-CRE KDE-CRE 0.3 0.2 0.3 30.14 0.20.052 530.050.10.5 1.0 CRE 10 FLMMCE LL12 KDE-CRE KDE-CRE 0.3 0.2 0.36 0.1 0.20.05530.050.1 34 0.5 1.0 2 CRE FL 10 MMCE LL12 KDE-CRE KDE-CRE 6 0.3 0.2 0.3 30.14 0.20.052 530.050.11.00.5 CRE FL 10 MMCELL12 KDE-CRE KDE-CRE 6 53 0.3 0.2 0.3 0.14 0.20.050.050.1 3 2 1.00.5 0.87 0.88 0.89 0.90ACC 0.91 0.92 0.93 Marginal calibration on CIFAR10 using Resnet (SD) 0.87 0.88 0.89 0.90ACC 0.91 0.92 0.93 Top-label calibration on CIFAR10 using Resnet (SD) 0.87 0.88 0.89 0.90ACC 0.91 0.92 0.93 Marginal calibration on CIFAR10 using Resnet (SD) 0.87 0.88 0.89 0.90ACC 0.91 0.92 0.93 Marginal calibration on CIFAR10 using Resnet (SD) 0.05 0.04 0.03 0.02 0.01 0.00 0.014 0.012 0.010 0.008 0.006 0.004 0.002 0.0200 0.0175 0.0150 0.0125 0.0100 0.0075 0.0050 0.0025 0.0000 0.00175 0.00150 0.00125 0.00100 0.00075 0.00050 0.00025 0.00000 0.05 0.04 0.03 0.02 0.01 0.00 0.2 CREFL MMCE LL12 KDE-CRE KDE-CRE 0.2 106 0.30.3 3 0.1 0.050.15341.020.05 0.025 0.020 0.015 0.010 0.005 0.000 0.08 0.06 0.04 0.02 0.00 0.07 0.06 0.05 0.04 0.03 0.02 0.01 0.00 0.050.1531.020.05 |.2|Col2|CRE FL MMCE| |---|---|---| |||L L1 2 K KD DE E- -C CR RE E| |0.2||160 0.03.3 3 4| |Col1|Col2|Col3|CRE| |---|---|---|---| ||0.2||FL MMCE L L1 2 K KD DE E- -C CR R10E E 6| |0.2|||3 0.3 0.05 0054..3015 00.3.1 1.0| |0.2|Col2|CRE| |---|---|---| |||FL MMCE| |||L L1 2 K KD DE E- -C CR RE E| |||| |.2|Col2|CRE FL MMCE| |---|---|---| |||L L1 2 K KD DE E- -C CR RE E| |0.2||160 0.03.3 3 4| 0.2 CREFL MMCE LL12 KDE-CRE KDE-CRE 0.2 106 0.30.3 3 0.1 0.050.15341.020.05 CRE 0.2 FLMMCE LL12 KDE-CRE KDE-CRE10 36 0.2 0.3 0.05 0.050.1534 0.30.1 1.02 0.2 CREFL MMCE LL12 KDE-CRE KDE-CRE 0.2 0.30.10.3 31.00.05100.1534620.05 0.2 0.4 ACC 0.6 0.8 0.2 0.4 ACC 0.6 0.8 0.2 0.4 ACC 0.6 0.8 0.2 0.4 ACC 0.6 0.8 Marginal calibration on CIFAR10 using Wideresnet |Col1|Col2|Col3|CRE| |---|---|---|---| |||10 6|FL MMCE L L1 2 K KD DE E- -C CR RE E| |0.3||4 2 0.20.3 53|3| CRE 10 FL MMCE LL12 KDE-CRE KDE-CRE 6 4 0.3 2 0.20.3 530.2 0.10.131.00.050.01 0.89 0.90 0.91 0.92 ACC0.93 0.94 0.95 0.96 Top-label calibration on CIFAR10 using Wideresnet |Col1|Col2|Col3|Col4|CRE| |---|---|---|---|---| ||||10|FL MMCE L L1 2 K KD DE E- -C CR RE E| ||0.3||6 0.240.3 0.2 0 0.|1 .10.050.01 3| CRE 10 FLMMCE LL12 KDE-CRE KDE-CRE 6 0.3 0.240.3 0.2 0.10.13 0.050.01 2 53 1.0 0.89 0.90 0.91 0.92 ACC0.93 0.94 0.95 0.96 Marginal calibration on CIFAR10 using Wideresnet |Col1|Col2|Col3|CRE| |---|---|---|---| |||10|FL MMCE L L1 2 K KD DE E- -C CR RE E| ||6|4 253|| CRE 10 FLMMCE LL12 KDE-CRE KDE-CRE 6 4 0.3 0.20.3 2 530.2 0.10.131.00.050.01 0.89 0.90 0.91 0.92 ACC0.93 0.94 0.95 0.96 Top-label calibration on CIFAR10 using Wideresnet |Col1|Col2|Col3|CRE| |---|---|---|---| |||10|FL MMCE L L1 2 K KD DE E- -C CR RE E| |||6 0.|1| CRE FL 10 MMCE LL12 KDE-CRE KDE-CRE 6 0.3 0.240.3 2 530.2 0.10.131.00.050.01 0.89 0.90 0.91 0.92 ACC0.93 0.94 0.95 0.96 Figure 6: Top-label and marginal calibration on CIFAR-10. Table 3: Training time [sec] per epoch for Cross-Entropy and KDE-CE methods for different models and datasets. ## Dataset Model CRE L1 KDE-CRE ## ResNet-110 51.8 53 ResNet-110 (SD) 45 46 Wide-ResNet-28-10 152.9 154.9 DenseNet-40 103.2 106.8 ResNet-110 90 92.9 ResNet-110 (SD) 78.2 80.7 Wide-ResNet-28-10 150.5 155.3 DenseNet-40 101 105.5 ## CIFAR-10 CIFAR-100 tion. The used model is a ResNet-56, trained on a subset of three classes from CIFAR-10. The figure shows that the two estimates are comparable and both are doing a reasonable job. Figure 12 shows the absolute difference between the ground truth and estimated ECE using our KDE estimator and a binned estimator with varying number of points used for estimation. The results are ----- Marginal calibration on CIFAR100 using Densenet |Col1|0.2 0.3 3 0.3|0.1530.10.01 0.01| |---|---|---| CRE 4 0.2 2 1.0 FLMMCELL12 KDE-CRE KDE-CRE 0.2 0.3 3 0.1 53 0.10.01 0.3 0.01 0.72 0.73 0.74ACC 0.75 0.76 Top-label calibration on CIFAR100 using Densenet |CRE FL MMC L1 K|0.3 0.2 2 4 E 3 DE-CRE|53 0.1| |---|---|---| 0.3 0.01 0.30.2 1.00.1 53 0.01 0.2 2 0.1 4 CRE FL MMCE 3 LL12 KDE-CRE KDE-CRE 0.72 0.73 0.74ACC 0.75 0.76 Marginal calibration on CIFAR100 using Densenet Top-label calibration on CIFAR100 using Densenet |Col1|0 4|0 .3.2 10.0.153 0.2 2 3|0.01 0.1 CRE FL MMCE L1 KDE-CRE| |---|---|---|---| 0.3 0.01 0.30.2 1.00.1 53 0.01 0.2 2 0.1 CRE 4 FLMMCE 3 LL12 KDE-CRE KDE-CRE 0.72 0.73 0.74ACC 0.75 0.76 0.0013 0.0012 0.0011 0.0010 0.0009 0.0008 0.0007 0.0006 0.0015 0.0014 0.0013 0.0012 0.0011 0.0010 0.0009 0.0175 0.0150 0.0125 0.0100 0.0075 0.0050 0.0025 0.030 0.025 0.020 0.015 0.010 0.005 0.000 3.0 2.5 2.0 1.5 1.0 0.5 3.5 3.0 2.5 2.0 1.5 1.0 0.10 0.08 0.06 0.04 0.14 0.12 0.10 0.08 0.06 0.04 |Col1|4 0.3 03.2 0.3|53 0.1 0.100 .0.0 11| |---|---|---| 1e 5 1.0 CREFL 0.2 2 MMCELL12 KDE-CRE KDE-CRE 4 0.3 0.30.23 0.1 53 0.10.010.01 0.72 0.73 0.74ACC 0.75 0.76 Marginal calibration on CIFAR100 using Resnet Marginal calibration on CIFAR100 using Resnet |Col1|Col2|0.3 3|0.2 0.01 0.3 0.10.01 0.1| |---|---|---|---| ||CRE FL MM|CE|| ||L L1 2 K K|DE-CRE DE-CRE|| 2 460.2 1.053 0.2 0.01 0.3 3 0.3 0.1 0.01 0.1 CRE FL MMCE LL12 KDE-CRE KDE-CRE 0.65 0.66 0.67 0.68ACC 0.69 0.70 0.71 Top-label calibration on CIFAR100 using Resnet |Col1|Col2|2 4|Col4| |---|---|---|---| |||2 60.20.3|0.1 0.3 0.2| ||CRE FL MMC|E|3| ||L L1 2 K K|DE-CRE DE-CRE|| 1.0 530.010.1 0.01 2 460.2 0.3 0.3 0.1 0.2 CRE FL MMCELL12 KDE-CRE KDE-CRE 3 0.65 0.66 0.67 0.68ACC 0.69 0.70 0.71 Top-label calibration on CIFAR100 using Resnet |Col1|L KD|DE-CRE|0.|.1 0.01| |---|---|---|---|---| ||L2 K|DE-CRE 2 4 0 6|.20.3 0.3 0.2|0.1| |||||| ||||3|| CRE FLMMCELL12 KDE-CRE KDE-CRE 1.0530.010.1 0.01 2 460.2 0.3 0.3 0.1 0.2 3 0.65 0.66 0.67 0.68ACC 0.69 0.70 0.71 |Col1|Col2|2|0.2 L1 KDE-CRE L KDE-CRE| |---|---|---|---| |||6 0.3 3|L2 KDE-CRE 1.0 0.3 0.010.1 0.01 0.1| ||||| ||||| 1e 5 40.2 53 CREFLMMCE 2 6 0.2 1.0 LL12 KDE-CRE KDE-CRE 0.3 3 0.3 0.01 0.1 0.01 0.1 0.65 0.66 0.67 0.68ACC 0.69 0.70 0.71 Marginal calibration on CIFAR100 using Resnet (SD) |0.3|0.23 0.3|530.01 0.1 2| |---|---|---| |CRE FL|0.2|0.1 1. 00 .01| |MM|CE|| |L L1 2 K K|DE-CRE DE-CRE|| 0.3 530.01 0.2 3 0.3 0.1 2 0.2 0.1 1.00.01 CRE FL MMCE LL12 KDE-CRE KDE-CRE 0.64 0.66 0.68ACC 0.70 0.72 0.74 Top-label calibration on CIFAR100 using Resnet (SD) |1e 5 M|Marginal calibration on CIFAR100|using Resnet (SD)| |---|---|---| |0.3|3 0.2|530.01 2| |CRE FL MMC|0.3 0.2 E|0.1 0.1 1. 00 .01| |L L1 2 K K|DE-CRE DE-CRE|| 0.64 0.66 0.68ACC 0.70 0.72 0.74 Marginal calibration on CIFAR100 using Resnet (SD) |0.3|0.3|5 0.23 0.|30.01 1 2| |---|---|---|---| |CRE FL||0 0.2|.1 1. 00 .01| |MM|CE||| |L L1 2 K K|DE-CRE DE-CRE||| 0.64 0.66 0.68ACC 0.70 0.72 0.74 1e 5 2.5 2.0 1.5 1.0 0.5 4.0 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.12 0.10 0.08 0.06 0.04 0.02 0.00 0.10 0.08 0.06 0.04 0.02 0.0012 0.0011 0.0010 0.0009 0.0008 0.0007 0.0006 0.0012 0.0010 0.0008 0.0006 0.0012 0.0011 0.0010 0.0009 0.0008 0.0007 0.0006 0.016 0.014 0.012 0.010 0.008 0.006 0.004 0.002 0.000 |CRE FL MMC L L1 2 K K|E DE-CRE DE-CRE|12.00.01 0.1 0.01.01| |---|---|---| |0.3|0.2 0.2 0.3|53 3| |||| 0.64 0.66 0.68ACC 0.70 0.72 0.74 Marginal calibration on CIFAR100 using Wideresnet |Col1|Col2|4 532|CRE FL MMCE| |---|---|---|---| |||1.0 0.3 0.10.2 0.2|3L L1 2 K KD DE E- -C CR RE E 000.. 1.0011| 0.34 53 2 1.0 3CREFLMMCELL12 KDE-CRE KDE-CRE 0.10.2 0.2 0.10.010.01 0.3 0.74 0.75 0.76 0.77ACC 0.78 0.79 0.80 0.81 Top-label calibration on CIFAR100 using Wideresnet |Col1|Col2|0.2 0.1|0.1| |---|---|---|---| ||0|0.2 .3 0.3 1.0 4 532|0.01 0.01| 0.20.10.2 0.1 0.3 0.01 0.3 1.0 0.01 4 53 2 CRE FL MMCELL12 KDE-CRE KDE-CRE 3 0.74 0.75 0.76 0.77ACC 0.78 0.79 0.80 0.81 Top-label calibration on CIFAR100 using Wideresnet |Col1|CRE FL MMC|E|Col4|Col5| |---|---|---|---|---| ||L L1 2 K K 0|DE-CRE DE-CRE .3|0.2 0.1 0.2 1.0|0.1 0.01 0.01| CRE FL MMCELL12 KDE-CRE KDE-CRE 0.20.10.2 0.1 0.3 0.010.01 1.0 4 0.3 53 2 3 0.74 0.75 0.76 0.77ACC 0.78 0.79 0.80 0.81 |1e 5|Col2|Marginal calibration on CIFAR100|0 using Wideresnet| |---|---|---|---| |||4|CRE FL MMCE| |||5321.0 0.1|L L1 2 K KD DE E- -C CR RE E| 4 CREFL MMCE 53 2 1.0 LL12 KDE-CRE KDE-CRE 0.3 0.1 0.3 0.2 0.2 0.10.010.013 0.74 0.75 0.76 0.77ACC 0.78 0.79 0.80 0.81 Figure 7: Top-label and marginal calibration on CIFAR-100 0.0 1.0 0.2 0.8 0.4 0.6 0.6 0.4 0.8 0.2 1.0 0.0 0.0 0.2 0.4 0.6 0.8 1.0 (a) Splitting the simplex in 16 bins 0.00 0.05 0.10 0.15 0.20 0.25 0.30 (b) Corresponding histogram (c) Corresponding Dirichlet KDE Figure 8: An example of a simplex binned estimator and kernel-density estimator for CIFAR-10 averaged over 120 ResNet-56 models trained on a subset of three classes from CIFAR-10. Both estimators are biased and have some variance, and the plot shows that the combination of the two is in the same order of magnitude. The empirical convergence rates (slope of the log-log plot) is given in the legend and is shown to be close to the theoretically expected value of -0.5. D.4 CHOICE OF THE BATCH SIZE In Figure 13 we investigate the choice of the batch size on CIFAR-10. To this end, we use two differently shuffled dataloaders that draw random batches from the same training set. The first dataloader provides batches to the loss term (CRE) while the second dataloader provides the batches for the regularization (KDE). The batch size for the loss term is fixed in all experiments, while the ----- 0.40 Canonical, using 100 points, 25 bins, 0.001 bandwidth 0.35 0.30 0.25 0.20 Binned ECE 0.15 0.10 0.05 0.00 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 KDE ECE Canonical, using 500 points, 25 bins, 0.001 bandwidth 0.30 0.25 0.20 0.15 Binned ECE 0.10 0.05 0.00 0.00 0.05 0.10 0.15 0.20 0.25 0.30 KDE ECE Canonical, using 1000 points, 25 bins, 0.001 bandwidth 0.25 0.20 0.15 Binned ECE 0.10 0.05 0.00 0.00 0.05 0.10 0.15 0.20 0.25 KDE ECE Marginal, using 100 points, 25 bins, 0.001 bandwidth 0.10 0.08 0.06 Binned ECE 0.04 0.02 0.00 0.00 0.02 0.04 0.06 0.08 0.10 KDE ECE Marginal, using 500 points, 25 bins, 0.001 bandwidth 0.06 0.05 0.04 0.03 Binned ECE 0.02 0.01 0.00 0.00 0.01 0.02 0.03 0.04 0.05 0.06 KDE ECE Marginal, using 1000 points, 25 bins, 0.001 bandwidth 0.05 0.04 0.03 Binned ECE 0.02 0.01 0.00 0.00 0.01 0.02 0.03 0.04 0.05 KDE ECE Top-label, using 100 points, 25 bins, 0.001 bandwidth 0.10 0.08 0.06 Binned ECE 0.04 0.02 0.00 0.00 0.02 0.04 0.06 0.08 0.10 KDE ECE Top-label, using 500 points, 25 bins, 0.001 bandwidth 0.08 0.07 0.06 0.05 0.04 Binned ECE 0.03 0.02 0.01 0.00 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 KDE ECE Top-label, using 1000 points, 25 bins, 0.001 bandwidth 0.08 0.07 0.06 0.05 0.04 Binned ECE 0.03 0.02 0.01 0.00 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 KDE ECE Figure 9: Relationship between the ECE metric based on binning and kernel density estimation (KDE-ECE) for the three types of calibration: canonical, marginal and top-label. In every row, a different number of points are used to approximate the KDE-ECE. 0.40 0.35 0.18 0.16 0.30 0.25 0.14 0.12 0.20 |0|.01|Col3|C L|RE KDE-CRE| |---|---|---|---|---| |||1 L2|1 L2|KDE-CRE| ||0.1|||| ||0.2|||| ||0.3 0.2|0.01||| |||||| |||0.3 0.1||| |||||| |Col1|0.01|CRE L1 KDE-CRE| |---|---|---| ||0.2|L2 KDE-CRE| |||| ||0.1|0.01| ||0.2|| |||| ||0.3|0.3| ||0.1|| 0.01 CRE L1 KDE-CRE L2 KDE-CRE 0.1 0.2 0.3 0.2 0.01 0.3 0.1 0.84 0.86 0.88 0.90 ACC 0.84 0.86 0.88 0.90 ACC Figure 10: Canonical calibration on Kather using a Resnet-50 model batch size for the regularization varies. The orange point is our normal experimental set-up with just one dataloader (i.e. the same points are used for loss and KDE-ECE computation) as a comparison. The plot shows that our chosen batch size of 128 is appropriate for our purposes. ----- 0.00 0.02 0.04 0.06 0.08 0.10 |Col1|Col2|Col3|Col4|Col5|Col6|Ground tru KDE-ECE Binned EC|th E| |---|---|---|---|---|---|---|---| ||||||||| ||||||||| ||||||||| ||||||||| ||||||||| ||||||||| 200 400 600 800 1000 # points |Col1|Col2|Col3|Col4|Col5|Ground tru|th| |---|---|---|---|---|---|---| ||||||KDE-ECE Binned EC|E| |||||||| |||||||| |||||||| |||||||| |||||||| |||||||| 200 400 600 800 1000 # points |Col1|Col2|Col3|Col4|Col5|Col6|Ground tru KDE-ECE|th| |---|---|---|---|---|---|---|---| |||||||Binned EC|E| ||||||||| ||||||||| ||||||||| ||||||||| ||||||||| 200 400 600 800 1000 # points 0.035 0.030 0.025 0.020 0.015 0.010 0.005 0.000 0.05 0.04 0.03 0.02 0.01 0.00 (a) Canonical (b) Marginal (c) Top-label Figure 11: KDE-ECE estimates and their corresponding binned approximations on the three types of calibration for varying number of points used for the estimation. The ground truth is calculated using 3000 probability scores of the test set. For the binned estimate, the points are assigned in 25 bins with adaptive width. A bandwidth of 0.001 is used for KDE-ECE. 3 × 10 2 × 10 10 4 × 10 3 × 10 2 × 10 10 6 × 10 4 × 10 3 × 10 6 × 10 4 × 10 |Col1|KDE-ECE slope = 0.3 Binned ECE slope = 0.3|KDE-ECE slope = 0.3 Binned ECE slope = 0.3|8 2| |---|---|---|---| ||||| |Col1|Col2|KDE-ECE slope = 0.4 Binned ECE slope = 0.5|0 2| |---|---|---|---| ||||| |Col1|KDE-ECE slope = 0.5 Binned ECE slope = 0.4|6| |---|---|---| |||6| |||| 102 10 Number of points (a) Canonical 102 10 Number of points (b) Marginal 102 10 Number of points (c) Top-label Figure 12: Absolute difference between ground truth and estimated ECE for varying number of points used for the estimation. The ground truth is calculated using 3000 probability scores of the test set. For the binned estimate, the points are assigned in 25 bins with adaptive width. A bandwidth of 0.001 is used for KDE-ECE. Note that the axes are on a log scale. ----- 0.0055 0.0050 0.0045 0.0040 0.0035 0.0030 0.0025 0.050 0.00030 0.00025 0.00020 0.00015 0.00010 0.00005 0.008 |Col1|Col2|Col3|Col4|Col5|Col6|Col7|L|1 2 KDE-CRE| |---|---|---|---|---|---|---|---|---| |||||||L 64|L|1 KDE-CRE| |||||||||| ||||1|28||||| |||128 256||||2||| ||||||51|||| |||||||||32| |||||||||| |||||||||| |||||||||| |Col1|Col2|Col3|Col4|Col5|Col6|Col7|L1 L1|2 KDE-CRE KDE-CRE| |---|---|---|---|---|---|---|---|---| |||||||||| |||||||64||| |||||||||| |||128|1|28||||| |||||||||| |||256|||51|2||32| |||||||||| |||||||||| |||||||||| 0.915 0.920 0.925 0.930 0.935 0.940 ACC |Col1|Col2|Col3|Col4|Col5|Col6|L L|L L|1 2 KDE-CRE 1 KDE-CRE| |---|---|---|---|---|---|---|---|---| |||||||||| ||||||51|2||| |||||||||| |||256||||64||| |||128||||||| ||||12|8||||32| |||||||||| |||||||||| L1 2 KDE-CRE L1 KDE-CRE 512 256 128 128 64 32 0.915 0.920 0.925 0.930 0.935 0.940 ACC 0.915 0.920 0.925 0.930 0.935 0.940 ACC |Col1|Col2|Col3|Col4|Col5|Col6|L1 L1|L1 L1|2 KDE-CRE KDE-CRE| |---|---|---|---|---|---|---|---|---| |||||||||| ||||||512|||| |||||||||| |||256||||64||| |||||||||| |||128|1|28||||32| |||||||||| |||||||||| 0.915 0.920 0.925 0.930 0.935 0.940 ACC 0.007 0.006 0.045 0.040 0.005 0.004 0.035 0.030 0.003 Figure 13: Training with different batches for loss and regularization (2 KDE-CRE), where the batch size for the loss is fixed and the batch size for the regularization varies. The orange point shows our usual experimental set-up where we train with only one batch (KDE-CRE). Upper row: marginal, lower row: top-label. -----