# DENSITY-BASED CLUSTERING WITH KERNEL DIFFU## SION **Anonymous authors** Paper under double-blind review ABSTRACT Finding a suitable density function is essential for density-based clustering algorithms such as DBSCAN and DPC. A naive density corresponding to the indicator function of a unit d-dimensional Euclidean ball is commonly used in these algorithms. Such a density suffers from an inability to capture local features in complex datasets. To tackle this issue, we propose a new kernel diffusion density function, which is adaptive to data of varying local distributional characteristics and smoothness. Furthermore, we develop a surrogate that can be efficiently computed in linear time and space and prove that it is asymptotically equivalent to the kernel diffusion density function. Extensive empirical experiments on benchmark and large-scale face image datasets show that the proposed approach not only achieves a significant improvement over classic density-based clustering algorithms but also outperforms the state-of-the-art face clustering methods by a large margin. 1 INTRODUCTION Density-based clustering algorithms are now widely used in a variety of applications, ranging from high energy physics (Tramacere & Vecchio, 2012; Rovere et al., 2020), material sciences (Marquis et al., 2019; Reza et al., 2007), social network analysis (Shi et al., 2014; Khatoon & Banu, 2019) to molecular biology (Cao et al., 2017; Ziegler et al., 2020). In these algorithms, data points are partitioned into clusters that are considered to be sufficiently or locally high-density areas with respect to an underlying probability density or a similar reference function. We call them density functions throughout this paper. These techniques are attractive to practitioners, due to their nonparametric feature, which leads to flexibility in discovering clusters that have arbitrary shapes, whilst classic methods such as k-means and k-medoids (Friedman et al., 2001) can only detect convex (e.g., spherical) clusters. Seminal work in the context of density-based clustering includes DBSCAN (Ester et al., 1996) and DPC (Rodriguez & Laio, 2014), among many others (Ankerst et al., 1999; Cuevas et al., 2001; Comaniciu & Meer, 2002; Hinneburg & Gabriel, 2007; Stuetzle, 2003). Most density-based clustering algorithms implicitly identify cluster centers and assign remaining points to the clusters by connecting with the higher density point nearby. To proceed with these methods it requires a density function, which is usually an estimate of the underlying true probability density or some variants of it. For example, a popular choice is the naive density function that is carried out by simply calculating the number of data points covered in the ε-neighborhood of each _x. Note that such densities are not adaptive to different distribution regions. One of the challenging_ scenarios is when clusters in the data have varying local features, for example, size, height, spread, and smoothness. Therefore, the resulting density function has a tendency to flatten the peaks and valleys in the data distribution, which leads to underestimation of the number of clusters (see Figure 1). Many heuristics variations of DBSCAN and DPC have been proposed to magnify the local features, thus making the clustering task easier (Campello et al., 2013; Chen et al., 2018; Ertöz et al., 2003; Zhu et al., 2016). Most of these methods can be viewed as performing clustering on certain transformations of the naive density function. However, if the naive density function itself is quite problematic in the first place, these methods will become less effective. Moreover, even if we apply adaptive alternatives to modify the classic density functions, there are other contentious issues of the generally used linear kernel density estimator (KDE). It often suffers from severe boundary bias (Marron & Ruppert, 1994) and is acknowledged as computationally ----- Figure 1: (a) Data generated from Gaussian mixture model with 3 components, each has differing variance and weight. (b) Naive density function in 2D (top) and 3D (bottom): only one peak can be identified. (c) Proposed kernel diffusion density function: 3 clusters can be easily discovered. expensive. These phenomenon prevent the classic density functions being practically useful and reliable, especially for large-scale and complex clustering tasks. To overcome these problems in density-based clustering algorithms, in this paper we propose a general approach to build the so-called kernel diffusion density function to replace classic density functions. The key idea is to construct the density from a user-specified bivariate kernel that has desired local adaptive properties. Instead of using the naive density function and its variants, we utilize the bivariate kernel to derive a transition probability. A diffusion process is induced by this transition probability, which admits a limiting and stationary distribution. This limiting distribution serves as a plausible density function for clustering with reduced error. Under this framework, we provide examples of symmetric and asymmetric bivariate kernels to construct the kernel diffusion density function, which can tackle clustering complex and locally varying data. We apply the resulting adapted DBSCAN and DPC algorithms to widely different empirical datasets and show significant improvement in each of these analyses. The main contributions of this paper are summarized below: - We introduce new bivariate kernel functions and construct the associated kernel diffusion processes. Based on the diffusion process, we propose a kernel diffusion density function to adapt density-based clustering algorithms such as DBSACN and DPC, which attains accuracy in the presence of varying local features. - We derive a computationally much more efficient surrogate, and show analytically it is asymptotic equivalent to the proposed kernel diffusion density function. - By extensive experiments, we demonstrate the superiority of kernel diffusion density function over naive density function and its variants when applying to DBSCAN and DPC, and show it outperforms state-of-the-art GCN-based methods on face clustering tasks. 2 RELATED WORK **Density-Based Clustering** There is vast literature on adapting density-based clustering algorithms to tackle large variations in different clusters in the data. DPC itself is such a refinement of DBSCAN, as it determines cluster centers not only by highest density values but also by taking into account their distances from each other, thus has a generally better performance in complex clustering tasks. Other attempts include rescaling the data to have relative reference measures instead of KDE (Zhu ----- et al., 2016; Chen et al., 2018), and using the number of shared-nearest-neighbors between two points to replace the geometric distance (Ertöz et al., 2003). **Diffusion Maps** The technique of diffusion maps (Coifman et al., 2005; Coifman & Lafon, 2006) gives a multi-scale organization of the data according to their underlying geometric structure. It uses a local similarity measure to create a diffusion process on the data which integrates local geometry at different scales along time t. Generally speaking, the diffusion will segment the data into several smaller clusters in small t and group data into one cluster for large t. Applying eigenfunctions at a carefully selected time t leads to good macroscopic representations of the data, which is useful in dimension reduction and spectral clustering (Nadler et al., 2005). **Face Clustering** Face clustering has been extensively studied as an important application in machine learning. Traditional algorithms include k-means, hierarchical clustering (Sibson, 1973) and ARO (Otto et al., 2017). Many recent works take advantage of supervised information and GCN models, achieving impressive improvement comparing to traditional algorithms. To name a few, CDP (Zhan et al., 2018) proposes to aggregate the features extracted by different models; L-GCN (Wang et al., 2019) predicts the linkage in an instance pivot subgraph; LTC (Yang et al., 2019) generates a series of subgraphs as proposals and detects face clusters thereon; and GCN(V+E) (Yang et al., 2020) learns both the confidence and connectivity by GCN. In this paper we demonstrate that the proposed density-based clustering algorithm with kernel diffusion, as a general clustering approach, even outperforms theses state-of-the-art methods that are especially designed for face clustering. 3 PRELIMINARIES 3.1 NOTATIONS Let the dataset D = {x1, . . ., xn} ⊂ R[d] be n i.i.d samples drawn from a distribution measure F with density f on R[d]. Let Fn denote the corresponding empirical distribution measured with respect to D, i.e., Fn(A) = _n[1]_ _ni=1_ **[1][A][(][x][i][)][, where][ 1][A][(][·][)][ denotes the indicator function of set][ A][. We write][ ||][u][||][ as]** the Euclidean norm of vector u. Let B(x, ε) and Vd denote the d-dimensional ε-ball centered at x and the volume of the unit ballP _B(0, 1), respectively. Let Nk(x) denote the set of k-nearest neighbors_ of point x within the dataset D. 3.2 DENSITY FUNCTION Density-based algorithms perform clustering by specifying and segmenting high-value areas in a density function denoted by ρ. Usually, we calculate each of ρ(xi), and then identify cluster centers with (locally) highest values. Many popular algorithms such as DBSCAN and DPC employ the following naive density function: 1 **1B(x,ε)(y)** _ρnaive(x) =_ _._ (1) _nε[d]_ _Vd_ _yX∈D_ The naive density function ρnaive is actually an empirical estimation of f for carefully chosen ε. It is easy to observe, for clustering purpose we only care about ρnaive(x) up to a normalising constant, which makes it simply equivalent to counting the total number of data points in the ε-ball around x. In practice, the data distribution may be very complex and contains varying local features that are difficult to be detected. The naive density in (1) with the same radius ε for all x usually suffers from unsatisfactory empirical performance, for example, failing to identify small clusters with fewer data points. One possible way to alleviate this problem is through a transformation into the following local contrast (LC) function (Chen et al., 2018): _ρLC(x) = n[1]_ **1ρnaive(x)>ρnaive(y).** (2) _y∈XNk(x)_ In this way, ρLC compares the density of each data point with its k-nearest neighbors. To see the benefit of LC, let us consider x to be a cluster center. After local contrasting, ρLC(x) is likely to reach the value of k regardless of the size of this cluster. _ρnaive(x) =_ _nε[d]_ _ρLC(x) = [1]_ ----- However density functions like ρLC still highly depend on the underpinning performance of ρnaive. This restricts their applications in clustering data with challenging local features. 4 METHODOLOGY In this section, we present a new type of density-based clustering algorithm, based on the notion of kernel diffusion density function. Towards this end, we will introduce a kernel diffusion density function, which takes account of local adaptability and is well-suited for clustering purpose. We provide details on how to derive this density function from a diffusion process induced by bivariate kernels. We also provide a surrogate density function that is computationally more efficient. 4.1 DIFFUSION PROCESS AND KERNEL DIFFUSION DENSITY Considering a bivariate kernel function k : D × D → R[+], such that: - k(x, y) is positive semi-definite, i.e., k(x, y) ≥ 0. - k(x, y) is Fn-integrable with respect to both x and y. We define d(x) = n _D_ _[k][(][x, y][)][dF][n][(][y][)][ as a local measure of the volume at][ x][ and define]_ _p(x, y) =_ _[k][(][x, y][)]_ (3) _d(x)_ _[.]_ It is easy to see that p(x, y) satisfies the conservation property n _[p][(][x, y][)][dF][n][(][y][) = 1][. As a result,]_ _D_ _p(x, y) can be viewed as a probability for a random walk on the dataset from point x to point y, which_ R induces a Markov chain on D with n × n transition matrix P = [p(x, y)]. This technique is standard in various applications, known as the normalized graph Laplacian construction. For example, we can view D as a graph, L = I − _P as the normalized graph Laplacian, and d(x) as a normalization factor._ For t ≥ 0, the probability of transiting from x to y in t time steps is given by P _[t], the t-th power of P_ . Running the Markov chain forward in time, we observe the dataset at different scales, which is the diffusion process Xt on D. Let ρ(x, t): D × R[+] _→_ R[+] be the associated probability density, which is governed by the following second-order differential equation with initial conditions: _∂_ _∂t_ _[ρ][(][x, t][) =][ −][Lρ][(][x, t][)][,]_ (4) _ρ(x, 0) = ϕ0(x),_  where ϕ0(x) is a probability density at time t = 0. In practice we can use any valid choice of ϕ0(x), e.g. the uniform density. To give an explicit example of the diffusion process, consider a sub-class of k, i.e., isotropic kernels, where k(x, y) = K(||x − _y||[2]/h) for some function K : R →_ R[+]. Here we can dual interpret h as a scale parameter to infer local information and as a time step h = ∆t at which the random walk jumps. Then we can define the forward Chapman-Kolmogorov operator TF as _TF (x) = n_ _p(x, y)ϕ0(y)dFn(y)._ Note that TF is the data distribution at time t = h, thus can be viewed as continuous analogues of the left multiplication by the transition matrix P . Letting h → 0, the random walk converges to a continuous diffusion process with probability density evolves continuously in t. In this case, we can explicitly write the second-order differential equation in (4) as: _∂_ _ρˆ(x, t + h) −_ _ρ(x, t)_ _∂t_ _[ρ][(][x, t][) = lim]h→0_ _h_ _TF_ _I_ = lim _−_ _h→0_ _h_ _ρ(x, t),_ (5) where Lh = limh 0 (TF _I)/h is the conventional infinitesimal generator of the process._ _→_ _−_ Now we are ready to introduce our kernel diffusion density function. ----- **Definition 1. (Kernel diffusion density function) Suppose the Markov chain induced by P is ergodic,** _we define the kernel diffusion density function as the limiting probability density of the diffusion_ _process Xt, i.e.,_ _ρKD(x) = lim_ (6) _t→∞_ _[ρ][(][x, t][)][.]_ Intuitively, on the one hand, with increased values of t we expect the diffusion process Xt gradually reveals the geometric structure (such as high-density regions) of the data distribution F . To see this, note that the transition probability P reflects connectivity between data. We can interpret a cluster as an underlying geometric structure in which the probability of staying in this region is high during a transition. In the diffusion process, the probability of following a path along a structure usually increases with t, as the involved data points are dense and highly connected. Therefore, the path consists of short and high probability jumps. Whilst paths that do not follow any structure consists of long and low probability jumps, which lowers their overall probability. As a result, geometry structures of F is magnified during the diffusion. On the other hand, by talking certain sophisticated forms of k(x, y) that take into account of local adaptivity, we also slow down the diffusion to avoid trivial geometry structures such as one big cluster for all the data points. In this way, we can eventually identify the correct geometry structure at the right scale. 4.2 LOCALLY ADAPTIVE KERNELS To address the local adaptability in kernel diffusion density function, we propose the following two bivariate kernels. Both of them are very simple variations of the most commonly used classic kernels. **Symmetric-Gaussian kernel:** _k(x, y) = exp_ _−_ _[∥][x][ −]h_ _[y][∥][2]_  **1B(x,ε)(y).** (7) Here h and ϵ are both hyper-parameters. We call this kernel symmetric since k(x, y) = k(y, x) . **Asymmetric-Gaussian kernel:** _k(x, y) = exp_ _−_ _[∥][x][ −]h_ _[y][∥][2]_  **1Nk(x)(y).** (8) Here h and k are hyper-parameters. Note that in this case k(x, y) is asymmetric as y _Nk(x) does_ _∈_ not imply x _Nk(y)._ _∈_ Bivariate kernels defined in (7) and (8) are just combinations of classic Gaussian kernel and εneighbourhood or k-nearest neighbours kernels, respectively. With these simple combinations, we truncate Gaussian kernel at local areas, and the contribution of each point y to the construction of the density function ρKD(x) depends not only on the distance ||y − _x|| but also on the local geometry_ structure around x. Hence, the new kernels are adaptive at different x, which is expected to lead to better clustering performance against local features. We remark that the Asymmetric-Gaussian kernel takes into account a varying neighborhood around each x, thus is more adaptive comparing to the Symmetric-Gaussian kernel. Although here we only provide two examples of locally adaptive kernels, other options can be easily created in a similar spirit under this framework, e.g., changing the Gaussian kernels to other kernels or changing the ε-neighbourhood (k-nearest neighbours) kernels to other locally truncated functions. Once k(x, y) is determined, we can derive the corresponding density function ρKD. Next, we just need to simply apply any density clustering procedure like DPC or DBSCAN based on ρKD instead of the naive density function ρnaive. In Section 5, we assess the empirical performance of the proposed kernel diffusion density function with the above two locally adaptive kernels. They outperform existing density-based algorithms and other state-of-the-art methods. ----- 4.3 FAST KERNEL DIFFUSION DENSITY The kernel diffusion density function ρKD can be calculated as the stationary distribution of a Markov chain induced by the transition matrix P . Numerically, we can solve it by iteratively right multiplying _P with ρ(x, t) until convergence, or applying a QR decomposition on P_ . These methods are expensive in terms of computational cost, especially when the sample size n is large. To tackle this problem, we propose the following surrogate of ρKD(x) which is computationally more efficient. **Definition 2. (Fast kernel diffusion density function) Let p(y, x) be the transition probability from** _point y to point x, as defined in equation (3). We define the fast kernel diffusion density function as_ _p(y, x)dFn(y),_ (9) _ρFKD(x) =_ It is straightforward that ρFKD can be obtained in linear time and memory space, as we only need to compute the column averages of matrix P . Here we show that ρFKD is not only computationally efficient but also suitable for detecting local features. This is illustrated through the following Theorem 1. Consider a special case that k(x, y) = **1B(x,ε)(y). Then it is easy to verify that** _ρFKD(x) = [1]_ _Cd_ _ρnaive(y)_ _[,]_ _y∈B(x,ε)_ where Cd = nε[d]Vd is a normalising constant. In this way, we build a connection between ρFKD and the naive density function ρnaive in this special example. **Theorem 1. Consider the above special case that k(x, y) = 1B(x,ε)(y). In addition, assume the** _dataset D = {x1, ..., xn} can be split into m disjoint clusters: i.e., X = D1_ _..._ _Dm and_ _for each x_ _D, B(x, ε) only contain data points that belong to the same cluster as x. Denote_ _ρ¯j =_ _|D1j_ _|_ _∈x∈Dj_ _[ρ][FKD][(][x][)][ as the average density in cluster][ j][. We have]_ S S P _ρ¯1 =_ = ¯ρm = 1. _· · ·_ Theorem 1 demonstrates that the averaged ρFKD in each cluster are the same regardless of cluster sizes and other local features. This shows that ρFKD elevates the density of small clusters, which is essential for finding the density peaks of small clusters. Previously we claim that ρFKD is a surrogate of the kernel diffusion density ρKD. Next, we want to reveal the relationship between these two density functions from an asymptotic viewpoint. To proceed, we will need the following assumption. **Assumption 1. There exists some positive constant c < 1 that is independent of n, such that** _ρFKD(x) ≤_ _c holds for every x ∈_ _D._ This is a very mild assumption, since it always holds that ρFKD(x) < 1, and the average of ρFKD(x) over the dataset is _D_ _[ρ][FKD][(][x][)][dF][n][(][x][) = 1][/n][, which vanishes as][ n][ →∞][. Now we are ready to]_ present the following theorem that characterise the closeness between ρFKD and ρKD. R **Theorem 2. Suppose that Assumption 1 holds and the Markov chain induced by the kernel k(x, y) is** _ergodic. We have_ _ρKD(x)_ _a.s._ 1 _ρFKD(x)_ _−→_ As shown in the Appendix, the almost sure convergence in Theorem 2 is of a fast rate at n[−][1]. Thus it is safe for us to use it to replace ρKD in finite sample experiments. This result is also verified by our numerical experiments in Section 5. 5 EXPERIMENTS In this section, we empirically evaluate the proposed kernel diffusion density functions against ρnaive and ρLC in density-based clustering algorithms, and also compare them with other state-of-the-art ----- Table 1: Clustering performance on benchmark datasets with different density functions applied to DPC. Pairwise F-score (FP ) and BCube F-score (FB) under optimal parameter tuning are given. The best and second-bset results in each dataset are bolded and underlined, respectively. |F P|F B| |---|---| |Col1|ρ ρ naive LC|ρsym ρasym ρsym ρasym KD KD FKD FKD|ρ ρ naive LC|ρsym ρasym ρsym ρasym KD KD FKD FKD| |---|---|---|---|---| |Banknote Breast-d Breast-o Control Glass Haberman Ionosphere Iris Libras Pageblocks Seeds Segment Wine|54.3 31.6 55.9 51.8 57.6 70.7 48.6 49.3 36.9 39.0 66.9 64.1 27.4 28.3 54.3 53.8 20.0 22.9 92.9 93.0 54.3 54.9 48.4 48.0 45.2 61.1|67.2 83.9 67.2 93.6 78.0 69.1 67.4 72.6 82.8 92.9 82.7 92.9 49.0 63.9 52.5 64.5 46.3 48.1 44.8 47.8 74.5 75.7 75.8 75.7 46.9 54.9 46.0 53.9 65.8 74.6 69.2 74.6 29.3 31.5 26.0 31.0 90.5 90.2 89.7 90.2 68.0 78.0 69.5 78.0 57.1 58.0 41.4 56.1 56.6 68.0 60.0 65.3|57.7 31.8 59.0 58.7 52.2 74.1 51.6 52.4 42.7 45.7 66.9 63.3 25.0 25.8 61.6 62.3 26.8 29.1 89.9 90.0 54.3 55.4 64.2 63.8 46.0 61.9|67.2 85.1 67.2 93.6 76.0 69.7 69.4 72.2 75.9 92.2 75.8 92.2 52.0 70.8 55.1 71.8 55.1 56.9 53.5 57.1 74.5 75.8 75.9 75.8 42.6 52.5 41.7 49.2 72.7 80.0 74.0 80.0 37.8 41.8 33.3 39.8 89.8 89.7 89.6 89.7 72.4 78.7 72.9 78.7 67.1 69.2 60.6 68.2 61.5 74.7 66.3 71.4| |---|---|---|---|---| methods. We denote ρ[sym]KD [and][ ρ]FKD[sym] [as the kernel diffusion density functions and its fast surrogate,] with symmetric-Gaussian kernel, respectively. Similarly, we denote ρ[asym]KD [and][ ρ]FKD[asym] [as the proposed] two density functions with asymmetric-Gaussian kernel, respectively. We examine their performance on a wide range of datasets. The clustering results are measured Pairwise F-score (Banerjee et al., 2005), BCubed F-score (Amigó et al., 2009) or NMI (Cover, 1999). 5.1 PERFORMANCE ON BENCHMARK DATASETS We now discuss the performance on 13 benchmark datasets (∼100 to ∼5, 000 data points) from UCI repository. The metadata is summarised in the Appendix. As summarised in Table 1, both ρ[sym]KD [and][ ρ]KD[asym] uniformly outperform ρnaive and ρLC in terms of clustering accuracy in terms of F-scores. results based on NMI are deferred to the Appendix. The proposed kernel diffusion density functions with asymmetric Gaussian kernel, ρ[asym]KD [, which enjoys] better local adaptivity analytically, achieves the best results on most datasets and outperforms ρnaive and ρLC by a large margin. It is worth noticing that the two fast surrogates, ρ[sym]FKD [and][ ρ]FKD[asym][, achieve] comparable results with their original counterparts, ρ[sym]KD [and][ ρ]KD[asym][. In the Appendix similar results] are observed for the same set of density functions applied to DBSCAN. Figure 2: Precision-Recall curves of different approaches applied to DPC on MS1M dateset, using (a) Pairwise metric, and (b) BCubed metric . ----- 5.2 PERFORMANCE ON FACE IMAGE DATASETS Clustering face images according to their latent identity becomes an important application in recent years. It is challenging in the sense that face image datasets usually contain thousands of identities, corresponding to thousands of clusters. Meanwhile, the number of images for each identity (cluster) is quite different, corresponding to the variety of cluster sizes. We assess the performance of the proposed approach on two popular face image datasets: emore_200k (Zhan et al., 2018) and MS1M (Guo et al., 2016). **emore_200k. The dataset contains** Table 2: Clustering performance on emore_200k. BCubed 2,577 identities with 200,000 images precision, recall and F-score are reported. following the protocol in Zhan et al. |Col1|Algorithm # clusters Precision Recall F B| |---|---| pared with k-means, HAC (Sibson, |Baseline|k-means 2,577 94.24 74.89 83.45 HAC 2,577 97.74 88.02 92.62 ARO 85,150 52.96 16.93 25.66 CDP - 89.35 88.98 89.16| |---|---| 1973), ARO (Otto et al., 2017), and CDP (Zhan et al., 2018). Again, we ing with proposed kernel diffusion - Not available |Density -based|ρ 7928 92.36 78.14 84.65 naive ρ 3485 96.15 86.58 91.11 LC ρsym 2781 95.82 93.24 94.51 KD ρasym 2546 95.48 93.82 94.64 KD ρsym 3622 95.27 92.54 93.89 FKD ρasym 2569 96.37 93.93 95.13 FKD| |---|---| density functions also outperform the state-of-the-arts approaches such as CDP by a large margin. **MS1M. The dataset contains 8,573 iden-** tities with around 584,000 images following the protocols in Yang et al. (2020). We set ε to 0.8, k to 200, and h to 0.5 for density-based methods. We reported the results of clustering performance in Table 3. Precision versus Recall curves for different density functions (applied to DPC) are plotted in Figure 2. In Table 3, the proposed kernel diffusion density functions outperform ρnaive and ρLC. Note that GCN-based methods such as L-GCN (Wang et al., 2019), LTC (Yang et al., 2019) and GCN (V+E) (Yang et al., 2020) achieve generally better clustering performance than unsupervised methods due to their supervised nature. However, it is quite encouraging to see that the proposed kernel diffusion approaches, although are also unsupervised clustering methods, considerably outperform the GCN-based methods. 5.3 SENSITIVITY ANALYSIS Table 3: Clustering performance on MS1M. Pairwise Fscore and BCubed F-score are reported. |Col1|Algorithm # clusters F F P B| |---|---| |Unsupervised|k-means 8,573 79.21 81.23 HAC 8,573 70.63 70.46 ARO - 13.60 17.00 CDP - 75.02 78.70| |---|---| |Supervised|L-GCN - 78.68 84.37 LTC - 85.66 85.52 GCN(V+E) - 87.55 85.94| |---|---| |Density-based|ρ 59551 78.37 79.35 naive ρ 24019 83.61 85.06 LC ρsym - - - KD ρasym 22869 88.15 87.14 KD ρsym 34246 84.40 85.37 FKD ρasym 22927 87.26 87.41 FKD| |---|---| - Not available Next, we examine the sensitivity of the proposed kernel diffusion density functions to hyperparameters and compare it with ρnaive and ρLC. The results are obtained via extensive experiments on emore_200k and MS1M, which are shown in Figure 3. We can see that the clustering performance of _ρ[sym]KD_ [is much more stable than][ ρ][naive][ and][ ρ][LC][ when we vary the value of][ ε][. Whilst][ ρ]KD[asym] [is robust to] the parameter k, and both ρ[sym]KD [and][ ρ]KD[asym] [are quite robust to the parameter][ h][.] ----- Figure 3: Sensitivity analysis on emore_200k and MS1M. We investigate the clustering performance by varying the following parameters: (a) Radius of ε-ball; (b) Number k of nearest neighbors; (c) Bandwidth h of Gaussian kernel. 5.4 COMPUTATIONAL COST We carried out a series of experiments on MS1M to demonstrate the computational efficiency of the fast surrogate ρFKD in terms of time and space. With a collection of subsampled data from MS1M at different percentile levels, we run both the kernel diffusion density ρKD and the fast surrogate ρFKD. As we can observe from Figure 4, the running time and memory usage of ρKD increase dramatically with the sample size. Whilst ρFKD retains a very low level of computational cost. This suggests that ρFKD, which achieves an excellent computational efficiency, should be favored in practice. 6 CONCLUSION Figure 4: Running time and memory usage of the proposed methods at different sample sizes on MS1M. Density-based clustering has a profound impact on machine learning and data mining. However, the underpinning naive density function suffers from detecting varying local features, causing extra errors in the clustering. We propose a new set of density functions based on the kernel diffusion process to resolve this problem, which is adaptive to density regions of varying local distributional features. We demonstrate that DBSCAN and DPC adapted by the proposed approach have improved clustering performance comparing to their classic versions and other state-of-the-art methods. ----- REFERENCES Enrique Amigó, Julio Gonzalo, Javier Artiles, and Felisa Verdejo. A comparison of extrinsic clustering evaluation metrics based on formal constraints. Information Retrieval, 12(4):461–486, 2009. Mihael Ankerst, Markus M Breunig, Hans-Peter Kriegel, and Jörg Sander. Optics: Ordering points to identify the clustering structure. ACM Sigmod Record, 28(2):49–60, 1999. Arindam Banerjee, Chase Krumpelman, Joydeep Ghosh, Sugato Basu, and Raymond J Mooney. Model-based overlapping clustering. In Proceedings of the eleventh ACM SIGKDD International _Conference on Knowledge Discovery in Data Mining, pp. 532–537, 2005._ Ricardo JGB Campello, Davoud Moulavi, and Jörg Sander. Density-based clustering based on hierarchical density estimates. In Pacific-Asia Conference on Knowledge Discovery and Data _Mining, pp. 160–172. Springer, 2013._ Junyue Cao, Jonathan S Packer, Vijay Ramani, Darren A Cusanovich, Chau Huynh, Riza Daza, Xiaojie Qiu, Choli Lee, Scott N Furlan, Frank J Steemers, et al. Comprehensive single-cell transcriptional profiling of a multicellular organism. Science, 357(6352):661–667, 2017. Bo Chen, Kai Ming Ting, Takashi Washio, and Ye Zhu. Local contrast as an effective means to robust clustering against varying densities. Machine Learning, 107(8):1621–1645, 2018. Ronald R Coifman and Stéphane Lafon. Diffusion maps. Applied and Computational Harmonic _Analysis, 21(1):5–30, 2006._ Ronald R Coifman, Stephane Lafon, Ann B Lee, Mauro Maggioni, Boaz Nadler, Frederick Warner, and Steven W Zucker. Geometric diffusions as a tool for harmonic analysis and structure definition of data: Diffusion maps. Proceedings of the National Academy of Sciences, 102(21):7426–7431, 2005. Dorin Comaniciu and Peter Meer. Mean shift: A robust approach toward feature space analysis. _IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(5):603–619, 2002._ Thomas M Cover. Elements of information theory. John Wiley & Sons, 1999. Antonio Cuevas, Manuel Febrero, and Ricardo Fraiman. Cluster analysis: a further approach based on density estimation. Computational Statistics & Data Analysis, 36(4):441–459, 2001. Levent Ertöz, Michael Steinbach, and Vipin Kumar. Finding clusters of different sizes, shapes, and densities in noisy, high dimensional data. In Proceedings of the 2003 SIAM International _Conference on Data Mining, pp. 47–58. SIAM, 2003._ Martin Ester, Hans-Peter Kriegel, Jörg Sander, Xiaowei Xu, et al. A density-based algorithm for discovering clusters in large spatial databases with noise. In KDD, volume 96, pp. 226–231, 1996. Jerome Friedman, Trevor Hastie, Robert Tibshirani, et al. The elements of statistical learning, volume 1. Springer Series in Statistics New York, 2001. Yandong Guo, Lei Zhang, Yuxiao Hu, Xiaodong He, and Jianfeng Gao. Ms-celeb-1m: A dataset and benchmark for large-scale face recognition. In European Conference on Computer Vision, pp. 87–102. Springer, 2016. Alexander Hinneburg and Hans-Henning Gabriel. Denclue 2.0: Fast clustering based on kernel density estimation. In International Symposium on Intelligent Data Analysis, pp. 70–80. Springer, 2007. Jeffrey J Hunter. Generalized inverses and their application to applied probability problems. Linear _Algebra and Its Applications, 45:157–198, 1982._ Mehjabin Khatoon and W Aisha Banu. An efficient method to detect communities in social networks using dbscan algorithm. Social Network Analysis and Mining, 9(1):1–12, 2019. ----- Emmanuelle A Marquis, Vicente Araullo-Peters, Yan Dong, Auriane Etienne, Svetlana Fedotova, Katsuhiko Fujii, Koji Fukuya, Evgenia Kuleshova, Anabelle Lopez, Andrew London, et al. On the use of density-based algorithms for the analysis of solute clustering in atom probe tomography data. In Proceedings of the 18th International Conference on Environmental Degradation of Materials _in Nuclear Power Systems–Water Reactors, pp. 2097–2113. Springer, 2019._ James S Marron and David Ruppert. Transformations to reduce boundary bias in kernel density estimation. Journal of the Royal Statistical Society: Series B (Methodological), 56(4):653–671, 1994. Boaz Nadler, Stephane Lafon, Ronald R Coifman, and Ioannis G Kevrekidis. Diffusion maps, spectral clustering and eigenfunctions of fokker-planck operators. arXiv preprint math/0506090, 2005. Charles Otto, Dayong Wang, and Anil K Jain. Clustering millions of faces by identity. IEEE _Transactions on Pattern Analysis and Machine Intelligence, 40(2):289–303, 2017._ Hasanzadeh PR Reza, AH Rezaie, SHH Sadeghi, MH Moradi, and M Ahmadi. A density-based fuzzy clustering technique for non-destructive detection of defects in materials. Ndt & E International, 40(4):337–346, 2007. Alex Rodriguez and Alessandro Laio. Clustering by fast search and find of density peaks. Science, 344(6191):1492–1496, 2014. Marco Rovere, Ziheng Chen, Antonio Di Pilato, Felice Pantaleo, and Chris Seez. Clue: A fast parallel clustering algorithm for high granularity calorimeters in high-energy physics. Frontiers in Big _Data, 3, 2020._ Jieming Shi, Nikos Mamoulis, Dingming Wu, and David W Cheung. Density-based place clustering in geo-social networks. In Proceedings of the 2014 ACM SIGMOD International Conference on _Management of Data, pp. 99–110, 2014._ Robin Sibson. Slink: an optimally efficient algorithm for the single-link cluster method. The _Computer Journal, 16(1):30–34, 1973._ Werner Stuetzle. Estimating the cluster tree of a density by analyzing the minimal spanning tree of a sample. Journal of Classification, 20(1):25–47, 2003. A Tramacere and C Vecchio. γ-ray dbscan: A clustering algorithm applied to fermi-lat γ-ray data. In _AIP Conference Proceedings, volume 1505, pp. 705–708. American Institute of Physics, 2012._ [UCI. UCI machine learning repository. http://archive.ics.uci.edu/ml/datasets.](http://archive.ics.uci.edu/ml/datasets.php) [php.](http://archive.ics.uci.edu/ml/datasets.php) Zhongdao Wang, Liang Zheng, Yali Li, and Shengjin Wang. Linkage based face clustering via graph convolution network. In Proceedings of the IEEE/CVF Conference on Computer Vision and _Pattern Recognition, pp. 1117–1125, 2019._ Lei Yang, Xiaohang Zhan, Dapeng Chen, Junjie Yan, Chen Change Loy, and Dahua Lin. Learning to cluster faces on an affinity graph. In Proceedings of the IEEE/CVF Conference on Computer _Vision and Pattern Recognition, pp. 2298–2306, 2019._ Lei Yang, Dapeng Chen, Xiaohang Zhan, Rui Zhao, Chen Change Loy, and Dahua Lin. Learning to cluster faces via confidence and connectivity estimation. In Proceedings of the IEEE/CVF _Conference on Computer Vision and Pattern Recognition, pp. 13369–13378, 2020._ Xiaohang Zhan, Ziwei Liu, Junjie Yan, Dahua Lin, and Chen Change Loy. Consensus-driven propagation in massive unlabeled data for face recognition. In Proceedings of the European _Conference on Computer Vision (ECCV), pp. 568–583, 2018._ Ye Zhu, Kai Ming Ting, and Mark J Carman. Density-ratio based clustering for discovering clusters with varying densities. Pattern Recognition, 60:983–997, 2016. Carly GK Ziegler, Samuel J Allon, Sarah K Nyquist, Ian M Mbano, Vincent N Miao, Constantine N Tzouanas, Yuming Cao, Ashraf S Yousif, Julia Bals, Blake M Hauser, et al. Sars-cov-2 receptor ace2 is an interferon-stimulated gene in human airway epithelial cells and is detected in specific cell subsets across tissues. Cell, 181(5):1016–1035, 2020. ----- A APPENDIX In this supplementary file, we provide technical proofs of the theoretical results in Section 4.3, and present extra empirical experiments regarding our kernel diffusion approach with symmetric and asymmetric Gaussian kernels applied to DBSCAN. All the numerical experiments are carried out on a standard work station with a Intel 64-cores CPU and two Nvidia P100 GPUs. A.1 PSEUDO CODE FOR DBSCAN AND DPC. **Algorithm 1 DBSCAN** 1: Input: SetOfPoints X, Eps ε, MinPts k 2: H:={x ∈ _X : |B(x, ε) ∩_ _X| ≥_ _k};_ 3: G:=undirected graph with vertices H and edge between x, x[′] _∈_ _H if |x −_ _x[′]| ≤_ _ε;_ 4: Output: connected components of G The connected compoenents of the graph are determined a clusters,a dn the remaining points are unclustered and considered as noise-points. **Algorithm 2 DPC** 1: Input: SetOfPoints, TruncDis dc 2: Compute di,j for ∀i, j ∈ SetOfPoints; 3: For i from 1 to SetOfPoints.size: 4: _ρi :=_ _i≠_ _j_ [1][{][d]i,j _[−][d]c[}]_ 5: _δi := minj:ρj_ _>ρi_ (di,j) 6: Plot decision map[P] _M with ρ as the horizontal axis and δ as the vertical axis;_ 7: Mark Point i with relatively higher ρi and δi as a cluster center; 8: Mark Point i with relatively lower ρi but relatively higher δi as a noise point; 9: Assign the rest point with the label the same as the nearest cluster center; 10: Output: SetOfPoints A.2 PROOFS OF THEORETICAL RESULT. **Proof of Theorem 1.** Since {D1, · · ·, Dm} are disjoint, we have p(x, y) = 0 if x and y belong to different clusters. By the definition of matrix P, for each x _Dj, we have_ _∈_ _p(x, y)dFn(y) = 1,_ _D_ Z which implies that _p(x, y)dFn(x)dFn(y) =_ _Dj_ _._ _y∈Dj_ _|_ _|_ _x∈Dj_ _ρ¯j_ _Dj_ = _|_ _|_ Z Therefore, _ρ¯j_ _Dj_ = _p(x, y)dFn(x)dFn(y) =_ _Dj_ _,_ _|_ _|_ Zx∈Dj Zy∈Dj _|_ _|_ which implies that ¯ρj = 1 for any j = 1,, . . ., m. Before proceeding to the proof of Theorem 2, we need following auxiliary lemma that relates the stationary distribution of a Markov chain to an arbitrary vector g. **Lemma A.1. Let P be transition probability matrix of a finite inreducible discrete time Markov** _chain with n states, which admits a stationary distribution, denoted by vector π. We write e =_ (1, . . ., 1)[T] _∈_ R[n] _as the a column vector of ones. The following holds for any vector g such that_ _g[T]_ _e ̸= 0:_ _(1) (I −_ _P + eg[T]_ ) is non-singular. _(2) Let H = (I −_ _P + eg[T]_ )[−][1], then π[T] = g[T] _H._ ----- _Proof. Since π is the stationary distribution, we have π[T]_ _e = 1. Applying Theorem 3.3 in (Hunter,_ 1982) yields that matrix (I − _P + eg[T]_ ) is non-singular. Next recall that π[T] _P = π[T]_, therefore we have _π[T]_ (I − _P + eg[T]_ ) = π[T] _−_ _π[T]_ _P + π[T]_ _eg[T]_ = π[T] _eg[T]_ = g[T] _,_ which implies π[T] = g[T] _H._ **Proof of Theorem 2.** Note that for for each x ∈ _D, the linear reference function ρFKD(x) =_ _D_ _[p][(][y, x][)][dF][n][(][y][)][ is the corresponding column average of the transition matrix][ P]_ [. We write the][ i][-th] column vector of P as _T_ R _pi =_ _p(x1, xi), . . ., p(xn, xi)_ _._ Therefore ρFKD(xi) = e[T] _pi/n_  Since the Markov chain induced by the kernel k(x, y) is ergodic, the density ρ(x, t) of the diffusion process Xt converges to the limiting stationary distribution of the Markov chain, denoted by π. We can write the n-vectors of g and π in the following form: _T_ _T_ _g = (g1, . . ., gn)[T]_ = n _ρFKD(x1), . . ., ρFKD(xn)_ and _π =_ _ρKD(x1), . . ., ρKD(xn)_ _,_ where gi = e[T] _pi is the i-th column sums of matrix_ _P. As a result, we have_  _ρFKD(x)dFn(x) = [1]_ and _D_ _n_ _[e][T][ g][ = 1][,]_ Z By the definition of g, we know _ρKDdFn(x) = e[T]_ _π = 1._ (eg[T] )[2] = neg[T] and _e[T]_ _P = g[T]_ _._ It follows from Lemma A.1 that (I − _P + eg[T]_ ) is non-singular and π[T] = g[T] _H, where H =_ (I − _P + eg[T]_ )[−][1]. We define B = I + eg[T] . By simple algebra calculation, we can find B is non-singular with _B[−][1]_ = I _−_ _n[eg] + 1[T]_ _[.]_ _g[T]_ As a result, it is easy to see that that g[T] _B[−][1]_ = _n + 1_ [and] _H_ _[−][1]_ = B − _P = (I −_ _PB[−][1])B._ Use the Neumann series, we have _H = B[−][1](I −_ _PB[−][1])[−][1]_ = B[−][1] (PB[−][1])[i]. _i=0_ X _∞_ (PB[−][1])[i] _−_ _n[I]_ _i=0_ X Thus _π[T]_ _g[T]_ _/n = g[T]_ _H_ _−_ _−_ _n[I]_  = g[T] _B[−][1]_  Since we assume for any x ∈ _D, ˆg(x) < c_ for some 0 < c < 1. This leads to _g[T]_ _pj ≤_ _nce[T]_ _pj = ncgj._ Therefore, let κj be the j-th compoenent of g[T] _PB[−][1], it is straightforward_ _nc_ _κj_ _≤_ _n + 1_ _[g][j][ ≤]_ _[cg][j][.]_ ----- This implies for every x ∈ _D,_ _|ρKD(x) −_ _ρFKD(x)| ≤_ _ρFKD(x)_ _≤_ _ρFKD(x)_ _∞_ _c[i]_ _−_ _n[1]_ _i=0_ X _n + 1_ 1 (n + 1)(1 − _c)_ _[−]_ _n[1]_ Hence we have which completes the proof. _ρKD(x)_ lim = 1, _n→∞_ _ρFKD(x) [= 1]_ A.3 ADDITIONAL NUMERICAL EXPERIMENTS **Naive density with different bandwidths.** To illustrate the fail of naive density function in scenario as in Figure 1, we also plot it with a range of different values of hyperparameters ϵ below. We can observe that it is difficult to detect the three true underlying clusters in all the cases. Figure 5: Naive density function in 3D with different values of ϵ. **Hyperparameters** The parameter ε (radius of the ball, used in ρnaive, ρLC, ρ[sym]KD [and][ ρ]FKD[sym] [) is tuned] by searching within the range between 0.1 and 1 with am increment of 0.1, parameter k (number of nearest neighbors, used in ρLC, ρ[asym]KD [and][ ρ]FKD[asym][) is tuned by searching within the range between][ 10%] and 50% number of samples, with an increment of 10%. **Metadata of benchmark datasets.** The number of samples n, the number of clusters c, and feature dimension d for each benchmark dataset are listed in Table 4 below. **Benckmark datasets with DBSCAN.** We provide the performance of the conventional density functions, ρnaive and ρLC, and the proposed kernel diffusion density functions with symmetric and asymmetric Gaussian kernels, ρ[∗]KD [and][ ρ]FKD[∗] [(][∗∈{][sym][,][ asym][}][), applied to DBSCAN on 13 bench-] mark datasets. The results are summarised in Table 6. Similar to DPC, we see that both ρ[sym]KD [and] _ρ[asym]KD_ uniformly outperform ρnaive and ρLC in terms of clustering quality. ρ[asym]KD [, which has better] local adaptivity analytically, achieves the best results on most datasets and outperforms others by a significant margin in Breast-o, Control, Haberma and Seeds. **NMI for benchmark datasets.** Below we present in Table 6 the clustering results for benchmark datasets based on NMI metric. ----- Table 4: Metadata of benchmark datasets, includes sample size (n), the number of clusters (c), and feature dimension d. Dataset _n_ _c_ _d_ Banknote 1372 2 4 Breast-d 569 2 30 Breast-o 699 2 9 Control 600 6 60 Glass 214 7 9 Haberman 306 2 3 Ionosphere 351 2 34 Iris 150 3 4 Libras 360 15 90 Pageblocks 5473 5 10 Seeds 210 3 7 Segment 210 7 19 Wine 178 3 13 Table 5: Clustering performance on benchmark datasets with different density functions applied to DBSCAN. Pairwise F-score (FP ) and BCube F-score (FB) under optimal parameter tuning are given. The best and second-best results in each dataset are bolded and underlined, respectively. |F P|F B| |---|---| |Col1|ρ ρ naive LC|ρsym ρasym ρsym ρasym KD KD FKD FKD|ρ ρ naive LC|ρsym ρasym ρsym ρasym KD KD FKD FKD| |---|---|---|---|---| |Banknote Breast-d Breast-o Control Glass Haberman Ionosphere Iris Libras Pageblocks Seeds Segment Wine|26.8 60.7 56.7 63.0 18.2 55.3 32.5 37.1 22.0 29.8 68.6 72.2 25.9 68.4 66.2 69.8 18.1 12.0 48.4 89.2 57.8 47.6 18.5 47.9 40.5 40.5|62.0 66.4 65.4 66.4 65.0 66.6 67.2 66.6 59.2 70.6 70.5 70.6 51.0 60.3 48.9 59.1 29.8 42.5 42.0 42.5 68.6 75.6 68.9 75.3 68.0 74.2 74.2 74.2 66.2 57.2 73.7 73.3 15.6 13.8 20.2 13.5 90.0 90.1 89.9 90.1 57.8 63.2 22.4 62.4 54.8 30.8 41.4 30.8 40.5 49.5 50.0 49.5|26.5 65.1 60.9 64.7 15.9 50.7 34.2 50.1 25.8 36.9 69.2 73.1 23.8 64.1 67.2 76.6 31.1 16.5 45.2 85.5 59.2 53.0 22.5 55.2 45.7 45.7|60.7 67.4 65.7 67.4 66.0 67.4 67.2 67.4 52.3 71.3 70.6 71.2 53.7 66.9 51.6 65.7 36.9 45.2 43.5 45.2 69.2 75.8 68.3 75.7 63.7 72.1 72.1 72.1 67.2 67.0 79.4 79.0 42.1 45.5 32.9 37.7 89.7 89.5 89.7 89.5 59.2 70.0 24.4 69.2 66.6 53.6 59.9 53.6 45.7 52.3 51.3 52.3| |---|---|---|---|---| **Number of clusters.** In Table 7, we present the number of clusters returned by the density-based methods for the benchmark datasets. It can be observed that clustering with the proposed diffusion density functions returned a significantly better estimate of the number of clusters, comparing to that with classic density functions such as ρnaive and ρLC. ----- Table 6: Clustering performance on benchmark datasets with different density functions applied to DPC. NMI under optimal parameter tuning are given. The best results in each dataset are bolded. Dataset |Col1|ρ ρ naive LC|ρsym ρasym ρsym ρasym KD KD FKD FKD|k-means Spectral| |---|---|---|---| |Banknote Breast-d Breast-o Control Glass Haberman Ionosphere Iris Libras Pageblocks Seeds Segment Wine|27.5 33.0 43.7 49.1 30.2 32.7 60.6 60.6 43.1 43.4 9.5 5.7 27.9 28.0 51.1 53.1 63.3 66.4 8.6 13.0 47.1 49.8 63.5 64.4 58.1 58.2|21.7 64.8 53.2 80.2 46.8 57.4 55.7 46.1 37.2 79.1 36.4 78.4 63.2 69.5 61.0 69.6 45.0 48.4 43.8 46.6 9.5 3.2 16.9 3.2 30.9 31.1 30.1 30.5 60.1 73.4 62.6 73.4 63.0 68.8 68.2 69.1 11.8 28.7 14.4 29.1 53.6 64.8 58.6 64.8 65.1 72.2 63.0 70.7 72.0 73.3 71.1 58.6|34.2 17.3 62.3 52.6 74.8 14.0 75.4 68.3 34.8 36.4 7.8 6.6 13.5 5.2 74.2 70.6 60.0 56.1 13.2 12.1 67.4 60.3 61.2 65.2 84.2 72.7| |---|---|---|---| NMI Table 7: Number of clusters returned by different density functions applied to DPC. The ground truth is listed in the last column. |Dataset|ρ ρ naive LC|ρsym ρasym ρsym ρasym KD KD FKD FKD|Ground Truth| |---|---|---|---| |Banknote Breast-d Breast-o Control Glass Haberman Ionosphere Iris Libras Pageblocks Seeds Segment Wine|16 44 3 5 15 17 32 32 4 21 34 40 12 11 8 12 12 60 7 25 10 10 21 27 3 7|26 2 7 2 2 3 3 3 7 2 1 2 23 23 27 25 4 8 3 7 34 1 3 1 7 6 7 7 4 2 3 2 2 61 100 55 3 10 11 8 2 6 3 6 14 8 6 9 3 5 3 2|2 2 2 6 7 2 2 3 15 5 3 7 3| |---|---|---|---| -----