# DENSITY-BASED CLUSTERING WITH KERNEL DIFFU## SION

**Anonymous authors**
Paper under double-blind review

ABSTRACT

Finding a suitable density function is essential for density-based clustering algorithms such as DBSCAN and DPC. A naive density corresponding to the indicator
function of a unit d-dimensional Euclidean ball is commonly used in these algorithms. Such a density suffers from an inability to capture local features in
complex datasets. To tackle this issue, we propose a new kernel diffusion density
function, which is adaptive to data of varying local distributional characteristics and
smoothness. Furthermore, we develop a surrogate that can be efficiently computed
in linear time and space and prove that it is asymptotically equivalent to the kernel
diffusion density function. Extensive empirical experiments on benchmark and
large-scale face image datasets show that the proposed approach not only achieves
a significant improvement over classic density-based clustering algorithms but also
outperforms the state-of-the-art face clustering methods by a large margin.

1 INTRODUCTION

Density-based clustering algorithms are now widely used in a variety of applications, ranging from
high energy physics (Tramacere & Vecchio, 2012; Rovere et al., 2020), material sciences (Marquis
et al., 2019; Reza et al., 2007), social network analysis (Shi et al., 2014; Khatoon & Banu, 2019)
to molecular biology (Cao et al., 2017; Ziegler et al., 2020). In these algorithms, data points are
partitioned into clusters that are considered to be sufficiently or locally high-density areas with
respect to an underlying probability density or a similar reference function. We call them density
functions throughout this paper. These techniques are attractive to practitioners, due to their nonparametric feature, which leads to flexibility in discovering clusters that have arbitrary shapes, whilst
classic methods such as k-means and k-medoids (Friedman et al., 2001) can only detect convex (e.g.,
spherical) clusters. Seminal work in the context of density-based clustering includes DBSCAN (Ester
et al., 1996) and DPC (Rodriguez & Laio, 2014), among many others (Ankerst et al., 1999; Cuevas
et al., 2001; Comaniciu & Meer, 2002; Hinneburg & Gabriel, 2007; Stuetzle, 2003).

Most density-based clustering algorithms implicitly identify cluster centers and assign remaining
points to the clusters by connecting with the higher density point nearby. To proceed with these
methods it requires a density function, which is usually an estimate of the underlying true probability
density or some variants of it. For example, a popular choice is the naive density function that is
carried out by simply calculating the number of data points covered in the ε-neighborhood of each
_x. Note that such densities are not adaptive to different distribution regions. One of the challenging_
scenarios is when clusters in the data have varying local features, for example, size, height, spread,
and smoothness. Therefore, the resulting density function has a tendency to flatten the peaks and
valleys in the data distribution, which leads to underestimation of the number of clusters (see Figure
1). Many heuristics variations of DBSCAN and DPC have been proposed to magnify the local
features, thus making the clustering task easier (Campello et al., 2013; Chen et al., 2018; Ertöz et al.,
2003; Zhu et al., 2016). Most of these methods can be viewed as performing clustering on certain
transformations of the naive density function. However, if the naive density function itself is quite
problematic in the first place, these methods will become less effective.

Moreover, even if we apply adaptive alternatives to modify the classic density functions, there are
other contentious issues of the generally used linear kernel density estimator (KDE). It often suffers
from severe boundary bias (Marron & Ruppert, 1994) and is acknowledged as computationally


-----

Figure 1: (a) Data generated from Gaussian mixture model with 3 components, each has differing
variance and weight. (b) Naive density function in 2D (top) and 3D (bottom): only one peak can be
identified. (c) Proposed kernel diffusion density function: 3 clusters can be easily discovered.

expensive. These phenomenon prevent the classic density functions being practically useful and
reliable, especially for large-scale and complex clustering tasks.

To overcome these problems in density-based clustering algorithms, in this paper we propose a
general approach to build the so-called kernel diffusion density function to replace classic density
functions. The key idea is to construct the density from a user-specified bivariate kernel that has
desired local adaptive properties. Instead of using the naive density function and its variants, we
utilize the bivariate kernel to derive a transition probability. A diffusion process is induced by this
transition probability, which admits a limiting and stationary distribution. This limiting distribution
serves as a plausible density function for clustering with reduced error.

Under this framework, we provide examples of symmetric and asymmetric bivariate kernels to
construct the kernel diffusion density function, which can tackle clustering complex and locally
varying data. We apply the resulting adapted DBSCAN and DPC algorithms to widely different
empirical datasets and show significant improvement in each of these analyses. The main contributions
of this paper are summarized below:

-  We introduce new bivariate kernel functions and construct the associated kernel diffusion
processes. Based on the diffusion process, we propose a kernel diffusion density function
to adapt density-based clustering algorithms such as DBSACN and DPC, which attains
accuracy in the presence of varying local features.

-  We derive a computationally much more efficient surrogate, and show analytically it is
asymptotic equivalent to the proposed kernel diffusion density function.

-  By extensive experiments, we demonstrate the superiority of kernel diffusion density function over naive density function and its variants when applying to DBSCAN and DPC, and
show it outperforms state-of-the-art GCN-based methods on face clustering tasks.

2 RELATED WORK

**Density-Based Clustering** There is vast literature on adapting density-based clustering algorithms
to tackle large variations in different clusters in the data. DPC itself is such a refinement of DBSCAN,
as it determines cluster centers not only by highest density values but also by taking into account
their distances from each other, thus has a generally better performance in complex clustering tasks.
Other attempts include rescaling the data to have relative reference measures instead of KDE (Zhu


-----

et al., 2016; Chen et al., 2018), and using the number of shared-nearest-neighbors between two points
to replace the geometric distance (Ertöz et al., 2003).

**Diffusion Maps** The technique of diffusion maps (Coifman et al., 2005; Coifman & Lafon, 2006)
gives a multi-scale organization of the data according to their underlying geometric structure. It uses
a local similarity measure to create a diffusion process on the data which integrates local geometry at
different scales along time t. Generally speaking, the diffusion will segment the data into several
smaller clusters in small t and group data into one cluster for large t. Applying eigenfunctions at a
carefully selected time t leads to good macroscopic representations of the data, which is useful in
dimension reduction and spectral clustering (Nadler et al., 2005).

**Face Clustering** Face clustering has been extensively studied as an important application in
machine learning. Traditional algorithms include k-means, hierarchical clustering (Sibson, 1973)
and ARO (Otto et al., 2017). Many recent works take advantage of supervised information and
GCN models, achieving impressive improvement comparing to traditional algorithms. To name
a few, CDP (Zhan et al., 2018) proposes to aggregate the features extracted by different models;
L-GCN (Wang et al., 2019) predicts the linkage in an instance pivot subgraph; LTC (Yang et al., 2019)
generates a series of subgraphs as proposals and detects face clusters thereon; and GCN(V+E) (Yang
et al., 2020) learns both the confidence and connectivity by GCN. In this paper we demonstrate
that the proposed density-based clustering algorithm with kernel diffusion, as a general clustering
approach, even outperforms theses state-of-the-art methods that are especially designed for face
clustering.

3 PRELIMINARIES

3.1 NOTATIONS

Let the dataset D = {x1, . . ., xn} ⊂ R[d] be n i.i.d samples drawn from a distribution measure F with
density f on R[d]. Let Fn denote the corresponding empirical distribution measured with respect to D,
i.e., Fn(A) = _n[1]_ _ni=1_ **[1][A][(][x][i][)][, where][ 1][A][(][·][)][ denotes the indicator function of set][ A][. We write][ ||][u][||][ as]**

the Euclidean norm of vector u. Let B(x, ε) and Vd denote the d-dimensional ε-ball centered at x
and the volume of the unit ballP _B(0, 1), respectively. Let Nk(x) denote the set of k-nearest neighbors_
of point x within the dataset D.

3.2 DENSITY FUNCTION

Density-based algorithms perform clustering by specifying and segmenting high-value areas in a
density function denoted by ρ. Usually, we calculate each of ρ(xi), and then identify cluster centers
with (locally) highest values. Many popular algorithms such as DBSCAN and DPC employ the
following naive density function:

1 **1B(x,ε)(y)**
_ρnaive(x) =_ _._ (1)

_nε[d]_ _Vd_

_yX∈D_

The naive density function ρnaive is actually an empirical estimation of f for carefully chosen ε. It
is easy to observe, for clustering purpose we only care about ρnaive(x) up to a normalising constant,
which makes it simply equivalent to counting the total number of data points in the ε-ball around x.

In practice, the data distribution may be very complex and contains varying local features that are
difficult to be detected. The naive density in (1) with the same radius ε for all x usually suffers from
unsatisfactory empirical performance, for example, failing to identify small clusters with fewer data
points. One possible way to alleviate this problem is through a transformation into the following
local contrast (LC) function (Chen et al., 2018):

_ρLC(x) = n[1]_ **1ρnaive(x)>ρnaive(y).** (2)

_y∈XNk(x)_

In this way, ρLC compares the density of each data point with its k-nearest neighbors. To see the
benefit of LC, let us consider x to be a cluster center. After local contrasting, ρLC(x) is likely to reach
the value of k regardless of the size of this cluster.


_ρnaive(x) =_


_nε[d]_


_ρLC(x) = [1]_


-----

However density functions like ρLC still highly depend on the underpinning performance of ρnaive.
This restricts their applications in clustering data with challenging local features.

4 METHODOLOGY

In this section, we present a new type of density-based clustering algorithm, based on the notion
of kernel diffusion density function. Towards this end, we will introduce a kernel diffusion density
function, which takes account of local adaptability and is well-suited for clustering purpose. We
provide details on how to derive this density function from a diffusion process induced by bivariate
kernels. We also provide a surrogate density function that is computationally more efficient.

4.1 DIFFUSION PROCESS AND KERNEL DIFFUSION DENSITY

Considering a bivariate kernel function k : D × D → R[+], such that:

-  k(x, y) is positive semi-definite, i.e., k(x, y) ≥ 0.

-  k(x, y) is Fn-integrable with respect to both x and y.


We define d(x) = n


_D_ _[k][(][x, y][)][dF][n][(][y][)][ as a local measure of the volume at][ x][ and define]_


_p(x, y) =_ _[k][(][x, y][)]_ (3)

_d(x)_ _[.]_

It is easy to see that p(x, y) satisfies the conservation property n _[p][(][x, y][)][dF][n][(][y][) = 1][. As a result,]_

_D_
_p(x, y) can be viewed as a probability for a random walk on the dataset from point x to point y, which_

R

induces a Markov chain on D with n × n transition matrix P = [p(x, y)]. This technique is standard
in various applications, known as the normalized graph Laplacian construction. For example, we can
view D as a graph, L = I − _P as the normalized graph Laplacian, and d(x) as a normalization factor._

For t ≥ 0, the probability of transiting from x to y in t time steps is given by P _[t], the t-th power of P_ .
Running the Markov chain forward in time, we observe the dataset at different scales, which is the
diffusion process Xt on D. Let ρ(x, t): D × R[+] _→_ R[+] be the associated probability density, which
is governed by the following second-order differential equation with initial conditions:

_∂_

_∂t_ _[ρ][(][x, t][) =][ −][Lρ][(][x, t][)][,]_
(4)
_ρ(x, 0) = ϕ0(x),_



where ϕ0(x) is a probability density at time t = 0. In practice we can use any valid choice of ϕ0(x),
e.g. the uniform density.

To give an explicit example of the diffusion process, consider a sub-class of k, i.e., isotropic kernels,
where k(x, y) = K(||x − _y||[2]/h) for some function K : R →_ R[+]. Here we can dual interpret h as
a scale parameter to infer local information and as a time step h = ∆t at which the random walk
jumps. Then we can define the forward Chapman-Kolmogorov operator TF as


_TF (x) = n_


_p(x, y)ϕ0(y)dFn(y)._


Note that TF is the data distribution at time t = h, thus can be viewed as continuous analogues of
the left multiplication by the transition matrix P . Letting h → 0, the random walk converges to a
continuous diffusion process with probability density evolves continuously in t. In this case, we can
explicitly write the second-order differential equation in (4) as:


_∂_ _ρˆ(x, t + h) −_ _ρ(x, t)_

_∂t_ _[ρ][(][x, t][) = lim]h→0_ _h_


_TF_ _I_
= lim _−_
_h→0_ _h_


_ρ(x, t),_ (5)


where Lh = limh 0 (TF _I)/h is the conventional infinitesimal generator of the process._
_→_ _−_

Now we are ready to introduce our kernel diffusion density function.


-----

**Definition 1. (Kernel diffusion density function) Suppose the Markov chain induced by P is ergodic,**
_we define the kernel diffusion density function as the limiting probability density of the diffusion_
_process Xt, i.e.,_
_ρKD(x) = lim_ (6)
_t→∞_ _[ρ][(][x, t][)][.]_

Intuitively, on the one hand, with increased values of t we expect the diffusion process Xt gradually
reveals the geometric structure (such as high-density regions) of the data distribution F . To see this,
note that the transition probability P reflects connectivity between data. We can interpret a cluster as
an underlying geometric structure in which the probability of staying in this region is high during
a transition. In the diffusion process, the probability of following a path along a structure usually
increases with t, as the involved data points are dense and highly connected. Therefore, the path
consists of short and high probability jumps. Whilst paths that do not follow any structure consists
of long and low probability jumps, which lowers their overall probability. As a result, geometry
structures of F is magnified during the diffusion.

On the other hand, by talking certain sophisticated forms of k(x, y) that take into account of local
adaptivity, we also slow down the diffusion to avoid trivial geometry structures such as one big cluster
for all the data points. In this way, we can eventually identify the correct geometry structure at the
right scale.

4.2 LOCALLY ADAPTIVE KERNELS

To address the local adaptability in kernel diffusion density function, we propose the following two
bivariate kernels. Both of them are very simple variations of the most commonly used classic kernels.

**Symmetric-Gaussian kernel:**


_k(x, y) = exp_
_−_ _[∥][x][ −]h_ _[y][∥][2]_



**1B(x,ε)(y).** (7)


Here h and ϵ are both hyper-parameters. We call this kernel symmetric since k(x, y) = k(y, x) .

**Asymmetric-Gaussian kernel:**


_k(x, y) = exp_
_−_ _[∥][x][ −]h_ _[y][∥][2]_



**1Nk(x)(y).** (8)


Here h and k are hyper-parameters. Note that in this case k(x, y) is asymmetric as y _Nk(x) does_
_∈_
not imply x _Nk(y)._
_∈_

Bivariate kernels defined in (7) and (8) are just combinations of classic Gaussian kernel and εneighbourhood or k-nearest neighbours kernels, respectively. With these simple combinations, we
truncate Gaussian kernel at local areas, and the contribution of each point y to the construction of the
density function ρKD(x) depends not only on the distance ||y − _x|| but also on the local geometry_
structure around x. Hence, the new kernels are adaptive at different x, which is expected to lead to
better clustering performance against local features. We remark that the Asymmetric-Gaussian kernel
takes into account a varying neighborhood around each x, thus is more adaptive comparing to the
Symmetric-Gaussian kernel.

Although here we only provide two examples of locally adaptive kernels, other options can be easily
created in a similar spirit under this framework, e.g., changing the Gaussian kernels to other kernels
or changing the ε-neighbourhood (k-nearest neighbours) kernels to other locally truncated functions.
Once k(x, y) is determined, we can derive the corresponding density function ρKD. Next, we just
need to simply apply any density clustering procedure like DPC or DBSCAN based on ρKD instead
of the naive density function ρnaive.

In Section 5, we assess the empirical performance of the proposed kernel diffusion density function
with the above two locally adaptive kernels. They outperform existing density-based algorithms and
other state-of-the-art methods.


-----

4.3 FAST KERNEL DIFFUSION DENSITY

The kernel diffusion density function ρKD can be calculated as the stationary distribution of a Markov
chain induced by the transition matrix P . Numerically, we can solve it by iteratively right multiplying
_P with ρ(x, t) until convergence, or applying a QR decomposition on P_ . These methods are expensive
in terms of computational cost, especially when the sample size n is large.

To tackle this problem, we propose the following surrogate of ρKD(x) which is computationally more
efficient.
**Definition 2. (Fast kernel diffusion density function) Let p(y, x) be the transition probability from**
_point y to point x, as defined in equation (3). We define the fast kernel diffusion density function as_


_p(y, x)dFn(y),_ (9)


_ρFKD(x) =_


It is straightforward that ρFKD can be obtained in linear time and memory space, as we only need to
compute the column averages of matrix P .

Here we show that ρFKD is not only computationally efficient but also suitable for detecting local
features. This is illustrated through the following Theorem 1. Consider a special case that k(x, y) =
**1B(x,ε)(y). Then it is easy to verify that**


_ρFKD(x) = [1]_

_Cd_


_ρnaive(y)_ _[,]_


_y∈B(x,ε)_


where Cd = nε[d]Vd is a normalising constant. In this way, we build a connection between ρFKD and
the naive density function ρnaive in this special example.
**Theorem 1. Consider the above special case that k(x, y) = 1B(x,ε)(y). In addition, assume the**
_dataset D = {x1, ..., xn} can be split into m disjoint clusters: i.e., X = D1_ _..._ _Dm and_
_for each x_ _D, B(x, ε) only contain data points that belong to the same cluster as x. Denote_
_ρ¯j =_ _|D1j_ _|_ _∈x∈Dj_ _[ρ][FKD][(][x][)][ as the average density in cluster][ j][. We have]_ S S

P _ρ¯1 =_ = ¯ρm = 1.

_· · ·_

Theorem 1 demonstrates that the averaged ρFKD in each cluster are the same regardless of cluster
sizes and other local features. This shows that ρFKD elevates the density of small clusters, which is
essential for finding the density peaks of small clusters.

Previously we claim that ρFKD is a surrogate of the kernel diffusion density ρKD. Next, we want
to reveal the relationship between these two density functions from an asymptotic viewpoint. To
proceed, we will need the following assumption.
**Assumption 1. There exists some positive constant c < 1 that is independent of n, such that**
_ρFKD(x) ≤_ _c holds for every x ∈_ _D._

This is a very mild assumption, since it always holds that ρFKD(x) < 1, and the average of ρFKD(x)
over the dataset is _D_ _[ρ][FKD][(][x][)][dF][n][(][x][) = 1][/n][, which vanishes as][ n][ →∞][. Now we are ready to]_

present the following theorem that characterise the closeness between ρFKD and ρKD.

R

**Theorem 2. Suppose that Assumption 1 holds and the Markov chain induced by the kernel k(x, y) is**
_ergodic. We have_
_ρKD(x)_ _a.s._

1

_ρFKD(x)_ _−→_

As shown in the Appendix, the almost sure convergence in Theorem 2 is of a fast rate at n[−][1]. Thus it
is safe for us to use it to replace ρKD in finite sample experiments. This result is also verified by our
numerical experiments in Section 5.

5 EXPERIMENTS

In this section, we empirically evaluate the proposed kernel diffusion density functions against ρnaive
and ρLC in density-based clustering algorithms, and also compare them with other state-of-the-art


-----

Table 1: Clustering performance on benchmark datasets with different density functions applied to
DPC. Pairwise F-score (FP ) and BCube F-score (FB) under optimal parameter tuning are given. The
best and second-bset results in each dataset are bolded and underlined, respectively.

|F P|F B|
|---|---|


|Col1|ρ ρ naive LC|ρsym ρasym ρsym ρasym KD KD FKD FKD|ρ ρ naive LC|ρsym ρasym ρsym ρasym KD KD FKD FKD|
|---|---|---|---|---|


|Banknote Breast-d Breast-o Control Glass Haberman Ionosphere Iris Libras Pageblocks Seeds Segment Wine|54.3 31.6 55.9 51.8 57.6 70.7 48.6 49.3 36.9 39.0 66.9 64.1 27.4 28.3 54.3 53.8 20.0 22.9 92.9 93.0 54.3 54.9 48.4 48.0 45.2 61.1|67.2 83.9 67.2 93.6 78.0 69.1 67.4 72.6 82.8 92.9 82.7 92.9 49.0 63.9 52.5 64.5 46.3 48.1 44.8 47.8 74.5 75.7 75.8 75.7 46.9 54.9 46.0 53.9 65.8 74.6 69.2 74.6 29.3 31.5 26.0 31.0 90.5 90.2 89.7 90.2 68.0 78.0 69.5 78.0 57.1 58.0 41.4 56.1 56.6 68.0 60.0 65.3|57.7 31.8 59.0 58.7 52.2 74.1 51.6 52.4 42.7 45.7 66.9 63.3 25.0 25.8 61.6 62.3 26.8 29.1 89.9 90.0 54.3 55.4 64.2 63.8 46.0 61.9|67.2 85.1 67.2 93.6 76.0 69.7 69.4 72.2 75.9 92.2 75.8 92.2 52.0 70.8 55.1 71.8 55.1 56.9 53.5 57.1 74.5 75.8 75.9 75.8 42.6 52.5 41.7 49.2 72.7 80.0 74.0 80.0 37.8 41.8 33.3 39.8 89.8 89.7 89.6 89.7 72.4 78.7 72.9 78.7 67.1 69.2 60.6 68.2 61.5 74.7 66.3 71.4|
|---|---|---|---|---|


methods. We denote ρ[sym]KD [and][ ρ]FKD[sym] [as the kernel diffusion density functions and its fast surrogate,]
with symmetric-Gaussian kernel, respectively. Similarly, we denote ρ[asym]KD [and][ ρ]FKD[asym] [as the proposed]
two density functions with asymmetric-Gaussian kernel, respectively. We examine their performance
on a wide range of datasets. The clustering results are measured Pairwise F-score (Banerjee et al.,
2005), BCubed F-score (Amigó et al., 2009) or NMI (Cover, 1999).

5.1 PERFORMANCE ON BENCHMARK DATASETS

We now discuss the performance on 13 benchmark datasets (∼100 to ∼5, 000 data points) from UCI
repository. The metadata is summarised in the Appendix.

As summarised in Table 1, both ρ[sym]KD [and][ ρ]KD[asym] uniformly outperform ρnaive and ρLC in terms of
clustering accuracy in terms of F-scores. results based on NMI are deferred to the Appendix. The
proposed kernel diffusion density functions with asymmetric Gaussian kernel, ρ[asym]KD [, which enjoys]
better local adaptivity analytically, achieves the best results on most datasets and outperforms ρnaive
and ρLC by a large margin. It is worth noticing that the two fast surrogates, ρ[sym]FKD [and][ ρ]FKD[asym][, achieve]
comparable results with their original counterparts, ρ[sym]KD [and][ ρ]KD[asym][. In the Appendix similar results]
are observed for the same set of density functions applied to DBSCAN.

Figure 2: Precision-Recall curves of different approaches applied to DPC on MS1M dateset, using
(a) Pairwise metric, and (b) BCubed metric .


-----

5.2 PERFORMANCE ON FACE IMAGE DATASETS

Clustering face images according to their latent identity becomes an important application in recent
years. It is challenging in the sense that face image datasets usually contain thousands of identities,
corresponding to thousands of clusters. Meanwhile, the number of images for each identity (cluster)
is quite different, corresponding to the variety of cluster sizes. We assess the performance of
the proposed approach on two popular face image datasets: emore_200k (Zhan et al., 2018) and
MS1M (Guo et al., 2016).

**emore_200k. The dataset contains**

Table 2: Clustering performance on emore_200k. BCubed

2,577 identities with 200,000 images

precision, recall and F-score are reported.

following the protocol in Zhan et al.

|Col1|Algorithm # clusters Precision Recall F B|
|---|---|


pared with k-means, HAC (Sibson,

|Baseline|k-means 2,577 94.24 74.89 83.45 HAC 2,577 97.74 88.02 92.62 ARO 85,150 52.96 16.93 25.66 CDP - 89.35 88.98 89.16|
|---|---|

1973), ARO (Otto et al., 2017), and
CDP (Zhan et al., 2018). Again, we

ing with proposed kernel diffusion -  Not available

|Density -based|ρ 7928 92.36 78.14 84.65 naive ρ 3485 96.15 86.58 91.11 LC ρsym 2781 95.82 93.24 94.51 KD ρasym 2546 95.48 93.82 94.64 KD ρsym 3622 95.27 92.54 93.89 FKD ρasym 2569 96.37 93.93 95.13 FKD|
|---|---|

density functions also outperform the
state-of-the-arts approaches such as CDP by a large margin.


**MS1M. The dataset contains 8,573 iden-**
tities with around 584,000 images following the protocols in Yang et al. (2020).
We set ε to 0.8, k to 200, and h to 0.5
for density-based methods. We reported
the results of clustering performance in
Table 3. Precision versus Recall curves
for different density functions (applied
to DPC) are plotted in Figure 2. In Table 3, the proposed kernel diffusion density functions outperform ρnaive and ρLC.
Note that GCN-based methods such as
L-GCN (Wang et al., 2019), LTC (Yang
et al., 2019) and GCN (V+E) (Yang et al.,
2020) achieve generally better clustering
performance than unsupervised methods
due to their supervised nature. However,
it is quite encouraging to see that the
proposed kernel diffusion approaches, although are also unsupervised clustering
methods, considerably outperform the
GCN-based methods.

5.3 SENSITIVITY ANALYSIS


Table 3: Clustering performance on MS1M. Pairwise Fscore and BCubed F-score are reported.

|Col1|Algorithm # clusters F F P B|
|---|---|


|Unsupervised|k-means 8,573 79.21 81.23 HAC 8,573 70.63 70.46 ARO - 13.60 17.00 CDP - 75.02 78.70|
|---|---|


|Supervised|L-GCN - 78.68 84.37 LTC - 85.66 85.52 GCN(V+E) - 87.55 85.94|
|---|---|


|Density-based|ρ 59551 78.37 79.35 naive ρ 24019 83.61 85.06 LC ρsym - - - KD ρasym 22869 88.15 87.14 KD ρsym 34246 84.40 85.37 FKD ρasym 22927 87.26 87.41 FKD|
|---|---|


-  Not available


Next, we examine the sensitivity of the proposed kernel diffusion density functions to hyperparameters and compare it with ρnaive and ρLC. The results are obtained via extensive experiments on
emore_200k and MS1M, which are shown in Figure 3. We can see that the clustering performance of
_ρ[sym]KD_ [is much more stable than][ ρ][naive][ and][ ρ][LC][ when we vary the value of][ ε][. Whilst][ ρ]KD[asym] [is robust to]
the parameter k, and both ρ[sym]KD [and][ ρ]KD[asym] [are quite robust to the parameter][ h][.]


-----

Figure 3: Sensitivity analysis on emore_200k and MS1M. We investigate the clustering performance
by varying the following parameters: (a) Radius of ε-ball; (b) Number k of nearest neighbors; (c)
Bandwidth h of Gaussian kernel.

5.4 COMPUTATIONAL COST


We carried out a series of experiments on
MS1M to demonstrate the computational
efficiency of the fast surrogate ρFKD in
terms of time and space. With a collection of subsampled data from MS1M at
different percentile levels, we run both the
kernel diffusion density ρKD and the fast
surrogate ρFKD. As we can observe from
Figure 4, the running time and memory
usage of ρKD increase dramatically with
the sample size. Whilst ρFKD retains a
very low level of computational cost. This
suggests that ρFKD, which achieves an excellent computational efficiency, should
be favored in practice.

6 CONCLUSION


Figure 4: Running time and memory usage of the proposed methods at different sample sizes on MS1M.


Density-based clustering has a profound impact on machine learning and data mining. However,
the underpinning naive density function suffers from detecting varying local features, causing extra
errors in the clustering. We propose a new set of density functions based on the kernel diffusion
process to resolve this problem, which is adaptive to density regions of varying local distributional
features. We demonstrate that DBSCAN and DPC adapted by the proposed approach have improved
clustering performance comparing to their classic versions and other state-of-the-art methods.


-----

REFERENCES

Enrique Amigó, Julio Gonzalo, Javier Artiles, and Felisa Verdejo. A comparison of extrinsic
clustering evaluation metrics based on formal constraints. Information Retrieval, 12(4):461–486,
2009.

Mihael Ankerst, Markus M Breunig, Hans-Peter Kriegel, and Jörg Sander. Optics: Ordering points to
identify the clustering structure. ACM Sigmod Record, 28(2):49–60, 1999.

Arindam Banerjee, Chase Krumpelman, Joydeep Ghosh, Sugato Basu, and Raymond J Mooney.
Model-based overlapping clustering. In Proceedings of the eleventh ACM SIGKDD International
_Conference on Knowledge Discovery in Data Mining, pp. 532–537, 2005._

Ricardo JGB Campello, Davoud Moulavi, and Jörg Sander. Density-based clustering based on
hierarchical density estimates. In Pacific-Asia Conference on Knowledge Discovery and Data
_Mining, pp. 160–172. Springer, 2013._

Junyue Cao, Jonathan S Packer, Vijay Ramani, Darren A Cusanovich, Chau Huynh, Riza Daza,
Xiaojie Qiu, Choli Lee, Scott N Furlan, Frank J Steemers, et al. Comprehensive single-cell
transcriptional profiling of a multicellular organism. Science, 357(6352):661–667, 2017.

Bo Chen, Kai Ming Ting, Takashi Washio, and Ye Zhu. Local contrast as an effective means to robust
clustering against varying densities. Machine Learning, 107(8):1621–1645, 2018.

Ronald R Coifman and Stéphane Lafon. Diffusion maps. Applied and Computational Harmonic
_Analysis, 21(1):5–30, 2006._

Ronald R Coifman, Stephane Lafon, Ann B Lee, Mauro Maggioni, Boaz Nadler, Frederick Warner,
and Steven W Zucker. Geometric diffusions as a tool for harmonic analysis and structure definition
of data: Diffusion maps. Proceedings of the National Academy of Sciences, 102(21):7426–7431,
2005.

Dorin Comaniciu and Peter Meer. Mean shift: A robust approach toward feature space analysis.
_IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(5):603–619, 2002._

Thomas M Cover. Elements of information theory. John Wiley & Sons, 1999.

Antonio Cuevas, Manuel Febrero, and Ricardo Fraiman. Cluster analysis: a further approach based
on density estimation. Computational Statistics & Data Analysis, 36(4):441–459, 2001.

Levent Ertöz, Michael Steinbach, and Vipin Kumar. Finding clusters of different sizes, shapes,
and densities in noisy, high dimensional data. In Proceedings of the 2003 SIAM International
_Conference on Data Mining, pp. 47–58. SIAM, 2003._

Martin Ester, Hans-Peter Kriegel, Jörg Sander, Xiaowei Xu, et al. A density-based algorithm for
discovering clusters in large spatial databases with noise. In KDD, volume 96, pp. 226–231, 1996.

Jerome Friedman, Trevor Hastie, Robert Tibshirani, et al. The elements of statistical learning,
volume 1. Springer Series in Statistics New York, 2001.

Yandong Guo, Lei Zhang, Yuxiao Hu, Xiaodong He, and Jianfeng Gao. Ms-celeb-1m: A dataset
and benchmark for large-scale face recognition. In European Conference on Computer Vision, pp.
87–102. Springer, 2016.

Alexander Hinneburg and Hans-Henning Gabriel. Denclue 2.0: Fast clustering based on kernel
density estimation. In International Symposium on Intelligent Data Analysis, pp. 70–80. Springer,
2007.

Jeffrey J Hunter. Generalized inverses and their application to applied probability problems. Linear
_Algebra and Its Applications, 45:157–198, 1982._

Mehjabin Khatoon and W Aisha Banu. An efficient method to detect communities in social networks
using dbscan algorithm. Social Network Analysis and Mining, 9(1):1–12, 2019.


-----

Emmanuelle A Marquis, Vicente Araullo-Peters, Yan Dong, Auriane Etienne, Svetlana Fedotova,
Katsuhiko Fujii, Koji Fukuya, Evgenia Kuleshova, Anabelle Lopez, Andrew London, et al. On the
use of density-based algorithms for the analysis of solute clustering in atom probe tomography data.
In Proceedings of the 18th International Conference on Environmental Degradation of Materials
_in Nuclear Power Systems–Water Reactors, pp. 2097–2113. Springer, 2019._

James S Marron and David Ruppert. Transformations to reduce boundary bias in kernel density
estimation. Journal of the Royal Statistical Society: Series B (Methodological), 56(4):653–671,
1994.

Boaz Nadler, Stephane Lafon, Ronald R Coifman, and Ioannis G Kevrekidis. Diffusion maps, spectral
clustering and eigenfunctions of fokker-planck operators. arXiv preprint math/0506090, 2005.

Charles Otto, Dayong Wang, and Anil K Jain. Clustering millions of faces by identity. IEEE
_Transactions on Pattern Analysis and Machine Intelligence, 40(2):289–303, 2017._

Hasanzadeh PR Reza, AH Rezaie, SHH Sadeghi, MH Moradi, and M Ahmadi. A density-based fuzzy
clustering technique for non-destructive detection of defects in materials. Ndt & E International,
40(4):337–346, 2007.

Alex Rodriguez and Alessandro Laio. Clustering by fast search and find of density peaks. Science,
344(6191):1492–1496, 2014.

Marco Rovere, Ziheng Chen, Antonio Di Pilato, Felice Pantaleo, and Chris Seez. Clue: A fast parallel
clustering algorithm for high granularity calorimeters in high-energy physics. Frontiers in Big
_Data, 3, 2020._

Jieming Shi, Nikos Mamoulis, Dingming Wu, and David W Cheung. Density-based place clustering
in geo-social networks. In Proceedings of the 2014 ACM SIGMOD International Conference on
_Management of Data, pp. 99–110, 2014._

Robin Sibson. Slink: an optimally efficient algorithm for the single-link cluster method. The
_Computer Journal, 16(1):30–34, 1973._

Werner Stuetzle. Estimating the cluster tree of a density by analyzing the minimal spanning tree of a
sample. Journal of Classification, 20(1):25–47, 2003.

A Tramacere and C Vecchio. γ-ray dbscan: A clustering algorithm applied to fermi-lat γ-ray data. In
_AIP Conference Proceedings, volume 1505, pp. 705–708. American Institute of Physics, 2012._

[UCI. UCI machine learning repository. http://archive.ics.uci.edu/ml/datasets.](http://archive.ics.uci.edu/ml/datasets.php)
[php.](http://archive.ics.uci.edu/ml/datasets.php)

Zhongdao Wang, Liang Zheng, Yali Li, and Shengjin Wang. Linkage based face clustering via
graph convolution network. In Proceedings of the IEEE/CVF Conference on Computer Vision and
_Pattern Recognition, pp. 1117–1125, 2019._

Lei Yang, Xiaohang Zhan, Dapeng Chen, Junjie Yan, Chen Change Loy, and Dahua Lin. Learning
to cluster faces on an affinity graph. In Proceedings of the IEEE/CVF Conference on Computer
_Vision and Pattern Recognition, pp. 2298–2306, 2019._

Lei Yang, Dapeng Chen, Xiaohang Zhan, Rui Zhao, Chen Change Loy, and Dahua Lin. Learning
to cluster faces via confidence and connectivity estimation. In Proceedings of the IEEE/CVF
_Conference on Computer Vision and Pattern Recognition, pp. 13369–13378, 2020._

Xiaohang Zhan, Ziwei Liu, Junjie Yan, Dahua Lin, and Chen Change Loy. Consensus-driven
propagation in massive unlabeled data for face recognition. In Proceedings of the European
_Conference on Computer Vision (ECCV), pp. 568–583, 2018._

Ye Zhu, Kai Ming Ting, and Mark J Carman. Density-ratio based clustering for discovering clusters
with varying densities. Pattern Recognition, 60:983–997, 2016.

Carly GK Ziegler, Samuel J Allon, Sarah K Nyquist, Ian M Mbano, Vincent N Miao, Constantine N
Tzouanas, Yuming Cao, Ashraf S Yousif, Julia Bals, Blake M Hauser, et al. Sars-cov-2 receptor
ace2 is an interferon-stimulated gene in human airway epithelial cells and is detected in specific
cell subsets across tissues. Cell, 181(5):1016–1035, 2020.


-----

A APPENDIX

In this supplementary file, we provide technical proofs of the theoretical results in Section 4.3, and
present extra empirical experiments regarding our kernel diffusion approach with symmetric and
asymmetric Gaussian kernels applied to DBSCAN. All the numerical experiments are carried out on
a standard work station with a Intel 64-cores CPU and two Nvidia P100 GPUs.

A.1 PSEUDO CODE FOR DBSCAN AND DPC.

**Algorithm 1 DBSCAN**

1: Input: SetOfPoints X, Eps ε, MinPts k
2: H:={x ∈ _X : |B(x, ε) ∩_ _X| ≥_ _k};_
3: G:=undirected graph with vertices H and edge between x, x[′] _∈_ _H if |x −_ _x[′]| ≤_ _ε;_
4: Output: connected components of G

The connected compoenents of the graph are determined a clusters,a dn the remaining points are
unclustered and considered as noise-points.

**Algorithm 2 DPC**

1: Input: SetOfPoints, TruncDis dc
2: Compute di,j for ∀i, j ∈ SetOfPoints;
3: For i from 1 to SetOfPoints.size:
4: _ρi :=_ _i≠_ _j_ [1][{][d]i,j _[−][d]c[}]_

5: _δi := minj:ρj_ _>ρi_ (di,j)

6: Plot decision map[P] _M with ρ as the horizontal axis and δ as the vertical axis;_
7: Mark Point i with relatively higher ρi and δi as a cluster center;
8: Mark Point i with relatively lower ρi but relatively higher δi as a noise point;
9: Assign the rest point with the label the same as the nearest cluster center;

10: Output: SetOfPoints


A.2 PROOFS OF THEORETICAL RESULT.

**Proof of Theorem 1.** Since {D1, · · ·, Dm} are disjoint, we have p(x, y) = 0 if x and y belong to
different clusters. By the definition of matrix P, for each x _Dj, we have_
_∈_

_p(x, y)dFn(y) = 1,_
_D_

Z

which implies that


_p(x, y)dFn(x)dFn(y) =_ _Dj_ _._
_y∈Dj_ _|_ _|_


_x∈Dj_

_ρ¯j_ _Dj_ =
_|_ _|_
Z


Therefore,


_ρ¯j_ _Dj_ = _p(x, y)dFn(x)dFn(y) =_ _Dj_ _,_
_|_ _|_ Zx∈Dj Zy∈Dj _|_ _|_

which implies that ¯ρj = 1 for any j = 1,, . . ., m.


Before proceeding to the proof of Theorem 2, we need following auxiliary lemma that relates the
stationary distribution of a Markov chain to an arbitrary vector g.
**Lemma A.1. Let P be transition probability matrix of a finite inreducible discrete time Markov**
_chain with n states, which admits a stationary distribution, denoted by vector π. We write e =_
(1, . . ., 1)[T] _∈_ R[n] _as the a column vector of ones. The following holds for any vector g such that_
_g[T]_ _e ̸= 0:_

_(1) (I −_ _P + eg[T]_ ) is non-singular.

_(2) Let H = (I −_ _P + eg[T]_ )[−][1], then π[T] = g[T] _H._


-----

_Proof. Since π is the stationary distribution, we have π[T]_ _e = 1. Applying Theorem 3.3 in (Hunter,_
1982) yields that matrix (I − _P + eg[T]_ ) is non-singular.

Next recall that π[T] _P = π[T]_, therefore we have

_π[T]_ (I − _P + eg[T]_ ) = π[T] _−_ _π[T]_ _P + π[T]_ _eg[T]_

= π[T] _eg[T]_

= g[T] _,_


which implies π[T] = g[T] _H._

**Proof of Theorem 2.** Note that for for each x ∈ _D, the linear reference function ρFKD(x) =_

_D_ _[p][(][y, x][)][dF][n][(][y][)][ is the corresponding column average of the transition matrix][ P]_ [. We write the][ i][-th]
column vector of P as
_T_

R _pi =_ _p(x1, xi), . . ., p(xn, xi)_ _._

Therefore ρFKD(xi) = e[T] _pi/n_   

Since the Markov chain induced by the kernel k(x, y) is ergodic, the density ρ(x, t) of the diffusion
process Xt converges to the limiting stationary distribution of the Markov chain, denoted by π.

We can write the n-vectors of g and π in the following form:


_T_ _T_
_g = (g1, . . ., gn)[T]_ = n _ρFKD(x1), . . ., ρFKD(xn)_ and _π =_ _ρKD(x1), . . ., ρKD(xn)_ _,_

where gi = e[T] _pi is the i-th column sums of matrix _ _P. As a result, we have _ 


_ρFKD(x)dFn(x) = [1]_ and
_D_ _n_ _[e][T][ g][ = 1][,]_

Z

By the definition of g, we know


_ρKDdFn(x) = e[T]_ _π = 1._


(eg[T] )[2] = neg[T] and _e[T]_ _P = g[T]_ _._


It follows from Lemma A.1 that (I − _P + eg[T]_ ) is non-singular and π[T] = g[T] _H, where H =_
(I − _P + eg[T]_ )[−][1].

We define B = I + eg[T] . By simple algebra calculation, we can find B is non-singular with

_B[−][1]_ = I
_−_ _n[eg] + 1[T]_ _[.]_


_g[T]_
As a result, it is easy to see that that g[T] _B[−][1]_ =

_n + 1_ [and]

_H_ _[−][1]_ = B − _P = (I −_ _PB[−][1])B._

Use the Neumann series, we have


_H = B[−][1](I −_ _PB[−][1])[−][1]_ = B[−][1]


(PB[−][1])[i].

_i=0_

X

_∞_

(PB[−][1])[i]
_−_ _n[I]_
_i=0_

X


Thus


_π[T]_ _g[T]_ _/n = g[T]_ _H_
_−_ _−_ _n[I]_



= g[T] _B[−][1]_



Since we assume for any x ∈ _D, ˆg(x) < c_ for some 0 < c < 1. This leads to

_g[T]_ _pj ≤_ _nce[T]_ _pj = ncgj._

Therefore, let κj be the j-th compoenent of g[T] _PB[−][1], it is straightforward_

_nc_
_κj_
_≤_ _n + 1_ _[g][j][ ≤]_ _[cg][j][.]_


-----

This implies for every x ∈ _D,_

_|ρKD(x) −_ _ρFKD(x)| ≤_ _ρFKD(x)_

_≤_ _ρFKD(x)_


_∞_

_c[i]_
_−_ _n[1]_
_i=0_

X


_n + 1_


1

(n + 1)(1 − _c)_ _[−]_ _n[1]_


Hence we have

which completes the proof.


_ρKD(x)_
lim = 1,
_n→∞_ _ρFKD(x) [= 1]_


A.3 ADDITIONAL NUMERICAL EXPERIMENTS

**Naive density with different bandwidths.** To illustrate the fail of naive density function in scenario
as in Figure 1, we also plot it with a range of different values of hyperparameters ϵ below. We can
observe that it is difficult to detect the three true underlying clusters in all the cases.

Figure 5: Naive density function in 3D with different values of ϵ.

**Hyperparameters** The parameter ε (radius of the ball, used in ρnaive, ρLC, ρ[sym]KD [and][ ρ]FKD[sym] [) is tuned]
by searching within the range between 0.1 and 1 with am increment of 0.1, parameter k (number of
nearest neighbors, used in ρLC, ρ[asym]KD [and][ ρ]FKD[asym][) is tuned by searching within the range between][ 10%]
and 50% number of samples, with an increment of 10%.

**Metadata of benchmark datasets.** The number of samples n, the number of clusters c, and feature
dimension d for each benchmark dataset are listed in Table 4 below.

**Benckmark datasets with DBSCAN.** We provide the performance of the conventional density
functions, ρnaive and ρLC, and the proposed kernel diffusion density functions with symmetric and
asymmetric Gaussian kernels, ρ[∗]KD [and][ ρ]FKD[∗] [(][∗∈{][sym][,][ asym][}][), applied to DBSCAN on 13 bench-]
mark datasets. The results are summarised in Table 6. Similar to DPC, we see that both ρ[sym]KD [and]
_ρ[asym]KD_ uniformly outperform ρnaive and ρLC in terms of clustering quality. ρ[asym]KD [, which has better]
local adaptivity analytically, achieves the best results on most datasets and outperforms others by a
significant margin in Breast-o, Control, Haberma and Seeds.

**NMI for benchmark datasets.** Below we present in Table 6 the clustering results for benchmark
datasets based on NMI metric.


-----

Table 4: Metadata of benchmark datasets, includes sample size (n), the number of clusters (c), and
feature dimension d.

Dataset _n_ _c_ _d_

Banknote 1372 2 4
Breast-d 569 2 30
Breast-o 699 2 9
Control 600 6 60
Glass 214 7 9
Haberman 306 2 3
Ionosphere 351 2 34
Iris 150 3 4
Libras 360 15 90
Pageblocks 5473 5 10
Seeds 210 3 7
Segment 210 7 19
Wine 178 3 13

Table 5: Clustering performance on benchmark datasets with different density functions applied to
DBSCAN. Pairwise F-score (FP ) and BCube F-score (FB) under optimal parameter tuning are given.
The best and second-best results in each dataset are bolded and underlined, respectively.

|F P|F B|
|---|---|


|Col1|ρ ρ naive LC|ρsym ρasym ρsym ρasym KD KD FKD FKD|ρ ρ naive LC|ρsym ρasym ρsym ρasym KD KD FKD FKD|
|---|---|---|---|---|


|Banknote Breast-d Breast-o Control Glass Haberman Ionosphere Iris Libras Pageblocks Seeds Segment Wine|26.8 60.7 56.7 63.0 18.2 55.3 32.5 37.1 22.0 29.8 68.6 72.2 25.9 68.4 66.2 69.8 18.1 12.0 48.4 89.2 57.8 47.6 18.5 47.9 40.5 40.5|62.0 66.4 65.4 66.4 65.0 66.6 67.2 66.6 59.2 70.6 70.5 70.6 51.0 60.3 48.9 59.1 29.8 42.5 42.0 42.5 68.6 75.6 68.9 75.3 68.0 74.2 74.2 74.2 66.2 57.2 73.7 73.3 15.6 13.8 20.2 13.5 90.0 90.1 89.9 90.1 57.8 63.2 22.4 62.4 54.8 30.8 41.4 30.8 40.5 49.5 50.0 49.5|26.5 65.1 60.9 64.7 15.9 50.7 34.2 50.1 25.8 36.9 69.2 73.1 23.8 64.1 67.2 76.6 31.1 16.5 45.2 85.5 59.2 53.0 22.5 55.2 45.7 45.7|60.7 67.4 65.7 67.4 66.0 67.4 67.2 67.4 52.3 71.3 70.6 71.2 53.7 66.9 51.6 65.7 36.9 45.2 43.5 45.2 69.2 75.8 68.3 75.7 63.7 72.1 72.1 72.1 67.2 67.0 79.4 79.0 42.1 45.5 32.9 37.7 89.7 89.5 89.7 89.5 59.2 70.0 24.4 69.2 66.6 53.6 59.9 53.6 45.7 52.3 51.3 52.3|
|---|---|---|---|---|


**Number of clusters.** In Table 7, we present the number of clusters returned by the density-based
methods for the benchmark datasets. It can be observed that clustering with the proposed diffusion
density functions returned a significantly better estimate of the number of clusters, comparing to that
with classic density functions such as ρnaive and ρLC.


-----

Table 6: Clustering performance on benchmark datasets with different density functions applied to
DPC. NMI under optimal parameter tuning are given. The best results in each dataset are bolded.

Dataset

|Col1|ρ ρ naive LC|ρsym ρasym ρsym ρasym KD KD FKD FKD|k-means Spectral|
|---|---|---|---|

|Banknote Breast-d Breast-o Control Glass Haberman Ionosphere Iris Libras Pageblocks Seeds Segment Wine|27.5 33.0 43.7 49.1 30.2 32.7 60.6 60.6 43.1 43.4 9.5 5.7 27.9 28.0 51.1 53.1 63.3 66.4 8.6 13.0 47.1 49.8 63.5 64.4 58.1 58.2|21.7 64.8 53.2 80.2 46.8 57.4 55.7 46.1 37.2 79.1 36.4 78.4 63.2 69.5 61.0 69.6 45.0 48.4 43.8 46.6 9.5 3.2 16.9 3.2 30.9 31.1 30.1 30.5 60.1 73.4 62.6 73.4 63.0 68.8 68.2 69.1 11.8 28.7 14.4 29.1 53.6 64.8 58.6 64.8 65.1 72.2 63.0 70.7 72.0 73.3 71.1 58.6|34.2 17.3 62.3 52.6 74.8 14.0 75.4 68.3 34.8 36.4 7.8 6.6 13.5 5.2 74.2 70.6 60.0 56.1 13.2 12.1 67.4 60.3 61.2 65.2 84.2 72.7|
|---|---|---|---|


NMI


Table 7: Number of clusters returned by different density functions applied to DPC. The ground truth
is listed in the last column.

|Dataset|ρ ρ naive LC|ρsym ρasym ρsym ρasym KD KD FKD FKD|Ground Truth|
|---|---|---|---|

|Banknote Breast-d Breast-o Control Glass Haberman Ionosphere Iris Libras Pageblocks Seeds Segment Wine|16 44 3 5 15 17 32 32 4 21 34 40 12 11 8 12 12 60 7 25 10 10 21 27 3 7|26 2 7 2 2 3 3 3 7 2 1 2 23 23 27 25 4 8 3 7 34 1 3 1 7 6 7 7 4 2 3 2 2 61 100 55 3 10 11 8 2 6 3 6 14 8 6 9 3 5 3 2|2 2 2 6 7 2 2 3 15 5 3 7 3|
|---|---|---|---|


-----