|
# DENSITY-BASED CLUSTERING WITH KERNEL DIFFU## SION |
|
|
|
**Anonymous authors** |
|
Paper under double-blind review |
|
|
|
ABSTRACT |
|
|
|
Finding a suitable density function is essential for density-based clustering algorithms such as DBSCAN and DPC. A naive density corresponding to the indicator |
|
function of a unit d-dimensional Euclidean ball is commonly used in these algorithms. Such a density suffers from an inability to capture local features in |
|
complex datasets. To tackle this issue, we propose a new kernel diffusion density |
|
function, which is adaptive to data of varying local distributional characteristics and |
|
smoothness. Furthermore, we develop a surrogate that can be efficiently computed |
|
in linear time and space and prove that it is asymptotically equivalent to the kernel |
|
diffusion density function. Extensive empirical experiments on benchmark and |
|
large-scale face image datasets show that the proposed approach not only achieves |
|
a significant improvement over classic density-based clustering algorithms but also |
|
outperforms the state-of-the-art face clustering methods by a large margin. |
|
|
|
1 INTRODUCTION |
|
|
|
Density-based clustering algorithms are now widely used in a variety of applications, ranging from |
|
high energy physics (Tramacere & Vecchio, 2012; Rovere et al., 2020), material sciences (Marquis |
|
et al., 2019; Reza et al., 2007), social network analysis (Shi et al., 2014; Khatoon & Banu, 2019) |
|
to molecular biology (Cao et al., 2017; Ziegler et al., 2020). In these algorithms, data points are |
|
partitioned into clusters that are considered to be sufficiently or locally high-density areas with |
|
respect to an underlying probability density or a similar reference function. We call them density |
|
functions throughout this paper. These techniques are attractive to practitioners, due to their nonparametric feature, which leads to flexibility in discovering clusters that have arbitrary shapes, whilst |
|
classic methods such as k-means and k-medoids (Friedman et al., 2001) can only detect convex (e.g., |
|
spherical) clusters. Seminal work in the context of density-based clustering includes DBSCAN (Ester |
|
et al., 1996) and DPC (Rodriguez & Laio, 2014), among many others (Ankerst et al., 1999; Cuevas |
|
et al., 2001; Comaniciu & Meer, 2002; Hinneburg & Gabriel, 2007; Stuetzle, 2003). |
|
|
|
Most density-based clustering algorithms implicitly identify cluster centers and assign remaining |
|
points to the clusters by connecting with the higher density point nearby. To proceed with these |
|
methods it requires a density function, which is usually an estimate of the underlying true probability |
|
density or some variants of it. For example, a popular choice is the naive density function that is |
|
carried out by simply calculating the number of data points covered in the ε-neighborhood of each |
|
_x. Note that such densities are not adaptive to different distribution regions. One of the challenging_ |
|
scenarios is when clusters in the data have varying local features, for example, size, height, spread, |
|
and smoothness. Therefore, the resulting density function has a tendency to flatten the peaks and |
|
valleys in the data distribution, which leads to underestimation of the number of clusters (see Figure |
|
1). Many heuristics variations of DBSCAN and DPC have been proposed to magnify the local |
|
features, thus making the clustering task easier (Campello et al., 2013; Chen et al., 2018; Ertöz et al., |
|
2003; Zhu et al., 2016). Most of these methods can be viewed as performing clustering on certain |
|
transformations of the naive density function. However, if the naive density function itself is quite |
|
problematic in the first place, these methods will become less effective. |
|
|
|
Moreover, even if we apply adaptive alternatives to modify the classic density functions, there are |
|
other contentious issues of the generally used linear kernel density estimator (KDE). It often suffers |
|
from severe boundary bias (Marron & Ruppert, 1994) and is acknowledged as computationally |
|
|
|
|
|
----- |
|
|
|
Figure 1: (a) Data generated from Gaussian mixture model with 3 components, each has differing |
|
variance and weight. (b) Naive density function in 2D (top) and 3D (bottom): only one peak can be |
|
identified. (c) Proposed kernel diffusion density function: 3 clusters can be easily discovered. |
|
|
|
expensive. These phenomenon prevent the classic density functions being practically useful and |
|
reliable, especially for large-scale and complex clustering tasks. |
|
|
|
To overcome these problems in density-based clustering algorithms, in this paper we propose a |
|
general approach to build the so-called kernel diffusion density function to replace classic density |
|
functions. The key idea is to construct the density from a user-specified bivariate kernel that has |
|
desired local adaptive properties. Instead of using the naive density function and its variants, we |
|
utilize the bivariate kernel to derive a transition probability. A diffusion process is induced by this |
|
transition probability, which admits a limiting and stationary distribution. This limiting distribution |
|
serves as a plausible density function for clustering with reduced error. |
|
|
|
Under this framework, we provide examples of symmetric and asymmetric bivariate kernels to |
|
construct the kernel diffusion density function, which can tackle clustering complex and locally |
|
varying data. We apply the resulting adapted DBSCAN and DPC algorithms to widely different |
|
empirical datasets and show significant improvement in each of these analyses. The main contributions |
|
of this paper are summarized below: |
|
|
|
- We introduce new bivariate kernel functions and construct the associated kernel diffusion |
|
processes. Based on the diffusion process, we propose a kernel diffusion density function |
|
to adapt density-based clustering algorithms such as DBSACN and DPC, which attains |
|
accuracy in the presence of varying local features. |
|
|
|
- We derive a computationally much more efficient surrogate, and show analytically it is |
|
asymptotic equivalent to the proposed kernel diffusion density function. |
|
|
|
- By extensive experiments, we demonstrate the superiority of kernel diffusion density function over naive density function and its variants when applying to DBSCAN and DPC, and |
|
show it outperforms state-of-the-art GCN-based methods on face clustering tasks. |
|
|
|
2 RELATED WORK |
|
|
|
**Density-Based Clustering** There is vast literature on adapting density-based clustering algorithms |
|
to tackle large variations in different clusters in the data. DPC itself is such a refinement of DBSCAN, |
|
as it determines cluster centers not only by highest density values but also by taking into account |
|
their distances from each other, thus has a generally better performance in complex clustering tasks. |
|
Other attempts include rescaling the data to have relative reference measures instead of KDE (Zhu |
|
|
|
|
|
----- |
|
|
|
et al., 2016; Chen et al., 2018), and using the number of shared-nearest-neighbors between two points |
|
to replace the geometric distance (Ertöz et al., 2003). |
|
|
|
**Diffusion Maps** The technique of diffusion maps (Coifman et al., 2005; Coifman & Lafon, 2006) |
|
gives a multi-scale organization of the data according to their underlying geometric structure. It uses |
|
a local similarity measure to create a diffusion process on the data which integrates local geometry at |
|
different scales along time t. Generally speaking, the diffusion will segment the data into several |
|
smaller clusters in small t and group data into one cluster for large t. Applying eigenfunctions at a |
|
carefully selected time t leads to good macroscopic representations of the data, which is useful in |
|
dimension reduction and spectral clustering (Nadler et al., 2005). |
|
|
|
**Face Clustering** Face clustering has been extensively studied as an important application in |
|
machine learning. Traditional algorithms include k-means, hierarchical clustering (Sibson, 1973) |
|
and ARO (Otto et al., 2017). Many recent works take advantage of supervised information and |
|
GCN models, achieving impressive improvement comparing to traditional algorithms. To name |
|
a few, CDP (Zhan et al., 2018) proposes to aggregate the features extracted by different models; |
|
L-GCN (Wang et al., 2019) predicts the linkage in an instance pivot subgraph; LTC (Yang et al., 2019) |
|
generates a series of subgraphs as proposals and detects face clusters thereon; and GCN(V+E) (Yang |
|
et al., 2020) learns both the confidence and connectivity by GCN. In this paper we demonstrate |
|
that the proposed density-based clustering algorithm with kernel diffusion, as a general clustering |
|
approach, even outperforms theses state-of-the-art methods that are especially designed for face |
|
clustering. |
|
|
|
3 PRELIMINARIES |
|
|
|
3.1 NOTATIONS |
|
|
|
Let the dataset D = {x1, . . ., xn} ⊂ R[d] be n i.i.d samples drawn from a distribution measure F with |
|
density f on R[d]. Let Fn denote the corresponding empirical distribution measured with respect to D, |
|
i.e., Fn(A) = _n[1]_ _ni=1_ **[1][A][(][x][i][)][, where][ 1][A][(][·][)][ denotes the indicator function of set][ A][. We write][ ||][u][||][ as]** |
|
|
|
the Euclidean norm of vector u. Let B(x, ε) and Vd denote the d-dimensional ε-ball centered at x |
|
and the volume of the unit ballP _B(0, 1), respectively. Let Nk(x) denote the set of k-nearest neighbors_ |
|
of point x within the dataset D. |
|
|
|
3.2 DENSITY FUNCTION |
|
|
|
Density-based algorithms perform clustering by specifying and segmenting high-value areas in a |
|
density function denoted by ρ. Usually, we calculate each of ρ(xi), and then identify cluster centers |
|
with (locally) highest values. Many popular algorithms such as DBSCAN and DPC employ the |
|
following naive density function: |
|
|
|
1 **1B(x,ε)(y)** |
|
_ρnaive(x) =_ _._ (1) |
|
|
|
_nε[d]_ _Vd_ |
|
|
|
_yX∈D_ |
|
|
|
The naive density function ρnaive is actually an empirical estimation of f for carefully chosen ε. It |
|
is easy to observe, for clustering purpose we only care about ρnaive(x) up to a normalising constant, |
|
which makes it simply equivalent to counting the total number of data points in the ε-ball around x. |
|
|
|
In practice, the data distribution may be very complex and contains varying local features that are |
|
difficult to be detected. The naive density in (1) with the same radius ε for all x usually suffers from |
|
unsatisfactory empirical performance, for example, failing to identify small clusters with fewer data |
|
points. One possible way to alleviate this problem is through a transformation into the following |
|
local contrast (LC) function (Chen et al., 2018): |
|
|
|
_ρLC(x) = n[1]_ **1ρnaive(x)>ρnaive(y).** (2) |
|
|
|
_y∈XNk(x)_ |
|
|
|
In this way, ρLC compares the density of each data point with its k-nearest neighbors. To see the |
|
benefit of LC, let us consider x to be a cluster center. After local contrasting, ρLC(x) is likely to reach |
|
the value of k regardless of the size of this cluster. |
|
|
|
|
|
_ρnaive(x) =_ |
|
|
|
|
|
_nε[d]_ |
|
|
|
|
|
_ρLC(x) = [1]_ |
|
|
|
|
|
----- |
|
|
|
However density functions like ρLC still highly depend on the underpinning performance of ρnaive. |
|
This restricts their applications in clustering data with challenging local features. |
|
|
|
4 METHODOLOGY |
|
|
|
In this section, we present a new type of density-based clustering algorithm, based on the notion |
|
of kernel diffusion density function. Towards this end, we will introduce a kernel diffusion density |
|
function, which takes account of local adaptability and is well-suited for clustering purpose. We |
|
provide details on how to derive this density function from a diffusion process induced by bivariate |
|
kernels. We also provide a surrogate density function that is computationally more efficient. |
|
|
|
4.1 DIFFUSION PROCESS AND KERNEL DIFFUSION DENSITY |
|
|
|
Considering a bivariate kernel function k : D × D → R[+], such that: |
|
|
|
- k(x, y) is positive semi-definite, i.e., k(x, y) ≥ 0. |
|
|
|
- k(x, y) is Fn-integrable with respect to both x and y. |
|
|
|
|
|
We define d(x) = n |
|
|
|
|
|
_D_ _[k][(][x, y][)][dF][n][(][y][)][ as a local measure of the volume at][ x][ and define]_ |
|
|
|
|
|
_p(x, y) =_ _[k][(][x, y][)]_ (3) |
|
|
|
_d(x)_ _[.]_ |
|
|
|
It is easy to see that p(x, y) satisfies the conservation property n _[p][(][x, y][)][dF][n][(][y][) = 1][. As a result,]_ |
|
|
|
_D_ |
|
_p(x, y) can be viewed as a probability for a random walk on the dataset from point x to point y, which_ |
|
|
|
R |
|
|
|
induces a Markov chain on D with n × n transition matrix P = [p(x, y)]. This technique is standard |
|
in various applications, known as the normalized graph Laplacian construction. For example, we can |
|
view D as a graph, L = I − _P as the normalized graph Laplacian, and d(x) as a normalization factor._ |
|
|
|
For t ≥ 0, the probability of transiting from x to y in t time steps is given by P _[t], the t-th power of P_ . |
|
Running the Markov chain forward in time, we observe the dataset at different scales, which is the |
|
diffusion process Xt on D. Let ρ(x, t): D × R[+] _→_ R[+] be the associated probability density, which |
|
is governed by the following second-order differential equation with initial conditions: |
|
|
|
_∂_ |
|
|
|
_∂t_ _[ρ][(][x, t][) =][ −][Lρ][(][x, t][)][,]_ |
|
(4) |
|
_ρ(x, 0) = ϕ0(x),_ |
|
|
|
|
|
|
|
where ϕ0(x) is a probability density at time t = 0. In practice we can use any valid choice of ϕ0(x), |
|
e.g. the uniform density. |
|
|
|
To give an explicit example of the diffusion process, consider a sub-class of k, i.e., isotropic kernels, |
|
where k(x, y) = K(||x − _y||[2]/h) for some function K : R →_ R[+]. Here we can dual interpret h as |
|
a scale parameter to infer local information and as a time step h = ∆t at which the random walk |
|
jumps. Then we can define the forward Chapman-Kolmogorov operator TF as |
|
|
|
|
|
_TF (x) = n_ |
|
|
|
|
|
_p(x, y)ϕ0(y)dFn(y)._ |
|
|
|
|
|
Note that TF is the data distribution at time t = h, thus can be viewed as continuous analogues of |
|
the left multiplication by the transition matrix P . Letting h → 0, the random walk converges to a |
|
continuous diffusion process with probability density evolves continuously in t. In this case, we can |
|
explicitly write the second-order differential equation in (4) as: |
|
|
|
|
|
_∂_ _ρˆ(x, t + h) −_ _ρ(x, t)_ |
|
|
|
_∂t_ _[ρ][(][x, t][) = lim]h→0_ _h_ |
|
|
|
|
|
_TF_ _I_ |
|
= lim _−_ |
|
_h→0_ _h_ |
|
|
|
|
|
_ρ(x, t),_ (5) |
|
|
|
|
|
where Lh = limh 0 (TF _I)/h is the conventional infinitesimal generator of the process._ |
|
_→_ _−_ |
|
|
|
Now we are ready to introduce our kernel diffusion density function. |
|
|
|
|
|
----- |
|
|
|
**Definition 1. (Kernel diffusion density function) Suppose the Markov chain induced by P is ergodic,** |
|
_we define the kernel diffusion density function as the limiting probability density of the diffusion_ |
|
_process Xt, i.e.,_ |
|
_ρKD(x) = lim_ (6) |
|
_t→∞_ _[ρ][(][x, t][)][.]_ |
|
|
|
Intuitively, on the one hand, with increased values of t we expect the diffusion process Xt gradually |
|
reveals the geometric structure (such as high-density regions) of the data distribution F . To see this, |
|
note that the transition probability P reflects connectivity between data. We can interpret a cluster as |
|
an underlying geometric structure in which the probability of staying in this region is high during |
|
a transition. In the diffusion process, the probability of following a path along a structure usually |
|
increases with t, as the involved data points are dense and highly connected. Therefore, the path |
|
consists of short and high probability jumps. Whilst paths that do not follow any structure consists |
|
of long and low probability jumps, which lowers their overall probability. As a result, geometry |
|
structures of F is magnified during the diffusion. |
|
|
|
On the other hand, by talking certain sophisticated forms of k(x, y) that take into account of local |
|
adaptivity, we also slow down the diffusion to avoid trivial geometry structures such as one big cluster |
|
for all the data points. In this way, we can eventually identify the correct geometry structure at the |
|
right scale. |
|
|
|
4.2 LOCALLY ADAPTIVE KERNELS |
|
|
|
To address the local adaptability in kernel diffusion density function, we propose the following two |
|
bivariate kernels. Both of them are very simple variations of the most commonly used classic kernels. |
|
|
|
**Symmetric-Gaussian kernel:** |
|
|
|
|
|
_k(x, y) = exp_ |
|
_−_ _[∥][x][ −]h_ _[y][∥][2]_ |
|
|
|
|
|
|
|
**1B(x,ε)(y).** (7) |
|
|
|
|
|
Here h and ϵ are both hyper-parameters. We call this kernel symmetric since k(x, y) = k(y, x) . |
|
|
|
**Asymmetric-Gaussian kernel:** |
|
|
|
|
|
_k(x, y) = exp_ |
|
_−_ _[∥][x][ −]h_ _[y][∥][2]_ |
|
|
|
|
|
|
|
**1Nk(x)(y).** (8) |
|
|
|
|
|
Here h and k are hyper-parameters. Note that in this case k(x, y) is asymmetric as y _Nk(x) does_ |
|
_∈_ |
|
not imply x _Nk(y)._ |
|
_∈_ |
|
|
|
Bivariate kernels defined in (7) and (8) are just combinations of classic Gaussian kernel and εneighbourhood or k-nearest neighbours kernels, respectively. With these simple combinations, we |
|
truncate Gaussian kernel at local areas, and the contribution of each point y to the construction of the |
|
density function ρKD(x) depends not only on the distance ||y − _x|| but also on the local geometry_ |
|
structure around x. Hence, the new kernels are adaptive at different x, which is expected to lead to |
|
better clustering performance against local features. We remark that the Asymmetric-Gaussian kernel |
|
takes into account a varying neighborhood around each x, thus is more adaptive comparing to the |
|
Symmetric-Gaussian kernel. |
|
|
|
Although here we only provide two examples of locally adaptive kernels, other options can be easily |
|
created in a similar spirit under this framework, e.g., changing the Gaussian kernels to other kernels |
|
or changing the ε-neighbourhood (k-nearest neighbours) kernels to other locally truncated functions. |
|
Once k(x, y) is determined, we can derive the corresponding density function ρKD. Next, we just |
|
need to simply apply any density clustering procedure like DPC or DBSCAN based on ρKD instead |
|
of the naive density function ρnaive. |
|
|
|
In Section 5, we assess the empirical performance of the proposed kernel diffusion density function |
|
with the above two locally adaptive kernels. They outperform existing density-based algorithms and |
|
other state-of-the-art methods. |
|
|
|
|
|
----- |
|
|
|
4.3 FAST KERNEL DIFFUSION DENSITY |
|
|
|
The kernel diffusion density function ρKD can be calculated as the stationary distribution of a Markov |
|
chain induced by the transition matrix P . Numerically, we can solve it by iteratively right multiplying |
|
_P with ρ(x, t) until convergence, or applying a QR decomposition on P_ . These methods are expensive |
|
in terms of computational cost, especially when the sample size n is large. |
|
|
|
To tackle this problem, we propose the following surrogate of ρKD(x) which is computationally more |
|
efficient. |
|
**Definition 2. (Fast kernel diffusion density function) Let p(y, x) be the transition probability from** |
|
_point y to point x, as defined in equation (3). We define the fast kernel diffusion density function as_ |
|
|
|
|
|
_p(y, x)dFn(y),_ (9) |
|
|
|
|
|
_ρFKD(x) =_ |
|
|
|
|
|
It is straightforward that ρFKD can be obtained in linear time and memory space, as we only need to |
|
compute the column averages of matrix P . |
|
|
|
Here we show that ρFKD is not only computationally efficient but also suitable for detecting local |
|
features. This is illustrated through the following Theorem 1. Consider a special case that k(x, y) = |
|
**1B(x,ε)(y). Then it is easy to verify that** |
|
|
|
|
|
_ρFKD(x) = [1]_ |
|
|
|
_Cd_ |
|
|
|
|
|
_ρnaive(y)_ _[,]_ |
|
|
|
|
|
_y∈B(x,ε)_ |
|
|
|
|
|
where Cd = nε[d]Vd is a normalising constant. In this way, we build a connection between ρFKD and |
|
the naive density function ρnaive in this special example. |
|
**Theorem 1. Consider the above special case that k(x, y) = 1B(x,ε)(y). In addition, assume the** |
|
_dataset D = {x1, ..., xn} can be split into m disjoint clusters: i.e., X = D1_ _..._ _Dm and_ |
|
_for each x_ _D, B(x, ε) only contain data points that belong to the same cluster as x. Denote_ |
|
_ρ¯j =_ _|D1j_ _|_ _∈x∈Dj_ _[ρ][FKD][(][x][)][ as the average density in cluster][ j][. We have]_ S S |
|
|
|
P _ρ¯1 =_ = ¯ρm = 1. |
|
|
|
_· · ·_ |
|
|
|
Theorem 1 demonstrates that the averaged ρFKD in each cluster are the same regardless of cluster |
|
sizes and other local features. This shows that ρFKD elevates the density of small clusters, which is |
|
essential for finding the density peaks of small clusters. |
|
|
|
Previously we claim that ρFKD is a surrogate of the kernel diffusion density ρKD. Next, we want |
|
to reveal the relationship between these two density functions from an asymptotic viewpoint. To |
|
proceed, we will need the following assumption. |
|
**Assumption 1. There exists some positive constant c < 1 that is independent of n, such that** |
|
_ρFKD(x) ≤_ _c holds for every x ∈_ _D._ |
|
|
|
This is a very mild assumption, since it always holds that ρFKD(x) < 1, and the average of ρFKD(x) |
|
over the dataset is _D_ _[ρ][FKD][(][x][)][dF][n][(][x][) = 1][/n][, which vanishes as][ n][ →∞][. Now we are ready to]_ |
|
|
|
present the following theorem that characterise the closeness between ρFKD and ρKD. |
|
|
|
R |
|
|
|
**Theorem 2. Suppose that Assumption 1 holds and the Markov chain induced by the kernel k(x, y) is** |
|
_ergodic. We have_ |
|
_ρKD(x)_ _a.s._ |
|
|
|
1 |
|
|
|
_ρFKD(x)_ _−→_ |
|
|
|
As shown in the Appendix, the almost sure convergence in Theorem 2 is of a fast rate at n[−][1]. Thus it |
|
is safe for us to use it to replace ρKD in finite sample experiments. This result is also verified by our |
|
numerical experiments in Section 5. |
|
|
|
5 EXPERIMENTS |
|
|
|
In this section, we empirically evaluate the proposed kernel diffusion density functions against ρnaive |
|
and ρLC in density-based clustering algorithms, and also compare them with other state-of-the-art |
|
|
|
|
|
----- |
|
|
|
Table 1: Clustering performance on benchmark datasets with different density functions applied to |
|
DPC. Pairwise F-score (FP ) and BCube F-score (FB) under optimal parameter tuning are given. The |
|
best and second-bset results in each dataset are bolded and underlined, respectively. |
|
|
|
|F P|F B| |
|
|---|---| |
|
|
|
|
|
|Col1|ρ ρ naive LC|ρsym ρasym ρsym ρasym KD KD FKD FKD|ρ ρ naive LC|ρsym ρasym ρsym ρasym KD KD FKD FKD| |
|
|---|---|---|---|---| |
|
|
|
|
|
|Banknote Breast-d Breast-o Control Glass Haberman Ionosphere Iris Libras Pageblocks Seeds Segment Wine|54.3 31.6 55.9 51.8 57.6 70.7 48.6 49.3 36.9 39.0 66.9 64.1 27.4 28.3 54.3 53.8 20.0 22.9 92.9 93.0 54.3 54.9 48.4 48.0 45.2 61.1|67.2 83.9 67.2 93.6 78.0 69.1 67.4 72.6 82.8 92.9 82.7 92.9 49.0 63.9 52.5 64.5 46.3 48.1 44.8 47.8 74.5 75.7 75.8 75.7 46.9 54.9 46.0 53.9 65.8 74.6 69.2 74.6 29.3 31.5 26.0 31.0 90.5 90.2 89.7 90.2 68.0 78.0 69.5 78.0 57.1 58.0 41.4 56.1 56.6 68.0 60.0 65.3|57.7 31.8 59.0 58.7 52.2 74.1 51.6 52.4 42.7 45.7 66.9 63.3 25.0 25.8 61.6 62.3 26.8 29.1 89.9 90.0 54.3 55.4 64.2 63.8 46.0 61.9|67.2 85.1 67.2 93.6 76.0 69.7 69.4 72.2 75.9 92.2 75.8 92.2 52.0 70.8 55.1 71.8 55.1 56.9 53.5 57.1 74.5 75.8 75.9 75.8 42.6 52.5 41.7 49.2 72.7 80.0 74.0 80.0 37.8 41.8 33.3 39.8 89.8 89.7 89.6 89.7 72.4 78.7 72.9 78.7 67.1 69.2 60.6 68.2 61.5 74.7 66.3 71.4| |
|
|---|---|---|---|---| |
|
|
|
|
|
|
|
methods. We denote ρ[sym]KD [and][ ρ]FKD[sym] [as the kernel diffusion density functions and its fast surrogate,] |
|
with symmetric-Gaussian kernel, respectively. Similarly, we denote ρ[asym]KD [and][ ρ]FKD[asym] [as the proposed] |
|
two density functions with asymmetric-Gaussian kernel, respectively. We examine their performance |
|
on a wide range of datasets. The clustering results are measured Pairwise F-score (Banerjee et al., |
|
2005), BCubed F-score (Amigó et al., 2009) or NMI (Cover, 1999). |
|
|
|
5.1 PERFORMANCE ON BENCHMARK DATASETS |
|
|
|
We now discuss the performance on 13 benchmark datasets (∼100 to ∼5, 000 data points) from UCI |
|
repository. The metadata is summarised in the Appendix. |
|
|
|
As summarised in Table 1, both ρ[sym]KD [and][ ρ]KD[asym] uniformly outperform ρnaive and ρLC in terms of |
|
clustering accuracy in terms of F-scores. results based on NMI are deferred to the Appendix. The |
|
proposed kernel diffusion density functions with asymmetric Gaussian kernel, ρ[asym]KD [, which enjoys] |
|
better local adaptivity analytically, achieves the best results on most datasets and outperforms ρnaive |
|
and ρLC by a large margin. It is worth noticing that the two fast surrogates, ρ[sym]FKD [and][ ρ]FKD[asym][, achieve] |
|
comparable results with their original counterparts, ρ[sym]KD [and][ ρ]KD[asym][. In the Appendix similar results] |
|
are observed for the same set of density functions applied to DBSCAN. |
|
|
|
Figure 2: Precision-Recall curves of different approaches applied to DPC on MS1M dateset, using |
|
(a) Pairwise metric, and (b) BCubed metric . |
|
|
|
|
|
----- |
|
|
|
5.2 PERFORMANCE ON FACE IMAGE DATASETS |
|
|
|
Clustering face images according to their latent identity becomes an important application in recent |
|
years. It is challenging in the sense that face image datasets usually contain thousands of identities, |
|
corresponding to thousands of clusters. Meanwhile, the number of images for each identity (cluster) |
|
is quite different, corresponding to the variety of cluster sizes. We assess the performance of |
|
the proposed approach on two popular face image datasets: emore_200k (Zhan et al., 2018) and |
|
MS1M (Guo et al., 2016). |
|
|
|
**emore_200k. The dataset contains** |
|
|
|
Table 2: Clustering performance on emore_200k. BCubed |
|
|
|
2,577 identities with 200,000 images |
|
|
|
precision, recall and F-score are reported. |
|
|
|
following the protocol in Zhan et al. |
|
|
|
|Col1|Algorithm # clusters Precision Recall F B| |
|
|---|---| |
|
|
|
|
|
pared with k-means, HAC (Sibson, |
|
|
|
|Baseline|k-means 2,577 94.24 74.89 83.45 HAC 2,577 97.74 88.02 92.62 ARO 85,150 52.96 16.93 25.66 CDP - 89.35 88.98 89.16| |
|
|---|---| |
|
|
|
1973), ARO (Otto et al., 2017), and |
|
CDP (Zhan et al., 2018). Again, we |
|
|
|
ing with proposed kernel diffusion - Not available |
|
|
|
|Density -based|ρ 7928 92.36 78.14 84.65 naive ρ 3485 96.15 86.58 91.11 LC ρsym 2781 95.82 93.24 94.51 KD ρasym 2546 95.48 93.82 94.64 KD ρsym 3622 95.27 92.54 93.89 FKD ρasym 2569 96.37 93.93 95.13 FKD| |
|
|---|---| |
|
|
|
density functions also outperform the |
|
state-of-the-arts approaches such as CDP by a large margin. |
|
|
|
|
|
**MS1M. The dataset contains 8,573 iden-** |
|
tities with around 584,000 images following the protocols in Yang et al. (2020). |
|
We set ε to 0.8, k to 200, and h to 0.5 |
|
for density-based methods. We reported |
|
the results of clustering performance in |
|
Table 3. Precision versus Recall curves |
|
for different density functions (applied |
|
to DPC) are plotted in Figure 2. In Table 3, the proposed kernel diffusion density functions outperform ρnaive and ρLC. |
|
Note that GCN-based methods such as |
|
L-GCN (Wang et al., 2019), LTC (Yang |
|
et al., 2019) and GCN (V+E) (Yang et al., |
|
2020) achieve generally better clustering |
|
performance than unsupervised methods |
|
due to their supervised nature. However, |
|
it is quite encouraging to see that the |
|
proposed kernel diffusion approaches, although are also unsupervised clustering |
|
methods, considerably outperform the |
|
GCN-based methods. |
|
|
|
5.3 SENSITIVITY ANALYSIS |
|
|
|
|
|
Table 3: Clustering performance on MS1M. Pairwise Fscore and BCubed F-score are reported. |
|
|
|
|Col1|Algorithm # clusters F F P B| |
|
|---|---| |
|
|
|
|
|
|Unsupervised|k-means 8,573 79.21 81.23 HAC 8,573 70.63 70.46 ARO - 13.60 17.00 CDP - 75.02 78.70| |
|
|---|---| |
|
|
|
|
|
|Supervised|L-GCN - 78.68 84.37 LTC - 85.66 85.52 GCN(V+E) - 87.55 85.94| |
|
|---|---| |
|
|
|
|
|
|Density-based|ρ 59551 78.37 79.35 naive ρ 24019 83.61 85.06 LC ρsym - - - KD ρasym 22869 88.15 87.14 KD ρsym 34246 84.40 85.37 FKD ρasym 22927 87.26 87.41 FKD| |
|
|---|---| |
|
|
|
|
|
|
|
- Not available |
|
|
|
|
|
Next, we examine the sensitivity of the proposed kernel diffusion density functions to hyperparameters and compare it with ρnaive and ρLC. The results are obtained via extensive experiments on |
|
emore_200k and MS1M, which are shown in Figure 3. We can see that the clustering performance of |
|
_ρ[sym]KD_ [is much more stable than][ ρ][naive][ and][ ρ][LC][ when we vary the value of][ ε][. Whilst][ ρ]KD[asym] [is robust to] |
|
the parameter k, and both ρ[sym]KD [and][ ρ]KD[asym] [are quite robust to the parameter][ h][.] |
|
|
|
|
|
----- |
|
|
|
Figure 3: Sensitivity analysis on emore_200k and MS1M. We investigate the clustering performance |
|
by varying the following parameters: (a) Radius of ε-ball; (b) Number k of nearest neighbors; (c) |
|
Bandwidth h of Gaussian kernel. |
|
|
|
5.4 COMPUTATIONAL COST |
|
|
|
|
|
We carried out a series of experiments on |
|
MS1M to demonstrate the computational |
|
efficiency of the fast surrogate ρFKD in |
|
terms of time and space. With a collection of subsampled data from MS1M at |
|
different percentile levels, we run both the |
|
kernel diffusion density ρKD and the fast |
|
surrogate ρFKD. As we can observe from |
|
Figure 4, the running time and memory |
|
usage of ρKD increase dramatically with |
|
the sample size. Whilst ρFKD retains a |
|
very low level of computational cost. This |
|
suggests that ρFKD, which achieves an excellent computational efficiency, should |
|
be favored in practice. |
|
|
|
6 CONCLUSION |
|
|
|
|
|
Figure 4: Running time and memory usage of the proposed methods at different sample sizes on MS1M. |
|
|
|
|
|
Density-based clustering has a profound impact on machine learning and data mining. However, |
|
the underpinning naive density function suffers from detecting varying local features, causing extra |
|
errors in the clustering. We propose a new set of density functions based on the kernel diffusion |
|
process to resolve this problem, which is adaptive to density regions of varying local distributional |
|
features. We demonstrate that DBSCAN and DPC adapted by the proposed approach have improved |
|
clustering performance comparing to their classic versions and other state-of-the-art methods. |
|
|
|
|
|
----- |
|
|
|
REFERENCES |
|
|
|
Enrique Amigó, Julio Gonzalo, Javier Artiles, and Felisa Verdejo. A comparison of extrinsic |
|
clustering evaluation metrics based on formal constraints. Information Retrieval, 12(4):461–486, |
|
2009. |
|
|
|
Mihael Ankerst, Markus M Breunig, Hans-Peter Kriegel, and Jörg Sander. Optics: Ordering points to |
|
identify the clustering structure. ACM Sigmod Record, 28(2):49–60, 1999. |
|
|
|
Arindam Banerjee, Chase Krumpelman, Joydeep Ghosh, Sugato Basu, and Raymond J Mooney. |
|
Model-based overlapping clustering. In Proceedings of the eleventh ACM SIGKDD International |
|
_Conference on Knowledge Discovery in Data Mining, pp. 532–537, 2005._ |
|
|
|
Ricardo JGB Campello, Davoud Moulavi, and Jörg Sander. Density-based clustering based on |
|
hierarchical density estimates. In Pacific-Asia Conference on Knowledge Discovery and Data |
|
_Mining, pp. 160–172. Springer, 2013._ |
|
|
|
Junyue Cao, Jonathan S Packer, Vijay Ramani, Darren A Cusanovich, Chau Huynh, Riza Daza, |
|
Xiaojie Qiu, Choli Lee, Scott N Furlan, Frank J Steemers, et al. Comprehensive single-cell |
|
transcriptional profiling of a multicellular organism. Science, 357(6352):661–667, 2017. |
|
|
|
Bo Chen, Kai Ming Ting, Takashi Washio, and Ye Zhu. Local contrast as an effective means to robust |
|
clustering against varying densities. Machine Learning, 107(8):1621–1645, 2018. |
|
|
|
Ronald R Coifman and Stéphane Lafon. Diffusion maps. Applied and Computational Harmonic |
|
_Analysis, 21(1):5–30, 2006._ |
|
|
|
Ronald R Coifman, Stephane Lafon, Ann B Lee, Mauro Maggioni, Boaz Nadler, Frederick Warner, |
|
and Steven W Zucker. Geometric diffusions as a tool for harmonic analysis and structure definition |
|
of data: Diffusion maps. Proceedings of the National Academy of Sciences, 102(21):7426–7431, |
|
2005. |
|
|
|
Dorin Comaniciu and Peter Meer. Mean shift: A robust approach toward feature space analysis. |
|
_IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(5):603–619, 2002._ |
|
|
|
Thomas M Cover. Elements of information theory. John Wiley & Sons, 1999. |
|
|
|
Antonio Cuevas, Manuel Febrero, and Ricardo Fraiman. Cluster analysis: a further approach based |
|
on density estimation. Computational Statistics & Data Analysis, 36(4):441–459, 2001. |
|
|
|
Levent Ertöz, Michael Steinbach, and Vipin Kumar. Finding clusters of different sizes, shapes, |
|
and densities in noisy, high dimensional data. In Proceedings of the 2003 SIAM International |
|
_Conference on Data Mining, pp. 47–58. SIAM, 2003._ |
|
|
|
Martin Ester, Hans-Peter Kriegel, Jörg Sander, Xiaowei Xu, et al. A density-based algorithm for |
|
discovering clusters in large spatial databases with noise. In KDD, volume 96, pp. 226–231, 1996. |
|
|
|
Jerome Friedman, Trevor Hastie, Robert Tibshirani, et al. The elements of statistical learning, |
|
volume 1. Springer Series in Statistics New York, 2001. |
|
|
|
Yandong Guo, Lei Zhang, Yuxiao Hu, Xiaodong He, and Jianfeng Gao. Ms-celeb-1m: A dataset |
|
and benchmark for large-scale face recognition. In European Conference on Computer Vision, pp. |
|
87–102. Springer, 2016. |
|
|
|
Alexander Hinneburg and Hans-Henning Gabriel. Denclue 2.0: Fast clustering based on kernel |
|
density estimation. In International Symposium on Intelligent Data Analysis, pp. 70–80. Springer, |
|
2007. |
|
|
|
Jeffrey J Hunter. Generalized inverses and their application to applied probability problems. Linear |
|
_Algebra and Its Applications, 45:157–198, 1982._ |
|
|
|
Mehjabin Khatoon and W Aisha Banu. An efficient method to detect communities in social networks |
|
using dbscan algorithm. Social Network Analysis and Mining, 9(1):1–12, 2019. |
|
|
|
|
|
----- |
|
|
|
Emmanuelle A Marquis, Vicente Araullo-Peters, Yan Dong, Auriane Etienne, Svetlana Fedotova, |
|
Katsuhiko Fujii, Koji Fukuya, Evgenia Kuleshova, Anabelle Lopez, Andrew London, et al. On the |
|
use of density-based algorithms for the analysis of solute clustering in atom probe tomography data. |
|
In Proceedings of the 18th International Conference on Environmental Degradation of Materials |
|
_in Nuclear Power Systems–Water Reactors, pp. 2097–2113. Springer, 2019._ |
|
|
|
James S Marron and David Ruppert. Transformations to reduce boundary bias in kernel density |
|
estimation. Journal of the Royal Statistical Society: Series B (Methodological), 56(4):653–671, |
|
1994. |
|
|
|
Boaz Nadler, Stephane Lafon, Ronald R Coifman, and Ioannis G Kevrekidis. Diffusion maps, spectral |
|
clustering and eigenfunctions of fokker-planck operators. arXiv preprint math/0506090, 2005. |
|
|
|
Charles Otto, Dayong Wang, and Anil K Jain. Clustering millions of faces by identity. IEEE |
|
_Transactions on Pattern Analysis and Machine Intelligence, 40(2):289–303, 2017._ |
|
|
|
Hasanzadeh PR Reza, AH Rezaie, SHH Sadeghi, MH Moradi, and M Ahmadi. A density-based fuzzy |
|
clustering technique for non-destructive detection of defects in materials. Ndt & E International, |
|
40(4):337–346, 2007. |
|
|
|
Alex Rodriguez and Alessandro Laio. Clustering by fast search and find of density peaks. Science, |
|
344(6191):1492–1496, 2014. |
|
|
|
Marco Rovere, Ziheng Chen, Antonio Di Pilato, Felice Pantaleo, and Chris Seez. Clue: A fast parallel |
|
clustering algorithm for high granularity calorimeters in high-energy physics. Frontiers in Big |
|
_Data, 3, 2020._ |
|
|
|
Jieming Shi, Nikos Mamoulis, Dingming Wu, and David W Cheung. Density-based place clustering |
|
in geo-social networks. In Proceedings of the 2014 ACM SIGMOD International Conference on |
|
_Management of Data, pp. 99–110, 2014._ |
|
|
|
Robin Sibson. Slink: an optimally efficient algorithm for the single-link cluster method. The |
|
_Computer Journal, 16(1):30–34, 1973._ |
|
|
|
Werner Stuetzle. Estimating the cluster tree of a density by analyzing the minimal spanning tree of a |
|
sample. Journal of Classification, 20(1):25–47, 2003. |
|
|
|
A Tramacere and C Vecchio. γ-ray dbscan: A clustering algorithm applied to fermi-lat γ-ray data. In |
|
_AIP Conference Proceedings, volume 1505, pp. 705–708. American Institute of Physics, 2012._ |
|
|
|
[UCI. UCI machine learning repository. http://archive.ics.uci.edu/ml/datasets.](http://archive.ics.uci.edu/ml/datasets.php) |
|
[php.](http://archive.ics.uci.edu/ml/datasets.php) |
|
|
|
Zhongdao Wang, Liang Zheng, Yali Li, and Shengjin Wang. Linkage based face clustering via |
|
graph convolution network. In Proceedings of the IEEE/CVF Conference on Computer Vision and |
|
_Pattern Recognition, pp. 1117–1125, 2019._ |
|
|
|
Lei Yang, Xiaohang Zhan, Dapeng Chen, Junjie Yan, Chen Change Loy, and Dahua Lin. Learning |
|
to cluster faces on an affinity graph. In Proceedings of the IEEE/CVF Conference on Computer |
|
_Vision and Pattern Recognition, pp. 2298–2306, 2019._ |
|
|
|
Lei Yang, Dapeng Chen, Xiaohang Zhan, Rui Zhao, Chen Change Loy, and Dahua Lin. Learning |
|
to cluster faces via confidence and connectivity estimation. In Proceedings of the IEEE/CVF |
|
_Conference on Computer Vision and Pattern Recognition, pp. 13369–13378, 2020._ |
|
|
|
Xiaohang Zhan, Ziwei Liu, Junjie Yan, Dahua Lin, and Chen Change Loy. Consensus-driven |
|
propagation in massive unlabeled data for face recognition. In Proceedings of the European |
|
_Conference on Computer Vision (ECCV), pp. 568–583, 2018._ |
|
|
|
Ye Zhu, Kai Ming Ting, and Mark J Carman. Density-ratio based clustering for discovering clusters |
|
with varying densities. Pattern Recognition, 60:983–997, 2016. |
|
|
|
Carly GK Ziegler, Samuel J Allon, Sarah K Nyquist, Ian M Mbano, Vincent N Miao, Constantine N |
|
Tzouanas, Yuming Cao, Ashraf S Yousif, Julia Bals, Blake M Hauser, et al. Sars-cov-2 receptor |
|
ace2 is an interferon-stimulated gene in human airway epithelial cells and is detected in specific |
|
cell subsets across tissues. Cell, 181(5):1016–1035, 2020. |
|
|
|
|
|
----- |
|
|
|
A APPENDIX |
|
|
|
In this supplementary file, we provide technical proofs of the theoretical results in Section 4.3, and |
|
present extra empirical experiments regarding our kernel diffusion approach with symmetric and |
|
asymmetric Gaussian kernels applied to DBSCAN. All the numerical experiments are carried out on |
|
a standard work station with a Intel 64-cores CPU and two Nvidia P100 GPUs. |
|
|
|
A.1 PSEUDO CODE FOR DBSCAN AND DPC. |
|
|
|
**Algorithm 1 DBSCAN** |
|
|
|
1: Input: SetOfPoints X, Eps ε, MinPts k |
|
2: H:={x ∈ _X : |B(x, ε) ∩_ _X| ≥_ _k};_ |
|
3: G:=undirected graph with vertices H and edge between x, x[′] _∈_ _H if |x −_ _x[′]| ≤_ _ε;_ |
|
4: Output: connected components of G |
|
|
|
The connected compoenents of the graph are determined a clusters,a dn the remaining points are |
|
unclustered and considered as noise-points. |
|
|
|
**Algorithm 2 DPC** |
|
|
|
1: Input: SetOfPoints, TruncDis dc |
|
2: Compute di,j for ∀i, j ∈ SetOfPoints; |
|
3: For i from 1 to SetOfPoints.size: |
|
4: _ρi :=_ _i≠_ _j_ [1][{][d]i,j _[−][d]c[}]_ |
|
|
|
5: _δi := minj:ρj_ _>ρi_ (di,j) |
|
|
|
6: Plot decision map[P] _M with ρ as the horizontal axis and δ as the vertical axis;_ |
|
7: Mark Point i with relatively higher ρi and δi as a cluster center; |
|
8: Mark Point i with relatively lower ρi but relatively higher δi as a noise point; |
|
9: Assign the rest point with the label the same as the nearest cluster center; |
|
|
|
10: Output: SetOfPoints |
|
|
|
|
|
A.2 PROOFS OF THEORETICAL RESULT. |
|
|
|
**Proof of Theorem 1.** Since {D1, · · ·, Dm} are disjoint, we have p(x, y) = 0 if x and y belong to |
|
different clusters. By the definition of matrix P, for each x _Dj, we have_ |
|
_∈_ |
|
|
|
_p(x, y)dFn(y) = 1,_ |
|
_D_ |
|
|
|
Z |
|
|
|
which implies that |
|
|
|
|
|
_p(x, y)dFn(x)dFn(y) =_ _Dj_ _._ |
|
_y∈Dj_ _|_ _|_ |
|
|
|
|
|
_x∈Dj_ |
|
|
|
_ρ¯j_ _Dj_ = |
|
_|_ _|_ |
|
Z |
|
|
|
|
|
Therefore, |
|
|
|
|
|
_ρ¯j_ _Dj_ = _p(x, y)dFn(x)dFn(y) =_ _Dj_ _,_ |
|
_|_ _|_ Zx∈Dj Zy∈Dj _|_ _|_ |
|
|
|
which implies that ¯ρj = 1 for any j = 1,, . . ., m. |
|
|
|
|
|
Before proceeding to the proof of Theorem 2, we need following auxiliary lemma that relates the |
|
stationary distribution of a Markov chain to an arbitrary vector g. |
|
**Lemma A.1. Let P be transition probability matrix of a finite inreducible discrete time Markov** |
|
_chain with n states, which admits a stationary distribution, denoted by vector π. We write e =_ |
|
(1, . . ., 1)[T] _∈_ R[n] _as the a column vector of ones. The following holds for any vector g such that_ |
|
_g[T]_ _e ̸= 0:_ |
|
|
|
_(1) (I −_ _P + eg[T]_ ) is non-singular. |
|
|
|
_(2) Let H = (I −_ _P + eg[T]_ )[−][1], then π[T] = g[T] _H._ |
|
|
|
|
|
----- |
|
|
|
_Proof. Since π is the stationary distribution, we have π[T]_ _e = 1. Applying Theorem 3.3 in (Hunter,_ |
|
1982) yields that matrix (I − _P + eg[T]_ ) is non-singular. |
|
|
|
Next recall that π[T] _P = π[T]_, therefore we have |
|
|
|
_π[T]_ (I − _P + eg[T]_ ) = π[T] _−_ _π[T]_ _P + π[T]_ _eg[T]_ |
|
|
|
= π[T] _eg[T]_ |
|
|
|
= g[T] _,_ |
|
|
|
|
|
which implies π[T] = g[T] _H._ |
|
|
|
**Proof of Theorem 2.** Note that for for each x ∈ _D, the linear reference function ρFKD(x) =_ |
|
|
|
_D_ _[p][(][y, x][)][dF][n][(][y][)][ is the corresponding column average of the transition matrix][ P]_ [. We write the][ i][-th] |
|
column vector of P as |
|
_T_ |
|
|
|
R _pi =_ _p(x1, xi), . . ., p(xn, xi)_ _._ |
|
|
|
Therefore ρFKD(xi) = e[T] _pi/n_ |
|
|
|
Since the Markov chain induced by the kernel k(x, y) is ergodic, the density ρ(x, t) of the diffusion |
|
process Xt converges to the limiting stationary distribution of the Markov chain, denoted by π. |
|
|
|
We can write the n-vectors of g and π in the following form: |
|
|
|
|
|
_T_ _T_ |
|
_g = (g1, . . ., gn)[T]_ = n _ρFKD(x1), . . ., ρFKD(xn)_ and _π =_ _ρKD(x1), . . ., ρKD(xn)_ _,_ |
|
|
|
where gi = e[T] _pi is the i-th column sums of matrix _ _P. As a result, we have _ |
|
|
|
|
|
_ρFKD(x)dFn(x) = [1]_ and |
|
_D_ _n_ _[e][T][ g][ = 1][,]_ |
|
|
|
Z |
|
|
|
By the definition of g, we know |
|
|
|
|
|
_ρKDdFn(x) = e[T]_ _π = 1._ |
|
|
|
|
|
(eg[T] )[2] = neg[T] and _e[T]_ _P = g[T]_ _._ |
|
|
|
|
|
It follows from Lemma A.1 that (I − _P + eg[T]_ ) is non-singular and π[T] = g[T] _H, where H =_ |
|
(I − _P + eg[T]_ )[−][1]. |
|
|
|
We define B = I + eg[T] . By simple algebra calculation, we can find B is non-singular with |
|
|
|
_B[−][1]_ = I |
|
_−_ _n[eg] + 1[T]_ _[.]_ |
|
|
|
|
|
_g[T]_ |
|
As a result, it is easy to see that that g[T] _B[−][1]_ = |
|
|
|
_n + 1_ [and] |
|
|
|
_H_ _[−][1]_ = B − _P = (I −_ _PB[−][1])B._ |
|
|
|
Use the Neumann series, we have |
|
|
|
|
|
_H = B[−][1](I −_ _PB[−][1])[−][1]_ = B[−][1] |
|
|
|
|
|
(PB[−][1])[i]. |
|
|
|
_i=0_ |
|
|
|
X |
|
|
|
_∞_ |
|
|
|
(PB[−][1])[i] |
|
_−_ _n[I]_ |
|
_i=0_ |
|
|
|
X |
|
|
|
|
|
Thus |
|
|
|
|
|
_π[T]_ _g[T]_ _/n = g[T]_ _H_ |
|
_−_ _−_ _n[I]_ |
|
|
|
|
|
|
|
= g[T] _B[−][1]_ |
|
|
|
|
|
|
|
Since we assume for any x ∈ _D, ˆg(x) < c_ for some 0 < c < 1. This leads to |
|
|
|
_g[T]_ _pj ≤_ _nce[T]_ _pj = ncgj._ |
|
|
|
Therefore, let κj be the j-th compoenent of g[T] _PB[−][1], it is straightforward_ |
|
|
|
_nc_ |
|
_κj_ |
|
_≤_ _n + 1_ _[g][j][ ≤]_ _[cg][j][.]_ |
|
|
|
|
|
----- |
|
|
|
This implies for every x ∈ _D,_ |
|
|
|
_|ρKD(x) −_ _ρFKD(x)| ≤_ _ρFKD(x)_ |
|
|
|
_≤_ _ρFKD(x)_ |
|
|
|
|
|
_∞_ |
|
|
|
_c[i]_ |
|
_−_ _n[1]_ |
|
_i=0_ |
|
|
|
X |
|
|
|
|
|
_n + 1_ |
|
|
|
|
|
1 |
|
|
|
(n + 1)(1 − _c)_ _[−]_ _n[1]_ |
|
|
|
|
|
Hence we have |
|
|
|
which completes the proof. |
|
|
|
|
|
_ρKD(x)_ |
|
lim = 1, |
|
_n→∞_ _ρFKD(x) [= 1]_ |
|
|
|
|
|
A.3 ADDITIONAL NUMERICAL EXPERIMENTS |
|
|
|
**Naive density with different bandwidths.** To illustrate the fail of naive density function in scenario |
|
as in Figure 1, we also plot it with a range of different values of hyperparameters ϵ below. We can |
|
observe that it is difficult to detect the three true underlying clusters in all the cases. |
|
|
|
Figure 5: Naive density function in 3D with different values of ϵ. |
|
|
|
**Hyperparameters** The parameter ε (radius of the ball, used in ρnaive, ρLC, ρ[sym]KD [and][ ρ]FKD[sym] [) is tuned] |
|
by searching within the range between 0.1 and 1 with am increment of 0.1, parameter k (number of |
|
nearest neighbors, used in ρLC, ρ[asym]KD [and][ ρ]FKD[asym][) is tuned by searching within the range between][ 10%] |
|
and 50% number of samples, with an increment of 10%. |
|
|
|
**Metadata of benchmark datasets.** The number of samples n, the number of clusters c, and feature |
|
dimension d for each benchmark dataset are listed in Table 4 below. |
|
|
|
**Benckmark datasets with DBSCAN.** We provide the performance of the conventional density |
|
functions, ρnaive and ρLC, and the proposed kernel diffusion density functions with symmetric and |
|
asymmetric Gaussian kernels, ρ[∗]KD [and][ ρ]FKD[∗] [(][∗∈{][sym][,][ asym][}][), applied to DBSCAN on 13 bench-] |
|
mark datasets. The results are summarised in Table 6. Similar to DPC, we see that both ρ[sym]KD [and] |
|
_ρ[asym]KD_ uniformly outperform ρnaive and ρLC in terms of clustering quality. ρ[asym]KD [, which has better] |
|
local adaptivity analytically, achieves the best results on most datasets and outperforms others by a |
|
significant margin in Breast-o, Control, Haberma and Seeds. |
|
|
|
**NMI for benchmark datasets.** Below we present in Table 6 the clustering results for benchmark |
|
datasets based on NMI metric. |
|
|
|
|
|
----- |
|
|
|
Table 4: Metadata of benchmark datasets, includes sample size (n), the number of clusters (c), and |
|
feature dimension d. |
|
|
|
Dataset _n_ _c_ _d_ |
|
|
|
Banknote 1372 2 4 |
|
Breast-d 569 2 30 |
|
Breast-o 699 2 9 |
|
Control 600 6 60 |
|
Glass 214 7 9 |
|
Haberman 306 2 3 |
|
Ionosphere 351 2 34 |
|
Iris 150 3 4 |
|
Libras 360 15 90 |
|
Pageblocks 5473 5 10 |
|
Seeds 210 3 7 |
|
Segment 210 7 19 |
|
Wine 178 3 13 |
|
|
|
Table 5: Clustering performance on benchmark datasets with different density functions applied to |
|
DBSCAN. Pairwise F-score (FP ) and BCube F-score (FB) under optimal parameter tuning are given. |
|
The best and second-best results in each dataset are bolded and underlined, respectively. |
|
|
|
|F P|F B| |
|
|---|---| |
|
|
|
|
|
|Col1|ρ ρ naive LC|ρsym ρasym ρsym ρasym KD KD FKD FKD|ρ ρ naive LC|ρsym ρasym ρsym ρasym KD KD FKD FKD| |
|
|---|---|---|---|---| |
|
|
|
|
|
|Banknote Breast-d Breast-o Control Glass Haberman Ionosphere Iris Libras Pageblocks Seeds Segment Wine|26.8 60.7 56.7 63.0 18.2 55.3 32.5 37.1 22.0 29.8 68.6 72.2 25.9 68.4 66.2 69.8 18.1 12.0 48.4 89.2 57.8 47.6 18.5 47.9 40.5 40.5|62.0 66.4 65.4 66.4 65.0 66.6 67.2 66.6 59.2 70.6 70.5 70.6 51.0 60.3 48.9 59.1 29.8 42.5 42.0 42.5 68.6 75.6 68.9 75.3 68.0 74.2 74.2 74.2 66.2 57.2 73.7 73.3 15.6 13.8 20.2 13.5 90.0 90.1 89.9 90.1 57.8 63.2 22.4 62.4 54.8 30.8 41.4 30.8 40.5 49.5 50.0 49.5|26.5 65.1 60.9 64.7 15.9 50.7 34.2 50.1 25.8 36.9 69.2 73.1 23.8 64.1 67.2 76.6 31.1 16.5 45.2 85.5 59.2 53.0 22.5 55.2 45.7 45.7|60.7 67.4 65.7 67.4 66.0 67.4 67.2 67.4 52.3 71.3 70.6 71.2 53.7 66.9 51.6 65.7 36.9 45.2 43.5 45.2 69.2 75.8 68.3 75.7 63.7 72.1 72.1 72.1 67.2 67.0 79.4 79.0 42.1 45.5 32.9 37.7 89.7 89.5 89.7 89.5 59.2 70.0 24.4 69.2 66.6 53.6 59.9 53.6 45.7 52.3 51.3 52.3| |
|
|---|---|---|---|---| |
|
|
|
|
|
|
|
**Number of clusters.** In Table 7, we present the number of clusters returned by the density-based |
|
methods for the benchmark datasets. It can be observed that clustering with the proposed diffusion |
|
density functions returned a significantly better estimate of the number of clusters, comparing to that |
|
with classic density functions such as ρnaive and ρLC. |
|
|
|
|
|
----- |
|
|
|
Table 6: Clustering performance on benchmark datasets with different density functions applied to |
|
DPC. NMI under optimal parameter tuning are given. The best results in each dataset are bolded. |
|
|
|
Dataset |
|
|
|
|Col1|ρ ρ naive LC|ρsym ρasym ρsym ρasym KD KD FKD FKD|k-means Spectral| |
|
|---|---|---|---| |
|
|
|
|Banknote Breast-d Breast-o Control Glass Haberman Ionosphere Iris Libras Pageblocks Seeds Segment Wine|27.5 33.0 43.7 49.1 30.2 32.7 60.6 60.6 43.1 43.4 9.5 5.7 27.9 28.0 51.1 53.1 63.3 66.4 8.6 13.0 47.1 49.8 63.5 64.4 58.1 58.2|21.7 64.8 53.2 80.2 46.8 57.4 55.7 46.1 37.2 79.1 36.4 78.4 63.2 69.5 61.0 69.6 45.0 48.4 43.8 46.6 9.5 3.2 16.9 3.2 30.9 31.1 30.1 30.5 60.1 73.4 62.6 73.4 63.0 68.8 68.2 69.1 11.8 28.7 14.4 29.1 53.6 64.8 58.6 64.8 65.1 72.2 63.0 70.7 72.0 73.3 71.1 58.6|34.2 17.3 62.3 52.6 74.8 14.0 75.4 68.3 34.8 36.4 7.8 6.6 13.5 5.2 74.2 70.6 60.0 56.1 13.2 12.1 67.4 60.3 61.2 65.2 84.2 72.7| |
|
|---|---|---|---| |
|
|
|
|
|
NMI |
|
|
|
|
|
Table 7: Number of clusters returned by different density functions applied to DPC. The ground truth |
|
is listed in the last column. |
|
|
|
|Dataset|ρ ρ naive LC|ρsym ρasym ρsym ρasym KD KD FKD FKD|Ground Truth| |
|
|---|---|---|---| |
|
|
|
|Banknote Breast-d Breast-o Control Glass Haberman Ionosphere Iris Libras Pageblocks Seeds Segment Wine|16 44 3 5 15 17 32 32 4 21 34 40 12 11 8 12 12 60 7 25 10 10 21 27 3 7|26 2 7 2 2 3 3 3 7 2 1 2 23 23 27 25 4 8 3 7 34 1 3 1 7 6 7 7 4 2 3 2 2 61 100 55 3 10 11 8 2 6 3 6 14 8 6 9 3 5 3 2|2 2 2 6 7 2 2 3 15 5 3 7 3| |
|
|---|---|---|---| |
|
|
|
|
|
----- |
|
|
|
|