An Efficient Approach to Clustering in Large Multimedia ... - CiteSeerX
An Efficient Approach to Clustering in Large Multimedia ... - CiteSeerX
An Efficient Approach to Clustering in Large Multimedia ... - CiteSeerX
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
<strong>An</strong> Ecient <strong>Approach</strong> <strong>to</strong> <strong>Cluster<strong>in</strong>g</strong> <strong>in</strong> <strong>Large</strong> <strong>Multimedia</strong> Databases<br />
with Noise<br />
Alexander H<strong>in</strong>neburg, Daniel A. Keim<br />
Institute of Computer Science, University of Halle, Germany<br />
fh<strong>in</strong>neburg, keimg@<strong>in</strong>formatik.uni-halle.de<br />
Abstract<br />
Several cluster<strong>in</strong>g algorithms can be applied <strong>to</strong> cluster<strong>in</strong>g <strong>in</strong><br />
large multimedia databases. The eectiveness and eciency<br />
of the exist<strong>in</strong>g algorithms, however, is somewhat limited,<br />
s<strong>in</strong>ce cluster<strong>in</strong>g <strong>in</strong> multimedia databases requires cluster<strong>in</strong>g<br />
high-dimensional feature vec<strong>to</strong>rs and s<strong>in</strong>ce multimedia<br />
databases often conta<strong>in</strong> large amounts of noise. In this paper,<br />
we therefore <strong>in</strong>troduce a new algorithm <strong>to</strong> cluster<strong>in</strong>g<br />
<strong>in</strong> large multimedia databases called DENCLUE (DENsitybased<br />
CLUstEr<strong>in</strong>g). The basic idea of our new approach is<br />
<strong>to</strong> model the overall po<strong>in</strong>t density analytically as the sum<br />
of <strong>in</strong>uence functions of the data po<strong>in</strong>ts. Clusters can then<br />
be identied by determ<strong>in</strong><strong>in</strong>g density-attrac<strong>to</strong>rs and clusters<br />
of arbitrary shape can be easily described by a simple equation<br />
based on the overall density function. The advantages<br />
of our new approach are (1) it has a rm mathematical basis,<br />
(2) it has good cluster<strong>in</strong>g properties <strong>in</strong> data sets with large<br />
amounts of noise, (3) it allows a compact mathematical description<br />
of arbitrarily shaped clusters <strong>in</strong> high-dimensional<br />
data sets and (4) it is signicantly faster than exist<strong>in</strong>g algorithms.<br />
To demonstrate the eectiveness and eciency of<br />
DENCLUE, we perform a series of experiments on a number<br />
of dierent data sets from CAD and molecular biology.<br />
A comparison with DBSCAN shows the superiority ofour<br />
new approach.<br />
Keywords: <strong>Cluster<strong>in</strong>g</strong> Algorithms, Density-based <strong>Cluster<strong>in</strong>g</strong>,<br />
<strong>Cluster<strong>in</strong>g</strong> of High-dimensional Data, <strong>Cluster<strong>in</strong>g</strong> <strong>in</strong><br />
<strong>Multimedia</strong> Databases, <strong>Cluster<strong>in</strong>g</strong> <strong>in</strong> the Presence of Noise<br />
1 Introduction<br />
Because of the fast technological progress, the amount<br />
of data which is s<strong>to</strong>red <strong>in</strong> databases <strong>in</strong>creases very fast.<br />
The types of data which are s<strong>to</strong>red <strong>in</strong> the computer become<br />
<strong>in</strong>creas<strong>in</strong>gly complex. In addition <strong>to</strong> numerical<br />
data, complex 2D and 3D multimedia data such asimage,<br />
CAD, geographic, and molecular biology data are<br />
s<strong>to</strong>red <strong>in</strong> databases. For an ecient retrieval, the complex<br />
data is usually transformed <strong>in</strong><strong>to</strong> high-dimensional<br />
feature vec<strong>to</strong>rs. Examples of feature vec<strong>to</strong>rs are color<br />
his<strong>to</strong>grams [SH94], shape descrip<strong>to</strong>rs [Jag91, MG95],<br />
Fourier vec<strong>to</strong>rs [WW80], text descrip<strong>to</strong>rs [Kuk92], etc.<br />
In many ofthe mentioned applications, the databases<br />
are very large and consist of millions of data objects<br />
with several tens <strong>to</strong> a few hundreds of dimensions.<br />
Au<strong>to</strong>mated knowledge discovery <strong>in</strong> large multimedia<br />
databases is an <strong>in</strong>creas<strong>in</strong>gly important research issue.<br />
<strong>Cluster<strong>in</strong>g</strong> and trend detection <strong>in</strong> such databases, however,<br />
is dicult s<strong>in</strong>ce the databases often conta<strong>in</strong> large<br />
amounts of noise and sometimes only a small portion<br />
of the large databases accounts for the cluster<strong>in</strong>g. In<br />
addition, most of the known algorithms do not work ef-<br />
ciently on high-dimensional data.The methods which<br />
Copyright (c) 1998, American Association for Articial<br />
Intelligence (www.aaai.org). All rights reserved.<br />
are applicable <strong>to</strong> databases of high-dimensional feature<br />
vec<strong>to</strong>rs are the methods which are known from<br />
the area of spatial data m<strong>in</strong><strong>in</strong>g. The most prom<strong>in</strong>ent<br />
representatives are partition<strong>in</strong>g algorithms such<br />
as CLARANS [NH94], hierarchical cluster<strong>in</strong>g algorithms,and<br />
locality-based cluster<strong>in</strong>g algorithms such<br />
as (G)DBSCAN [EKSX96, EKSX97] and DBCLASD<br />
[XEKS98]. The basic idea of partition<strong>in</strong>g algorithms<br />
is <strong>to</strong> partition the database <strong>in</strong><strong>to</strong> k clusters which are<br />
represented by the gravity of the cluster (k-means) or<br />
by one representative object of the cluster (k-medoid).<br />
Each object is assigned <strong>to</strong> the closest cluster. A wellknown<br />
partition<strong>in</strong>g algorithm is CLARANS which uses<br />
a randomized and bounded search strategy <strong>to</strong> improve<br />
the performance. Hierarchical cluster<strong>in</strong>g algorithms decompose<br />
the database <strong>in</strong><strong>to</strong> several levels of partition<strong>in</strong>gs<br />
which are usually represented by a dendrogram - a<br />
tree which splits the database recursively <strong>in</strong><strong>to</strong> smaller<br />
subsets. The dendrogram can be created <strong>to</strong>p-down (divisive)<br />
or bot<strong>to</strong>m-up (agglomerative). Although hierarchical<br />
cluster<strong>in</strong>g algorithms can be very eective <strong>in</strong><br />
knowledge discovery, the costs of creat<strong>in</strong>g the dendrograms<br />
is prohibitively expensive for large data sets s<strong>in</strong>ce<br />
the algorithms are usually at least quatratic <strong>in</strong> the number<br />
of data objects. More ecient are locality-based<br />
cluster<strong>in</strong>g algorithms s<strong>in</strong>ce they usually group neighbor<strong>in</strong>g<br />
data elements <strong>in</strong><strong>to</strong> clusters based on local conditions<br />
and therefore allow the cluster<strong>in</strong>g <strong>to</strong> be performed <strong>in</strong><br />
one scan of the database. DBSCAN, for example, uses<br />
a density-based notion of clusters and allows the discovery<br />
of arbitrarily shaped clusters. The basic idea is that<br />
for each po<strong>in</strong>t of a cluster the density of data po<strong>in</strong>ts <strong>in</strong><br />
the neighborhood has <strong>to</strong> exceed some threshold. DB-<br />
CLASD also works locality-based but <strong>in</strong> contrast <strong>to</strong> DB-<br />
SCAN assumes that the po<strong>in</strong>ts <strong>in</strong>side of the clusters<br />
are randomly distributed, allow<strong>in</strong>g DBCLASD <strong>to</strong> work<br />
without any <strong>in</strong>put parameters. A performance comparison<br />
[XEKS98] shows that DBSCAN is slightly faster<br />
than DBCLASD and both, DBSCAN and DBCLASD<br />
are much faster than hierarchical cluster<strong>in</strong>g algorithms<br />
and partition<strong>in</strong>g algorithms such as CLARANS. To<br />
improve the eciency, optimized cluster<strong>in</strong>g techniques<br />
have been proposed. Examples <strong>in</strong>clude R*-Tree-based<br />
Sampl<strong>in</strong>g [EKX95], Gridle-based cluster<strong>in</strong>g [Sch96],<br />
BIRCH [ZRL96] which is based on the Cluster-Featuretree,<br />
and STING which uses a quadtree-like structure<br />
conta<strong>in</strong><strong>in</strong>g additional statistical <strong>in</strong>formation [WYM97].<br />
A problem of the exist<strong>in</strong>g approaches <strong>in</strong> the context<br />
of cluster<strong>in</strong>g multimedia data is that most algorithms<br />
are not designed for cluster<strong>in</strong>g high-dimensional<br />
feature vec<strong>to</strong>rs and therefore, the performance of ex-
ist<strong>in</strong>g algorithms degenerates rapidly with <strong>in</strong>creas<strong>in</strong>g<br />
dimension. In addition, few algorithms can deal with<br />
databases conta<strong>in</strong><strong>in</strong>g large amounts of noise, which is<br />
quite common <strong>in</strong> multimedia databases where usually<br />
only a small portion of the database forms the <strong>in</strong>terest<strong>in</strong>g<br />
subset which accounts for the cluster<strong>in</strong>g. Our new<br />
approach solves these problems. It works eciently for<br />
high-dimensional data sets and allows arbitrary noise<br />
levels while still guarantee<strong>in</strong>g <strong>to</strong> nd the cluster<strong>in</strong>g. In<br />
addition, our approach can be seen as a generalization<br />
of dierent previous cluster<strong>in</strong>g approaches { partion<strong>in</strong>gbased,<br />
hierarchical, and locality-based cluster<strong>in</strong>g. Depend<strong>in</strong>g<br />
on the parameter sett<strong>in</strong>gs, we may <strong>in</strong>italize our<br />
approach <strong>to</strong> provide similar (or even the same) results<br />
as dierent other cluster<strong>in</strong>g algorithms, and due <strong>to</strong> its<br />
generality, <strong>in</strong> same cases our approach even provides a<br />
better eectiveness. In addition, our algorithm works<br />
eciently on very large amounts of high-dimensional<br />
data, outperform<strong>in</strong>g one of the fastest exist<strong>in</strong>g methods<br />
(DBSCAN) by a fac<strong>to</strong>r of up <strong>to</strong> 45.<br />
The rest of the paper is organized as follows. In section<br />
2, we <strong>in</strong>troduce the basic idea of our approach and<br />
formally dene <strong>in</strong>uence functions, density functions,<br />
density-attrac<strong>to</strong>rs, clusters, and outl<strong>in</strong>ers. In section 3,<br />
we discuss the properties of our approach, namely its<br />
generality and its <strong>in</strong>variance with respect <strong>to</strong> noise. In<br />
section 4, we then <strong>in</strong>troduce our algorithm <strong>in</strong>clud<strong>in</strong>g its<br />
theoretical foundations such as the locality-based density<br />
function and its error-bounds. In addition, we also<br />
discuss the complexity ofour approach. In section 5,<br />
we provide an experimental evaluation compar<strong>in</strong>g our<br />
approach <strong>to</strong> previous approaches such as DBSCAN. For<br />
the experiments, we use real data from CAD and molecular<br />
biology. To show the ability of our approach <strong>to</strong> deal<br />
with noisy data, we also use synthetic data sets with a<br />
variable amount of noise. Section 6 summarizes our results<br />
and discusses their impact as well as important<br />
issues for future work.<br />
2 A General <strong>Approach</strong> <strong>to</strong> <strong>Cluster<strong>in</strong>g</strong><br />
Before <strong>in</strong>troduc<strong>in</strong>g the formal denitions required for<br />
describ<strong>in</strong>g our new approach, <strong>in</strong> the follow<strong>in</strong>g we rst<br />
try <strong>to</strong> give a basic understand<strong>in</strong>g of our approach.<br />
2.1 Basic Idea<br />
Our new approach is based on the idea that the <strong>in</strong>uence<br />
of each data po<strong>in</strong>t can be modeled formally us<strong>in</strong>g<br />
a mathematical function which we call <strong>in</strong>uence function.<br />
The <strong>in</strong>uence function can be seen as a function<br />
which describes the impact of a data po<strong>in</strong>t with<strong>in</strong><br />
its neighborhood. Examples for <strong>in</strong>uence functions are<br />
parabolic functions, square wave function, or the Gaussian<br />
function. The <strong>in</strong>uence function is applied <strong>to</strong> each<br />
data po<strong>in</strong>t. The overall density of the data space can<br />
be calculated as the sum of the <strong>in</strong>uence function of<br />
all data po<strong>in</strong>ts. Clusters can then be determ<strong>in</strong>ed mathematically<br />
by identify<strong>in</strong>g density-attrac<strong>to</strong>rs. Densityattrac<strong>to</strong>rs<br />
are local maxima of the overall density function.<br />
If the overall density function is cont<strong>in</strong>uous and<br />
dierentiable at any po<strong>in</strong>t, determ<strong>in</strong><strong>in</strong>g the densityattrac<strong>to</strong>rs<br />
can be done eciently by a hill-climb<strong>in</strong>g procedure<br />
which is guided by the gradient of the overall<br />
density function. In addition, the mathematical form of<br />
the overall density function allows clusters of arbitrary<br />
shape <strong>to</strong> be described <strong>in</strong> a very compact mathematical<br />
form, namely by a simple equation of the overall density<br />
function. Theoretical results show that our approach is<br />
very general and allows dierent cluster<strong>in</strong>gs of the data<br />
(partition-based, hierarchical, and locality-based) <strong>to</strong> be<br />
found. We also show (theoretically and experimentally)<br />
that our algorithm is <strong>in</strong>variant aga<strong>in</strong>st large amounts of<br />
noise and works well for high-dimensional data sets.<br />
The algorithm DENCLUE is an ecient implementation<br />
of our idea. The overall density function requires <strong>to</strong><br />
sum up the <strong>in</strong>uence functions of all data po<strong>in</strong>ts. Most<br />
of the data po<strong>in</strong>ts, however, do not actually contribute<br />
<strong>to</strong> the overall density function. Therefore, DENCLUE<br />
uses a local density function which considers only the<br />
data po<strong>in</strong>ts which actually contribute <strong>to</strong> the overall density<br />
function. This can be done while still guarantee<strong>in</strong>g<br />
tight error bounds. <strong>An</strong> <strong>in</strong>telligent cell-based organization<br />
of the data allows our algorithm <strong>to</strong> work eciently<br />
on very large amounts of high-dimensional data.<br />
2.2 Denitions<br />
We rst have <strong>to</strong><strong>in</strong>troduce the general notion of <strong>in</strong>uence<br />
and density functions. Informally, the <strong>in</strong>uence<br />
functions are a mathematical description of the <strong>in</strong>uence<br />
a data object has with<strong>in</strong> its neighborhood. We<br />
denote the d-dimensional feature space by F d . The density<br />
function at a po<strong>in</strong>t x 2 F d is dened as the sum of<br />
the <strong>in</strong>uence functions of all data objects at that po<strong>in</strong>t<br />
(cf. [Schn64] or [FH 75] for a similar notion of density<br />
functions).<br />
Def. 1 (Inuence & Density Function)<br />
The <strong>in</strong>uence function of a data object y 2 F d is a<br />
function f y : F d ,! R + B 0 which is dened <strong>in</strong> terms of a<br />
basic <strong>in</strong>uence function f B<br />
f y B (x) =f B(x; y):<br />
The density function is dened as the sum of the <strong>in</strong>-<br />
uence functions of all data po<strong>in</strong>ts. Given N data<br />
objects described by a set of feature vec<strong>to</strong>rs D =<br />
fx 1 ;::: ;x N gF d the density function is dened as<br />
f D B (x) = NX<br />
i=1<br />
f xi<br />
B (x):<br />
In pr<strong>in</strong>ciple, the <strong>in</strong>uence function can be an arbitrary<br />
function. For the dention of specic <strong>in</strong>uence functions,<br />
we need a distance function d : F d F d ,! R + 0<br />
which determ<strong>in</strong>es the distance of two d-dimensional feature<br />
vec<strong>to</strong>rs. The distance function has <strong>to</strong> be reexive<br />
and symmetric. For simplicity, <strong>in</strong> the follow<strong>in</strong>g we assume<br />
a Euclidean distance function. Note, however,<br />
that the denitions are <strong>in</strong>dependent from the choice of<br />
the distance function. Examples of basic <strong>in</strong>uence functions<br />
are:
Density<br />
Density<br />
(a) Data Set (b) Square Wave (c) Gaussian<br />
Figure 1: Example for Density Functions<br />
1. Square Wave Inuence Function<br />
f Square (x; y) = 0 if d(x; y) ><br />
1 otherwise<br />
2. Gaussian Inuence Function<br />
f Gauss (x; y) =e ,d(x;y)2 2 2<br />
The density function which results from a Gaussian <strong>in</strong>-<br />
uence function is<br />
f D Gauss (x) = NX<br />
i=1<br />
e , d(x;x i )2<br />
2 2 :<br />
Figure 1 shows an example of a set of data po<strong>in</strong>ts <strong>in</strong> 2D<br />
space (cf. gure 1a) <strong>to</strong>gether with the correspond<strong>in</strong>g<br />
overall density functions for a square wave (cf. gure<br />
1b) and a Gaussian <strong>in</strong>uence function (cf. gure 1c).<br />
In the follow<strong>in</strong>g, we dene two dierent notions of clusters<br />
{ center-dened clusters (similar <strong>to</strong> k-means clusters)<br />
and arbitrary-shape clusters. For the denitions,<br />
we need the notion of density-attrac<strong>to</strong>rs. Informally,<br />
density attra<strong>to</strong>rs are local maxima of the overall denstiy<br />
function and therefore, we also need <strong>to</strong> dene the<br />
gradient of the density function.<br />
Def. 2 (Gradient)<br />
The gradient of a function f D (x) is dened as<br />
B<br />
rf D B<br />
(x) =<br />
NX<br />
i=1<br />
(x i , x) f xi<br />
(x):<br />
B<br />
In case of the Gaussian <strong>in</strong>uence function, the gradient<br />
is dened as:<br />
rf D Gauss (x) = NX<br />
i=1<br />
(x i , x) e , d(x;x i )2<br />
2 2 :<br />
In general, it is desirable that the <strong>in</strong>uence function<br />
is a symmetric, cont<strong>in</strong>uous, and dierentiable function.<br />
Note, however, that the denition of the gradient is<br />
<strong>in</strong>dependent of these properties. Now, we are able <strong>to</strong><br />
dene the notion of density-attrac<strong>to</strong>rs.<br />
Def. 3 (Density-Attrac<strong>to</strong>r)<br />
A po<strong>in</strong>t x 2 F d is called a density-attrac<strong>to</strong>r for a<br />
given <strong>in</strong>uence function, i x is a local maxium of the<br />
density-function f D B .<br />
A po<strong>in</strong>t x 2 F d is density-attracted <strong>to</strong> a densityattrac<strong>to</strong>r<br />
x ,i9k2N: d(x k ;x ) with<br />
x 0 = x; x i = x i,1 + <br />
rf D B (xi,1 )<br />
krf D B (xi,1 )k :<br />
Density<br />
Data Space<br />
Figure 2: Example of Density-Attrac<strong>to</strong>rs<br />
Figure 2 shows an example of density-attra<strong>to</strong>rs <strong>in</strong> a onedimensional<br />
space. For a cont<strong>in</strong>uous and dierentiable<br />
<strong>in</strong>uence function, a simple hill-climb<strong>in</strong>g algorithm can<br />
be used <strong>to</strong> determ<strong>in</strong>e the density-attrac<strong>to</strong>r for a data<br />
po<strong>in</strong>t x 2 D. The hill-climb<strong>in</strong>g procedure is guided by<br />
the gradient off D.<br />
B<br />
We are now are able <strong>to</strong> <strong>in</strong>troduce our denitions of clusters<br />
and outliers. Outliers are po<strong>in</strong>ts, which are not<br />
<strong>in</strong>uenced by "many" other data po<strong>in</strong>ts. We need a<br />
bound <strong>to</strong> formalize the "many".<br />
Def. 4 (Center-Dened Cluster)<br />
A center-dened cluster (wrt <strong>to</strong> ,) for a densityattrac<strong>to</strong>r<br />
x is a subset C D, with x 2 C be<strong>in</strong>g<br />
density-attracted by x and f D(x ) . Po<strong>in</strong>ts x 2 D<br />
B<br />
are called outliers if they are density-attraced by alocal<br />
maximum x o<br />
with f D(x B o)
Density<br />
Density<br />
Density<br />
(a) =0:2 (b) =0:6 (d) =1:5<br />
Figure 3: Example of Center-Dened Clusters for dierent <br />
Density<br />
Density<br />
(a) =2 (b) =2 (c) =1 (d) =1<br />
Figure 4: Example of Arbitray-Shape Clusters for dierent <br />
3 Properties of Our <strong>Approach</strong><br />
In the follow<strong>in</strong>g subsections we discuss some important<br />
properties of our approach <strong>in</strong>clud<strong>in</strong>g its generality,<br />
noise-<strong>in</strong>variance, and parameter sett<strong>in</strong>gs.<br />
3.1 Generality<br />
As already mentioned, our approach generalizes other<br />
cluster<strong>in</strong>g methods, namely partition-based, hierarchical,<br />
and locality-based cluster<strong>in</strong>g methods. Due the<br />
wide range of dierent cluster<strong>in</strong>g methods and the given<br />
space limitations, we can not discuss the generality of<br />
our approach <strong>in</strong>detail. Instead, we provide the basic<br />
ideas of how our approach generalizes some other wellknown<br />
cluster<strong>in</strong>g algorithms.<br />
Let us rst consider the locality-based cluster<strong>in</strong>g algorithm<br />
DBSCAN. Us<strong>in</strong>g a square wave <strong>in</strong>uence function<br />
with =EPS and an outlier-bound =M<strong>in</strong>Pts,<br />
the abitary-shape clusters dened by our method (c.f.<br />
denition 5) are the same as the clusters found by DB-<br />
SCAN. The reason is that <strong>in</strong> case of the square wave<br />
<strong>in</strong>uence function, the po<strong>in</strong>ts x 2 D : f D (x) >satisfy<br />
the core-po<strong>in</strong>t condition of DBSCAN and each non-core<br />
po<strong>in</strong>t x 2 D which is directly density-reachable from a<br />
core-po<strong>in</strong>t x c is attracted by the density-attrac<strong>to</strong>r of<br />
x c . <strong>An</strong> example show<strong>in</strong>g the identical cluster<strong>in</strong>g of DB-<br />
SCAN and our approach is provided <strong>in</strong> gure 5. Note<br />
that the results are only identical for a very simple <strong>in</strong>-<br />
uence function, namely the square wave function.<br />
(a) DBSCAN (b) DENCLUE<br />
Figure 5: Generality of our <strong>Approach</strong><br />
To show the generality of our approach with respect <strong>to</strong><br />
partition-based cluster<strong>in</strong>g methods such asthek-means<br />
cluster<strong>in</strong>g, wehave <strong>to</strong> use a Gaussian <strong>in</strong>uence function.<br />
If of denition 3 equals <strong>to</strong> =2, then there exists a<br />
such that our approach will nd k density-attrac<strong>to</strong>rs<br />
correspond<strong>in</strong>g <strong>to</strong> the k mean values of the k-means cluster<strong>in</strong>g,<br />
and the center-dened clusters (cf. denition 4)<br />
correspond <strong>to</strong> the k-means clusters. The idea of the<br />
proof of this property is based on the observation that<br />
byvary<strong>in</strong>g the we are able <strong>to</strong> obta<strong>in</strong> between 1 and N <br />
clusters (N is the number of non-identical data po<strong>in</strong>ts)<br />
and we therefore only have <strong>to</strong> nd the appropriate <strong>to</strong><br />
obta<strong>in</strong> a correspond<strong>in</strong>g k-means cluster<strong>in</strong>g. Note that<br />
the results of our approach represent a globally optimal<br />
cluster<strong>in</strong>g while most k-means cluster<strong>in</strong>g methods only<br />
provide a local optimum of partition<strong>in</strong>g the data set D<br />
<strong>in</strong><strong>to</strong> k clusters. The results k-means cluster<strong>in</strong>g and our<br />
approach are therefore only the same if the k-means<br />
cluster<strong>in</strong>g provides a global optimum.<br />
The third class of cluster<strong>in</strong>g methods are the hierarchical<br />
methods. By us<strong>in</strong>g dierent values for and the<br />
notion of center-dened clusters accord<strong>in</strong>g <strong>to</strong> denition<br />
5, we are able <strong>to</strong> generate a hierarchy of clusters. If we<br />
start with a very small value for , we obta<strong>in</strong> N clusters.<br />
By <strong>in</strong>creas<strong>in</strong>g the , density-attrac<strong>to</strong>rs start <strong>to</strong><br />
merge and we get the next level of the hierarchy. If we<br />
further <strong>in</strong>crease , more and more density-attrac<strong>to</strong>rs<br />
merge and nally, we only have one density-atrac<strong>to</strong>r<br />
represent<strong>in</strong>g the root of our hierarchy.<br />
3.2 Noise-Invariance<br />
In this subsection, we show the ability of our approach<br />
<strong>to</strong> handle large amounts of noise. Let D F d be a<br />
given data set and DS D (data space) the relevant portion<br />
of F d . S<strong>in</strong>ce D consists of clusters and noise, it can<br />
be partitioned D = D C [ D N , where D C conta<strong>in</strong>s the<br />
clusters and D N conta<strong>in</strong>s the noise (e.g., po<strong>in</strong>ts that are
uniformly distributed <strong>in</strong> DS D ). Let X = fx 1 ;::: ;xg<br />
k<br />
be the canonically ordered set of density-attrac<strong>to</strong>rs belong<strong>in</strong>g<br />
<strong>to</strong> D (wrt ;) and X C = f^x 1;::: ;^x ^kgthe density<br />
attrac<strong>to</strong>rs for D C (wrt ; c ) with and c be<strong>in</strong>g<br />
adequately chosen, and let kSk denote the card<strong>in</strong>ality<br />
of S. Then, we are able <strong>to</strong> show the follow<strong>in</strong>g lemma.<br />
Lemma 1 (Noise-Invariance)<br />
The number of density-attrac<strong>to</strong>rs of D and D C is the<br />
same and the probability that the density-attrac<strong>to</strong>rs rema<strong>in</strong><br />
identical goes aga<strong>in</strong>st 1 for kD N k ! 1. More<br />
formally,<br />
kXk = kX C k and<br />
lim<br />
kDNk!1<br />
<br />
X<br />
kXk<br />
P ( d(x i<br />
; ^x i<br />
)=0)<br />
i=1<br />
<br />
=1:<br />
Idea of the Proof: The proof is based on the fact, that<br />
the normalized density f ~ DN of a uniformly distributed<br />
data set is nearly constant with f ~ DN = c (0 0. So the density distribution of D can<br />
approximated by<br />
f D (y) =f DC (y)+kD N k p 2 2d c<br />
for any y 2 DS D . The second portion of this formula<br />
is constant for a given D and therefore, the densityattrac<strong>to</strong>rs<br />
dended by f DC do not change. <br />
3.3 Parameter Discussion<br />
As <strong>in</strong> most other approaches, the quality of the result<strong>in</strong>g<br />
cluster<strong>in</strong>g depends on an adequate choice of the<br />
parameters. In our approach, we have two important<br />
parameters, namely and . The parameter determ<strong>in</strong>es<br />
the <strong>in</strong>uence of a po<strong>in</strong>t <strong>in</strong> its neighborhood and <br />
describes whether a density-attrac<strong>to</strong>r is signicant, allow<strong>in</strong>g<br />
a reduction of the number of density-attrac<strong>to</strong>rs<br />
and help<strong>in</strong>g <strong>to</strong> improve the performance. In the follow<strong>in</strong>g,<br />
we describe how the parameters should be chosen<br />
<strong>to</strong> obta<strong>in</strong> good results.<br />
Choos<strong>in</strong>g a good can be done by consider<strong>in</strong>g dierent<br />
and determ<strong>in</strong><strong>in</strong>g the largest <strong>in</strong>terval between max<br />
and m<strong>in</strong> where the number of density-attrac<strong>to</strong>rs m()<br />
rema<strong>in</strong>s constant. The cluster<strong>in</strong>g which results from<br />
this approach can be seen as naturally adapted <strong>to</strong> the<br />
data-set. In gure 6, we provide an example for the<br />
number of density-attrac<strong>to</strong>rs depend<strong>in</strong>g on .<br />
m( σ)<br />
Figure 6:<br />
Number of Density-Attrac<strong>to</strong>rs depend<strong>in</strong>g on <br />
σ<br />
The parameter is the m<strong>in</strong>imum density level for a<br />
density-attrac<strong>to</strong>r <strong>to</strong> be signicant. If is set <strong>to</strong> zero, all<br />
density-attrac<strong>to</strong>rs <strong>to</strong>gether with their density-attracted<br />
data po<strong>in</strong>ts are reported as clusters. This, however, is<br />
often not desirable s<strong>in</strong>ce { especially <strong>in</strong> regions with a<br />
low density { each po<strong>in</strong>t may become a cluster of its<br />
own. A good choice for helps the algorithm <strong>to</strong> focus<br />
on the densely populated regions and <strong>to</strong> save computational<br />
time. Note that only the density-attrac<strong>to</strong>rs have<br />
<strong>to</strong> have apo<strong>in</strong>t-density . The po<strong>in</strong>ts belong<strong>in</strong>g <strong>to</strong><br />
the result<strong>in</strong>g cluster, however, may havealower po<strong>in</strong>tdensity<br />
s<strong>in</strong>ce the algorithm assigns all density-attracted<br />
po<strong>in</strong>ts <strong>to</strong> the cluster (cf. denition 4). But what is a<br />
good choice for ? If we assume the database D <strong>to</strong><br />
be noise-free, all density-attrac<strong>to</strong>rs of D are signicant<br />
and should be choosen <strong>in</strong> 0 m<strong>in</strong> ff D (x )g.<br />
x 2X<br />
In most cases the database will conta<strong>in</strong> noise (D =<br />
D C [ D N ; cf. subsection 3.2). Accord<strong>in</strong>g <strong>to</strong> lemma<br />
1, the noise level can be described by the constant<br />
kD N k p 2 2d<br />
and therefore, should be chosen <strong>in</strong><br />
kD N k p 2 2d m<strong>in</strong> ff DC (x )g:<br />
x 2X<br />
Note that the noise level also has an <strong>in</strong>uence on the<br />
choice of . If the dierence of the noise level and the<br />
density of small density-attrac<strong>to</strong>rs is large, then there<br />
is a large <strong>in</strong>tervall for choos<strong>in</strong>g .<br />
4 The Algorithm<br />
In this section, we describe an algorithm which implements<br />
our ideas described <strong>in</strong> sections 2 and 3. For an<br />
ecient determ<strong>in</strong>ation of the clusters, we have <strong>to</strong> nd<br />
a possibility <strong>to</strong> eciently calculate the density function<br />
and the density-attrac<strong>to</strong>rs. <strong>An</strong> important observation<br />
is that <strong>to</strong> calculate the density for a po<strong>in</strong>t x 2 F d ,<br />
only po<strong>in</strong>ts of the data set which are close <strong>to</strong> x actually<br />
contribute <strong>to</strong> the density. All other po<strong>in</strong>ts may be<br />
neglected without mak<strong>in</strong>g a substantial error. Before<br />
describ<strong>in</strong>g the details of our algorithm, <strong>in</strong> the follow<strong>in</strong>g<br />
we rst <strong>in</strong>troduce a local variant of the density function<br />
<strong>to</strong>gether with its error bound.<br />
4.1 Local Density Function<br />
The local density function is an approximation of the<br />
overall density function. The idea is <strong>to</strong> consider the <strong>in</strong>-<br />
uence of nearby po<strong>in</strong>ts exactly whereas the <strong>in</strong>uence of<br />
far po<strong>in</strong>ts is neglected. This <strong>in</strong>troduces an error which<br />
can, however, be guaranteed <strong>to</strong> be <strong>in</strong> tight bounds. To<br />
dene the local density function, we need the function<br />
near(x) with x 1 2 near(x) : d(x 1 ;x) near . Then,<br />
the local density is dened as follows.<br />
Def. 6 (Local Density Function)<br />
The local density ^f D (x) is<br />
^f D (x)=<br />
X<br />
x12near(x)<br />
f x1<br />
B (x) :<br />
The gradient of the local density function is dened<br />
similar <strong>to</strong> denition 2. In the follow<strong>in</strong>g, we assume that
near = k. <strong>An</strong> upper bound for the error made by us<strong>in</strong>g<br />
the local density function ^f D (x) <strong>in</strong>stead of the real<br />
density function f D (x) is given by the follow<strong>in</strong>g lemma.<br />
Lemma 2 (Error-Bound)<br />
If the po<strong>in</strong>ts x i 2 D : d(x i ;x) >kare neglected, the<br />
error is bound by<br />
X<br />
Error =<br />
e , d(x i ;x)2<br />
2 2<br />
xi2D; d(xi;x)>k<br />
kfx i 2Djd(x i ;x)>kgk e ,k2 =2 :<br />
Idea of the Proof: The error-bound assumes that all<br />
po<strong>in</strong>ts are on a hypersphere of k around x. <br />
The error-bound given by lemma 5 is only an upper<br />
bound for the error made by the local density function.<br />
In real data sets, the error will be much smaller s<strong>in</strong>ce<br />
many po<strong>in</strong>ts are much further away from x than k.<br />
4.2 The DENCLUE Algorithm<br />
The DENCLUE algorithm works <strong>in</strong> two steps. Step<br />
one is a precluster<strong>in</strong>g step, <strong>in</strong> which a map of the relevant<br />
portion of the data space is constructed. The<br />
map is used <strong>to</strong> speed up the calculation of the density<br />
function which requires <strong>to</strong> eciently access neighbor<strong>in</strong>g<br />
portions of the data space. The second step is the<br />
actual cluster<strong>in</strong>g step, <strong>in</strong> which the algorithm identies<br />
the density-attrac<strong>to</strong>rs and the correspond<strong>in</strong>g densityattracted<br />
po<strong>in</strong>ts.<br />
Step 1: The map is constructed as follows. The m<strong>in</strong>imal<br />
bound<strong>in</strong>g (hyper-)rectangle of the data set is divided<br />
<strong>in</strong><strong>to</strong> d-dimensional hypercubes, with an edgelength<br />
of 2. Only hypercubes which actually conta<strong>in</strong><br />
data po<strong>in</strong>ts are determ<strong>in</strong>ed. The number of populated<br />
cubes (kC p k) can range from 1 <strong>to</strong> N depend<strong>in</strong>g on the<br />
choosen , but does not depend on the dimensionalityof<br />
the data space. The hypercubes are numbered depend<strong>in</strong>g<br />
on their relative position from a given orig<strong>in</strong> (cf. gure7foratwo-dimensional<br />
example). In this way, the<br />
populated hypercubes (conta<strong>in</strong><strong>in</strong>g d-dimensional data<br />
po<strong>in</strong>ts) can be mapped <strong>to</strong> one-dimensional keys. The<br />
keys of the populated cubes can be eciently s<strong>to</strong>red <strong>in</strong><br />
a randomized search-tree or a B + -tree.<br />
31 32 33 34 35 36<br />
25 26 27 28 29 30<br />
19 20 21 22 23 24<br />
13 14 15 16 17 18<br />
7 8 9 10 11 12<br />
1 2 3 4 5 6<br />
This<br />
orig<strong>in</strong><br />
Figure 7: Map <strong>in</strong> a 2-dim. data space<br />
For each populated cube c 2 C p , <strong>in</strong> addition P<br />
<strong>to</strong> the key<br />
the number of po<strong>in</strong>ts (N c ) which belong <strong>to</strong> c, po<strong>in</strong>ters<br />
<strong>to</strong> those po<strong>in</strong>ts, and the l<strong>in</strong>ear sum x are s<strong>to</strong>red.<br />
x2c<br />
<strong>in</strong>formation is used <strong>in</strong> the cluster<strong>in</strong>g step for a<br />
fast calculation of the mean of a cube (mean(c)). S<strong>in</strong>ce<br />
clusters can spread over more than one cube, neighbor<strong>in</strong>g<br />
cubes which are also populated have <strong>to</strong> be accessed.<br />
To speed up this access, we connect neighbor<strong>in</strong>g populated<br />
cubes. More formally, two cubes c 1 ;c 2 2 C p<br />
are connected if d(mean(c 1 ); mean(c 2 )) 4. Do<strong>in</strong>g<br />
that for all cubes would normally take O(C 2) time.<br />
p<br />
We can, however, use a second outlier-bound c <strong>to</strong> reduce<br />
the time needed for connect<strong>in</strong>g the cubes. Let<br />
C sp = fc 2 C p jN c c g be the set of highly populated<br />
cubes. In general, the number of highly populated<br />
cubes C sp is much smaller than C p , especially <strong>in</strong><br />
high-dimensional space. The time needed for connect<strong>in</strong>g<br />
the highly populated cubes with their neighbors is<br />
then O(kC sp kkC p k) with kC sp k
determ<strong>in</strong><strong>in</strong>g the density-attrac<strong>to</strong>r x for a po<strong>in</strong>t x and<br />
^f D Gauss (x ) , the po<strong>in</strong>t x is classied and attached<br />
<strong>to</strong> the cluster belong<strong>in</strong>g <strong>to</strong> x . For eciency reasons,<br />
the algorithm s<strong>to</strong>res all po<strong>in</strong>ts x 0 with d(x i ;x 0 ) =2<br />
for any step 0 i k dur<strong>in</strong>g the hill-climb<strong>in</strong>g procedure<br />
and attaches these po<strong>in</strong>ts <strong>to</strong> the cluster of x as<br />
well. Us<strong>in</strong>g this heuristics, all po<strong>in</strong>ts which are located<br />
close <strong>to</strong> the path from x <strong>to</strong> its density-attrac<strong>to</strong>r can be<br />
classied without appl<strong>in</strong>g the hill-climb<strong>in</strong>g procedure <strong>to</strong><br />
them.<br />
4.3 Time Complexity<br />
First, let us recall the ma<strong>in</strong> steps of our algorithm.<br />
DENCLUE (D; ;; c )<br />
1. MBR Determ<strong>in</strong>eMBR(D)<br />
2. C p DetP opulatedCubes(D; MBR; )<br />
C sp DetHighlyP opulatedCubes(C p ; c )<br />
3. map; C r ConnectMap(C p ;C sp ;)<br />
4. clusters DetDensAttrac<strong>to</strong>rs(map; C r ;;)<br />
Step one is a l<strong>in</strong>ear scan of the data set and takes<br />
O(kDk). Step two takes O(kDk + kC p klog(kC p k))<br />
time because a tree-based access structure (such as a<br />
randomized search tree or B + -tree) is used for s<strong>to</strong>r<strong>in</strong>g<br />
the keys of the populated cubes. Note that kC p k <br />
kDk and therefore <strong>in</strong> the worst case, we haveO(kDk<br />
log(kDk). The complexity of step 3 depends on and<br />
, which determ<strong>in</strong>e the card<strong>in</strong>atiltyofC sp and C p . Formally,<br />
the time complexity of step 3 is O(kC sp kkC p k)<br />
with kC sp k
of the molecule <strong>in</strong> the conformation space. Due <strong>to</strong> the<br />
large amount of high-dimensional data, it is dicult <strong>to</strong><br />
nd, for example, the preferred conformations of the<br />
molecule. This, however, is important for applications<br />
<strong>in</strong> the pharmaceutical <strong>in</strong>dustry s<strong>in</strong>ce small and exible<br />
peptides are the basis for many medicaments. The exibility<br />
of the peptides, however, is also responsible for<br />
the fact that the peptide has many <strong>in</strong>termediate nonstable<br />
conformations which causes the data <strong>to</strong> conta<strong>in</strong><br />
large amounts of noise (more than 50 percent).<br />
(a) Folded State (b) Unfolded State<br />
Figure 10: Folded Conformation of the Peptide<br />
In gures 10a we show the most important conformations<br />
(correspond<strong>in</strong>g <strong>to</strong> the largest clusters of conformations<br />
<strong>in</strong> the 19-dimensional dihedral angle space) found<br />
by DENCLUE. The two conformations correspond <strong>to</strong> a<br />
folded and an unfolded state of the peptide.<br />
6 Conclusions<br />
In this paper, we propose a new approach <strong>to</strong> cluster<strong>in</strong>g<br />
<strong>in</strong> large multimedia databases. We formally <strong>in</strong>troduce<br />
the notion of <strong>in</strong>uence and density functions as well as<br />
dierent notions of clusters which are both based on<br />
determ<strong>in</strong><strong>in</strong>g the density-attrac<strong>to</strong>rs of the density function.<br />
For an ecient implementation of our approach,<br />
we <strong>in</strong>troduce a local density function which can be eciently<br />
determ<strong>in</strong>ed us<strong>in</strong>g a map-oriented representation<br />
of the data. We show the generality of our approach, i.e.<br />
that we are able <strong>to</strong> simulate a locality-based, partitionbased,<br />
and hierachical cluster<strong>in</strong>g. We also formally<br />
show the noise-<strong>in</strong>variance and error-bounds of our approach.<br />
<strong>An</strong> experimental evaluation shows the superior<br />
eciency and eectiveness of DENCLUE. Due <strong>to</strong> the<br />
promis<strong>in</strong>g results of the experimental evaluation, webelieve<br />
that our method will have a noticeable impact on<br />
the way cluster<strong>in</strong>g will be done <strong>in</strong> the future. Our plans<br />
for future work <strong>in</strong>clude an application and ne tun<strong>in</strong>g<br />
of our method for specic applications and an extensive<br />
comparison <strong>to</strong> other cluster<strong>in</strong>g methods (such as<br />
Birch).<br />
References<br />
[DJSGM97] X. Daura, B. Jaun, D. Seebach, W.F. van Gunsteren<br />
A.E. Mark:'Reversible peptide fold<strong>in</strong>g <strong>in</strong><br />
solution by molecular dynamics simulation', submitted<br />
<strong>to</strong> Science (1997)<br />
[EKSX96]<br />
Ester M., Kriegel H.-P., Sander J., Xu X.:'A<br />
Density-Based Algorithm for Discover<strong>in</strong>g Clusters<br />
<strong>in</strong> <strong>Large</strong> Spatial Databases with Noise',<br />
Proc. 3rd Int. Conf. on Knowledge Discovery<br />
and Data M<strong>in</strong><strong>in</strong>g, AAAI Press, 1996.<br />
[EKSX97] Ester M., Kriegel H.-P., Sander J., Xu<br />
X.:'Density-Connected Sets and their Application<br />
for Trend Detection <strong>in</strong> Spatial Databases',<br />
Proc. 3rd Int. Conf. on Knowledge Discovery<br />
and Data M<strong>in</strong><strong>in</strong>g, AAAI Press, 1997.<br />
[EKX95] Ester M., Kriegel H.-P., Xu X.:'Knowledge Discovery<br />
<strong>in</strong> <strong>Large</strong> Spatial Databases: Focus<strong>in</strong>g<br />
Techniques for Ecient Class Identication',<br />
Proc. 4th Int. Symp. on <strong>Large</strong> Spatial<br />
Databases, 1995, <strong>in</strong>: Lecture Notes <strong>in</strong> Computer<br />
Science, Vol. 951, Spr<strong>in</strong>ger, 1995, pp. 67-82.<br />
[FH75] K. Fukunaga, L.D. Hostler:'The estimation of<br />
the gradient of a density function, with application<br />
<strong>in</strong> pattern recognation', IEEE Trans. Info.<br />
Thy. 1975, IT-21, 32-40<br />
[Jag91] Jagadish H. V.:'A Retrieval Technique for Similar<br />
Shapes', Proc. ACM SIGMOD Int.Conf. on<br />
Management of Data, 1991, pp. 208-217.<br />
[Kuk92] Kukich K.:'Techniques for Au<strong>to</strong>matically Correct<strong>in</strong>g<br />
Word <strong>in</strong>Text', ACM Comput<strong>in</strong>g Surveys,<br />
Vol. 24, No. 4, 1992, pp. 377-440.<br />
[MG95] Methrotra R., Gray J. E.:'Feature-Index-Based<br />
Similar Shape retrieval', Proc. of the 3rd Work<strong>in</strong>g<br />
Conf. on Visual Database Systems, 1995.<br />
[MN89] Mehlhorn K., Naher S.:'LEDA, a libary of ecient<br />
data types and algorithms', University of<br />
Saarland, FB Informatik, TR A 04/89.<br />
[Nad65] Nadaraya E. A.:'On nonparametric estimates of<br />
density functions and regression curves', Theory<br />
Prob. Appl. 10, 1965, pp. 186-190.<br />
[NH94] Ng R. T., Han J.:'Ecient and Eective <strong>Cluster<strong>in</strong>g</strong><br />
Methods for Spatial Data M<strong>in</strong><strong>in</strong>g', Proc.<br />
20th Int. Conf. on Very <strong>Large</strong> Data Bases, Morgan<br />
Kaufmann, 1994, pp. 144-155.<br />
[Sch96] Schikuta E.:'Grid cluster<strong>in</strong>g: <strong>An</strong> ecient hierarchical<br />
cluster<strong>in</strong>g method for very large data sets',<br />
Proc. 13th Conf. on Pattern Recognition, Vol.<br />
2, IEEE Computer Society Press, pp. 101-105.<br />
[Schn64] Schnell P.:'A method <strong>to</strong> nd po<strong>in</strong>t-groups',<br />
Biometrika, 6, 1964, pp. 47-48.<br />
[Schu70] Schuster E. F.:'Note on the uniform convergence<br />
of density estimates', <strong>An</strong>n. Math. Statist 41,<br />
1970, pp. 1347-1348.<br />
[SH94] Shawney H., Hafner J.:'Ecient Color His<strong>to</strong>gram<br />
Index<strong>in</strong>g', Proc. Int. Conf.on Image Process<strong>in</strong>g,<br />
1994, pp. 66-70.<br />
[WW80] Wallace T., W<strong>in</strong>tz P.:'<strong>An</strong> Ecient Three-<br />
Dimensional Aircraft Recognition Algorithm Us<strong>in</strong>g<br />
Normalized Fourier Descrip<strong>to</strong>rs', Computer<br />
Graphics and Image Process<strong>in</strong>g, Vol. 13, 1980.<br />
[WYM97] Wang W., Yang J., Muntz R.:'STING: A Statistical<br />
Information Grid <strong>Approach</strong> <strong>to</strong> Spatial Data<br />
M<strong>in</strong><strong>in</strong>g', Proc. 23rd Int. Conf. on Very <strong>Large</strong><br />
Data Bases, Morgan Kaufmann, 1997.<br />
[XEKS98] Xu X., Ester M., Kriegel H.-P., Sander J.:'A<br />
Nonparametric <strong>Cluster<strong>in</strong>g</strong> Algorithm for Knowlege<br />
Discovery <strong>in</strong> <strong>Large</strong> Spatial Databases', will<br />
appear <strong>in</strong> Proc. IEEE Int. Conf. on Data Eng<strong>in</strong>eer<strong>in</strong>g,<br />
1998, IEEE Computer Society Press.<br />
[ZRL96]<br />
Zhang T., Ramakrishnan R., L<strong>in</strong>vy M.:'BIRCH:<br />
<strong>An</strong> Ecient Data <strong>Cluster<strong>in</strong>g</strong> Method for very<br />
<strong>Large</strong> Databases', Proc. ACM SIGMOD Int.<br />
Conf. on Management of Data, ACM Press,<br />
1996, pp. 103-114.