01.04.2015 Views

An Efficient Approach to Clustering in Large Multimedia ... - CiteSeerX

An Efficient Approach to Clustering in Large Multimedia ... - CiteSeerX

An Efficient Approach to Clustering in Large Multimedia ... - CiteSeerX

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>An</strong> Ecient <strong>Approach</strong> <strong>to</strong> <strong>Cluster<strong>in</strong>g</strong> <strong>in</strong> <strong>Large</strong> <strong>Multimedia</strong> Databases<br />

with Noise<br />

Alexander H<strong>in</strong>neburg, Daniel A. Keim<br />

Institute of Computer Science, University of Halle, Germany<br />

fh<strong>in</strong>neburg, keimg@<strong>in</strong>formatik.uni-halle.de<br />

Abstract<br />

Several cluster<strong>in</strong>g algorithms can be applied <strong>to</strong> cluster<strong>in</strong>g <strong>in</strong><br />

large multimedia databases. The eectiveness and eciency<br />

of the exist<strong>in</strong>g algorithms, however, is somewhat limited,<br />

s<strong>in</strong>ce cluster<strong>in</strong>g <strong>in</strong> multimedia databases requires cluster<strong>in</strong>g<br />

high-dimensional feature vec<strong>to</strong>rs and s<strong>in</strong>ce multimedia<br />

databases often conta<strong>in</strong> large amounts of noise. In this paper,<br />

we therefore <strong>in</strong>troduce a new algorithm <strong>to</strong> cluster<strong>in</strong>g<br />

<strong>in</strong> large multimedia databases called DENCLUE (DENsitybased<br />

CLUstEr<strong>in</strong>g). The basic idea of our new approach is<br />

<strong>to</strong> model the overall po<strong>in</strong>t density analytically as the sum<br />

of <strong>in</strong>uence functions of the data po<strong>in</strong>ts. Clusters can then<br />

be identied by determ<strong>in</strong><strong>in</strong>g density-attrac<strong>to</strong>rs and clusters<br />

of arbitrary shape can be easily described by a simple equation<br />

based on the overall density function. The advantages<br />

of our new approach are (1) it has a rm mathematical basis,<br />

(2) it has good cluster<strong>in</strong>g properties <strong>in</strong> data sets with large<br />

amounts of noise, (3) it allows a compact mathematical description<br />

of arbitrarily shaped clusters <strong>in</strong> high-dimensional<br />

data sets and (4) it is signicantly faster than exist<strong>in</strong>g algorithms.<br />

To demonstrate the eectiveness and eciency of<br />

DENCLUE, we perform a series of experiments on a number<br />

of dierent data sets from CAD and molecular biology.<br />

A comparison with DBSCAN shows the superiority ofour<br />

new approach.<br />

Keywords: <strong>Cluster<strong>in</strong>g</strong> Algorithms, Density-based <strong>Cluster<strong>in</strong>g</strong>,<br />

<strong>Cluster<strong>in</strong>g</strong> of High-dimensional Data, <strong>Cluster<strong>in</strong>g</strong> <strong>in</strong><br />

<strong>Multimedia</strong> Databases, <strong>Cluster<strong>in</strong>g</strong> <strong>in</strong> the Presence of Noise<br />

1 Introduction<br />

Because of the fast technological progress, the amount<br />

of data which is s<strong>to</strong>red <strong>in</strong> databases <strong>in</strong>creases very fast.<br />

The types of data which are s<strong>to</strong>red <strong>in</strong> the computer become<br />

<strong>in</strong>creas<strong>in</strong>gly complex. In addition <strong>to</strong> numerical<br />

data, complex 2D and 3D multimedia data such asimage,<br />

CAD, geographic, and molecular biology data are<br />

s<strong>to</strong>red <strong>in</strong> databases. For an ecient retrieval, the complex<br />

data is usually transformed <strong>in</strong><strong>to</strong> high-dimensional<br />

feature vec<strong>to</strong>rs. Examples of feature vec<strong>to</strong>rs are color<br />

his<strong>to</strong>grams [SH94], shape descrip<strong>to</strong>rs [Jag91, MG95],<br />

Fourier vec<strong>to</strong>rs [WW80], text descrip<strong>to</strong>rs [Kuk92], etc.<br />

In many ofthe mentioned applications, the databases<br />

are very large and consist of millions of data objects<br />

with several tens <strong>to</strong> a few hundreds of dimensions.<br />

Au<strong>to</strong>mated knowledge discovery <strong>in</strong> large multimedia<br />

databases is an <strong>in</strong>creas<strong>in</strong>gly important research issue.<br />

<strong>Cluster<strong>in</strong>g</strong> and trend detection <strong>in</strong> such databases, however,<br />

is dicult s<strong>in</strong>ce the databases often conta<strong>in</strong> large<br />

amounts of noise and sometimes only a small portion<br />

of the large databases accounts for the cluster<strong>in</strong>g. In<br />

addition, most of the known algorithms do not work ef-<br />

ciently on high-dimensional data.The methods which<br />

Copyright (c) 1998, American Association for Articial<br />

Intelligence (www.aaai.org). All rights reserved.<br />

are applicable <strong>to</strong> databases of high-dimensional feature<br />

vec<strong>to</strong>rs are the methods which are known from<br />

the area of spatial data m<strong>in</strong><strong>in</strong>g. The most prom<strong>in</strong>ent<br />

representatives are partition<strong>in</strong>g algorithms such<br />

as CLARANS [NH94], hierarchical cluster<strong>in</strong>g algorithms,and<br />

locality-based cluster<strong>in</strong>g algorithms such<br />

as (G)DBSCAN [EKSX96, EKSX97] and DBCLASD<br />

[XEKS98]. The basic idea of partition<strong>in</strong>g algorithms<br />

is <strong>to</strong> partition the database <strong>in</strong><strong>to</strong> k clusters which are<br />

represented by the gravity of the cluster (k-means) or<br />

by one representative object of the cluster (k-medoid).<br />

Each object is assigned <strong>to</strong> the closest cluster. A wellknown<br />

partition<strong>in</strong>g algorithm is CLARANS which uses<br />

a randomized and bounded search strategy <strong>to</strong> improve<br />

the performance. Hierarchical cluster<strong>in</strong>g algorithms decompose<br />

the database <strong>in</strong><strong>to</strong> several levels of partition<strong>in</strong>gs<br />

which are usually represented by a dendrogram - a<br />

tree which splits the database recursively <strong>in</strong><strong>to</strong> smaller<br />

subsets. The dendrogram can be created <strong>to</strong>p-down (divisive)<br />

or bot<strong>to</strong>m-up (agglomerative). Although hierarchical<br />

cluster<strong>in</strong>g algorithms can be very eective <strong>in</strong><br />

knowledge discovery, the costs of creat<strong>in</strong>g the dendrograms<br />

is prohibitively expensive for large data sets s<strong>in</strong>ce<br />

the algorithms are usually at least quatratic <strong>in</strong> the number<br />

of data objects. More ecient are locality-based<br />

cluster<strong>in</strong>g algorithms s<strong>in</strong>ce they usually group neighbor<strong>in</strong>g<br />

data elements <strong>in</strong><strong>to</strong> clusters based on local conditions<br />

and therefore allow the cluster<strong>in</strong>g <strong>to</strong> be performed <strong>in</strong><br />

one scan of the database. DBSCAN, for example, uses<br />

a density-based notion of clusters and allows the discovery<br />

of arbitrarily shaped clusters. The basic idea is that<br />

for each po<strong>in</strong>t of a cluster the density of data po<strong>in</strong>ts <strong>in</strong><br />

the neighborhood has <strong>to</strong> exceed some threshold. DB-<br />

CLASD also works locality-based but <strong>in</strong> contrast <strong>to</strong> DB-<br />

SCAN assumes that the po<strong>in</strong>ts <strong>in</strong>side of the clusters<br />

are randomly distributed, allow<strong>in</strong>g DBCLASD <strong>to</strong> work<br />

without any <strong>in</strong>put parameters. A performance comparison<br />

[XEKS98] shows that DBSCAN is slightly faster<br />

than DBCLASD and both, DBSCAN and DBCLASD<br />

are much faster than hierarchical cluster<strong>in</strong>g algorithms<br />

and partition<strong>in</strong>g algorithms such as CLARANS. To<br />

improve the eciency, optimized cluster<strong>in</strong>g techniques<br />

have been proposed. Examples <strong>in</strong>clude R*-Tree-based<br />

Sampl<strong>in</strong>g [EKX95], Gridle-based cluster<strong>in</strong>g [Sch96],<br />

BIRCH [ZRL96] which is based on the Cluster-Featuretree,<br />

and STING which uses a quadtree-like structure<br />

conta<strong>in</strong><strong>in</strong>g additional statistical <strong>in</strong>formation [WYM97].<br />

A problem of the exist<strong>in</strong>g approaches <strong>in</strong> the context<br />

of cluster<strong>in</strong>g multimedia data is that most algorithms<br />

are not designed for cluster<strong>in</strong>g high-dimensional<br />

feature vec<strong>to</strong>rs and therefore, the performance of ex-


ist<strong>in</strong>g algorithms degenerates rapidly with <strong>in</strong>creas<strong>in</strong>g<br />

dimension. In addition, few algorithms can deal with<br />

databases conta<strong>in</strong><strong>in</strong>g large amounts of noise, which is<br />

quite common <strong>in</strong> multimedia databases where usually<br />

only a small portion of the database forms the <strong>in</strong>terest<strong>in</strong>g<br />

subset which accounts for the cluster<strong>in</strong>g. Our new<br />

approach solves these problems. It works eciently for<br />

high-dimensional data sets and allows arbitrary noise<br />

levels while still guarantee<strong>in</strong>g <strong>to</strong> nd the cluster<strong>in</strong>g. In<br />

addition, our approach can be seen as a generalization<br />

of dierent previous cluster<strong>in</strong>g approaches { partion<strong>in</strong>gbased,<br />

hierarchical, and locality-based cluster<strong>in</strong>g. Depend<strong>in</strong>g<br />

on the parameter sett<strong>in</strong>gs, we may <strong>in</strong>italize our<br />

approach <strong>to</strong> provide similar (or even the same) results<br />

as dierent other cluster<strong>in</strong>g algorithms, and due <strong>to</strong> its<br />

generality, <strong>in</strong> same cases our approach even provides a<br />

better eectiveness. In addition, our algorithm works<br />

eciently on very large amounts of high-dimensional<br />

data, outperform<strong>in</strong>g one of the fastest exist<strong>in</strong>g methods<br />

(DBSCAN) by a fac<strong>to</strong>r of up <strong>to</strong> 45.<br />

The rest of the paper is organized as follows. In section<br />

2, we <strong>in</strong>troduce the basic idea of our approach and<br />

formally dene <strong>in</strong>uence functions, density functions,<br />

density-attrac<strong>to</strong>rs, clusters, and outl<strong>in</strong>ers. In section 3,<br />

we discuss the properties of our approach, namely its<br />

generality and its <strong>in</strong>variance with respect <strong>to</strong> noise. In<br />

section 4, we then <strong>in</strong>troduce our algorithm <strong>in</strong>clud<strong>in</strong>g its<br />

theoretical foundations such as the locality-based density<br />

function and its error-bounds. In addition, we also<br />

discuss the complexity ofour approach. In section 5,<br />

we provide an experimental evaluation compar<strong>in</strong>g our<br />

approach <strong>to</strong> previous approaches such as DBSCAN. For<br />

the experiments, we use real data from CAD and molecular<br />

biology. To show the ability of our approach <strong>to</strong> deal<br />

with noisy data, we also use synthetic data sets with a<br />

variable amount of noise. Section 6 summarizes our results<br />

and discusses their impact as well as important<br />

issues for future work.<br />

2 A General <strong>Approach</strong> <strong>to</strong> <strong>Cluster<strong>in</strong>g</strong><br />

Before <strong>in</strong>troduc<strong>in</strong>g the formal denitions required for<br />

describ<strong>in</strong>g our new approach, <strong>in</strong> the follow<strong>in</strong>g we rst<br />

try <strong>to</strong> give a basic understand<strong>in</strong>g of our approach.<br />

2.1 Basic Idea<br />

Our new approach is based on the idea that the <strong>in</strong>uence<br />

of each data po<strong>in</strong>t can be modeled formally us<strong>in</strong>g<br />

a mathematical function which we call <strong>in</strong>uence function.<br />

The <strong>in</strong>uence function can be seen as a function<br />

which describes the impact of a data po<strong>in</strong>t with<strong>in</strong><br />

its neighborhood. Examples for <strong>in</strong>uence functions are<br />

parabolic functions, square wave function, or the Gaussian<br />

function. The <strong>in</strong>uence function is applied <strong>to</strong> each<br />

data po<strong>in</strong>t. The overall density of the data space can<br />

be calculated as the sum of the <strong>in</strong>uence function of<br />

all data po<strong>in</strong>ts. Clusters can then be determ<strong>in</strong>ed mathematically<br />

by identify<strong>in</strong>g density-attrac<strong>to</strong>rs. Densityattrac<strong>to</strong>rs<br />

are local maxima of the overall density function.<br />

If the overall density function is cont<strong>in</strong>uous and<br />

dierentiable at any po<strong>in</strong>t, determ<strong>in</strong><strong>in</strong>g the densityattrac<strong>to</strong>rs<br />

can be done eciently by a hill-climb<strong>in</strong>g procedure<br />

which is guided by the gradient of the overall<br />

density function. In addition, the mathematical form of<br />

the overall density function allows clusters of arbitrary<br />

shape <strong>to</strong> be described <strong>in</strong> a very compact mathematical<br />

form, namely by a simple equation of the overall density<br />

function. Theoretical results show that our approach is<br />

very general and allows dierent cluster<strong>in</strong>gs of the data<br />

(partition-based, hierarchical, and locality-based) <strong>to</strong> be<br />

found. We also show (theoretically and experimentally)<br />

that our algorithm is <strong>in</strong>variant aga<strong>in</strong>st large amounts of<br />

noise and works well for high-dimensional data sets.<br />

The algorithm DENCLUE is an ecient implementation<br />

of our idea. The overall density function requires <strong>to</strong><br />

sum up the <strong>in</strong>uence functions of all data po<strong>in</strong>ts. Most<br />

of the data po<strong>in</strong>ts, however, do not actually contribute<br />

<strong>to</strong> the overall density function. Therefore, DENCLUE<br />

uses a local density function which considers only the<br />

data po<strong>in</strong>ts which actually contribute <strong>to</strong> the overall density<br />

function. This can be done while still guarantee<strong>in</strong>g<br />

tight error bounds. <strong>An</strong> <strong>in</strong>telligent cell-based organization<br />

of the data allows our algorithm <strong>to</strong> work eciently<br />

on very large amounts of high-dimensional data.<br />

2.2 Denitions<br />

We rst have <strong>to</strong><strong>in</strong>troduce the general notion of <strong>in</strong>uence<br />

and density functions. Informally, the <strong>in</strong>uence<br />

functions are a mathematical description of the <strong>in</strong>uence<br />

a data object has with<strong>in</strong> its neighborhood. We<br />

denote the d-dimensional feature space by F d . The density<br />

function at a po<strong>in</strong>t x 2 F d is dened as the sum of<br />

the <strong>in</strong>uence functions of all data objects at that po<strong>in</strong>t<br />

(cf. [Schn64] or [FH 75] for a similar notion of density<br />

functions).<br />

Def. 1 (Inuence & Density Function)<br />

The <strong>in</strong>uence function of a data object y 2 F d is a<br />

function f y : F d ,! R + B 0 which is dened <strong>in</strong> terms of a<br />

basic <strong>in</strong>uence function f B<br />

f y B (x) =f B(x; y):<br />

The density function is dened as the sum of the <strong>in</strong>-<br />

uence functions of all data po<strong>in</strong>ts. Given N data<br />

objects described by a set of feature vec<strong>to</strong>rs D =<br />

fx 1 ;::: ;x N gF d the density function is dened as<br />

f D B (x) = NX<br />

i=1<br />

f xi<br />

B (x):<br />

In pr<strong>in</strong>ciple, the <strong>in</strong>uence function can be an arbitrary<br />

function. For the dention of specic <strong>in</strong>uence functions,<br />

we need a distance function d : F d F d ,! R + 0<br />

which determ<strong>in</strong>es the distance of two d-dimensional feature<br />

vec<strong>to</strong>rs. The distance function has <strong>to</strong> be reexive<br />

and symmetric. For simplicity, <strong>in</strong> the follow<strong>in</strong>g we assume<br />

a Euclidean distance function. Note, however,<br />

that the denitions are <strong>in</strong>dependent from the choice of<br />

the distance function. Examples of basic <strong>in</strong>uence functions<br />

are:


Density<br />

Density<br />

(a) Data Set (b) Square Wave (c) Gaussian<br />

Figure 1: Example for Density Functions<br />

1. Square Wave Inuence Function<br />

f Square (x; y) = 0 if d(x; y) ><br />

1 otherwise<br />

2. Gaussian Inuence Function<br />

f Gauss (x; y) =e ,d(x;y)2 2 2<br />

The density function which results from a Gaussian <strong>in</strong>-<br />

uence function is<br />

f D Gauss (x) = NX<br />

i=1<br />

e , d(x;x i )2<br />

2 2 :<br />

Figure 1 shows an example of a set of data po<strong>in</strong>ts <strong>in</strong> 2D<br />

space (cf. gure 1a) <strong>to</strong>gether with the correspond<strong>in</strong>g<br />

overall density functions for a square wave (cf. gure<br />

1b) and a Gaussian <strong>in</strong>uence function (cf. gure 1c).<br />

In the follow<strong>in</strong>g, we dene two dierent notions of clusters<br />

{ center-dened clusters (similar <strong>to</strong> k-means clusters)<br />

and arbitrary-shape clusters. For the denitions,<br />

we need the notion of density-attrac<strong>to</strong>rs. Informally,<br />

density attra<strong>to</strong>rs are local maxima of the overall denstiy<br />

function and therefore, we also need <strong>to</strong> dene the<br />

gradient of the density function.<br />

Def. 2 (Gradient)<br />

The gradient of a function f D (x) is dened as<br />

B<br />

rf D B<br />

(x) =<br />

NX<br />

i=1<br />

(x i , x) f xi<br />

(x):<br />

B<br />

In case of the Gaussian <strong>in</strong>uence function, the gradient<br />

is dened as:<br />

rf D Gauss (x) = NX<br />

i=1<br />

(x i , x) e , d(x;x i )2<br />

2 2 :<br />

In general, it is desirable that the <strong>in</strong>uence function<br />

is a symmetric, cont<strong>in</strong>uous, and dierentiable function.<br />

Note, however, that the denition of the gradient is<br />

<strong>in</strong>dependent of these properties. Now, we are able <strong>to</strong><br />

dene the notion of density-attrac<strong>to</strong>rs.<br />

Def. 3 (Density-Attrac<strong>to</strong>r)<br />

A po<strong>in</strong>t x 2 F d is called a density-attrac<strong>to</strong>r for a<br />

given <strong>in</strong>uence function, i x is a local maxium of the<br />

density-function f D B .<br />

A po<strong>in</strong>t x 2 F d is density-attracted <strong>to</strong> a densityattrac<strong>to</strong>r<br />

x ,i9k2N: d(x k ;x ) with<br />

x 0 = x; x i = x i,1 + <br />

rf D B (xi,1 )<br />

krf D B (xi,1 )k :<br />

Density<br />

Data Space<br />

Figure 2: Example of Density-Attrac<strong>to</strong>rs<br />

Figure 2 shows an example of density-attra<strong>to</strong>rs <strong>in</strong> a onedimensional<br />

space. For a cont<strong>in</strong>uous and dierentiable<br />

<strong>in</strong>uence function, a simple hill-climb<strong>in</strong>g algorithm can<br />

be used <strong>to</strong> determ<strong>in</strong>e the density-attrac<strong>to</strong>r for a data<br />

po<strong>in</strong>t x 2 D. The hill-climb<strong>in</strong>g procedure is guided by<br />

the gradient off D.<br />

B<br />

We are now are able <strong>to</strong> <strong>in</strong>troduce our denitions of clusters<br />

and outliers. Outliers are po<strong>in</strong>ts, which are not<br />

<strong>in</strong>uenced by "many" other data po<strong>in</strong>ts. We need a<br />

bound <strong>to</strong> formalize the "many".<br />

Def. 4 (Center-Dened Cluster)<br />

A center-dened cluster (wrt <strong>to</strong> ,) for a densityattrac<strong>to</strong>r<br />

x is a subset C D, with x 2 C be<strong>in</strong>g<br />

density-attracted by x and f D(x ) . Po<strong>in</strong>ts x 2 D<br />

B<br />

are called outliers if they are density-attraced by alocal<br />

maximum x o<br />

with f D(x B o)


Density<br />

Density<br />

Density<br />

(a) =0:2 (b) =0:6 (d) =1:5<br />

Figure 3: Example of Center-Dened Clusters for dierent <br />

Density<br />

Density<br />

(a) =2 (b) =2 (c) =1 (d) =1<br />

Figure 4: Example of Arbitray-Shape Clusters for dierent <br />

3 Properties of Our <strong>Approach</strong><br />

In the follow<strong>in</strong>g subsections we discuss some important<br />

properties of our approach <strong>in</strong>clud<strong>in</strong>g its generality,<br />

noise-<strong>in</strong>variance, and parameter sett<strong>in</strong>gs.<br />

3.1 Generality<br />

As already mentioned, our approach generalizes other<br />

cluster<strong>in</strong>g methods, namely partition-based, hierarchical,<br />

and locality-based cluster<strong>in</strong>g methods. Due the<br />

wide range of dierent cluster<strong>in</strong>g methods and the given<br />

space limitations, we can not discuss the generality of<br />

our approach <strong>in</strong>detail. Instead, we provide the basic<br />

ideas of how our approach generalizes some other wellknown<br />

cluster<strong>in</strong>g algorithms.<br />

Let us rst consider the locality-based cluster<strong>in</strong>g algorithm<br />

DBSCAN. Us<strong>in</strong>g a square wave <strong>in</strong>uence function<br />

with =EPS and an outlier-bound =M<strong>in</strong>Pts,<br />

the abitary-shape clusters dened by our method (c.f.<br />

denition 5) are the same as the clusters found by DB-<br />

SCAN. The reason is that <strong>in</strong> case of the square wave<br />

<strong>in</strong>uence function, the po<strong>in</strong>ts x 2 D : f D (x) >satisfy<br />

the core-po<strong>in</strong>t condition of DBSCAN and each non-core<br />

po<strong>in</strong>t x 2 D which is directly density-reachable from a<br />

core-po<strong>in</strong>t x c is attracted by the density-attrac<strong>to</strong>r of<br />

x c . <strong>An</strong> example show<strong>in</strong>g the identical cluster<strong>in</strong>g of DB-<br />

SCAN and our approach is provided <strong>in</strong> gure 5. Note<br />

that the results are only identical for a very simple <strong>in</strong>-<br />

uence function, namely the square wave function.<br />

(a) DBSCAN (b) DENCLUE<br />

Figure 5: Generality of our <strong>Approach</strong><br />

To show the generality of our approach with respect <strong>to</strong><br />

partition-based cluster<strong>in</strong>g methods such asthek-means<br />

cluster<strong>in</strong>g, wehave <strong>to</strong> use a Gaussian <strong>in</strong>uence function.<br />

If of denition 3 equals <strong>to</strong> =2, then there exists a<br />

such that our approach will nd k density-attrac<strong>to</strong>rs<br />

correspond<strong>in</strong>g <strong>to</strong> the k mean values of the k-means cluster<strong>in</strong>g,<br />

and the center-dened clusters (cf. denition 4)<br />

correspond <strong>to</strong> the k-means clusters. The idea of the<br />

proof of this property is based on the observation that<br />

byvary<strong>in</strong>g the we are able <strong>to</strong> obta<strong>in</strong> between 1 and N <br />

clusters (N is the number of non-identical data po<strong>in</strong>ts)<br />

and we therefore only have <strong>to</strong> nd the appropriate <strong>to</strong><br />

obta<strong>in</strong> a correspond<strong>in</strong>g k-means cluster<strong>in</strong>g. Note that<br />

the results of our approach represent a globally optimal<br />

cluster<strong>in</strong>g while most k-means cluster<strong>in</strong>g methods only<br />

provide a local optimum of partition<strong>in</strong>g the data set D<br />

<strong>in</strong><strong>to</strong> k clusters. The results k-means cluster<strong>in</strong>g and our<br />

approach are therefore only the same if the k-means<br />

cluster<strong>in</strong>g provides a global optimum.<br />

The third class of cluster<strong>in</strong>g methods are the hierarchical<br />

methods. By us<strong>in</strong>g dierent values for and the<br />

notion of center-dened clusters accord<strong>in</strong>g <strong>to</strong> denition<br />

5, we are able <strong>to</strong> generate a hierarchy of clusters. If we<br />

start with a very small value for , we obta<strong>in</strong> N clusters.<br />

By <strong>in</strong>creas<strong>in</strong>g the , density-attrac<strong>to</strong>rs start <strong>to</strong><br />

merge and we get the next level of the hierarchy. If we<br />

further <strong>in</strong>crease , more and more density-attrac<strong>to</strong>rs<br />

merge and nally, we only have one density-atrac<strong>to</strong>r<br />

represent<strong>in</strong>g the root of our hierarchy.<br />

3.2 Noise-Invariance<br />

In this subsection, we show the ability of our approach<br />

<strong>to</strong> handle large amounts of noise. Let D F d be a<br />

given data set and DS D (data space) the relevant portion<br />

of F d . S<strong>in</strong>ce D consists of clusters and noise, it can<br />

be partitioned D = D C [ D N , where D C conta<strong>in</strong>s the<br />

clusters and D N conta<strong>in</strong>s the noise (e.g., po<strong>in</strong>ts that are


uniformly distributed <strong>in</strong> DS D ). Let X = fx 1 ;::: ;xg<br />

k<br />

be the canonically ordered set of density-attrac<strong>to</strong>rs belong<strong>in</strong>g<br />

<strong>to</strong> D (wrt ;) and X C = f^x 1;::: ;^x ^kgthe density<br />

attrac<strong>to</strong>rs for D C (wrt ; c ) with and c be<strong>in</strong>g<br />

adequately chosen, and let kSk denote the card<strong>in</strong>ality<br />

of S. Then, we are able <strong>to</strong> show the follow<strong>in</strong>g lemma.<br />

Lemma 1 (Noise-Invariance)<br />

The number of density-attrac<strong>to</strong>rs of D and D C is the<br />

same and the probability that the density-attrac<strong>to</strong>rs rema<strong>in</strong><br />

identical goes aga<strong>in</strong>st 1 for kD N k ! 1. More<br />

formally,<br />

kXk = kX C k and<br />

lim<br />

kDNk!1<br />

<br />

X<br />

kXk<br />

P ( d(x i<br />

; ^x i<br />

)=0)<br />

i=1<br />

<br />

=1:<br />

Idea of the Proof: The proof is based on the fact, that<br />

the normalized density f ~ DN of a uniformly distributed<br />

data set is nearly constant with f ~ DN = c (0 0. So the density distribution of D can<br />

approximated by<br />

f D (y) =f DC (y)+kD N k p 2 2d c<br />

for any y 2 DS D . The second portion of this formula<br />

is constant for a given D and therefore, the densityattrac<strong>to</strong>rs<br />

dended by f DC do not change. <br />

3.3 Parameter Discussion<br />

As <strong>in</strong> most other approaches, the quality of the result<strong>in</strong>g<br />

cluster<strong>in</strong>g depends on an adequate choice of the<br />

parameters. In our approach, we have two important<br />

parameters, namely and . The parameter determ<strong>in</strong>es<br />

the <strong>in</strong>uence of a po<strong>in</strong>t <strong>in</strong> its neighborhood and <br />

describes whether a density-attrac<strong>to</strong>r is signicant, allow<strong>in</strong>g<br />

a reduction of the number of density-attrac<strong>to</strong>rs<br />

and help<strong>in</strong>g <strong>to</strong> improve the performance. In the follow<strong>in</strong>g,<br />

we describe how the parameters should be chosen<br />

<strong>to</strong> obta<strong>in</strong> good results.<br />

Choos<strong>in</strong>g a good can be done by consider<strong>in</strong>g dierent<br />

and determ<strong>in</strong><strong>in</strong>g the largest <strong>in</strong>terval between max<br />

and m<strong>in</strong> where the number of density-attrac<strong>to</strong>rs m()<br />

rema<strong>in</strong>s constant. The cluster<strong>in</strong>g which results from<br />

this approach can be seen as naturally adapted <strong>to</strong> the<br />

data-set. In gure 6, we provide an example for the<br />

number of density-attrac<strong>to</strong>rs depend<strong>in</strong>g on .<br />

m( σ)<br />

Figure 6:<br />

Number of Density-Attrac<strong>to</strong>rs depend<strong>in</strong>g on <br />

σ<br />

The parameter is the m<strong>in</strong>imum density level for a<br />

density-attrac<strong>to</strong>r <strong>to</strong> be signicant. If is set <strong>to</strong> zero, all<br />

density-attrac<strong>to</strong>rs <strong>to</strong>gether with their density-attracted<br />

data po<strong>in</strong>ts are reported as clusters. This, however, is<br />

often not desirable s<strong>in</strong>ce { especially <strong>in</strong> regions with a<br />

low density { each po<strong>in</strong>t may become a cluster of its<br />

own. A good choice for helps the algorithm <strong>to</strong> focus<br />

on the densely populated regions and <strong>to</strong> save computational<br />

time. Note that only the density-attrac<strong>to</strong>rs have<br />

<strong>to</strong> have apo<strong>in</strong>t-density . The po<strong>in</strong>ts belong<strong>in</strong>g <strong>to</strong><br />

the result<strong>in</strong>g cluster, however, may havealower po<strong>in</strong>tdensity<br />

s<strong>in</strong>ce the algorithm assigns all density-attracted<br />

po<strong>in</strong>ts <strong>to</strong> the cluster (cf. denition 4). But what is a<br />

good choice for ? If we assume the database D <strong>to</strong><br />

be noise-free, all density-attrac<strong>to</strong>rs of D are signicant<br />

and should be choosen <strong>in</strong> 0 m<strong>in</strong> ff D (x )g.<br />

x 2X<br />

In most cases the database will conta<strong>in</strong> noise (D =<br />

D C [ D N ; cf. subsection 3.2). Accord<strong>in</strong>g <strong>to</strong> lemma<br />

1, the noise level can be described by the constant<br />

kD N k p 2 2d<br />

and therefore, should be chosen <strong>in</strong><br />

kD N k p 2 2d m<strong>in</strong> ff DC (x )g:<br />

x 2X<br />

Note that the noise level also has an <strong>in</strong>uence on the<br />

choice of . If the dierence of the noise level and the<br />

density of small density-attrac<strong>to</strong>rs is large, then there<br />

is a large <strong>in</strong>tervall for choos<strong>in</strong>g .<br />

4 The Algorithm<br />

In this section, we describe an algorithm which implements<br />

our ideas described <strong>in</strong> sections 2 and 3. For an<br />

ecient determ<strong>in</strong>ation of the clusters, we have <strong>to</strong> nd<br />

a possibility <strong>to</strong> eciently calculate the density function<br />

and the density-attrac<strong>to</strong>rs. <strong>An</strong> important observation<br />

is that <strong>to</strong> calculate the density for a po<strong>in</strong>t x 2 F d ,<br />

only po<strong>in</strong>ts of the data set which are close <strong>to</strong> x actually<br />

contribute <strong>to</strong> the density. All other po<strong>in</strong>ts may be<br />

neglected without mak<strong>in</strong>g a substantial error. Before<br />

describ<strong>in</strong>g the details of our algorithm, <strong>in</strong> the follow<strong>in</strong>g<br />

we rst <strong>in</strong>troduce a local variant of the density function<br />

<strong>to</strong>gether with its error bound.<br />

4.1 Local Density Function<br />

The local density function is an approximation of the<br />

overall density function. The idea is <strong>to</strong> consider the <strong>in</strong>-<br />

uence of nearby po<strong>in</strong>ts exactly whereas the <strong>in</strong>uence of<br />

far po<strong>in</strong>ts is neglected. This <strong>in</strong>troduces an error which<br />

can, however, be guaranteed <strong>to</strong> be <strong>in</strong> tight bounds. To<br />

dene the local density function, we need the function<br />

near(x) with x 1 2 near(x) : d(x 1 ;x) near . Then,<br />

the local density is dened as follows.<br />

Def. 6 (Local Density Function)<br />

The local density ^f D (x) is<br />

^f D (x)=<br />

X<br />

x12near(x)<br />

f x1<br />

B (x) :<br />

The gradient of the local density function is dened<br />

similar <strong>to</strong> denition 2. In the follow<strong>in</strong>g, we assume that


near = k. <strong>An</strong> upper bound for the error made by us<strong>in</strong>g<br />

the local density function ^f D (x) <strong>in</strong>stead of the real<br />

density function f D (x) is given by the follow<strong>in</strong>g lemma.<br />

Lemma 2 (Error-Bound)<br />

If the po<strong>in</strong>ts x i 2 D : d(x i ;x) >kare neglected, the<br />

error is bound by<br />

X<br />

Error =<br />

e , d(x i ;x)2<br />

2 2<br />

xi2D; d(xi;x)>k<br />

kfx i 2Djd(x i ;x)>kgk e ,k2 =2 :<br />

Idea of the Proof: The error-bound assumes that all<br />

po<strong>in</strong>ts are on a hypersphere of k around x. <br />

The error-bound given by lemma 5 is only an upper<br />

bound for the error made by the local density function.<br />

In real data sets, the error will be much smaller s<strong>in</strong>ce<br />

many po<strong>in</strong>ts are much further away from x than k.<br />

4.2 The DENCLUE Algorithm<br />

The DENCLUE algorithm works <strong>in</strong> two steps. Step<br />

one is a precluster<strong>in</strong>g step, <strong>in</strong> which a map of the relevant<br />

portion of the data space is constructed. The<br />

map is used <strong>to</strong> speed up the calculation of the density<br />

function which requires <strong>to</strong> eciently access neighbor<strong>in</strong>g<br />

portions of the data space. The second step is the<br />

actual cluster<strong>in</strong>g step, <strong>in</strong> which the algorithm identies<br />

the density-attrac<strong>to</strong>rs and the correspond<strong>in</strong>g densityattracted<br />

po<strong>in</strong>ts.<br />

Step 1: The map is constructed as follows. The m<strong>in</strong>imal<br />

bound<strong>in</strong>g (hyper-)rectangle of the data set is divided<br />

<strong>in</strong><strong>to</strong> d-dimensional hypercubes, with an edgelength<br />

of 2. Only hypercubes which actually conta<strong>in</strong><br />

data po<strong>in</strong>ts are determ<strong>in</strong>ed. The number of populated<br />

cubes (kC p k) can range from 1 <strong>to</strong> N depend<strong>in</strong>g on the<br />

choosen , but does not depend on the dimensionalityof<br />

the data space. The hypercubes are numbered depend<strong>in</strong>g<br />

on their relative position from a given orig<strong>in</strong> (cf. gure7foratwo-dimensional<br />

example). In this way, the<br />

populated hypercubes (conta<strong>in</strong><strong>in</strong>g d-dimensional data<br />

po<strong>in</strong>ts) can be mapped <strong>to</strong> one-dimensional keys. The<br />

keys of the populated cubes can be eciently s<strong>to</strong>red <strong>in</strong><br />

a randomized search-tree or a B + -tree.<br />

31 32 33 34 35 36<br />

25 26 27 28 29 30<br />

19 20 21 22 23 24<br />

13 14 15 16 17 18<br />

7 8 9 10 11 12<br />

1 2 3 4 5 6<br />

This<br />

orig<strong>in</strong><br />

Figure 7: Map <strong>in</strong> a 2-dim. data space<br />

For each populated cube c 2 C p , <strong>in</strong> addition P<br />

<strong>to</strong> the key<br />

the number of po<strong>in</strong>ts (N c ) which belong <strong>to</strong> c, po<strong>in</strong>ters<br />

<strong>to</strong> those po<strong>in</strong>ts, and the l<strong>in</strong>ear sum x are s<strong>to</strong>red.<br />

x2c<br />

<strong>in</strong>formation is used <strong>in</strong> the cluster<strong>in</strong>g step for a<br />

fast calculation of the mean of a cube (mean(c)). S<strong>in</strong>ce<br />

clusters can spread over more than one cube, neighbor<strong>in</strong>g<br />

cubes which are also populated have <strong>to</strong> be accessed.<br />

To speed up this access, we connect neighbor<strong>in</strong>g populated<br />

cubes. More formally, two cubes c 1 ;c 2 2 C p<br />

are connected if d(mean(c 1 ); mean(c 2 )) 4. Do<strong>in</strong>g<br />

that for all cubes would normally take O(C 2) time.<br />

p<br />

We can, however, use a second outlier-bound c <strong>to</strong> reduce<br />

the time needed for connect<strong>in</strong>g the cubes. Let<br />

C sp = fc 2 C p jN c c g be the set of highly populated<br />

cubes. In general, the number of highly populated<br />

cubes C sp is much smaller than C p , especially <strong>in</strong><br />

high-dimensional space. The time needed for connect<strong>in</strong>g<br />

the highly populated cubes with their neighbors is<br />

then O(kC sp kkC p k) with kC sp k


determ<strong>in</strong><strong>in</strong>g the density-attrac<strong>to</strong>r x for a po<strong>in</strong>t x and<br />

^f D Gauss (x ) , the po<strong>in</strong>t x is classied and attached<br />

<strong>to</strong> the cluster belong<strong>in</strong>g <strong>to</strong> x . For eciency reasons,<br />

the algorithm s<strong>to</strong>res all po<strong>in</strong>ts x 0 with d(x i ;x 0 ) =2<br />

for any step 0 i k dur<strong>in</strong>g the hill-climb<strong>in</strong>g procedure<br />

and attaches these po<strong>in</strong>ts <strong>to</strong> the cluster of x as<br />

well. Us<strong>in</strong>g this heuristics, all po<strong>in</strong>ts which are located<br />

close <strong>to</strong> the path from x <strong>to</strong> its density-attrac<strong>to</strong>r can be<br />

classied without appl<strong>in</strong>g the hill-climb<strong>in</strong>g procedure <strong>to</strong><br />

them.<br />

4.3 Time Complexity<br />

First, let us recall the ma<strong>in</strong> steps of our algorithm.<br />

DENCLUE (D; ;; c )<br />

1. MBR Determ<strong>in</strong>eMBR(D)<br />

2. C p DetP opulatedCubes(D; MBR; )<br />

C sp DetHighlyP opulatedCubes(C p ; c )<br />

3. map; C r ConnectMap(C p ;C sp ;)<br />

4. clusters DetDensAttrac<strong>to</strong>rs(map; C r ;;)<br />

Step one is a l<strong>in</strong>ear scan of the data set and takes<br />

O(kDk). Step two takes O(kDk + kC p klog(kC p k))<br />

time because a tree-based access structure (such as a<br />

randomized search tree or B + -tree) is used for s<strong>to</strong>r<strong>in</strong>g<br />

the keys of the populated cubes. Note that kC p k <br />

kDk and therefore <strong>in</strong> the worst case, we haveO(kDk<br />

log(kDk). The complexity of step 3 depends on and<br />

, which determ<strong>in</strong>e the card<strong>in</strong>atiltyofC sp and C p . Formally,<br />

the time complexity of step 3 is O(kC sp kkC p k)<br />

with kC sp k


of the molecule <strong>in</strong> the conformation space. Due <strong>to</strong> the<br />

large amount of high-dimensional data, it is dicult <strong>to</strong><br />

nd, for example, the preferred conformations of the<br />

molecule. This, however, is important for applications<br />

<strong>in</strong> the pharmaceutical <strong>in</strong>dustry s<strong>in</strong>ce small and exible<br />

peptides are the basis for many medicaments. The exibility<br />

of the peptides, however, is also responsible for<br />

the fact that the peptide has many <strong>in</strong>termediate nonstable<br />

conformations which causes the data <strong>to</strong> conta<strong>in</strong><br />

large amounts of noise (more than 50 percent).<br />

(a) Folded State (b) Unfolded State<br />

Figure 10: Folded Conformation of the Peptide<br />

In gures 10a we show the most important conformations<br />

(correspond<strong>in</strong>g <strong>to</strong> the largest clusters of conformations<br />

<strong>in</strong> the 19-dimensional dihedral angle space) found<br />

by DENCLUE. The two conformations correspond <strong>to</strong> a<br />

folded and an unfolded state of the peptide.<br />

6 Conclusions<br />

In this paper, we propose a new approach <strong>to</strong> cluster<strong>in</strong>g<br />

<strong>in</strong> large multimedia databases. We formally <strong>in</strong>troduce<br />

the notion of <strong>in</strong>uence and density functions as well as<br />

dierent notions of clusters which are both based on<br />

determ<strong>in</strong><strong>in</strong>g the density-attrac<strong>to</strong>rs of the density function.<br />

For an ecient implementation of our approach,<br />

we <strong>in</strong>troduce a local density function which can be eciently<br />

determ<strong>in</strong>ed us<strong>in</strong>g a map-oriented representation<br />

of the data. We show the generality of our approach, i.e.<br />

that we are able <strong>to</strong> simulate a locality-based, partitionbased,<br />

and hierachical cluster<strong>in</strong>g. We also formally<br />

show the noise-<strong>in</strong>variance and error-bounds of our approach.<br />

<strong>An</strong> experimental evaluation shows the superior<br />

eciency and eectiveness of DENCLUE. Due <strong>to</strong> the<br />

promis<strong>in</strong>g results of the experimental evaluation, webelieve<br />

that our method will have a noticeable impact on<br />

the way cluster<strong>in</strong>g will be done <strong>in</strong> the future. Our plans<br />

for future work <strong>in</strong>clude an application and ne tun<strong>in</strong>g<br />

of our method for specic applications and an extensive<br />

comparison <strong>to</strong> other cluster<strong>in</strong>g methods (such as<br />

Birch).<br />

References<br />

[DJSGM97] X. Daura, B. Jaun, D. Seebach, W.F. van Gunsteren<br />

A.E. Mark:'Reversible peptide fold<strong>in</strong>g <strong>in</strong><br />

solution by molecular dynamics simulation', submitted<br />

<strong>to</strong> Science (1997)<br />

[EKSX96]<br />

Ester M., Kriegel H.-P., Sander J., Xu X.:'A<br />

Density-Based Algorithm for Discover<strong>in</strong>g Clusters<br />

<strong>in</strong> <strong>Large</strong> Spatial Databases with Noise',<br />

Proc. 3rd Int. Conf. on Knowledge Discovery<br />

and Data M<strong>in</strong><strong>in</strong>g, AAAI Press, 1996.<br />

[EKSX97] Ester M., Kriegel H.-P., Sander J., Xu<br />

X.:'Density-Connected Sets and their Application<br />

for Trend Detection <strong>in</strong> Spatial Databases',<br />

Proc. 3rd Int. Conf. on Knowledge Discovery<br />

and Data M<strong>in</strong><strong>in</strong>g, AAAI Press, 1997.<br />

[EKX95] Ester M., Kriegel H.-P., Xu X.:'Knowledge Discovery<br />

<strong>in</strong> <strong>Large</strong> Spatial Databases: Focus<strong>in</strong>g<br />

Techniques for Ecient Class Identication',<br />

Proc. 4th Int. Symp. on <strong>Large</strong> Spatial<br />

Databases, 1995, <strong>in</strong>: Lecture Notes <strong>in</strong> Computer<br />

Science, Vol. 951, Spr<strong>in</strong>ger, 1995, pp. 67-82.<br />

[FH75] K. Fukunaga, L.D. Hostler:'The estimation of<br />

the gradient of a density function, with application<br />

<strong>in</strong> pattern recognation', IEEE Trans. Info.<br />

Thy. 1975, IT-21, 32-40<br />

[Jag91] Jagadish H. V.:'A Retrieval Technique for Similar<br />

Shapes', Proc. ACM SIGMOD Int.Conf. on<br />

Management of Data, 1991, pp. 208-217.<br />

[Kuk92] Kukich K.:'Techniques for Au<strong>to</strong>matically Correct<strong>in</strong>g<br />

Word <strong>in</strong>Text', ACM Comput<strong>in</strong>g Surveys,<br />

Vol. 24, No. 4, 1992, pp. 377-440.<br />

[MG95] Methrotra R., Gray J. E.:'Feature-Index-Based<br />

Similar Shape retrieval', Proc. of the 3rd Work<strong>in</strong>g<br />

Conf. on Visual Database Systems, 1995.<br />

[MN89] Mehlhorn K., Naher S.:'LEDA, a libary of ecient<br />

data types and algorithms', University of<br />

Saarland, FB Informatik, TR A 04/89.<br />

[Nad65] Nadaraya E. A.:'On nonparametric estimates of<br />

density functions and regression curves', Theory<br />

Prob. Appl. 10, 1965, pp. 186-190.<br />

[NH94] Ng R. T., Han J.:'Ecient and Eective <strong>Cluster<strong>in</strong>g</strong><br />

Methods for Spatial Data M<strong>in</strong><strong>in</strong>g', Proc.<br />

20th Int. Conf. on Very <strong>Large</strong> Data Bases, Morgan<br />

Kaufmann, 1994, pp. 144-155.<br />

[Sch96] Schikuta E.:'Grid cluster<strong>in</strong>g: <strong>An</strong> ecient hierarchical<br />

cluster<strong>in</strong>g method for very large data sets',<br />

Proc. 13th Conf. on Pattern Recognition, Vol.<br />

2, IEEE Computer Society Press, pp. 101-105.<br />

[Schn64] Schnell P.:'A method <strong>to</strong> nd po<strong>in</strong>t-groups',<br />

Biometrika, 6, 1964, pp. 47-48.<br />

[Schu70] Schuster E. F.:'Note on the uniform convergence<br />

of density estimates', <strong>An</strong>n. Math. Statist 41,<br />

1970, pp. 1347-1348.<br />

[SH94] Shawney H., Hafner J.:'Ecient Color His<strong>to</strong>gram<br />

Index<strong>in</strong>g', Proc. Int. Conf.on Image Process<strong>in</strong>g,<br />

1994, pp. 66-70.<br />

[WW80] Wallace T., W<strong>in</strong>tz P.:'<strong>An</strong> Ecient Three-<br />

Dimensional Aircraft Recognition Algorithm Us<strong>in</strong>g<br />

Normalized Fourier Descrip<strong>to</strong>rs', Computer<br />

Graphics and Image Process<strong>in</strong>g, Vol. 13, 1980.<br />

[WYM97] Wang W., Yang J., Muntz R.:'STING: A Statistical<br />

Information Grid <strong>Approach</strong> <strong>to</strong> Spatial Data<br />

M<strong>in</strong><strong>in</strong>g', Proc. 23rd Int. Conf. on Very <strong>Large</strong><br />

Data Bases, Morgan Kaufmann, 1997.<br />

[XEKS98] Xu X., Ester M., Kriegel H.-P., Sander J.:'A<br />

Nonparametric <strong>Cluster<strong>in</strong>g</strong> Algorithm for Knowlege<br />

Discovery <strong>in</strong> <strong>Large</strong> Spatial Databases', will<br />

appear <strong>in</strong> Proc. IEEE Int. Conf. on Data Eng<strong>in</strong>eer<strong>in</strong>g,<br />

1998, IEEE Computer Society Press.<br />

[ZRL96]<br />

Zhang T., Ramakrishnan R., L<strong>in</strong>vy M.:'BIRCH:<br />

<strong>An</strong> Ecient Data <strong>Cluster<strong>in</strong>g</strong> Method for very<br />

<strong>Large</strong> Databases', Proc. ACM SIGMOD Int.<br />

Conf. on Management of Data, ACM Press,<br />

1996, pp. 103-114.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!