08.11.2014 Views

an overview on clustering methods - arXiv

an overview on clustering methods - arXiv

an overview on clustering methods - arXiv

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

IOSR Journal of Engineering<br />

Apr. 2012, Vol. 2(4) pp: 719-725<br />

AN OVERVIEW ON CLUSTERING METHODS<br />

T. S<strong>on</strong>i Madhulatha<br />

Associate Professor, Alluri Institute of M<str<strong>on</strong>g>an</str<strong>on</strong>g>agement Sciences, War<str<strong>on</strong>g>an</str<strong>on</strong>g>gal.<br />

ABSTRACT<br />

Clustering is a comm<strong>on</strong> technique for statistical data <str<strong>on</strong>g>an</str<strong>on</strong>g>alysis, which is used in m<str<strong>on</strong>g>an</str<strong>on</strong>g>y fields, including machine learning,<br />

data mining, pattern recogniti<strong>on</strong>, image <str<strong>on</strong>g>an</str<strong>on</strong>g>alysis <str<strong>on</strong>g>an</str<strong>on</strong>g>d bioinformatics. Clustering is the process of grouping similar objects<br />

into different groups, or more precisely, the partiti<strong>on</strong>ing of a data set into subsets, so that the data in each subset<br />

according to some defined dist<str<strong>on</strong>g>an</str<strong>on</strong>g>ce measure. This paper covers about <strong>clustering</strong> algorithms, benefits <str<strong>on</strong>g>an</str<strong>on</strong>g>d its applicati<strong>on</strong>s.<br />

Paper c<strong>on</strong>cludes by discussing some limitati<strong>on</strong>s.<br />

Keywords: Clustering, hierarchical algorithm, partiti<strong>on</strong>al algorithm, dist<str<strong>on</strong>g>an</str<strong>on</strong>g>ce measure,<br />

I. INTRODUCTION<br />

Clustering c<str<strong>on</strong>g>an</str<strong>on</strong>g> be c<strong>on</strong>sidered the most import<str<strong>on</strong>g>an</str<strong>on</strong>g>t<br />

unsupervised learning problem; so, as every other problem<br />

of this kind, it deals with finding a structure in a collecti<strong>on</strong><br />

of unlabeled data. A cluster is therefore a collecti<strong>on</strong> of<br />

objects which are “similar” between them <str<strong>on</strong>g>an</str<strong>on</strong>g>d are<br />

“dissimilar” to the objects bel<strong>on</strong>ging to other clusters.<br />

Besides the term data <strong>clustering</strong> as syn<strong>on</strong>yms like cluster<br />

<str<strong>on</strong>g>an</str<strong>on</strong>g>alysis, automatic classificati<strong>on</strong>, numerical tax<strong>on</strong>omy,<br />

botrology <str<strong>on</strong>g>an</str<strong>on</strong>g>d typological <str<strong>on</strong>g>an</str<strong>on</strong>g>alysis.<br />

II. TYPES OF CLUSTERING.<br />

Data <strong>clustering</strong> algorithms c<str<strong>on</strong>g>an</str<strong>on</strong>g> be hierarchical or partiti<strong>on</strong>al.<br />

Hierarchical algorithms find successive clusters using<br />

previously established clusters, whereas partiti<strong>on</strong>al<br />

algorithms determine all clusters at time. Hierarchical<br />

algorithms c<str<strong>on</strong>g>an</str<strong>on</strong>g> be agglomerative (bottom-up) or divisive<br />

(top-down). Agglomerative algorithms begin with each<br />

element as a separate cluster <str<strong>on</strong>g>an</str<strong>on</strong>g>d merge them in<br />

successively larger clusters. Divisive algorithms begin with<br />

the whole set <str<strong>on</strong>g>an</str<strong>on</strong>g>d proceed to divide it into successively<br />

smaller clusters.<br />

HIERARCHICAL CLUSTERING<br />

A key step in a hierarchical <strong>clustering</strong> is to select a dist<str<strong>on</strong>g>an</str<strong>on</strong>g>ce<br />

measure. A simple measure is m<str<strong>on</strong>g>an</str<strong>on</strong>g>hatt<str<strong>on</strong>g>an</str<strong>on</strong>g> dist<str<strong>on</strong>g>an</str<strong>on</strong>g>ce, equal to<br />

the sum of absolute dist<str<strong>on</strong>g>an</str<strong>on</strong>g>ces for each variable. The name<br />

comes from the fact that in a two-variable case, the<br />

variables c<str<strong>on</strong>g>an</str<strong>on</strong>g> be plotted <strong>on</strong> a grid that c<str<strong>on</strong>g>an</str<strong>on</strong>g> be compared to<br />

city streets, <str<strong>on</strong>g>an</str<strong>on</strong>g>d the dist<str<strong>on</strong>g>an</str<strong>on</strong>g>ce between two points is the<br />

number of blocks a pers<strong>on</strong> would walk.<br />

A more comm<strong>on</strong> measure is Euclide<str<strong>on</strong>g>an</str<strong>on</strong>g> dist<str<strong>on</strong>g>an</str<strong>on</strong>g>ce, computed<br />

by finding the square of the dist<str<strong>on</strong>g>an</str<strong>on</strong>g>ce between each variable,<br />

summing the squares, <str<strong>on</strong>g>an</str<strong>on</strong>g>d finding the square root of that<br />

sum. In the two-variable case, the dist<str<strong>on</strong>g>an</str<strong>on</strong>g>ce is <str<strong>on</strong>g>an</str<strong>on</strong>g>alogous to<br />

finding the length of the hypotenuse in a tri<str<strong>on</strong>g>an</str<strong>on</strong>g>gle; that is, it<br />

is the dist<str<strong>on</strong>g>an</str<strong>on</strong>g>ce "as the crow flies." A review of cluster<br />

<str<strong>on</strong>g>an</str<strong>on</strong>g>alysis in health psychology research found that the most<br />

comm<strong>on</strong> dist<str<strong>on</strong>g>an</str<strong>on</strong>g>ce measure in published studies in that<br />

research area is the Euclide<str<strong>on</strong>g>an</str<strong>on</strong>g> dist<str<strong>on</strong>g>an</str<strong>on</strong>g>ce or the squared<br />

Euclide<str<strong>on</strong>g>an</str<strong>on</strong>g> dist<str<strong>on</strong>g>an</str<strong>on</strong>g>ce.<br />

The M<str<strong>on</strong>g>an</str<strong>on</strong>g>hatt<str<strong>on</strong>g>an</str<strong>on</strong>g> dist<str<strong>on</strong>g>an</str<strong>on</strong>g>ce functi<strong>on</strong> computes the<br />

dist<str<strong>on</strong>g>an</str<strong>on</strong>g>ce that would be traveled to get from <strong>on</strong>e data point to<br />

the other if a grid-like path is followed. The M<str<strong>on</strong>g>an</str<strong>on</strong>g>hatt<str<strong>on</strong>g>an</str<strong>on</strong>g><br />

dist<str<strong>on</strong>g>an</str<strong>on</strong>g>ce between two items is the sum of the differences of<br />

their corresp<strong>on</strong>ding comp<strong>on</strong>ents. The formula for this<br />

dist<str<strong>on</strong>g>an</str<strong>on</strong>g>ce between a point X=(X1, X2, etc.) <str<strong>on</strong>g>an</str<strong>on</strong>g>d a point<br />

Y=(Y1, Y2, etc.) is:<br />

d <br />

n<br />

<br />

i1<br />

X i<br />

Y i<br />

Where n is the number of variables, <str<strong>on</strong>g>an</str<strong>on</strong>g>d Xi <str<strong>on</strong>g>an</str<strong>on</strong>g>d Yi are the<br />

values of the ith variable, at points X <str<strong>on</strong>g>an</str<strong>on</strong>g>d Y respectively.<br />

The Euclide<str<strong>on</strong>g>an</str<strong>on</strong>g> dist<str<strong>on</strong>g>an</str<strong>on</strong>g>ce functi<strong>on</strong> measures the „asthe-crow-flies<br />

dist<str<strong>on</strong>g>an</str<strong>on</strong>g>ce. The formula for this dist<str<strong>on</strong>g>an</str<strong>on</strong>g>ce<br />

between a point X (X1, X2, etc.) <str<strong>on</strong>g>an</str<strong>on</strong>g>d a point Y (Y1, Y2, etc.)<br />

is:<br />

d <br />

n<br />

<br />

j1<br />

( x j<br />

y j<br />

)<br />

2<br />

Deriving the Euclide<str<strong>on</strong>g>an</str<strong>on</strong>g> dist<str<strong>on</strong>g>an</str<strong>on</strong>g>ce between two data points<br />

involves computing the square root of the sum of the<br />

squares of the differences between corresp<strong>on</strong>ding values.<br />

The following figure illustrates the difference between<br />

M<str<strong>on</strong>g>an</str<strong>on</strong>g>hatt<str<strong>on</strong>g>an</str<strong>on</strong>g> dist<str<strong>on</strong>g>an</str<strong>on</strong>g>ce <str<strong>on</strong>g>an</str<strong>on</strong>g>d Euclide<str<strong>on</strong>g>an</str<strong>on</strong>g> dist<str<strong>on</strong>g>an</str<strong>on</strong>g>ce:<br />

ISSN: 2250-3021 www.iosrjen.org 719 | P a g e


IOSR Journal of Engineering<br />

Apr. 2012, Vol. 2(4) pp: 719-725<br />

1<br />

card ( A)<br />

card ( B)<br />

<br />

xA<br />

yB<br />

d(<br />

x,<br />

y)<br />

M<str<strong>on</strong>g>an</str<strong>on</strong>g>hatt<str<strong>on</strong>g>an</str<strong>on</strong>g> dist<str<strong>on</strong>g>an</str<strong>on</strong>g>ce<br />

Euclide<str<strong>on</strong>g>an</str<strong>on</strong>g> dist<str<strong>on</strong>g>an</str<strong>on</strong>g>ce<br />

This method builds the hierarchy from the individual<br />

elements by progressively merging clusters. Again, we have<br />

six elements {a} {b} {c} {d} {e} <str<strong>on</strong>g>an</str<strong>on</strong>g>d {f}. The first step is<br />

to determine which elements to merge in a cluster. Usually,<br />

we w<str<strong>on</strong>g>an</str<strong>on</strong>g>t to take the two closest elements, therefore we must<br />

define a dist<str<strong>on</strong>g>an</str<strong>on</strong>g>ce between elements. One c<str<strong>on</strong>g>an</str<strong>on</strong>g> also c<strong>on</strong>struct<br />

a dist<str<strong>on</strong>g>an</str<strong>on</strong>g>ce matrix at this stage.<br />

The sum of all intra-cluster vari<str<strong>on</strong>g>an</str<strong>on</strong>g>ce<br />

The increase in vari<str<strong>on</strong>g>an</str<strong>on</strong>g>ce for the cluster being merged<br />

The probability that c<str<strong>on</strong>g>an</str<strong>on</strong>g>didate clusters spawn from the<br />

same distributi<strong>on</strong> functi<strong>on</strong>.<br />

Each agglomerati<strong>on</strong> occurs at a greater dist<str<strong>on</strong>g>an</str<strong>on</strong>g>ce between<br />

clusters th<str<strong>on</strong>g>an</str<strong>on</strong>g> the previous agglomerati<strong>on</strong>, <str<strong>on</strong>g>an</str<strong>on</strong>g>d <strong>on</strong>e c<str<strong>on</strong>g>an</str<strong>on</strong>g><br />

decide to stop <strong>clustering</strong> either when the clusters are too far<br />

apart to be merged or when there is a sufficiently small<br />

number of clusters.<br />

Agglomerative hierarchical <strong>clustering</strong><br />

For example, suppose these data are to be <str<strong>on</strong>g>an</str<strong>on</strong>g>alyzed, where<br />

pixel euclide<str<strong>on</strong>g>an</str<strong>on</strong>g> dist<str<strong>on</strong>g>an</str<strong>on</strong>g>ce is the dist<str<strong>on</strong>g>an</str<strong>on</strong>g>ce metric.<br />

Usually the dist<str<strong>on</strong>g>an</str<strong>on</strong>g>ce between two clusters <str<strong>on</strong>g>an</str<strong>on</strong>g>d is <strong>on</strong>e of the<br />

following:<br />

The maximum dist<str<strong>on</strong>g>an</str<strong>on</strong>g>ce between elements of each cluster<br />

is also called complete linkage <strong>clustering</strong>.<br />

max<br />

<br />

d(x,<br />

y): x<br />

A,y B<br />

The minimum dist<str<strong>on</strong>g>an</str<strong>on</strong>g>ce between elements of each cluster<br />

is also called single linkage <strong>clustering</strong>.<br />

<br />

min d(x, y): x A,yB<br />

The me<str<strong>on</strong>g>an</str<strong>on</strong>g> dist<str<strong>on</strong>g>an</str<strong>on</strong>g>ce between elements of each cluster is<br />

also called average linkage <strong>clustering</strong>.<br />

<br />

<br />

Divisive <strong>clustering</strong><br />

So far we have <strong>on</strong>ly looked at agglomerative <strong>clustering</strong>, but<br />

a cluster hierarchy c<str<strong>on</strong>g>an</str<strong>on</strong>g> also be generated top-down. This<br />

vari<str<strong>on</strong>g>an</str<strong>on</strong>g>t of hierarchical <strong>clustering</strong> is called top-down<br />

<strong>clustering</strong> or divisive <strong>clustering</strong>. We start at the top with all<br />

documents in <strong>on</strong>e cluster. The cluster is split using a flat<br />

<strong>clustering</strong> algorithm. This procedure is applied recursively<br />

until each document is in its own singlet<strong>on</strong> cluster.<br />

Top-down <strong>clustering</strong> is c<strong>on</strong>ceptually more complex th<str<strong>on</strong>g>an</str<strong>on</strong>g><br />

bottom-up <strong>clustering</strong> since we need a sec<strong>on</strong>d, flat <strong>clustering</strong><br />

algorithm as a ``subroutine''. It has the adv<str<strong>on</strong>g>an</str<strong>on</strong>g>tage of being<br />

more efficient if we do not generate a complete hierarchy<br />

all the way down to individual document leaves. For a fixed<br />

number of top levels, using <str<strong>on</strong>g>an</str<strong>on</strong>g> efficient flat algorithm like<br />

K-me<str<strong>on</strong>g>an</str<strong>on</strong>g>s, top-down algorithms are linear in the number of<br />

documents <str<strong>on</strong>g>an</str<strong>on</strong>g>d clusters<br />

ISSN: 2250-3021 www.iosrjen.org 720 | P a g e


IOSR Journal of Engineering<br />

Apr. 2012, Vol. 2(4) pp: 719-725<br />

Hierarchal method suffers from the fact that <strong>on</strong>ce the<br />

merge/split is d<strong>on</strong>e, it c<str<strong>on</strong>g>an</str<strong>on</strong>g> never be und<strong>on</strong>e. This rigidity is<br />

useful in that is useful in that it leads to smaller<br />

computati<strong>on</strong> costs by not worrying about a combinatorial<br />

number of different choices.<br />

However there are two approaches to improve the quality of<br />

hierarchical <strong>clustering</strong><br />

Perform careful <str<strong>on</strong>g>an</str<strong>on</strong>g>alysis of object linkages at each<br />

hierarchical partiti<strong>on</strong>ing such as CURE <str<strong>on</strong>g>an</str<strong>on</strong>g>d Chamele<strong>on</strong>.<br />

Integrate hierarchical agglomerati<strong>on</strong> <str<strong>on</strong>g>an</str<strong>on</strong>g>d then redefine the<br />

result using iterative relocati<strong>on</strong> as in BRICH<br />

PARTITIONAL CLUSTERING:<br />

Partiti<strong>on</strong>ing algorithms are based <strong>on</strong> specifying <str<strong>on</strong>g>an</str<strong>on</strong>g> initial<br />

number of groups, <str<strong>on</strong>g>an</str<strong>on</strong>g>d iteratively reallocating objects<br />

am<strong>on</strong>g groups to c<strong>on</strong>vergence. This algorithm typically<br />

determines all clusters at <strong>on</strong>ce. Most applicati<strong>on</strong>s adopt <strong>on</strong>e<br />

of two popular heuristic <strong>methods</strong> like<br />

k-me<str<strong>on</strong>g>an</str<strong>on</strong>g> algorithm<br />

k-medoids algorithm<br />

K-me<str<strong>on</strong>g>an</str<strong>on</strong>g>s algorithm<br />

The K-me<str<strong>on</strong>g>an</str<strong>on</strong>g>s algorithm assigns each point to the cluster<br />

whose center also called centroid is nearest. The center is<br />

the average of all the points in the cluster that is, its<br />

coordinates are the arithmetic me<str<strong>on</strong>g>an</str<strong>on</strong>g> for each dimensi<strong>on</strong><br />

separately over all the points in the cluster.<br />

The pseudo code of the k-me<str<strong>on</strong>g>an</str<strong>on</strong>g>s algorithm is to explain<br />

how it works:<br />

A. Choose K as the number of clusters.<br />

B. Initialize the codebook vectors of the K clusters<br />

(r<str<strong>on</strong>g>an</str<strong>on</strong>g>domly, for inst<str<strong>on</strong>g>an</str<strong>on</strong>g>ce)<br />

C. For every new sample vector:<br />

a. Compute the dist<str<strong>on</strong>g>an</str<strong>on</strong>g>ce between the new vector <str<strong>on</strong>g>an</str<strong>on</strong>g>d<br />

every cluster's codebook vector.<br />

b. Re-compute the closest codebook vector with the<br />

new vector, using a learning rate that decreases in time.<br />

The reas<strong>on</strong> behind choosing the k-me<str<strong>on</strong>g>an</str<strong>on</strong>g>s algorithm to study<br />

is its popularity for the following reas<strong>on</strong>s:<br />

• Its time complexity is O (nkl), where n is the number<br />

of patterns, k is the number of clusters, <str<strong>on</strong>g>an</str<strong>on</strong>g>d l is the<br />

number of iterati<strong>on</strong>s taken by the algorithm to<br />

c<strong>on</strong>verge.<br />

• Its space complexity is O (k+n). It requires<br />

additi<strong>on</strong>al space to store the data matrix.<br />

• It is order-independent; for a given initial seed set of<br />

cluster centers, it generates the same partiti<strong>on</strong> of the data<br />

irrespective of the order in which the patterns are presented<br />

to the algorithm.<br />

K-medoids algorithm:<br />

The basic strategy of k-medoids algorithm is each cluster is<br />

represented by <strong>on</strong>e of the objects located near the center of<br />

the cluster. PAM (Partiti<strong>on</strong>ing around Medoids) was <strong>on</strong>e of<br />

the first k-medoids algorithm is introduced.<br />

The pseudo code of the k-medoids algorithm is to explain<br />

how it works:<br />

Arbitrarily choose k objects as the initial medoids Repeat<br />

Assign each remaining object to the cluster with the nearest<br />

medoids R<str<strong>on</strong>g>an</str<strong>on</strong>g>domly select a n<strong>on</strong>-medoid object O r<str<strong>on</strong>g>an</str<strong>on</strong>g>dom<br />

Compute the total cost, S, of swapping O j with O r<str<strong>on</strong>g>an</str<strong>on</strong>g>dom<br />

If S


IOSR Journal of Engineering<br />

Apr. 2012, Vol. 2(4) pp: 719-725<br />

The <strong>clustering</strong> process is based <strong>on</strong> the classificati<strong>on</strong> of the<br />

points in the dataset as core points, border points <str<strong>on</strong>g>an</str<strong>on</strong>g>d noise<br />

points, <str<strong>on</strong>g>an</str<strong>on</strong>g>d <strong>on</strong> the use of density relati<strong>on</strong>s between points to<br />

form the clusters.<br />

The pseudo code of the DBSCAN algorithm is to explain<br />

how it works:<br />

To clusters a dataset, our DBSCAN implementati<strong>on</strong> starts<br />

by identifying the k nearest neighbours of each point <str<strong>on</strong>g>an</str<strong>on</strong>g>d<br />

identify the farthest k nearest neighbour. The average of all<br />

this dist<str<strong>on</strong>g>an</str<strong>on</strong>g>ce is then calculated.<br />

For each point of the dataset the algorithm identifies the<br />

directly density-reachable points using the Eps threshold<br />

provided by the user <str<strong>on</strong>g>an</str<strong>on</strong>g>d classifies the points into core or<br />

border points.<br />

It then loop trough all points of the dataset <str<strong>on</strong>g>an</str<strong>on</strong>g>d for the core<br />

points it starts to c<strong>on</strong>struct a new cluster with the support of<br />

the GetDRPoints() procedure that follows the definiti<strong>on</strong> of<br />

density reachable points.<br />

In this phase the value used as Eps threshold is the<br />

average dist<str<strong>on</strong>g>an</str<strong>on</strong>g>ce calculated previously. At the end, the<br />

compositi<strong>on</strong> of the clusters is verified in order to check if<br />

there exist clusters that c<str<strong>on</strong>g>an</str<strong>on</strong>g> be merged together. This c<str<strong>on</strong>g>an</str<strong>on</strong>g><br />

append if two points of different clusters are at a dist<str<strong>on</strong>g>an</str<strong>on</strong>g>ce<br />

less th<str<strong>on</strong>g>an</str<strong>on</strong>g> Eps.<br />

Note: DBSCAN does not deal very well with clusters of<br />

different densities.<br />

SNN ALGORITHM<br />

The SNN algorithm, as DBSCAN, is a density-based<br />

<strong>clustering</strong> algorithm. The main difference between this<br />

algorithm <str<strong>on</strong>g>an</str<strong>on</strong>g>d DBSCAN is that it defines the similarity<br />

between points by looking at the number of nearest<br />

neighbours that two points share. Using this similarity<br />

measure in the SNN algorithm, the density is defined as the<br />

sum of the similarities of the nearest neighbours of a point.<br />

Points with high density become core points, while points<br />

with low density represent noise points. All remainder<br />

points that are str<strong>on</strong>gly similar to a specific core points will<br />

represent a new clusters.<br />

The SNN algorithm needs three inputs parameters:<br />

- K, the neighbours‟ list size;<br />

- Eps, the threshold density;<br />

- MinPts, the threshold that define the core points.<br />

The pseudo code of the SSN algorithm is to explain how<br />

it works:<br />

Define the input parameters.<br />

Find the K nearest neighbours of each point of the dataset.<br />

Then the similarity between pairs of points is calculated in<br />

terms of how m<str<strong>on</strong>g>an</str<strong>on</strong>g>y nearest neighbours the two points share.<br />

Using this similarity measure, the density of each point c<str<strong>on</strong>g>an</str<strong>on</strong>g><br />

be calculated as being the numbers of neighbours with<br />

which the number of shared neighbours is equal or greater<br />

th<str<strong>on</strong>g>an</str<strong>on</strong>g> Eps.<br />

The points are classified as being core points, if the density<br />

of the point is equal or greater th<str<strong>on</strong>g>an</str<strong>on</strong>g> MinPts. At this point,<br />

the algorithm has all the informati<strong>on</strong> needed to start to build<br />

the clusters. Those start to be c<strong>on</strong>structed around the core<br />

points.<br />

However, these clusters do not c<strong>on</strong>tain all points. They<br />

c<strong>on</strong>tain <strong>on</strong>ly points that come from regi<strong>on</strong>s of relatively<br />

uniform density. The points that are not classified into <str<strong>on</strong>g>an</str<strong>on</strong>g>y<br />

cluster are classified as noise points.<br />

GRID-BASED CLUSTERING<br />

The grid based <strong>clustering</strong> approach uses a multiresoluti<strong>on</strong><br />

grid data structure. It qu<str<strong>on</strong>g>an</str<strong>on</strong>g>tizes the space into a finite<br />

number of cells that form a grid structure <strong>on</strong> which all the<br />

operati<strong>on</strong>s for <strong>clustering</strong> are performed. Grid approach<br />

includes STING (STatistical INformati<strong>on</strong> Grid) approach<br />

<str<strong>on</strong>g>an</str<strong>on</strong>g>d CLIQUE<br />

Basic Grid-based Algorithm<br />

1. Define a set of grid-cells<br />

2. Assign objects to the appropriate grid cell <str<strong>on</strong>g>an</str<strong>on</strong>g>d compute<br />

the density of each cell.<br />

3. Eliminate cells, whose density is below a certain<br />

threshold t.<br />

4. Form clusters from c<strong>on</strong>tiguous (adjacent) groups of dense<br />

cells.<br />

The pseudo code of the STING algorithm is to explain how<br />

it works:<br />

The spatial area is divided into rect<str<strong>on</strong>g>an</str<strong>on</strong>g>gular cells<br />

There are several levels of cells corresp<strong>on</strong>ding to different<br />

levels of resoluti<strong>on</strong><br />

Each cell is partiti<strong>on</strong>ed into a number of smaller cells in the<br />

next level. Statistical info of each cell is calculated <str<strong>on</strong>g>an</str<strong>on</strong>g>d<br />

stored beforeh<str<strong>on</strong>g>an</str<strong>on</strong>g>d <str<strong>on</strong>g>an</str<strong>on</strong>g>d is used to <str<strong>on</strong>g>an</str<strong>on</strong>g>swer queries<br />

Parameters of higher level cells c<str<strong>on</strong>g>an</str<strong>on</strong>g> be easily calculated<br />

from parameters of lower level cell count, me<str<strong>on</strong>g>an</str<strong>on</strong>g>, s, min,<br />

max type of distributi<strong>on</strong>—normal, uniform, etc.<br />

Use a top-down approach to <str<strong>on</strong>g>an</str<strong>on</strong>g>swer spatial data queries<br />

Start from a pre-selected layer—typically with a small<br />

number of cells from the pre-selected layer until you reach<br />

the bottom layer do the following:<br />

For each cell in the current level compute the c<strong>on</strong>fidence<br />

interval indicating a cell‟s relev<str<strong>on</strong>g>an</str<strong>on</strong>g>ce to a given query;<br />

1. If it is relev<str<strong>on</strong>g>an</str<strong>on</strong>g>t, include the cell in a cluster<br />

ISSN: 2250-3021 www.iosrjen.org 722 | P a g e


IOSR Journal of Engineering<br />

Apr. 2012, Vol. 2(4) pp: 719-725<br />

2. If it irrelev<str<strong>on</strong>g>an</str<strong>on</strong>g>t, remove cell from further c<strong>on</strong>siderati<strong>on</strong><br />

3. otherwise, look for relev<str<strong>on</strong>g>an</str<strong>on</strong>g>t cells at the next lower layer<br />

1. Combine relev<str<strong>on</strong>g>an</str<strong>on</strong>g>t cells into relev<str<strong>on</strong>g>an</str<strong>on</strong>g>t regi<strong>on</strong>s (based <strong>on</strong><br />

grid-neighborhood) <str<strong>on</strong>g>an</str<strong>on</strong>g>d return the so obtained clusters<br />

as your <str<strong>on</strong>g>an</str<strong>on</strong>g>swers.<br />

Adv<str<strong>on</strong>g>an</str<strong>on</strong>g>tages:<br />

Query-independent, easy to parallelize,<br />

incremental update O(K), where K is the number of grid<br />

cells at the lowest level<br />

Disadv<str<strong>on</strong>g>an</str<strong>on</strong>g>tages:<br />

All the cluster boundaries are either horiz<strong>on</strong>tal or<br />

vertical, <str<strong>on</strong>g>an</str<strong>on</strong>g>d no diag<strong>on</strong>al boundary is detected<br />

MODEL-BASED CLUSTERING<br />

Model-Based Clustering <strong>methods</strong> attempt to optimize the fit<br />

between the given data <str<strong>on</strong>g>an</str<strong>on</strong>g>d some mathematical model.<br />

Such <strong>methods</strong> often based <strong>on</strong> the assumpti<strong>on</strong> that the data<br />

are generated by mixture of underlying probability<br />

distributi<strong>on</strong>s. Model-Based Clustering <strong>methods</strong> follow two<br />

major approaches: Statistical Approach or Neural network<br />

approach<br />

1. Clustering is also performed by having several units<br />

competing for the current object<br />

2. The unit whose weight vector is closest to the current<br />

object wins<br />

3. The winner <str<strong>on</strong>g>an</str<strong>on</strong>g>d its neighbors learn by having their<br />

weights adjusted<br />

4. SOMs are believed to resemble processing that c<str<strong>on</strong>g>an</str<strong>on</strong>g> occur<br />

in the brain<br />

5. Useful for visualizing high-dimensi<strong>on</strong>al data in 2- or 3-D<br />

space<br />

In model-based <strong>clustering</strong>, the data x are viewed as coming<br />

P from a mixture density<br />

f ( x)<br />

<br />

G<br />

<br />

k 1<br />

T f ( x)<br />

k<br />

k<br />

( x ; ,<br />

i<br />

k<br />

<br />

1<br />

T<br />

exp <br />

( xi<br />

k<br />

)<br />

<br />

2<br />

k<br />

)<br />

det(2<br />

<br />

k<br />

<br />

1<br />

k<br />

( x<br />

)<br />

i<br />

<br />

k<br />

) <br />

<br />

For univariate data, the covari<str<strong>on</strong>g>an</str<strong>on</strong>g>ce matrix reduces to a<br />

scalar vari<str<strong>on</strong>g>an</str<strong>on</strong>g>ce. The likelihood for data c<strong>on</strong>sisting of n<br />

observati<strong>on</strong>s assuming a Gaussi<str<strong>on</strong>g>an</str<strong>on</strong>g> mixture model with G<br />

multivariate mixture comp<strong>on</strong>ents is<br />

n<br />

G<br />

<br />

i1 k 1<br />

T k<br />

( x i<br />

; k<br />

,<br />

<br />

k<br />

).<br />

MCLUST is probably the most well known model-based<br />

This is all about various <strong>clustering</strong> algorithms.<br />

III. HOW TO DETERMINE THE NUMBER OF<br />

CLUSTERS<br />

M<str<strong>on</strong>g>an</str<strong>on</strong>g>y <strong>clustering</strong> algorithms require the specificati<strong>on</strong> of the<br />

number of clusters to produce in the input data set, prior to<br />

executi<strong>on</strong> of the algorithm. Barring knowledge of the<br />

proper value beforeh<str<strong>on</strong>g>an</str<strong>on</strong>g>d, the appropriate value must be<br />

determined, a problem <strong>on</strong> its own for which a number of<br />

techniques have been developed.<br />

– If the number of clusters known, terminati<strong>on</strong> c<strong>on</strong>diti<strong>on</strong><br />

is given!<br />

– In general, set a dist<str<strong>on</strong>g>an</str<strong>on</strong>g>ce threshold value (terminati<strong>on</strong><br />

c<strong>on</strong>diti<strong>on</strong>)<br />

– The K-cluster lifetime as the r<str<strong>on</strong>g>an</str<strong>on</strong>g>ge of threshold values<br />

<strong>on</strong> the dendrogram tree that leads to the identificati<strong>on</strong><br />

of K clusters<br />

– Heuristic rule: cut a dendrogram tree with maximum<br />

life time<br />

One simple rule of thumb sets the number to<br />

k n with n as the number of objects .<br />

2<br />

where fk is the probability density functi<strong>on</strong> of the<br />

observati<strong>on</strong>s in group k, <str<strong>on</strong>g>an</str<strong>on</strong>g>d T k is the probability that <str<strong>on</strong>g>an</str<strong>on</strong>g><br />

observati<strong>on</strong> comes from the kth mixture comp<strong>on</strong>ent<br />

Each comp<strong>on</strong>ent is usually modeled by the normal or<br />

Gaussi<str<strong>on</strong>g>an</str<strong>on</strong>g> distributi<strong>on</strong>. Comp<strong>on</strong>ent distributi<strong>on</strong>s are<br />

characterized by the me<str<strong>on</strong>g>an</str<strong>on</strong>g> μk <str<strong>on</strong>g>an</str<strong>on</strong>g>d the covari<str<strong>on</strong>g>an</str<strong>on</strong>g>ce matrix ∑k,<br />

<str<strong>on</strong>g>an</str<strong>on</strong>g>d have the probability density functi<strong>on</strong><br />

Elbow criteri<strong>on</strong><br />

The elbow criteri<strong>on</strong> is a comm<strong>on</strong> rule of thumb to<br />

determine what number of clusters should be chosen, for<br />

example for k-me<str<strong>on</strong>g>an</str<strong>on</strong>g>s <str<strong>on</strong>g>an</str<strong>on</strong>g>d agglomerative hierarchical<br />

<strong>clustering</strong>. The elbow criteri<strong>on</strong> says that you should choose<br />

a number of clusters so that adding <str<strong>on</strong>g>an</str<strong>on</strong>g>other cluster doesn't<br />

add sufficient informati<strong>on</strong>. More precisely, if you graph the<br />

percentage of vari<str<strong>on</strong>g>an</str<strong>on</strong>g>ce explained by the clusters against the<br />

number of clusters, the first clusters will add much<br />

informati<strong>on</strong>, but at some point the marginal gain will drop,<br />

giving <str<strong>on</strong>g>an</str<strong>on</strong>g> <str<strong>on</strong>g>an</str<strong>on</strong>g>gle in the graph.<br />

ISSN: 2250-3021 www.iosrjen.org 723 | P a g e


IOSR Journal of Engineering<br />

Apr. 2012, Vol. 2(4) pp: 719-725<br />

Another set of <strong>methods</strong> for determining the number of<br />

clusters are informati<strong>on</strong> criteria, such as :<br />

The Akaike informati<strong>on</strong> criteri<strong>on</strong> (AIC), The Bayesi<str<strong>on</strong>g>an</str<strong>on</strong>g><br />

informati<strong>on</strong> criteri<strong>on</strong> (BIC), The Devi<str<strong>on</strong>g>an</str<strong>on</strong>g>ce informati<strong>on</strong><br />

criteri<strong>on</strong> (DIC).<br />

IV. HOW ALGORITHMS ARE COMPARED<br />

The above <strong>clustering</strong> algorithms are compared according to<br />

the following factors:<br />

The size of the dataset,<br />

Number of the clusters,<br />

Type of dataset,<br />

Type of software<br />

Table 1 explains how the four algorithms are<br />

compared <str<strong>on</strong>g>an</str<strong>on</strong>g>d the c<strong>on</strong>clusi<strong>on</strong>s are written down.<br />

Parti<br />

ti<strong>on</strong>a<br />

l<br />

Alg<br />

orit<br />

hm<br />

Hie<br />

rarc<br />

hica<br />

l<br />

Alg<br />

orit<br />

hm<br />

Grid<br />

base<br />

d<br />

Alg<br />

orit<br />

hm<br />

Mo<br />

del-<br />

Size of<br />

Dataset<br />

Huge<br />

Dataset<br />

&<br />

Small<br />

Dataset<br />

Huge<br />

Dataset<br />

&<br />

Small<br />

Dataset<br />

Huge<br />

Dataset<br />

&<br />

Small<br />

Dataset<br />

Huge<br />

Dataset<br />

Number<br />

of<br />

Clusters<br />

Large<br />

number<br />

of<br />

cluster<br />

s<br />

&<br />

Small<br />

number<br />

of<br />

clusters<br />

Large<br />

number<br />

of<br />

cluster<br />

s<br />

&<br />

Small<br />

number<br />

of<br />

Clusters<br />

Large<br />

number<br />

of<br />

cluster<br />

s<br />

&<br />

Small<br />

number<br />

of<br />

clusters<br />

Large<br />

number<br />

Type<br />

of<br />

Dataset<br />

Ideal<br />

Dataset<br />

&<br />

R<str<strong>on</strong>g>an</str<strong>on</strong>g>dom<br />

Dataset<br />

Ideal<br />

Dataset<br />

&<br />

R<str<strong>on</strong>g>an</str<strong>on</strong>g>dom<br />

Dataset<br />

Ideal<br />

Dataset<br />

&<br />

R<str<strong>on</strong>g>an</str<strong>on</strong>g>dom<br />

Dataset<br />

Ideal<br />

Dataset<br />

Type of<br />

Software<br />

LNKnet<br />

Package<br />

&<br />

Cluster <str<strong>on</strong>g>an</str<strong>on</strong>g>d<br />

TreeView<br />

Package<br />

LNKnet<br />

Package<br />

&<br />

Cluster <str<strong>on</strong>g>an</str<strong>on</strong>g>d<br />

TreeView<br />

Package<br />

LNKnet<br />

Package<br />

&<br />

Cluster <str<strong>on</strong>g>an</str<strong>on</strong>g>d<br />

TreeView<br />

Package<br />

LNKnet<br />

Package<br />

bas<br />

ed<br />

Alg<br />

orit<br />

hm<br />

&<br />

Small<br />

Dataset<br />

of<br />

cluster<br />

s<br />

&<br />

Small<br />

number<br />

of<br />

clusters<br />

&<br />

R<str<strong>on</strong>g>an</str<strong>on</strong>g>dom<br />

Dataset<br />

&<br />

Cluster <str<strong>on</strong>g>an</str<strong>on</strong>g>d<br />

TreeView<br />

Package<br />

V. POSSIBLE APPLICATIONS<br />

Clustering algorithms c<str<strong>on</strong>g>an</str<strong>on</strong>g> be applied in m<str<strong>on</strong>g>an</str<strong>on</strong>g>y fields, for<br />

inst<str<strong>on</strong>g>an</str<strong>on</strong>g>ce:<br />

‣ Marketing: finding groups of customers with similar<br />

behavior given a large database of customer data<br />

c<strong>on</strong>taining their properties <str<strong>on</strong>g>an</str<strong>on</strong>g>d past buying records;<br />

‣ Fin<str<strong>on</strong>g>an</str<strong>on</strong>g>cial task: Forecasting stock market, currency<br />

exch<str<strong>on</strong>g>an</str<strong>on</strong>g>ge rate, b<str<strong>on</strong>g>an</str<strong>on</strong>g>k b<str<strong>on</strong>g>an</str<strong>on</strong>g>kruptcies, un-derst<str<strong>on</strong>g>an</str<strong>on</strong>g>ding <str<strong>on</strong>g>an</str<strong>on</strong>g>d<br />

m<str<strong>on</strong>g>an</str<strong>on</strong>g>aging fin<str<strong>on</strong>g>an</str<strong>on</strong>g>cial risk, trading futures, credit rating,<br />

‣ Biology: classificati<strong>on</strong> of pl<str<strong>on</strong>g>an</str<strong>on</strong>g>ts <str<strong>on</strong>g>an</str<strong>on</strong>g>d <str<strong>on</strong>g>an</str<strong>on</strong>g>imals given their<br />

features;<br />

‣ Libraries: book ordering;<br />

‣ Insur<str<strong>on</strong>g>an</str<strong>on</strong>g>ce: identifying groups of motor insur<str<strong>on</strong>g>an</str<strong>on</strong>g>ce policy<br />

holders with a high average claim cost; identifying<br />

frauds;<br />

‣ City-pl<str<strong>on</strong>g>an</str<strong>on</strong>g>ning: identifying groups of houses according to<br />

their house type, value <str<strong>on</strong>g>an</str<strong>on</strong>g>d geographical locati<strong>on</strong>;<br />

‣ Earthquake studies: <strong>clustering</strong> observed earthquake<br />

epicenters to identify d<str<strong>on</strong>g>an</str<strong>on</strong>g>gerous z<strong>on</strong>es;<br />

‣ WWW: document classificati<strong>on</strong>; <strong>clustering</strong> web log data<br />

to discover groups of similar access patterns<br />

VI. CONCLUSION<br />

Clustering is a descriptive technique. The soluti<strong>on</strong> is not<br />

unique <str<strong>on</strong>g>an</str<strong>on</strong>g>d it str<strong>on</strong>gly depends up<strong>on</strong> the <str<strong>on</strong>g>an</str<strong>on</strong>g>alyst‟s choices.<br />

We described how it is possible to combine different results<br />

in order to obtain stable clusters, not depending too much<br />

<strong>on</strong> the criteria selected to <str<strong>on</strong>g>an</str<strong>on</strong>g>alyze data. Clustering always<br />

provides groups, even if there is no group structure.<br />

When applying a cluster <str<strong>on</strong>g>an</str<strong>on</strong>g>alysis we are hypothesizing<br />

that the groups exist. But this assumpti<strong>on</strong> may be false or<br />

weak. Clustering results‟ should not be generalized. Cases<br />

in the same cluster are similar <strong>on</strong>ly with respect to the<br />

informati<strong>on</strong> cluster <str<strong>on</strong>g>an</str<strong>on</strong>g>alysis was based <strong>on</strong> i.e.,<br />

dimensi<strong>on</strong>s/variables inducing the dissimilarities.<br />

REFERENCES<br />

1. H<str<strong>on</strong>g>an</str<strong>on</strong>g>, J. <str<strong>on</strong>g>an</str<strong>on</strong>g>d Kamber, M. Data Mining: C<strong>on</strong>cepts <str<strong>on</strong>g>an</str<strong>on</strong>g>d<br />

Techniques, 2001 (Academic Press, S<str<strong>on</strong>g>an</str<strong>on</strong>g> Diego,<br />

California, USA).<br />

2. Comparisi<strong>on</strong> between <strong>clustering</strong> algorithms- Osama<br />

Abu Abbas.<br />

3. Pham, D.T. <str<strong>on</strong>g>an</str<strong>on</strong>g>d Afify, A.A. Clustering techniques <str<strong>on</strong>g>an</str<strong>on</strong>g>d<br />

their applicati<strong>on</strong>s in engineering. Submitted to<br />

Proceedings of the Instituti<strong>on</strong> of Mech<str<strong>on</strong>g>an</str<strong>on</strong>g>ical Engineers,<br />

ISSN: 2250-3021 www.iosrjen.org 724 | P a g e


IOSR Journal of Engineering<br />

Apr. 2012, Vol. 2(4) pp: 719-725<br />

Part C: Journal of Mech<str<strong>on</strong>g>an</str<strong>on</strong>g>ical Engineering Science,<br />

2006.<br />

4. Jain, A.K. <str<strong>on</strong>g>an</str<strong>on</strong>g>d Dubes, R.C. Algorithms for Clustering<br />

Data, 1988 (Prentice Hall, Englewood Cliffs, New<br />

Jersey, USA).<br />

5. Bottou, L. <str<strong>on</strong>g>an</str<strong>on</strong>g>d Bengio, Y. C<strong>on</strong>vergence properties of<br />

the k-me<str<strong>on</strong>g>an</str<strong>on</strong>g>s algorithm.<br />

6. Adv<str<strong>on</strong>g>an</str<strong>on</strong>g>ces in Neural Informati<strong>on</strong> Processing Systems,<br />

1995, 7, 585-592.<br />

7. Grabmeier, J. <str<strong>on</strong>g>an</str<strong>on</strong>g>d Rudolph, A. Techniques of cluster<br />

algorithms in data mining. Data Mining <str<strong>on</strong>g>an</str<strong>on</strong>g>d<br />

Knowledge Discovery, 2002, 6, 303-360.<br />

8. Data Clustering. A Review: A.K. Jain Michig<str<strong>on</strong>g>an</str<strong>on</strong>g> State<br />

University <str<strong>on</strong>g>an</str<strong>on</strong>g>d M.N. Murty Indi<str<strong>on</strong>g>an</str<strong>on</strong>g> Institute of Science<br />

<str<strong>on</strong>g>an</str<strong>on</strong>g>d P.J. Flynn The Ohio State University.<br />

9. R C T Lee Cluster Analysis <str<strong>on</strong>g>an</str<strong>on</strong>g>d Its Applicati<strong>on</strong>s In J.T.<br />

Tou, editor, Adv<str<strong>on</strong>g>an</str<strong>on</strong>g>ces in Informati<strong>on</strong> Systems Science.<br />

Plenum Press. New York.<br />

10. Model-based Methods of Classificati<strong>on</strong>: Using the<br />

mclust Software in Chemo metrics Chris Fraley<br />

University of Washingt<strong>on</strong> Adri<str<strong>on</strong>g>an</str<strong>on</strong>g> E. Raftery University<br />

of Washingt<strong>on</strong>.<br />

ISSN: 2250-3021 www.iosrjen.org 725 | P a g e

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!