an overview on clustering methods - arXiv
an overview on clustering methods - arXiv
an overview on clustering methods - arXiv
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
IOSR Journal of Engineering<br />
Apr. 2012, Vol. 2(4) pp: 719-725<br />
AN OVERVIEW ON CLUSTERING METHODS<br />
T. S<strong>on</strong>i Madhulatha<br />
Associate Professor, Alluri Institute of M<str<strong>on</strong>g>an</str<strong>on</strong>g>agement Sciences, War<str<strong>on</strong>g>an</str<strong>on</strong>g>gal.<br />
ABSTRACT<br />
Clustering is a comm<strong>on</strong> technique for statistical data <str<strong>on</strong>g>an</str<strong>on</strong>g>alysis, which is used in m<str<strong>on</strong>g>an</str<strong>on</strong>g>y fields, including machine learning,<br />
data mining, pattern recogniti<strong>on</strong>, image <str<strong>on</strong>g>an</str<strong>on</strong>g>alysis <str<strong>on</strong>g>an</str<strong>on</strong>g>d bioinformatics. Clustering is the process of grouping similar objects<br />
into different groups, or more precisely, the partiti<strong>on</strong>ing of a data set into subsets, so that the data in each subset<br />
according to some defined dist<str<strong>on</strong>g>an</str<strong>on</strong>g>ce measure. This paper covers about <strong>clustering</strong> algorithms, benefits <str<strong>on</strong>g>an</str<strong>on</strong>g>d its applicati<strong>on</strong>s.<br />
Paper c<strong>on</strong>cludes by discussing some limitati<strong>on</strong>s.<br />
Keywords: Clustering, hierarchical algorithm, partiti<strong>on</strong>al algorithm, dist<str<strong>on</strong>g>an</str<strong>on</strong>g>ce measure,<br />
I. INTRODUCTION<br />
Clustering c<str<strong>on</strong>g>an</str<strong>on</strong>g> be c<strong>on</strong>sidered the most import<str<strong>on</strong>g>an</str<strong>on</strong>g>t<br />
unsupervised learning problem; so, as every other problem<br />
of this kind, it deals with finding a structure in a collecti<strong>on</strong><br />
of unlabeled data. A cluster is therefore a collecti<strong>on</strong> of<br />
objects which are “similar” between them <str<strong>on</strong>g>an</str<strong>on</strong>g>d are<br />
“dissimilar” to the objects bel<strong>on</strong>ging to other clusters.<br />
Besides the term data <strong>clustering</strong> as syn<strong>on</strong>yms like cluster<br />
<str<strong>on</strong>g>an</str<strong>on</strong>g>alysis, automatic classificati<strong>on</strong>, numerical tax<strong>on</strong>omy,<br />
botrology <str<strong>on</strong>g>an</str<strong>on</strong>g>d typological <str<strong>on</strong>g>an</str<strong>on</strong>g>alysis.<br />
II. TYPES OF CLUSTERING.<br />
Data <strong>clustering</strong> algorithms c<str<strong>on</strong>g>an</str<strong>on</strong>g> be hierarchical or partiti<strong>on</strong>al.<br />
Hierarchical algorithms find successive clusters using<br />
previously established clusters, whereas partiti<strong>on</strong>al<br />
algorithms determine all clusters at time. Hierarchical<br />
algorithms c<str<strong>on</strong>g>an</str<strong>on</strong>g> be agglomerative (bottom-up) or divisive<br />
(top-down). Agglomerative algorithms begin with each<br />
element as a separate cluster <str<strong>on</strong>g>an</str<strong>on</strong>g>d merge them in<br />
successively larger clusters. Divisive algorithms begin with<br />
the whole set <str<strong>on</strong>g>an</str<strong>on</strong>g>d proceed to divide it into successively<br />
smaller clusters.<br />
HIERARCHICAL CLUSTERING<br />
A key step in a hierarchical <strong>clustering</strong> is to select a dist<str<strong>on</strong>g>an</str<strong>on</strong>g>ce<br />
measure. A simple measure is m<str<strong>on</strong>g>an</str<strong>on</strong>g>hatt<str<strong>on</strong>g>an</str<strong>on</strong>g> dist<str<strong>on</strong>g>an</str<strong>on</strong>g>ce, equal to<br />
the sum of absolute dist<str<strong>on</strong>g>an</str<strong>on</strong>g>ces for each variable. The name<br />
comes from the fact that in a two-variable case, the<br />
variables c<str<strong>on</strong>g>an</str<strong>on</strong>g> be plotted <strong>on</strong> a grid that c<str<strong>on</strong>g>an</str<strong>on</strong>g> be compared to<br />
city streets, <str<strong>on</strong>g>an</str<strong>on</strong>g>d the dist<str<strong>on</strong>g>an</str<strong>on</strong>g>ce between two points is the<br />
number of blocks a pers<strong>on</strong> would walk.<br />
A more comm<strong>on</strong> measure is Euclide<str<strong>on</strong>g>an</str<strong>on</strong>g> dist<str<strong>on</strong>g>an</str<strong>on</strong>g>ce, computed<br />
by finding the square of the dist<str<strong>on</strong>g>an</str<strong>on</strong>g>ce between each variable,<br />
summing the squares, <str<strong>on</strong>g>an</str<strong>on</strong>g>d finding the square root of that<br />
sum. In the two-variable case, the dist<str<strong>on</strong>g>an</str<strong>on</strong>g>ce is <str<strong>on</strong>g>an</str<strong>on</strong>g>alogous to<br />
finding the length of the hypotenuse in a tri<str<strong>on</strong>g>an</str<strong>on</strong>g>gle; that is, it<br />
is the dist<str<strong>on</strong>g>an</str<strong>on</strong>g>ce "as the crow flies." A review of cluster<br />
<str<strong>on</strong>g>an</str<strong>on</strong>g>alysis in health psychology research found that the most<br />
comm<strong>on</strong> dist<str<strong>on</strong>g>an</str<strong>on</strong>g>ce measure in published studies in that<br />
research area is the Euclide<str<strong>on</strong>g>an</str<strong>on</strong>g> dist<str<strong>on</strong>g>an</str<strong>on</strong>g>ce or the squared<br />
Euclide<str<strong>on</strong>g>an</str<strong>on</strong>g> dist<str<strong>on</strong>g>an</str<strong>on</strong>g>ce.<br />
The M<str<strong>on</strong>g>an</str<strong>on</strong>g>hatt<str<strong>on</strong>g>an</str<strong>on</strong>g> dist<str<strong>on</strong>g>an</str<strong>on</strong>g>ce functi<strong>on</strong> computes the<br />
dist<str<strong>on</strong>g>an</str<strong>on</strong>g>ce that would be traveled to get from <strong>on</strong>e data point to<br />
the other if a grid-like path is followed. The M<str<strong>on</strong>g>an</str<strong>on</strong>g>hatt<str<strong>on</strong>g>an</str<strong>on</strong>g><br />
dist<str<strong>on</strong>g>an</str<strong>on</strong>g>ce between two items is the sum of the differences of<br />
their corresp<strong>on</strong>ding comp<strong>on</strong>ents. The formula for this<br />
dist<str<strong>on</strong>g>an</str<strong>on</strong>g>ce between a point X=(X1, X2, etc.) <str<strong>on</strong>g>an</str<strong>on</strong>g>d a point<br />
Y=(Y1, Y2, etc.) is:<br />
d <br />
n<br />
<br />
i1<br />
X i<br />
Y i<br />
Where n is the number of variables, <str<strong>on</strong>g>an</str<strong>on</strong>g>d Xi <str<strong>on</strong>g>an</str<strong>on</strong>g>d Yi are the<br />
values of the ith variable, at points X <str<strong>on</strong>g>an</str<strong>on</strong>g>d Y respectively.<br />
The Euclide<str<strong>on</strong>g>an</str<strong>on</strong>g> dist<str<strong>on</strong>g>an</str<strong>on</strong>g>ce functi<strong>on</strong> measures the „asthe-crow-flies<br />
dist<str<strong>on</strong>g>an</str<strong>on</strong>g>ce. The formula for this dist<str<strong>on</strong>g>an</str<strong>on</strong>g>ce<br />
between a point X (X1, X2, etc.) <str<strong>on</strong>g>an</str<strong>on</strong>g>d a point Y (Y1, Y2, etc.)<br />
is:<br />
d <br />
n<br />
<br />
j1<br />
( x j<br />
y j<br />
)<br />
2<br />
Deriving the Euclide<str<strong>on</strong>g>an</str<strong>on</strong>g> dist<str<strong>on</strong>g>an</str<strong>on</strong>g>ce between two data points<br />
involves computing the square root of the sum of the<br />
squares of the differences between corresp<strong>on</strong>ding values.<br />
The following figure illustrates the difference between<br />
M<str<strong>on</strong>g>an</str<strong>on</strong>g>hatt<str<strong>on</strong>g>an</str<strong>on</strong>g> dist<str<strong>on</strong>g>an</str<strong>on</strong>g>ce <str<strong>on</strong>g>an</str<strong>on</strong>g>d Euclide<str<strong>on</strong>g>an</str<strong>on</strong>g> dist<str<strong>on</strong>g>an</str<strong>on</strong>g>ce:<br />
ISSN: 2250-3021 www.iosrjen.org 719 | P a g e
IOSR Journal of Engineering<br />
Apr. 2012, Vol. 2(4) pp: 719-725<br />
1<br />
card ( A)<br />
card ( B)<br />
<br />
xA<br />
yB<br />
d(<br />
x,<br />
y)<br />
M<str<strong>on</strong>g>an</str<strong>on</strong>g>hatt<str<strong>on</strong>g>an</str<strong>on</strong>g> dist<str<strong>on</strong>g>an</str<strong>on</strong>g>ce<br />
Euclide<str<strong>on</strong>g>an</str<strong>on</strong>g> dist<str<strong>on</strong>g>an</str<strong>on</strong>g>ce<br />
This method builds the hierarchy from the individual<br />
elements by progressively merging clusters. Again, we have<br />
six elements {a} {b} {c} {d} {e} <str<strong>on</strong>g>an</str<strong>on</strong>g>d {f}. The first step is<br />
to determine which elements to merge in a cluster. Usually,<br />
we w<str<strong>on</strong>g>an</str<strong>on</strong>g>t to take the two closest elements, therefore we must<br />
define a dist<str<strong>on</strong>g>an</str<strong>on</strong>g>ce between elements. One c<str<strong>on</strong>g>an</str<strong>on</strong>g> also c<strong>on</strong>struct<br />
a dist<str<strong>on</strong>g>an</str<strong>on</strong>g>ce matrix at this stage.<br />
The sum of all intra-cluster vari<str<strong>on</strong>g>an</str<strong>on</strong>g>ce<br />
The increase in vari<str<strong>on</strong>g>an</str<strong>on</strong>g>ce for the cluster being merged<br />
The probability that c<str<strong>on</strong>g>an</str<strong>on</strong>g>didate clusters spawn from the<br />
same distributi<strong>on</strong> functi<strong>on</strong>.<br />
Each agglomerati<strong>on</strong> occurs at a greater dist<str<strong>on</strong>g>an</str<strong>on</strong>g>ce between<br />
clusters th<str<strong>on</strong>g>an</str<strong>on</strong>g> the previous agglomerati<strong>on</strong>, <str<strong>on</strong>g>an</str<strong>on</strong>g>d <strong>on</strong>e c<str<strong>on</strong>g>an</str<strong>on</strong>g><br />
decide to stop <strong>clustering</strong> either when the clusters are too far<br />
apart to be merged or when there is a sufficiently small<br />
number of clusters.<br />
Agglomerative hierarchical <strong>clustering</strong><br />
For example, suppose these data are to be <str<strong>on</strong>g>an</str<strong>on</strong>g>alyzed, where<br />
pixel euclide<str<strong>on</strong>g>an</str<strong>on</strong>g> dist<str<strong>on</strong>g>an</str<strong>on</strong>g>ce is the dist<str<strong>on</strong>g>an</str<strong>on</strong>g>ce metric.<br />
Usually the dist<str<strong>on</strong>g>an</str<strong>on</strong>g>ce between two clusters <str<strong>on</strong>g>an</str<strong>on</strong>g>d is <strong>on</strong>e of the<br />
following:<br />
The maximum dist<str<strong>on</strong>g>an</str<strong>on</strong>g>ce between elements of each cluster<br />
is also called complete linkage <strong>clustering</strong>.<br />
max<br />
<br />
d(x,<br />
y): x<br />
A,y B<br />
The minimum dist<str<strong>on</strong>g>an</str<strong>on</strong>g>ce between elements of each cluster<br />
is also called single linkage <strong>clustering</strong>.<br />
<br />
min d(x, y): x A,yB<br />
The me<str<strong>on</strong>g>an</str<strong>on</strong>g> dist<str<strong>on</strong>g>an</str<strong>on</strong>g>ce between elements of each cluster is<br />
also called average linkage <strong>clustering</strong>.<br />
<br />
<br />
Divisive <strong>clustering</strong><br />
So far we have <strong>on</strong>ly looked at agglomerative <strong>clustering</strong>, but<br />
a cluster hierarchy c<str<strong>on</strong>g>an</str<strong>on</strong>g> also be generated top-down. This<br />
vari<str<strong>on</strong>g>an</str<strong>on</strong>g>t of hierarchical <strong>clustering</strong> is called top-down<br />
<strong>clustering</strong> or divisive <strong>clustering</strong>. We start at the top with all<br />
documents in <strong>on</strong>e cluster. The cluster is split using a flat<br />
<strong>clustering</strong> algorithm. This procedure is applied recursively<br />
until each document is in its own singlet<strong>on</strong> cluster.<br />
Top-down <strong>clustering</strong> is c<strong>on</strong>ceptually more complex th<str<strong>on</strong>g>an</str<strong>on</strong>g><br />
bottom-up <strong>clustering</strong> since we need a sec<strong>on</strong>d, flat <strong>clustering</strong><br />
algorithm as a ``subroutine''. It has the adv<str<strong>on</strong>g>an</str<strong>on</strong>g>tage of being<br />
more efficient if we do not generate a complete hierarchy<br />
all the way down to individual document leaves. For a fixed<br />
number of top levels, using <str<strong>on</strong>g>an</str<strong>on</strong>g> efficient flat algorithm like<br />
K-me<str<strong>on</strong>g>an</str<strong>on</strong>g>s, top-down algorithms are linear in the number of<br />
documents <str<strong>on</strong>g>an</str<strong>on</strong>g>d clusters<br />
ISSN: 2250-3021 www.iosrjen.org 720 | P a g e
IOSR Journal of Engineering<br />
Apr. 2012, Vol. 2(4) pp: 719-725<br />
Hierarchal method suffers from the fact that <strong>on</strong>ce the<br />
merge/split is d<strong>on</strong>e, it c<str<strong>on</strong>g>an</str<strong>on</strong>g> never be und<strong>on</strong>e. This rigidity is<br />
useful in that is useful in that it leads to smaller<br />
computati<strong>on</strong> costs by not worrying about a combinatorial<br />
number of different choices.<br />
However there are two approaches to improve the quality of<br />
hierarchical <strong>clustering</strong><br />
Perform careful <str<strong>on</strong>g>an</str<strong>on</strong>g>alysis of object linkages at each<br />
hierarchical partiti<strong>on</strong>ing such as CURE <str<strong>on</strong>g>an</str<strong>on</strong>g>d Chamele<strong>on</strong>.<br />
Integrate hierarchical agglomerati<strong>on</strong> <str<strong>on</strong>g>an</str<strong>on</strong>g>d then redefine the<br />
result using iterative relocati<strong>on</strong> as in BRICH<br />
PARTITIONAL CLUSTERING:<br />
Partiti<strong>on</strong>ing algorithms are based <strong>on</strong> specifying <str<strong>on</strong>g>an</str<strong>on</strong>g> initial<br />
number of groups, <str<strong>on</strong>g>an</str<strong>on</strong>g>d iteratively reallocating objects<br />
am<strong>on</strong>g groups to c<strong>on</strong>vergence. This algorithm typically<br />
determines all clusters at <strong>on</strong>ce. Most applicati<strong>on</strong>s adopt <strong>on</strong>e<br />
of two popular heuristic <strong>methods</strong> like<br />
k-me<str<strong>on</strong>g>an</str<strong>on</strong>g> algorithm<br />
k-medoids algorithm<br />
K-me<str<strong>on</strong>g>an</str<strong>on</strong>g>s algorithm<br />
The K-me<str<strong>on</strong>g>an</str<strong>on</strong>g>s algorithm assigns each point to the cluster<br />
whose center also called centroid is nearest. The center is<br />
the average of all the points in the cluster that is, its<br />
coordinates are the arithmetic me<str<strong>on</strong>g>an</str<strong>on</strong>g> for each dimensi<strong>on</strong><br />
separately over all the points in the cluster.<br />
The pseudo code of the k-me<str<strong>on</strong>g>an</str<strong>on</strong>g>s algorithm is to explain<br />
how it works:<br />
A. Choose K as the number of clusters.<br />
B. Initialize the codebook vectors of the K clusters<br />
(r<str<strong>on</strong>g>an</str<strong>on</strong>g>domly, for inst<str<strong>on</strong>g>an</str<strong>on</strong>g>ce)<br />
C. For every new sample vector:<br />
a. Compute the dist<str<strong>on</strong>g>an</str<strong>on</strong>g>ce between the new vector <str<strong>on</strong>g>an</str<strong>on</strong>g>d<br />
every cluster's codebook vector.<br />
b. Re-compute the closest codebook vector with the<br />
new vector, using a learning rate that decreases in time.<br />
The reas<strong>on</strong> behind choosing the k-me<str<strong>on</strong>g>an</str<strong>on</strong>g>s algorithm to study<br />
is its popularity for the following reas<strong>on</strong>s:<br />
• Its time complexity is O (nkl), where n is the number<br />
of patterns, k is the number of clusters, <str<strong>on</strong>g>an</str<strong>on</strong>g>d l is the<br />
number of iterati<strong>on</strong>s taken by the algorithm to<br />
c<strong>on</strong>verge.<br />
• Its space complexity is O (k+n). It requires<br />
additi<strong>on</strong>al space to store the data matrix.<br />
• It is order-independent; for a given initial seed set of<br />
cluster centers, it generates the same partiti<strong>on</strong> of the data<br />
irrespective of the order in which the patterns are presented<br />
to the algorithm.<br />
K-medoids algorithm:<br />
The basic strategy of k-medoids algorithm is each cluster is<br />
represented by <strong>on</strong>e of the objects located near the center of<br />
the cluster. PAM (Partiti<strong>on</strong>ing around Medoids) was <strong>on</strong>e of<br />
the first k-medoids algorithm is introduced.<br />
The pseudo code of the k-medoids algorithm is to explain<br />
how it works:<br />
Arbitrarily choose k objects as the initial medoids Repeat<br />
Assign each remaining object to the cluster with the nearest<br />
medoids R<str<strong>on</strong>g>an</str<strong>on</strong>g>domly select a n<strong>on</strong>-medoid object O r<str<strong>on</strong>g>an</str<strong>on</strong>g>dom<br />
Compute the total cost, S, of swapping O j with O r<str<strong>on</strong>g>an</str<strong>on</strong>g>dom<br />
If S
IOSR Journal of Engineering<br />
Apr. 2012, Vol. 2(4) pp: 719-725<br />
The <strong>clustering</strong> process is based <strong>on</strong> the classificati<strong>on</strong> of the<br />
points in the dataset as core points, border points <str<strong>on</strong>g>an</str<strong>on</strong>g>d noise<br />
points, <str<strong>on</strong>g>an</str<strong>on</strong>g>d <strong>on</strong> the use of density relati<strong>on</strong>s between points to<br />
form the clusters.<br />
The pseudo code of the DBSCAN algorithm is to explain<br />
how it works:<br />
To clusters a dataset, our DBSCAN implementati<strong>on</strong> starts<br />
by identifying the k nearest neighbours of each point <str<strong>on</strong>g>an</str<strong>on</strong>g>d<br />
identify the farthest k nearest neighbour. The average of all<br />
this dist<str<strong>on</strong>g>an</str<strong>on</strong>g>ce is then calculated.<br />
For each point of the dataset the algorithm identifies the<br />
directly density-reachable points using the Eps threshold<br />
provided by the user <str<strong>on</strong>g>an</str<strong>on</strong>g>d classifies the points into core or<br />
border points.<br />
It then loop trough all points of the dataset <str<strong>on</strong>g>an</str<strong>on</strong>g>d for the core<br />
points it starts to c<strong>on</strong>struct a new cluster with the support of<br />
the GetDRPoints() procedure that follows the definiti<strong>on</strong> of<br />
density reachable points.<br />
In this phase the value used as Eps threshold is the<br />
average dist<str<strong>on</strong>g>an</str<strong>on</strong>g>ce calculated previously. At the end, the<br />
compositi<strong>on</strong> of the clusters is verified in order to check if<br />
there exist clusters that c<str<strong>on</strong>g>an</str<strong>on</strong>g> be merged together. This c<str<strong>on</strong>g>an</str<strong>on</strong>g><br />
append if two points of different clusters are at a dist<str<strong>on</strong>g>an</str<strong>on</strong>g>ce<br />
less th<str<strong>on</strong>g>an</str<strong>on</strong>g> Eps.<br />
Note: DBSCAN does not deal very well with clusters of<br />
different densities.<br />
SNN ALGORITHM<br />
The SNN algorithm, as DBSCAN, is a density-based<br />
<strong>clustering</strong> algorithm. The main difference between this<br />
algorithm <str<strong>on</strong>g>an</str<strong>on</strong>g>d DBSCAN is that it defines the similarity<br />
between points by looking at the number of nearest<br />
neighbours that two points share. Using this similarity<br />
measure in the SNN algorithm, the density is defined as the<br />
sum of the similarities of the nearest neighbours of a point.<br />
Points with high density become core points, while points<br />
with low density represent noise points. All remainder<br />
points that are str<strong>on</strong>gly similar to a specific core points will<br />
represent a new clusters.<br />
The SNN algorithm needs three inputs parameters:<br />
- K, the neighbours‟ list size;<br />
- Eps, the threshold density;<br />
- MinPts, the threshold that define the core points.<br />
The pseudo code of the SSN algorithm is to explain how<br />
it works:<br />
Define the input parameters.<br />
Find the K nearest neighbours of each point of the dataset.<br />
Then the similarity between pairs of points is calculated in<br />
terms of how m<str<strong>on</strong>g>an</str<strong>on</strong>g>y nearest neighbours the two points share.<br />
Using this similarity measure, the density of each point c<str<strong>on</strong>g>an</str<strong>on</strong>g><br />
be calculated as being the numbers of neighbours with<br />
which the number of shared neighbours is equal or greater<br />
th<str<strong>on</strong>g>an</str<strong>on</strong>g> Eps.<br />
The points are classified as being core points, if the density<br />
of the point is equal or greater th<str<strong>on</strong>g>an</str<strong>on</strong>g> MinPts. At this point,<br />
the algorithm has all the informati<strong>on</strong> needed to start to build<br />
the clusters. Those start to be c<strong>on</strong>structed around the core<br />
points.<br />
However, these clusters do not c<strong>on</strong>tain all points. They<br />
c<strong>on</strong>tain <strong>on</strong>ly points that come from regi<strong>on</strong>s of relatively<br />
uniform density. The points that are not classified into <str<strong>on</strong>g>an</str<strong>on</strong>g>y<br />
cluster are classified as noise points.<br />
GRID-BASED CLUSTERING<br />
The grid based <strong>clustering</strong> approach uses a multiresoluti<strong>on</strong><br />
grid data structure. It qu<str<strong>on</strong>g>an</str<strong>on</strong>g>tizes the space into a finite<br />
number of cells that form a grid structure <strong>on</strong> which all the<br />
operati<strong>on</strong>s for <strong>clustering</strong> are performed. Grid approach<br />
includes STING (STatistical INformati<strong>on</strong> Grid) approach<br />
<str<strong>on</strong>g>an</str<strong>on</strong>g>d CLIQUE<br />
Basic Grid-based Algorithm<br />
1. Define a set of grid-cells<br />
2. Assign objects to the appropriate grid cell <str<strong>on</strong>g>an</str<strong>on</strong>g>d compute<br />
the density of each cell.<br />
3. Eliminate cells, whose density is below a certain<br />
threshold t.<br />
4. Form clusters from c<strong>on</strong>tiguous (adjacent) groups of dense<br />
cells.<br />
The pseudo code of the STING algorithm is to explain how<br />
it works:<br />
The spatial area is divided into rect<str<strong>on</strong>g>an</str<strong>on</strong>g>gular cells<br />
There are several levels of cells corresp<strong>on</strong>ding to different<br />
levels of resoluti<strong>on</strong><br />
Each cell is partiti<strong>on</strong>ed into a number of smaller cells in the<br />
next level. Statistical info of each cell is calculated <str<strong>on</strong>g>an</str<strong>on</strong>g>d<br />
stored beforeh<str<strong>on</strong>g>an</str<strong>on</strong>g>d <str<strong>on</strong>g>an</str<strong>on</strong>g>d is used to <str<strong>on</strong>g>an</str<strong>on</strong>g>swer queries<br />
Parameters of higher level cells c<str<strong>on</strong>g>an</str<strong>on</strong>g> be easily calculated<br />
from parameters of lower level cell count, me<str<strong>on</strong>g>an</str<strong>on</strong>g>, s, min,<br />
max type of distributi<strong>on</strong>—normal, uniform, etc.<br />
Use a top-down approach to <str<strong>on</strong>g>an</str<strong>on</strong>g>swer spatial data queries<br />
Start from a pre-selected layer—typically with a small<br />
number of cells from the pre-selected layer until you reach<br />
the bottom layer do the following:<br />
For each cell in the current level compute the c<strong>on</strong>fidence<br />
interval indicating a cell‟s relev<str<strong>on</strong>g>an</str<strong>on</strong>g>ce to a given query;<br />
1. If it is relev<str<strong>on</strong>g>an</str<strong>on</strong>g>t, include the cell in a cluster<br />
ISSN: 2250-3021 www.iosrjen.org 722 | P a g e
IOSR Journal of Engineering<br />
Apr. 2012, Vol. 2(4) pp: 719-725<br />
2. If it irrelev<str<strong>on</strong>g>an</str<strong>on</strong>g>t, remove cell from further c<strong>on</strong>siderati<strong>on</strong><br />
3. otherwise, look for relev<str<strong>on</strong>g>an</str<strong>on</strong>g>t cells at the next lower layer<br />
1. Combine relev<str<strong>on</strong>g>an</str<strong>on</strong>g>t cells into relev<str<strong>on</strong>g>an</str<strong>on</strong>g>t regi<strong>on</strong>s (based <strong>on</strong><br />
grid-neighborhood) <str<strong>on</strong>g>an</str<strong>on</strong>g>d return the so obtained clusters<br />
as your <str<strong>on</strong>g>an</str<strong>on</strong>g>swers.<br />
Adv<str<strong>on</strong>g>an</str<strong>on</strong>g>tages:<br />
Query-independent, easy to parallelize,<br />
incremental update O(K), where K is the number of grid<br />
cells at the lowest level<br />
Disadv<str<strong>on</strong>g>an</str<strong>on</strong>g>tages:<br />
All the cluster boundaries are either horiz<strong>on</strong>tal or<br />
vertical, <str<strong>on</strong>g>an</str<strong>on</strong>g>d no diag<strong>on</strong>al boundary is detected<br />
MODEL-BASED CLUSTERING<br />
Model-Based Clustering <strong>methods</strong> attempt to optimize the fit<br />
between the given data <str<strong>on</strong>g>an</str<strong>on</strong>g>d some mathematical model.<br />
Such <strong>methods</strong> often based <strong>on</strong> the assumpti<strong>on</strong> that the data<br />
are generated by mixture of underlying probability<br />
distributi<strong>on</strong>s. Model-Based Clustering <strong>methods</strong> follow two<br />
major approaches: Statistical Approach or Neural network<br />
approach<br />
1. Clustering is also performed by having several units<br />
competing for the current object<br />
2. The unit whose weight vector is closest to the current<br />
object wins<br />
3. The winner <str<strong>on</strong>g>an</str<strong>on</strong>g>d its neighbors learn by having their<br />
weights adjusted<br />
4. SOMs are believed to resemble processing that c<str<strong>on</strong>g>an</str<strong>on</strong>g> occur<br />
in the brain<br />
5. Useful for visualizing high-dimensi<strong>on</strong>al data in 2- or 3-D<br />
space<br />
In model-based <strong>clustering</strong>, the data x are viewed as coming<br />
P from a mixture density<br />
f ( x)<br />
<br />
G<br />
<br />
k 1<br />
T f ( x)<br />
k<br />
k<br />
( x ; ,<br />
i<br />
k<br />
<br />
1<br />
T<br />
exp <br />
( xi<br />
k<br />
)<br />
<br />
2<br />
k<br />
)<br />
det(2<br />
<br />
k<br />
<br />
1<br />
k<br />
( x<br />
)<br />
i<br />
<br />
k<br />
) <br />
<br />
For univariate data, the covari<str<strong>on</strong>g>an</str<strong>on</strong>g>ce matrix reduces to a<br />
scalar vari<str<strong>on</strong>g>an</str<strong>on</strong>g>ce. The likelihood for data c<strong>on</strong>sisting of n<br />
observati<strong>on</strong>s assuming a Gaussi<str<strong>on</strong>g>an</str<strong>on</strong>g> mixture model with G<br />
multivariate mixture comp<strong>on</strong>ents is<br />
n<br />
G<br />
<br />
i1 k 1<br />
T k<br />
( x i<br />
; k<br />
,<br />
<br />
k<br />
).<br />
MCLUST is probably the most well known model-based<br />
This is all about various <strong>clustering</strong> algorithms.<br />
III. HOW TO DETERMINE THE NUMBER OF<br />
CLUSTERS<br />
M<str<strong>on</strong>g>an</str<strong>on</strong>g>y <strong>clustering</strong> algorithms require the specificati<strong>on</strong> of the<br />
number of clusters to produce in the input data set, prior to<br />
executi<strong>on</strong> of the algorithm. Barring knowledge of the<br />
proper value beforeh<str<strong>on</strong>g>an</str<strong>on</strong>g>d, the appropriate value must be<br />
determined, a problem <strong>on</strong> its own for which a number of<br />
techniques have been developed.<br />
– If the number of clusters known, terminati<strong>on</strong> c<strong>on</strong>diti<strong>on</strong><br />
is given!<br />
– In general, set a dist<str<strong>on</strong>g>an</str<strong>on</strong>g>ce threshold value (terminati<strong>on</strong><br />
c<strong>on</strong>diti<strong>on</strong>)<br />
– The K-cluster lifetime as the r<str<strong>on</strong>g>an</str<strong>on</strong>g>ge of threshold values<br />
<strong>on</strong> the dendrogram tree that leads to the identificati<strong>on</strong><br />
of K clusters<br />
– Heuristic rule: cut a dendrogram tree with maximum<br />
life time<br />
One simple rule of thumb sets the number to<br />
k n with n as the number of objects .<br />
2<br />
where fk is the probability density functi<strong>on</strong> of the<br />
observati<strong>on</strong>s in group k, <str<strong>on</strong>g>an</str<strong>on</strong>g>d T k is the probability that <str<strong>on</strong>g>an</str<strong>on</strong>g><br />
observati<strong>on</strong> comes from the kth mixture comp<strong>on</strong>ent<br />
Each comp<strong>on</strong>ent is usually modeled by the normal or<br />
Gaussi<str<strong>on</strong>g>an</str<strong>on</strong>g> distributi<strong>on</strong>. Comp<strong>on</strong>ent distributi<strong>on</strong>s are<br />
characterized by the me<str<strong>on</strong>g>an</str<strong>on</strong>g> μk <str<strong>on</strong>g>an</str<strong>on</strong>g>d the covari<str<strong>on</strong>g>an</str<strong>on</strong>g>ce matrix ∑k,<br />
<str<strong>on</strong>g>an</str<strong>on</strong>g>d have the probability density functi<strong>on</strong><br />
Elbow criteri<strong>on</strong><br />
The elbow criteri<strong>on</strong> is a comm<strong>on</strong> rule of thumb to<br />
determine what number of clusters should be chosen, for<br />
example for k-me<str<strong>on</strong>g>an</str<strong>on</strong>g>s <str<strong>on</strong>g>an</str<strong>on</strong>g>d agglomerative hierarchical<br />
<strong>clustering</strong>. The elbow criteri<strong>on</strong> says that you should choose<br />
a number of clusters so that adding <str<strong>on</strong>g>an</str<strong>on</strong>g>other cluster doesn't<br />
add sufficient informati<strong>on</strong>. More precisely, if you graph the<br />
percentage of vari<str<strong>on</strong>g>an</str<strong>on</strong>g>ce explained by the clusters against the<br />
number of clusters, the first clusters will add much<br />
informati<strong>on</strong>, but at some point the marginal gain will drop,<br />
giving <str<strong>on</strong>g>an</str<strong>on</strong>g> <str<strong>on</strong>g>an</str<strong>on</strong>g>gle in the graph.<br />
ISSN: 2250-3021 www.iosrjen.org 723 | P a g e
IOSR Journal of Engineering<br />
Apr. 2012, Vol. 2(4) pp: 719-725<br />
Another set of <strong>methods</strong> for determining the number of<br />
clusters are informati<strong>on</strong> criteria, such as :<br />
The Akaike informati<strong>on</strong> criteri<strong>on</strong> (AIC), The Bayesi<str<strong>on</strong>g>an</str<strong>on</strong>g><br />
informati<strong>on</strong> criteri<strong>on</strong> (BIC), The Devi<str<strong>on</strong>g>an</str<strong>on</strong>g>ce informati<strong>on</strong><br />
criteri<strong>on</strong> (DIC).<br />
IV. HOW ALGORITHMS ARE COMPARED<br />
The above <strong>clustering</strong> algorithms are compared according to<br />
the following factors:<br />
The size of the dataset,<br />
Number of the clusters,<br />
Type of dataset,<br />
Type of software<br />
Table 1 explains how the four algorithms are<br />
compared <str<strong>on</strong>g>an</str<strong>on</strong>g>d the c<strong>on</strong>clusi<strong>on</strong>s are written down.<br />
Parti<br />
ti<strong>on</strong>a<br />
l<br />
Alg<br />
orit<br />
hm<br />
Hie<br />
rarc<br />
hica<br />
l<br />
Alg<br />
orit<br />
hm<br />
Grid<br />
base<br />
d<br />
Alg<br />
orit<br />
hm<br />
Mo<br />
del-<br />
Size of<br />
Dataset<br />
Huge<br />
Dataset<br />
&<br />
Small<br />
Dataset<br />
Huge<br />
Dataset<br />
&<br />
Small<br />
Dataset<br />
Huge<br />
Dataset<br />
&<br />
Small<br />
Dataset<br />
Huge<br />
Dataset<br />
Number<br />
of<br />
Clusters<br />
Large<br />
number<br />
of<br />
cluster<br />
s<br />
&<br />
Small<br />
number<br />
of<br />
clusters<br />
Large<br />
number<br />
of<br />
cluster<br />
s<br />
&<br />
Small<br />
number<br />
of<br />
Clusters<br />
Large<br />
number<br />
of<br />
cluster<br />
s<br />
&<br />
Small<br />
number<br />
of<br />
clusters<br />
Large<br />
number<br />
Type<br />
of<br />
Dataset<br />
Ideal<br />
Dataset<br />
&<br />
R<str<strong>on</strong>g>an</str<strong>on</strong>g>dom<br />
Dataset<br />
Ideal<br />
Dataset<br />
&<br />
R<str<strong>on</strong>g>an</str<strong>on</strong>g>dom<br />
Dataset<br />
Ideal<br />
Dataset<br />
&<br />
R<str<strong>on</strong>g>an</str<strong>on</strong>g>dom<br />
Dataset<br />
Ideal<br />
Dataset<br />
Type of<br />
Software<br />
LNKnet<br />
Package<br />
&<br />
Cluster <str<strong>on</strong>g>an</str<strong>on</strong>g>d<br />
TreeView<br />
Package<br />
LNKnet<br />
Package<br />
&<br />
Cluster <str<strong>on</strong>g>an</str<strong>on</strong>g>d<br />
TreeView<br />
Package<br />
LNKnet<br />
Package<br />
&<br />
Cluster <str<strong>on</strong>g>an</str<strong>on</strong>g>d<br />
TreeView<br />
Package<br />
LNKnet<br />
Package<br />
bas<br />
ed<br />
Alg<br />
orit<br />
hm<br />
&<br />
Small<br />
Dataset<br />
of<br />
cluster<br />
s<br />
&<br />
Small<br />
number<br />
of<br />
clusters<br />
&<br />
R<str<strong>on</strong>g>an</str<strong>on</strong>g>dom<br />
Dataset<br />
&<br />
Cluster <str<strong>on</strong>g>an</str<strong>on</strong>g>d<br />
TreeView<br />
Package<br />
V. POSSIBLE APPLICATIONS<br />
Clustering algorithms c<str<strong>on</strong>g>an</str<strong>on</strong>g> be applied in m<str<strong>on</strong>g>an</str<strong>on</strong>g>y fields, for<br />
inst<str<strong>on</strong>g>an</str<strong>on</strong>g>ce:<br />
‣ Marketing: finding groups of customers with similar<br />
behavior given a large database of customer data<br />
c<strong>on</strong>taining their properties <str<strong>on</strong>g>an</str<strong>on</strong>g>d past buying records;<br />
‣ Fin<str<strong>on</strong>g>an</str<strong>on</strong>g>cial task: Forecasting stock market, currency<br />
exch<str<strong>on</strong>g>an</str<strong>on</strong>g>ge rate, b<str<strong>on</strong>g>an</str<strong>on</strong>g>k b<str<strong>on</strong>g>an</str<strong>on</strong>g>kruptcies, un-derst<str<strong>on</strong>g>an</str<strong>on</strong>g>ding <str<strong>on</strong>g>an</str<strong>on</strong>g>d<br />
m<str<strong>on</strong>g>an</str<strong>on</strong>g>aging fin<str<strong>on</strong>g>an</str<strong>on</strong>g>cial risk, trading futures, credit rating,<br />
‣ Biology: classificati<strong>on</strong> of pl<str<strong>on</strong>g>an</str<strong>on</strong>g>ts <str<strong>on</strong>g>an</str<strong>on</strong>g>d <str<strong>on</strong>g>an</str<strong>on</strong>g>imals given their<br />
features;<br />
‣ Libraries: book ordering;<br />
‣ Insur<str<strong>on</strong>g>an</str<strong>on</strong>g>ce: identifying groups of motor insur<str<strong>on</strong>g>an</str<strong>on</strong>g>ce policy<br />
holders with a high average claim cost; identifying<br />
frauds;<br />
‣ City-pl<str<strong>on</strong>g>an</str<strong>on</strong>g>ning: identifying groups of houses according to<br />
their house type, value <str<strong>on</strong>g>an</str<strong>on</strong>g>d geographical locati<strong>on</strong>;<br />
‣ Earthquake studies: <strong>clustering</strong> observed earthquake<br />
epicenters to identify d<str<strong>on</strong>g>an</str<strong>on</strong>g>gerous z<strong>on</strong>es;<br />
‣ WWW: document classificati<strong>on</strong>; <strong>clustering</strong> web log data<br />
to discover groups of similar access patterns<br />
VI. CONCLUSION<br />
Clustering is a descriptive technique. The soluti<strong>on</strong> is not<br />
unique <str<strong>on</strong>g>an</str<strong>on</strong>g>d it str<strong>on</strong>gly depends up<strong>on</strong> the <str<strong>on</strong>g>an</str<strong>on</strong>g>alyst‟s choices.<br />
We described how it is possible to combine different results<br />
in order to obtain stable clusters, not depending too much<br />
<strong>on</strong> the criteria selected to <str<strong>on</strong>g>an</str<strong>on</strong>g>alyze data. Clustering always<br />
provides groups, even if there is no group structure.<br />
When applying a cluster <str<strong>on</strong>g>an</str<strong>on</strong>g>alysis we are hypothesizing<br />
that the groups exist. But this assumpti<strong>on</strong> may be false or<br />
weak. Clustering results‟ should not be generalized. Cases<br />
in the same cluster are similar <strong>on</strong>ly with respect to the<br />
informati<strong>on</strong> cluster <str<strong>on</strong>g>an</str<strong>on</strong>g>alysis was based <strong>on</strong> i.e.,<br />
dimensi<strong>on</strong>s/variables inducing the dissimilarities.<br />
REFERENCES<br />
1. H<str<strong>on</strong>g>an</str<strong>on</strong>g>, J. <str<strong>on</strong>g>an</str<strong>on</strong>g>d Kamber, M. Data Mining: C<strong>on</strong>cepts <str<strong>on</strong>g>an</str<strong>on</strong>g>d<br />
Techniques, 2001 (Academic Press, S<str<strong>on</strong>g>an</str<strong>on</strong>g> Diego,<br />
California, USA).<br />
2. Comparisi<strong>on</strong> between <strong>clustering</strong> algorithms- Osama<br />
Abu Abbas.<br />
3. Pham, D.T. <str<strong>on</strong>g>an</str<strong>on</strong>g>d Afify, A.A. Clustering techniques <str<strong>on</strong>g>an</str<strong>on</strong>g>d<br />
their applicati<strong>on</strong>s in engineering. Submitted to<br />
Proceedings of the Instituti<strong>on</strong> of Mech<str<strong>on</strong>g>an</str<strong>on</strong>g>ical Engineers,<br />
ISSN: 2250-3021 www.iosrjen.org 724 | P a g e
IOSR Journal of Engineering<br />
Apr. 2012, Vol. 2(4) pp: 719-725<br />
Part C: Journal of Mech<str<strong>on</strong>g>an</str<strong>on</strong>g>ical Engineering Science,<br />
2006.<br />
4. Jain, A.K. <str<strong>on</strong>g>an</str<strong>on</strong>g>d Dubes, R.C. Algorithms for Clustering<br />
Data, 1988 (Prentice Hall, Englewood Cliffs, New<br />
Jersey, USA).<br />
5. Bottou, L. <str<strong>on</strong>g>an</str<strong>on</strong>g>d Bengio, Y. C<strong>on</strong>vergence properties of<br />
the k-me<str<strong>on</strong>g>an</str<strong>on</strong>g>s algorithm.<br />
6. Adv<str<strong>on</strong>g>an</str<strong>on</strong>g>ces in Neural Informati<strong>on</strong> Processing Systems,<br />
1995, 7, 585-592.<br />
7. Grabmeier, J. <str<strong>on</strong>g>an</str<strong>on</strong>g>d Rudolph, A. Techniques of cluster<br />
algorithms in data mining. Data Mining <str<strong>on</strong>g>an</str<strong>on</strong>g>d<br />
Knowledge Discovery, 2002, 6, 303-360.<br />
8. Data Clustering. A Review: A.K. Jain Michig<str<strong>on</strong>g>an</str<strong>on</strong>g> State<br />
University <str<strong>on</strong>g>an</str<strong>on</strong>g>d M.N. Murty Indi<str<strong>on</strong>g>an</str<strong>on</strong>g> Institute of Science<br />
<str<strong>on</strong>g>an</str<strong>on</strong>g>d P.J. Flynn The Ohio State University.<br />
9. R C T Lee Cluster Analysis <str<strong>on</strong>g>an</str<strong>on</strong>g>d Its Applicati<strong>on</strong>s In J.T.<br />
Tou, editor, Adv<str<strong>on</strong>g>an</str<strong>on</strong>g>ces in Informati<strong>on</strong> Systems Science.<br />
Plenum Press. New York.<br />
10. Model-based Methods of Classificati<strong>on</strong>: Using the<br />
mclust Software in Chemo metrics Chris Fraley<br />
University of Washingt<strong>on</strong> Adri<str<strong>on</strong>g>an</str<strong>on</strong>g> E. Raftery University<br />
of Washingt<strong>on</strong>.<br />
ISSN: 2250-3021 www.iosrjen.org 725 | P a g e