30.12.2014 Views

PDF (1MB) - QUT ePrints

PDF (1MB) - QUT ePrints

PDF (1MB) - QUT ePrints

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

XML Data Clustering: An Overview · 23<br />

clustering performance, such as BIRCH [Zhang et al. 1996], CURE [Guha et al. 1998],<br />

and ROCK [Guha et al. 2000].<br />

5.2 Non-hierarchical Methods<br />

The non-hierarchical clustering methods produce a single partition of the data set instead<br />

of a clustering structure, such as the dendrogram produced by a hierarchical technique.<br />

There are a number of such techniques, but two of the most prominent are k-means and<br />

K-medoid. The non-hierarchical clustering methods also have similar steps as follows:<br />

(1) select an initial partition of the data with a fixed number of clusters and cluster centers;<br />

(2) assign each XML data in the data set to its closest cluster center and compute the<br />

new cluster centers. Repeat this step until a predefined number of iterations or no<br />

reassignment of XML data to new clusters; and<br />

(3) merge and split clusters based on some heuristic information.<br />

The K-means algorithm is popular because it is easy to implement, and it has a time<br />

complexity of O(NKd), where N is the number of data items, K is the number of clusters,<br />

and d is the number of iterations taken by the algorithm to converge. Since K and d are<br />

usually much less than N and they are fixed in advance, the algorithm can be used to<br />

cluster large data sets. Several drawbacks of the K-means algorithm have been identified<br />

and well studied, such as the sensitivity to the selection of the initial partition, and the<br />

sensitivity to outliers and noise. As a result, many variants have appeared to defeat these<br />

drawbacks [Huang 1998; Ordonez and Omiecinski 2004].<br />

5.3 Other Clustering Methods<br />

Despite the widespread use, the performance of both hierarchal and non-hierarchal clustering<br />

solutions decreases radically when they are used to cluster a large scale and/or a<br />

large number of XML data. This becomes more crucial when dealing with large XML<br />

data as an XML data object is composed of many elements and each element of the object<br />

individually needs to be compared with elements of another object. Therefore the need for<br />

developing another set of clustering algorithms arises. Among them are the incremental<br />

clustering algorithms [Can 1993; Charikar et al. 2004] and the constrained agglomerative<br />

algorithms [Zhao and Karypis 2002b].<br />

Incremental clustering is based on the assumption that it is possible to consider XML<br />

data one at a time and assign them to existing clusters with the following steps. (1) Assign<br />

the first XML data to a cluster. (2) Consider the next XML data object, either assign this<br />

object to one of the existing clusters or assign it to a new cluster. This assignment is based<br />

on the similarity between the object and the existing clusters. (3) Repeat step 2 until all the<br />

XML data are assigned to specific clusters. The incremental clustering algorithms are noniterative,<br />

i.e., they do not require to compute the similarity between each pair of objects.<br />

So, their time and space requirements are small. As a result, they allow to cluster large<br />

sets of XML data efficiently, but they may produce a lower quality solution due to the<br />

avoidance of computing the similarity between every pair of documents.<br />

Constrained agglomerative clustering is a trade-off between hierarchical and non- hierarchical<br />

algorithms. It exploits features from both techniques. The constrained agglomerative<br />

algorithm is based on using a partitional clustering algorithm to constrain the space<br />

over which agglomeration decisions are made, so that each XML data is only allowed to<br />

ACM Computing Surveys, Vol. , No. , 2009.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!