PDF (1MB) - QUT ePrints
PDF (1MB) - QUT ePrints
PDF (1MB) - QUT ePrints
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
XML Data Clustering: An Overview · 23<br />
clustering performance, such as BIRCH [Zhang et al. 1996], CURE [Guha et al. 1998],<br />
and ROCK [Guha et al. 2000].<br />
5.2 Non-hierarchical Methods<br />
The non-hierarchical clustering methods produce a single partition of the data set instead<br />
of a clustering structure, such as the dendrogram produced by a hierarchical technique.<br />
There are a number of such techniques, but two of the most prominent are k-means and<br />
K-medoid. The non-hierarchical clustering methods also have similar steps as follows:<br />
(1) select an initial partition of the data with a fixed number of clusters and cluster centers;<br />
(2) assign each XML data in the data set to its closest cluster center and compute the<br />
new cluster centers. Repeat this step until a predefined number of iterations or no<br />
reassignment of XML data to new clusters; and<br />
(3) merge and split clusters based on some heuristic information.<br />
The K-means algorithm is popular because it is easy to implement, and it has a time<br />
complexity of O(NKd), where N is the number of data items, K is the number of clusters,<br />
and d is the number of iterations taken by the algorithm to converge. Since K and d are<br />
usually much less than N and they are fixed in advance, the algorithm can be used to<br />
cluster large data sets. Several drawbacks of the K-means algorithm have been identified<br />
and well studied, such as the sensitivity to the selection of the initial partition, and the<br />
sensitivity to outliers and noise. As a result, many variants have appeared to defeat these<br />
drawbacks [Huang 1998; Ordonez and Omiecinski 2004].<br />
5.3 Other Clustering Methods<br />
Despite the widespread use, the performance of both hierarchal and non-hierarchal clustering<br />
solutions decreases radically when they are used to cluster a large scale and/or a<br />
large number of XML data. This becomes more crucial when dealing with large XML<br />
data as an XML data object is composed of many elements and each element of the object<br />
individually needs to be compared with elements of another object. Therefore the need for<br />
developing another set of clustering algorithms arises. Among them are the incremental<br />
clustering algorithms [Can 1993; Charikar et al. 2004] and the constrained agglomerative<br />
algorithms [Zhao and Karypis 2002b].<br />
Incremental clustering is based on the assumption that it is possible to consider XML<br />
data one at a time and assign them to existing clusters with the following steps. (1) Assign<br />
the first XML data to a cluster. (2) Consider the next XML data object, either assign this<br />
object to one of the existing clusters or assign it to a new cluster. This assignment is based<br />
on the similarity between the object and the existing clusters. (3) Repeat step 2 until all the<br />
XML data are assigned to specific clusters. The incremental clustering algorithms are noniterative,<br />
i.e., they do not require to compute the similarity between each pair of objects.<br />
So, their time and space requirements are small. As a result, they allow to cluster large<br />
sets of XML data efficiently, but they may produce a lower quality solution due to the<br />
avoidance of computing the similarity between every pair of documents.<br />
Constrained agglomerative clustering is a trade-off between hierarchical and non- hierarchical<br />
algorithms. It exploits features from both techniques. The constrained agglomerative<br />
algorithm is based on using a partitional clustering algorithm to constrain the space<br />
over which agglomeration decisions are made, so that each XML data is only allowed to<br />
ACM Computing Surveys, Vol. , No. , 2009.