30.12.2014 Views

PDF (1MB) - QUT ePrints

PDF (1MB) - QUT ePrints

PDF (1MB) - QUT ePrints

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

22 · Alsayed Algergawy et al.<br />

Experimental evaluation reported in [Tran et al. 2008; Yang et al. 2009] shows that combining<br />

the structure and content features to compute the similarity between XML documents<br />

outperforms the other two methods, especially when documents belong to different structural<br />

definitions.<br />

5. CLUSTERING/GROUPING<br />

XML data that are similar in structures and semantics are grouped together to form a cluster<br />

using a suitable clustering algorithm [Jain et al. 1999; Berkhin 2002; Xu and Wunsch<br />

2005]. Clustering methods are generally divided into two broad categories: hierarchical<br />

and non-hierarchical (partitional) methods. The non-hierarchical methods group a data set<br />

into a number of clusters using a pairwise distance matrix that records the similarity between<br />

each pair of documents in the data set, while the hierarchical methods produce nested<br />

sets of data (hierarchies), in which pairs of elements or clusters are successively linked until<br />

every element in the data set becomes connected. Many other clustering techniques<br />

have been developed considering the more frequent requirement of tackling large-scale<br />

and high-dimensional data sets in many recent applications.<br />

5.1 Hierarchical Methods<br />

Hierarchical clustering builds a hierarchy of clusters, known as a dendrogram. The root<br />

node of the dendrogram represents the whole data set and each leaf node represents a<br />

data item. This representation gives a well informative description and visualization of<br />

the data clustering structure. Strategies for hierarchical clustering generally fall into two<br />

types: agglomerative and division. The divisive clustering algorithm builds the hierarchical<br />

solution from top toward the bottom by using a repeated cluster bisectioning approach.<br />

Given a set of N XML data to be clustered, and an N × N similarity matrix, the basic<br />

process of the agglomerative hierarchical clustering algorithm is:<br />

(1) treat each XML data of the data set to be clustered as a cluster, so that if you have N<br />

items, you now have N clusters, each containing just one XML data.<br />

(2) find the most similar pair of clusters and merge them into a single cluster.<br />

(3) compute similarities between the new cluster and each of the old clusters.<br />

(4) repeat steps 2 and 3 until all items are clustered into a single cluster of size N.<br />

Based on different methods used to compute the similarity between a pair of clusters (Step<br />

3), there are many agglomerative clustering algorithms. Among them are the single-link<br />

and complete-link algorithms [Jain et al. 1999; Manning et al. 2008]. In the single-link<br />

method, the similarity between one cluster and another cluster is equal to the maximum<br />

similarity from any member of one cluster to any member of the other cluster. In the<br />

complete-link method, the similarity between one cluster and another cluster has to be<br />

equal to the minimum similarity from any member of one cluster to any member of the<br />

other cluster. Even if the single-link algorithm has a better time complexity (O(N 2 )) than<br />

the time complexity (O(N 2 log N)) of the complete-link algorithm, it has been observed<br />

that the complete-link algorithm produces more useful hierarchies in many applications<br />

than the single-link algorithm. Further, the high cost for most of the hierarchical algorithms<br />

limit their application in large-scale data sets. With the requirement for handling largescale<br />

data sets, several new hierarchical algorithms have been proposed to improve the<br />

ACM Computing Surveys, Vol. , No. , 2009.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!