PDF (1MB) - QUT ePrints
PDF (1MB) - QUT ePrints
PDF (1MB) - QUT ePrints
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
22 · Alsayed Algergawy et al.<br />
Experimental evaluation reported in [Tran et al. 2008; Yang et al. 2009] shows that combining<br />
the structure and content features to compute the similarity between XML documents<br />
outperforms the other two methods, especially when documents belong to different structural<br />
definitions.<br />
5. CLUSTERING/GROUPING<br />
XML data that are similar in structures and semantics are grouped together to form a cluster<br />
using a suitable clustering algorithm [Jain et al. 1999; Berkhin 2002; Xu and Wunsch<br />
2005]. Clustering methods are generally divided into two broad categories: hierarchical<br />
and non-hierarchical (partitional) methods. The non-hierarchical methods group a data set<br />
into a number of clusters using a pairwise distance matrix that records the similarity between<br />
each pair of documents in the data set, while the hierarchical methods produce nested<br />
sets of data (hierarchies), in which pairs of elements or clusters are successively linked until<br />
every element in the data set becomes connected. Many other clustering techniques<br />
have been developed considering the more frequent requirement of tackling large-scale<br />
and high-dimensional data sets in many recent applications.<br />
5.1 Hierarchical Methods<br />
Hierarchical clustering builds a hierarchy of clusters, known as a dendrogram. The root<br />
node of the dendrogram represents the whole data set and each leaf node represents a<br />
data item. This representation gives a well informative description and visualization of<br />
the data clustering structure. Strategies for hierarchical clustering generally fall into two<br />
types: agglomerative and division. The divisive clustering algorithm builds the hierarchical<br />
solution from top toward the bottom by using a repeated cluster bisectioning approach.<br />
Given a set of N XML data to be clustered, and an N × N similarity matrix, the basic<br />
process of the agglomerative hierarchical clustering algorithm is:<br />
(1) treat each XML data of the data set to be clustered as a cluster, so that if you have N<br />
items, you now have N clusters, each containing just one XML data.<br />
(2) find the most similar pair of clusters and merge them into a single cluster.<br />
(3) compute similarities between the new cluster and each of the old clusters.<br />
(4) repeat steps 2 and 3 until all items are clustered into a single cluster of size N.<br />
Based on different methods used to compute the similarity between a pair of clusters (Step<br />
3), there are many agglomerative clustering algorithms. Among them are the single-link<br />
and complete-link algorithms [Jain et al. 1999; Manning et al. 2008]. In the single-link<br />
method, the similarity between one cluster and another cluster is equal to the maximum<br />
similarity from any member of one cluster to any member of the other cluster. In the<br />
complete-link method, the similarity between one cluster and another cluster has to be<br />
equal to the minimum similarity from any member of one cluster to any member of the<br />
other cluster. Even if the single-link algorithm has a better time complexity (O(N 2 )) than<br />
the time complexity (O(N 2 log N)) of the complete-link algorithm, it has been observed<br />
that the complete-link algorithm produces more useful hierarchies in many applications<br />
than the single-link algorithm. Further, the high cost for most of the hierarchical algorithms<br />
limit their application in large-scale data sets. With the requirement for handling largescale<br />
data sets, several new hierarchical algorithms have been proposed to improve the<br />
ACM Computing Surveys, Vol. , No. , 2009.