PDF (1MB) - QUT ePrints
PDF (1MB) - QUT ePrints
PDF (1MB) - QUT ePrints
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
32 · Alsayed Algergawy et al.<br />
in SG. Experimental results using both synthetic and real-world data sets show that S-<br />
GRACE is effective in identifying clusters, however it does not scale well. The system<br />
needs 4500 seconds to clusters a set of XML documents with size of 200KB.<br />
6.1.7 SumXClust. The Clustering XML documents exploiting structural summaries<br />
(SumXClust) 7 method [Dalamagasa et al. 2006] represents a clustering algorithm for<br />
grouping structurally similar XML documents using the tree structural summaries to improve<br />
the performance of the edit distance calculation. To extract structural summaries,<br />
SumXClust performs both nesting reduction and repetition reduction.<br />
—Data Representation. XML documents are represented as ordered data trees. SumX-<br />
Clust proposes the usage of compact trees, called tree structural summaries that have<br />
minimal processing requirements instead of the original trees representing XML documents.<br />
It performs nesting reduction and repetition reduction to extract structural summaries<br />
for data trees. For nesting reduction, the system traverses the data tree using<br />
pre-order traversal to detect nodes which have an ancestor with the same label in order<br />
to move up their subtrees. For repetition reduction, it traverses the data tree using<br />
pre-order traversal ignoring already existing paths and keeping new ones using a hash<br />
table.<br />
—Similarity Computation. SumXClust proposes a structural distance S between two XML<br />
documents represented as structural summaries. The structural distance is defined as<br />
S(DT 1 , DT 2 ) = δ(DT 1, DT 2 )<br />
δ ′ (DT 1 , DT 2 )<br />
where δ ′ (DT 1 , DT 2 ) is the cost to delete all nodes from DT 1 and insert all nodes from<br />
DT 2 . To determine edit distances between structural summaries δ(DT 1 , DT 2 ), the system<br />
uses a dynamic programming algorithm which is close to [Chawathe 1999].<br />
—Clustering/Grouping. SumXClust implements a hierarchical clustering algorithm using<br />
Prim’s algorithm for computing the minimum spanning tree. It forms a fully connected<br />
graph with n nodes from n structural summaries. The weight of an edge corresponds to<br />
the structural distance between the nodes that this edge connects.<br />
—Evaluation Criteria. The performance as well as the quality of the clustering results is<br />
tested using synthetic and real data. To evaluate the quality of the clustering results, two<br />
external criteria, precision and recall, have been used, while the response time is used<br />
to evaluate the efficiency of the clustering algorithm. SumXClust is also compared with<br />
the Chawathe algorithm [Chawathe 1999]. Experimental results indicate that with or<br />
without summaries, the SumXClust algorithm shows excellent clustering quality, and<br />
improved performance compared to Chawathe’s.<br />
6.1.8 VectXClust. Clustering XML documents based on the vector representation for<br />
documents (VectXClust) 8 [Yang et al. 2005] proposes the transformation of tree-structured<br />
data into an approximate numerical multi-dimensional vector, which encodes the original<br />
structure information.<br />
—Data Representation. XML documents are modeled as ordered data trees, which are then<br />
transformed into full binary trees. A full binary tree is a binary tree in which each node<br />
7 We give the tool this name for easier reference<br />
8 We give the tool this name for easier reference<br />
ACM Computing Surveys, Vol. , No. , 2009.