30.12.2014 Views

PDF (1MB) - QUT ePrints

PDF (1MB) - QUT ePrints

PDF (1MB) - QUT ePrints

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

32 · Alsayed Algergawy et al.<br />

in SG. Experimental results using both synthetic and real-world data sets show that S-<br />

GRACE is effective in identifying clusters, however it does not scale well. The system<br />

needs 4500 seconds to clusters a set of XML documents with size of 200KB.<br />

6.1.7 SumXClust. The Clustering XML documents exploiting structural summaries<br />

(SumXClust) 7 method [Dalamagasa et al. 2006] represents a clustering algorithm for<br />

grouping structurally similar XML documents using the tree structural summaries to improve<br />

the performance of the edit distance calculation. To extract structural summaries,<br />

SumXClust performs both nesting reduction and repetition reduction.<br />

—Data Representation. XML documents are represented as ordered data trees. SumX-<br />

Clust proposes the usage of compact trees, called tree structural summaries that have<br />

minimal processing requirements instead of the original trees representing XML documents.<br />

It performs nesting reduction and repetition reduction to extract structural summaries<br />

for data trees. For nesting reduction, the system traverses the data tree using<br />

pre-order traversal to detect nodes which have an ancestor with the same label in order<br />

to move up their subtrees. For repetition reduction, it traverses the data tree using<br />

pre-order traversal ignoring already existing paths and keeping new ones using a hash<br />

table.<br />

—Similarity Computation. SumXClust proposes a structural distance S between two XML<br />

documents represented as structural summaries. The structural distance is defined as<br />

S(DT 1 , DT 2 ) = δ(DT 1, DT 2 )<br />

δ ′ (DT 1 , DT 2 )<br />

where δ ′ (DT 1 , DT 2 ) is the cost to delete all nodes from DT 1 and insert all nodes from<br />

DT 2 . To determine edit distances between structural summaries δ(DT 1 , DT 2 ), the system<br />

uses a dynamic programming algorithm which is close to [Chawathe 1999].<br />

—Clustering/Grouping. SumXClust implements a hierarchical clustering algorithm using<br />

Prim’s algorithm for computing the minimum spanning tree. It forms a fully connected<br />

graph with n nodes from n structural summaries. The weight of an edge corresponds to<br />

the structural distance between the nodes that this edge connects.<br />

—Evaluation Criteria. The performance as well as the quality of the clustering results is<br />

tested using synthetic and real data. To evaluate the quality of the clustering results, two<br />

external criteria, precision and recall, have been used, while the response time is used<br />

to evaluate the efficiency of the clustering algorithm. SumXClust is also compared with<br />

the Chawathe algorithm [Chawathe 1999]. Experimental results indicate that with or<br />

without summaries, the SumXClust algorithm shows excellent clustering quality, and<br />

improved performance compared to Chawathe’s.<br />

6.1.8 VectXClust. Clustering XML documents based on the vector representation for<br />

documents (VectXClust) 8 [Yang et al. 2005] proposes the transformation of tree-structured<br />

data into an approximate numerical multi-dimensional vector, which encodes the original<br />

structure information.<br />

—Data Representation. XML documents are modeled as ordered data trees, which are then<br />

transformed into full binary trees. A full binary tree is a binary tree in which each node<br />

7 We give the tool this name for easier reference<br />

8 We give the tool this name for easier reference<br />

ACM Computing Surveys, Vol. , No. , 2009.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!