30.12.2014 Views

PDF (1MB) - QUT ePrints

PDF (1MB) - QUT ePrints

PDF (1MB) - QUT ePrints

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

XML Data Clustering: An Overview · 29<br />

—Clustering/Grouping. The path similarity matrix is represented as a weighted graph,<br />

named the path similarity graph, where nodes represent the set of SP clusters and edges<br />

connect them with weights. The weight of each edge is the path similarity between two<br />

nodes which are connected by that edge. The greedy algorithm is used to partition the<br />

path similarity graph.<br />

—Evaluation Criteria. To validate the performance of the PSim system, it is evaluated with<br />

1000 randomly generated path queries over an XML document with 200,000 nodes. The<br />

total sum of page I/Os required for query processing is used as the performance criterion.<br />

Experimental results show that the PSim clustering method has good performance<br />

in query processing, however, the proposed method assumes that document updates are<br />

infrequent. Further studies on the clustering method are needed when updates are frequent.<br />

6.1.4 XProj. XProj [Aggarwal et al. 2007] introduces a clustering algorithm for XML<br />

documents that uses substructures of the documents in order to gain insight into the important<br />

underlying structures.<br />

—Data Representation. XML documents are modeled as ordered data trees. XProj does<br />

not distinguish between attributes and elements of an XML document, since both are<br />

mapped to the label set. For efficient processing, the pre-order depth-first traversal of<br />

the tree structure is used, where each node is represented by the path from the root node<br />

to itself. The content within the nodes is ignored and only the structural information is<br />

being used.<br />

—Similarity Computation. XProj makes use of frequent substructures in order to define<br />

similarity among documents. This is analogous to the concept of projected clustering in<br />

multi-dimensional data, hence the name XProj. In the projected clustering algorithm, instead<br />

of using individual XML documents as representatives for partitions, XProj uses a<br />

set of substructures of the documents. The similarity of a document to a set of structures<br />

in a collection is the fraction of nodes in the document that are covered by any structure<br />

in the collection. This similarity can be generalized to similarity between a set of documents<br />

and a set of structures by averaging the structural similarity over the different<br />

documents. This gives the base for computing the frequent sub-structural self-similarity<br />

between a set of XML documents. To reduce the cost of similarity computation, XProj<br />

selects the top-K most frequent substructures of size l.<br />

—Clustering/Grouping. The primary approach is to use a sub-structural modification of<br />

a partition-based approach in which the clusters of XML documents are built around<br />

groups of representative substructures. Initially, the sets of XML documents are divided<br />

into K partitions with equal size, and the sets of substructure representatives are<br />

generated by mining frequent substructures from these partitions. These structure representatives<br />

are used to cluster the XML documents using the self-similarity measure.<br />

—Evaluation Criteria. The two external criteria, precision and recall, are used to evaluate<br />

the quality of the XProj clustering method. XProj is also compared with the Chawathe<br />

algorithm [Chawathe 1999] and the structural algorithm [Dalamagasa et al. 2006].<br />

The comparison results indicate that both the XProj and the structural algorithms work<br />

well and have a higher precision and recall than Chawathe’s tree edit distance-based<br />

algorithm when dealing with heterogeneous XML documents. In case of homogeneous<br />

XML documents, XProj outperforms the other two systems.<br />

ACM Computing Surveys, Vol. , No. , 2009.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!