PDF (1MB) - QUT ePrints
PDF (1MB) - QUT ePrints
PDF (1MB) - QUT ePrints
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
XML Data Clustering: An Overview · 29<br />
—Clustering/Grouping. The path similarity matrix is represented as a weighted graph,<br />
named the path similarity graph, where nodes represent the set of SP clusters and edges<br />
connect them with weights. The weight of each edge is the path similarity between two<br />
nodes which are connected by that edge. The greedy algorithm is used to partition the<br />
path similarity graph.<br />
—Evaluation Criteria. To validate the performance of the PSim system, it is evaluated with<br />
1000 randomly generated path queries over an XML document with 200,000 nodes. The<br />
total sum of page I/Os required for query processing is used as the performance criterion.<br />
Experimental results show that the PSim clustering method has good performance<br />
in query processing, however, the proposed method assumes that document updates are<br />
infrequent. Further studies on the clustering method are needed when updates are frequent.<br />
6.1.4 XProj. XProj [Aggarwal et al. 2007] introduces a clustering algorithm for XML<br />
documents that uses substructures of the documents in order to gain insight into the important<br />
underlying structures.<br />
—Data Representation. XML documents are modeled as ordered data trees. XProj does<br />
not distinguish between attributes and elements of an XML document, since both are<br />
mapped to the label set. For efficient processing, the pre-order depth-first traversal of<br />
the tree structure is used, where each node is represented by the path from the root node<br />
to itself. The content within the nodes is ignored and only the structural information is<br />
being used.<br />
—Similarity Computation. XProj makes use of frequent substructures in order to define<br />
similarity among documents. This is analogous to the concept of projected clustering in<br />
multi-dimensional data, hence the name XProj. In the projected clustering algorithm, instead<br />
of using individual XML documents as representatives for partitions, XProj uses a<br />
set of substructures of the documents. The similarity of a document to a set of structures<br />
in a collection is the fraction of nodes in the document that are covered by any structure<br />
in the collection. This similarity can be generalized to similarity between a set of documents<br />
and a set of structures by averaging the structural similarity over the different<br />
documents. This gives the base for computing the frequent sub-structural self-similarity<br />
between a set of XML documents. To reduce the cost of similarity computation, XProj<br />
selects the top-K most frequent substructures of size l.<br />
—Clustering/Grouping. The primary approach is to use a sub-structural modification of<br />
a partition-based approach in which the clusters of XML documents are built around<br />
groups of representative substructures. Initially, the sets of XML documents are divided<br />
into K partitions with equal size, and the sets of substructure representatives are<br />
generated by mining frequent substructures from these partitions. These structure representatives<br />
are used to cluster the XML documents using the self-similarity measure.<br />
—Evaluation Criteria. The two external criteria, precision and recall, are used to evaluate<br />
the quality of the XProj clustering method. XProj is also compared with the Chawathe<br />
algorithm [Chawathe 1999] and the structural algorithm [Dalamagasa et al. 2006].<br />
The comparison results indicate that both the XProj and the structural algorithms work<br />
well and have a higher precision and recall than Chawathe’s tree edit distance-based<br />
algorithm when dealing with heterogeneous XML documents. In case of homogeneous<br />
XML documents, XProj outperforms the other two systems.<br />
ACM Computing Surveys, Vol. , No. , 2009.