PDF (1MB) - QUT ePrints
PDF (1MB) - QUT ePrints
PDF (1MB) - QUT ePrints
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
schema classes.<br />
XML Data Clustering: An Overview · 35<br />
—Evaluation Criteria. XMine is tested using real-world schemas collected from different<br />
domains. To validate the quality of XMine, the standard criteria including FScore,<br />
intra- and inter-clustering similarity have been used. The evaluation shows that XMine<br />
is effective in clustering a set of heterogenous XML schemas. However, XMine is not<br />
subject to any scalability test and a comparison with another clustering method is missing.<br />
6.2.3 PCXSS. The Progressively Clustering XML by Semantic and Structural Similarity<br />
(PCXSS) [Nayak and Tran 2007] method introduces a fast clustering algorithm based<br />
on a global criterion function CPSim (Common Path Coefficient) that progressively measures<br />
the similarity between an XSD and existing clusters, ignoring the need to compute<br />
the similarity between two individual XSDs. The CPSim is calculated by considering the<br />
structural and semantic similarity between objects.<br />
—Data Representation. Each schema is represented as an ordered data tree. A simplification<br />
analysis of the data (schema) trees is then performed in order to deal with the<br />
composite elements. The XML tree is then decomposed into path information called<br />
node paths. A node path is an ordered set of nodes from the root node to a leaf node.<br />
—Similarity Computation. PCXSS uses the CPSim to measure the degree of similarity<br />
of nodes (simple objects) between node paths. Each node in a node path of a data tree<br />
is matched with the node in a node path of another data tree, and then aggregated to<br />
form the node path similarity. The node similarity between nodes of two paths is determined<br />
by measuring the similarity of its features such as name, data type and cardinality<br />
constraints.<br />
—Clustering/Grouping. PCXSS is motivated by the incremental clustering algorithms. It<br />
first starts with no clusters. When a new data tree comes in, it is assigned to a new<br />
cluster. When the next data tree comes in, it matches it with the existing cluster. PCXSS<br />
rather progressively measures the similarities between a new XML data tree and existing<br />
clusters by using the common path coefficient (CPSim) to cluster XML schemas<br />
incrementally.<br />
—Evaluation Criteria. PCXSS is evaluated using the standard criteria. Both the quality<br />
and the efficiency of PCXSS are validated. Furthermore, the PCXSS method is compared<br />
with a pairwise clustering algorithm, namely wCluto [Zhao and Karypis 2002b].<br />
The results show that the increment in time with the increase of the size of the data set<br />
is less with PCXSS in comparison to the wCluto pairwise method.<br />
6.2.4 SeqXClust. The Sequence matching approach to cluster XML schema (SeqX-<br />
Clust) [Algergawy et al. 2008b] method introduces an efficient clustering algorithm based<br />
on the sequence representation of XML schema using the Prüfer encoding method.<br />
—Data Representation. Each schema is represented as an ordered data tree. A simplification<br />
analysis of the data (schema) trees is then performed in order to deal with the<br />
composite elements. Each data tree is then represented as sequence using a modified<br />
Prüfer encoding method. The semantic information of the data tree is captured by Label<br />
Prüfer sequences (LPSs), while the structure information is captured by Number Prüfer<br />
sequences (NPSs).<br />
ACM Computing Surveys, Vol. , No. , 2009.