30.12.2014 Views

PDF (1MB) - QUT ePrints

PDF (1MB) - QUT ePrints

PDF (1MB) - QUT ePrints

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

schema classes.<br />

XML Data Clustering: An Overview · 35<br />

—Evaluation Criteria. XMine is tested using real-world schemas collected from different<br />

domains. To validate the quality of XMine, the standard criteria including FScore,<br />

intra- and inter-clustering similarity have been used. The evaluation shows that XMine<br />

is effective in clustering a set of heterogenous XML schemas. However, XMine is not<br />

subject to any scalability test and a comparison with another clustering method is missing.<br />

6.2.3 PCXSS. The Progressively Clustering XML by Semantic and Structural Similarity<br />

(PCXSS) [Nayak and Tran 2007] method introduces a fast clustering algorithm based<br />

on a global criterion function CPSim (Common Path Coefficient) that progressively measures<br />

the similarity between an XSD and existing clusters, ignoring the need to compute<br />

the similarity between two individual XSDs. The CPSim is calculated by considering the<br />

structural and semantic similarity between objects.<br />

—Data Representation. Each schema is represented as an ordered data tree. A simplification<br />

analysis of the data (schema) trees is then performed in order to deal with the<br />

composite elements. The XML tree is then decomposed into path information called<br />

node paths. A node path is an ordered set of nodes from the root node to a leaf node.<br />

—Similarity Computation. PCXSS uses the CPSim to measure the degree of similarity<br />

of nodes (simple objects) between node paths. Each node in a node path of a data tree<br />

is matched with the node in a node path of another data tree, and then aggregated to<br />

form the node path similarity. The node similarity between nodes of two paths is determined<br />

by measuring the similarity of its features such as name, data type and cardinality<br />

constraints.<br />

—Clustering/Grouping. PCXSS is motivated by the incremental clustering algorithms. It<br />

first starts with no clusters. When a new data tree comes in, it is assigned to a new<br />

cluster. When the next data tree comes in, it matches it with the existing cluster. PCXSS<br />

rather progressively measures the similarities between a new XML data tree and existing<br />

clusters by using the common path coefficient (CPSim) to cluster XML schemas<br />

incrementally.<br />

—Evaluation Criteria. PCXSS is evaluated using the standard criteria. Both the quality<br />

and the efficiency of PCXSS are validated. Furthermore, the PCXSS method is compared<br />

with a pairwise clustering algorithm, namely wCluto [Zhao and Karypis 2002b].<br />

The results show that the increment in time with the increase of the size of the data set<br />

is less with PCXSS in comparison to the wCluto pairwise method.<br />

6.2.4 SeqXClust. The Sequence matching approach to cluster XML schema (SeqX-<br />

Clust) [Algergawy et al. 2008b] method introduces an efficient clustering algorithm based<br />

on the sequence representation of XML schema using the Prüfer encoding method.<br />

—Data Representation. Each schema is represented as an ordered data tree. A simplification<br />

analysis of the data (schema) trees is then performed in order to deal with the<br />

composite elements. Each data tree is then represented as sequence using a modified<br />

Prüfer encoding method. The semantic information of the data tree is captured by Label<br />

Prüfer sequences (LPSs), while the structure information is captured by Number Prüfer<br />

sequences (NPSs).<br />

ACM Computing Surveys, Vol. , No. , 2009.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!