30.12.2014 Views

PDF (1MB) - QUT ePrints

PDF (1MB) - QUT ePrints

PDF (1MB) - QUT ePrints

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

34 · Alsayed Algergawy et al.<br />

6.2 Clustering XML Schema Prototypes<br />

6.2.1 XClust. XClust [Lee et al. 2002] proposes an approach of clustering the DTDs<br />

of XML data sources to effectively integrate them. It is defined by using the phases of the<br />

generic framework as follows.<br />

—Data Representation. DTDs are analyzed and represented as unordered data trees. To<br />

simplify schema matching, a series of transformation rules are used to transform auxiliary<br />

OR nodes (choices) to AND nodes (sequences) which can be merged. XClust<br />

makes use of simple objects, their features, and relationships between them.<br />

—Similarity Computation. To compute the similarity of two DTDs, XClust is based on the<br />

computation of the similarity between simple objects in the data trees. To this end, the<br />

system proposes a method that relies on the computation of semantic similarity exploiting<br />

semantic features of objects, and on the computation of the structure and context<br />

similarity exploiting relationships between objects. The output of this phase is the DTD<br />

similarity matrix.<br />

—Clustering/Grouping. The DTD similarity matrix is exploited by a hierarchical clustering<br />

algorithm to group DTDs into clusters. The hierarchical clustering technique can<br />

guide and enhance the integration process, since the clustering technique starts with<br />

clusters of single DTDs and gradually adds highly similar DTDs to these clusters.<br />

—Evaluation Criteria. Since the main objective of XClust is to develop an effective integration<br />

framework, it uses criteria to quantify the goodness of the integrated schema.<br />

No study concerning the clustering scalability has been done.<br />

6.2.2 XMine. XMine [Nayak and Iryadi 2007] introduces a clustering algorithm based<br />

on measuring the similarity between XML schemas by considering the semantics, as well<br />

as the hierarchical structural similarity of elements.<br />

—Data Representation. Each schema is represented as an ordered data tree. A simplification<br />

analysis of the data (schema) trees is then performed in order to deal with the nesting<br />

and repetition problems using a set of transformation rules similar to those in [Lee et al.<br />

2002]. XMine handles both the DTD and XSD schemas, and, like XClust, makes use of<br />

simple objects, their features, and relationships between objects.<br />

—Similarity Computation. XMine determines the schema similarity matrix through three<br />

components. (1) The element analyzer, it determines the linguistic similarity by comparing<br />

each pair of elements of two schemas primarily based on their names. It considers<br />

both the semantic relationship as found in the WordNet thesaurus and the syntactic relationship<br />

using the string edit distance function. (2) The maximally similar paths finder,<br />

it identifies paths and elements that are common and similar between each pair of tree<br />

schemas based on the assumption that similar schemas have more common paths. Moreover,<br />

it adapts the sequential pattern mining algorithm [Srikant and Agrawal 1996] to<br />

infer the similarity between elements and paths. (3) The schema similarity matrix processor,<br />

the similarity matrix between schemas is computed based on the above measured<br />

criteria. This matrix becomes the input to the next phase.<br />

—Clustering/Grouping. The constrained hierarchical agglomerative clustering algorithm<br />

is used to group similar schemas exploiting the schema similarity matrix. XMine makes<br />

use of the wCluto 10 web-enabled data clustering applications to form a hierarchy of<br />

10 http://cluto.ccgb.umn.edu/cgi-bin/wCluto<br />

ACM Computing Surveys, Vol. , No. , 2009.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!