PDF (1MB) - QUT ePrints
PDF (1MB) - QUT ePrints
PDF (1MB) - QUT ePrints
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
34 · Alsayed Algergawy et al.<br />
6.2 Clustering XML Schema Prototypes<br />
6.2.1 XClust. XClust [Lee et al. 2002] proposes an approach of clustering the DTDs<br />
of XML data sources to effectively integrate them. It is defined by using the phases of the<br />
generic framework as follows.<br />
—Data Representation. DTDs are analyzed and represented as unordered data trees. To<br />
simplify schema matching, a series of transformation rules are used to transform auxiliary<br />
OR nodes (choices) to AND nodes (sequences) which can be merged. XClust<br />
makes use of simple objects, their features, and relationships between them.<br />
—Similarity Computation. To compute the similarity of two DTDs, XClust is based on the<br />
computation of the similarity between simple objects in the data trees. To this end, the<br />
system proposes a method that relies on the computation of semantic similarity exploiting<br />
semantic features of objects, and on the computation of the structure and context<br />
similarity exploiting relationships between objects. The output of this phase is the DTD<br />
similarity matrix.<br />
—Clustering/Grouping. The DTD similarity matrix is exploited by a hierarchical clustering<br />
algorithm to group DTDs into clusters. The hierarchical clustering technique can<br />
guide and enhance the integration process, since the clustering technique starts with<br />
clusters of single DTDs and gradually adds highly similar DTDs to these clusters.<br />
—Evaluation Criteria. Since the main objective of XClust is to develop an effective integration<br />
framework, it uses criteria to quantify the goodness of the integrated schema.<br />
No study concerning the clustering scalability has been done.<br />
6.2.2 XMine. XMine [Nayak and Iryadi 2007] introduces a clustering algorithm based<br />
on measuring the similarity between XML schemas by considering the semantics, as well<br />
as the hierarchical structural similarity of elements.<br />
—Data Representation. Each schema is represented as an ordered data tree. A simplification<br />
analysis of the data (schema) trees is then performed in order to deal with the nesting<br />
and repetition problems using a set of transformation rules similar to those in [Lee et al.<br />
2002]. XMine handles both the DTD and XSD schemas, and, like XClust, makes use of<br />
simple objects, their features, and relationships between objects.<br />
—Similarity Computation. XMine determines the schema similarity matrix through three<br />
components. (1) The element analyzer, it determines the linguistic similarity by comparing<br />
each pair of elements of two schemas primarily based on their names. It considers<br />
both the semantic relationship as found in the WordNet thesaurus and the syntactic relationship<br />
using the string edit distance function. (2) The maximally similar paths finder,<br />
it identifies paths and elements that are common and similar between each pair of tree<br />
schemas based on the assumption that similar schemas have more common paths. Moreover,<br />
it adapts the sequential pattern mining algorithm [Srikant and Agrawal 1996] to<br />
infer the similarity between elements and paths. (3) The schema similarity matrix processor,<br />
the similarity matrix between schemas is computed based on the above measured<br />
criteria. This matrix becomes the input to the next phase.<br />
—Clustering/Grouping. The constrained hierarchical agglomerative clustering algorithm<br />
is used to group similar schemas exploiting the schema similarity matrix. XMine makes<br />
use of the wCluto 10 web-enabled data clustering applications to form a hierarchy of<br />
10 http://cluto.ccgb.umn.edu/cgi-bin/wCluto<br />
ACM Computing Surveys, Vol. , No. , 2009.