PDF (1MB) - QUT ePrints
PDF (1MB) - QUT ePrints
PDF (1MB) - QUT ePrints
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
14 · Alsayed Algergawy et al.<br />
data type compatibility table is built, as the one used in [Madhavan et al. 2001], as<br />
shown in Fig. 7(a). The figure illustrates that elements having the same data types<br />
or belonging to the same data type category have the possibility to be similar and<br />
their type similarities (Typesim) are high. The type similarity between two objects<br />
using the type compatibility table is:<br />
Typesim(type1, type2) = TypeTable(type1, type2)<br />
(3) Constraint Similarity. Another feature of an object that makes a small contribution<br />
in determining the simple similarity is its cardinality constraint. The authors<br />
of XClust [Lee et al. 2002] have defined a cardinality table for DTD constraints,<br />
as shown in Fig. 7(b). Also, the authors of PCXSS [Nayak and Tran 2007] have<br />
adapted the cardinality constraint table for constraint matching of XSDs. The cardinality<br />
similarity between two objects using the constraint cardinality table is:<br />
Cardsim(card1, card2) = CardinalityTable(card1, card2)<br />
(4) Content Similarity. To measure the similarity between text nodes of XML documents,<br />
the content information of these nodes should be captured. To this end, a<br />
token-based similarity measure can be used. According to the comparison made<br />
in [Cohen et al. 2003], the TFIDF ranking performed best among several tokenbased<br />
similarity measures. TFIDF (Term Frequency Inverse Document Frequency)<br />
is a statistical measure used to evaluate how important a word is to a document or<br />
corpus. The importance increases proportionally to the number of times a word<br />
appears in the document but is offset by the frequency of the word in the corpus. It<br />
is used here to assess the similarity between text nodes. In this case, the content of<br />
these nodes can be considered as multisets (or bags) of words. Given the content of<br />
two text nodes represented as texts cont 1 and cont 2 , The TFIDF measure between<br />
them is given by [Cohen et al. 2003]:<br />
where,<br />
and,<br />
Contsim(cont 1 , cont 2 ) =<br />
V (w, cont 1 ) =<br />
∑<br />
w∈cont 1∩cont 2<br />
V (w, cont 1 ).V (w, cont 2 )<br />
V ′ (w, cont 1 )<br />
√∑<br />
w ′ V ′ (w, cont 1 ) 2<br />
V ′ (w, cont 1 ) = log (TF w,cont1 + 1). log (IDF w )<br />
where, TF w,cont1 is the frequency of the word w in cont 1 , and IDF w is the inverse<br />
of the fraction of names in the corpus that contain w.<br />
These simple similarity measures are then aggregated using an aggregation function,<br />
such as the weighted sum, to produce the simple similarity between two objects.<br />
ACM Computing Surveys, Vol. , No. , 2009.