30.12.2014 Views

PDF (1MB) - QUT ePrints

PDF (1MB) - QUT ePrints

PDF (1MB) - QUT ePrints

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

14 · Alsayed Algergawy et al.<br />

data type compatibility table is built, as the one used in [Madhavan et al. 2001], as<br />

shown in Fig. 7(a). The figure illustrates that elements having the same data types<br />

or belonging to the same data type category have the possibility to be similar and<br />

their type similarities (Typesim) are high. The type similarity between two objects<br />

using the type compatibility table is:<br />

Typesim(type1, type2) = TypeTable(type1, type2)<br />

(3) Constraint Similarity. Another feature of an object that makes a small contribution<br />

in determining the simple similarity is its cardinality constraint. The authors<br />

of XClust [Lee et al. 2002] have defined a cardinality table for DTD constraints,<br />

as shown in Fig. 7(b). Also, the authors of PCXSS [Nayak and Tran 2007] have<br />

adapted the cardinality constraint table for constraint matching of XSDs. The cardinality<br />

similarity between two objects using the constraint cardinality table is:<br />

Cardsim(card1, card2) = CardinalityTable(card1, card2)<br />

(4) Content Similarity. To measure the similarity between text nodes of XML documents,<br />

the content information of these nodes should be captured. To this end, a<br />

token-based similarity measure can be used. According to the comparison made<br />

in [Cohen et al. 2003], the TFIDF ranking performed best among several tokenbased<br />

similarity measures. TFIDF (Term Frequency Inverse Document Frequency)<br />

is a statistical measure used to evaluate how important a word is to a document or<br />

corpus. The importance increases proportionally to the number of times a word<br />

appears in the document but is offset by the frequency of the word in the corpus. It<br />

is used here to assess the similarity between text nodes. In this case, the content of<br />

these nodes can be considered as multisets (or bags) of words. Given the content of<br />

two text nodes represented as texts cont 1 and cont 2 , The TFIDF measure between<br />

them is given by [Cohen et al. 2003]:<br />

where,<br />

and,<br />

Contsim(cont 1 , cont 2 ) =<br />

V (w, cont 1 ) =<br />

∑<br />

w∈cont 1∩cont 2<br />

V (w, cont 1 ).V (w, cont 2 )<br />

V ′ (w, cont 1 )<br />

√∑<br />

w ′ V ′ (w, cont 1 ) 2<br />

V ′ (w, cont 1 ) = log (TF w,cont1 + 1). log (IDF w )<br />

where, TF w,cont1 is the frequency of the word w in cont 1 , and IDF w is the inverse<br />

of the fraction of names in the corpus that contain w.<br />

These simple similarity measures are then aggregated using an aggregation function,<br />

such as the weighted sum, to produce the simple similarity between two objects.<br />

ACM Computing Surveys, Vol. , No. , 2009.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!