PDF (1MB) - QUT ePrints

More documents

Recommendations

Info

4.1 Tree-based similarity approaches XML Data Clustering: An Overview · 13 The computation of similarity among XML data represented as data trees depends on the exploited objects on which similarity functions to be applied. A function, Sim, is a similarity measure that exploits the features of objects as well as the relationships among them, in order to determine the proximity between objects. It is represented as Sim(OP 1 , OP 2 ), and its value ranges between 0 and 1, when the measure is normalized. The value 0 means strong dissimilarity between objects, while the value 1 means strong similarity. Based on the objects used to compute the similarity among XML data trees, we classify the similarity measures into: element-level measures and structure-level measures. 4.1.1 Element-level Measures. These measures, also known as schema matching-based methods, consider (simple) objects’ details such as name, data type as well as the relationships between objects. In element-level measures, the similarity between XML data is based on the computed similarities among their elements. The similarity between two simple objects in two document trees OP 1 ∈ DT 1 and OP 2 ∈ DT 2 can be determined using the following equation: Sim(OP 1 , OP 2 ) = w s × SSim(OP 1 , OP 2 ) + w x × CSim(OP 1 , OP 2 ) where SSim(OP 1 , OP 2 ) represents the simple similarity measure between two objects exploiting their features, such as name, data type, and constraint, while CSim(OP 1 , OP 2 ) represents the complex similarity measure between them exploiting the relationships of each object, and w s and w x are weights to quantify the importance of each measure, (w s , w x ≥ 0, w s + w x = 1). These computed similarities are then aggregated to determine the semantic similarities among paths and XML data trees themselves [Rahm and Bernstein 2001; Kade and Heuser 2008]. —Simple Similarity Measures: These measures determine the similarity between two objects OP1 ∈ DT1 and OP2 ∈ DT2 using their features. (1) Name Similarity. Object names can be semantically similar (e.g. person, people) or syntactically similar (e.g. name, fname). Hence, both semantic and syntactic measures are included to determine the degree of similarity between objects’ names. Semantic measures rely heavily on external or user-defined dictionaries such as WordNet. In the syntactic measures, each object names is decomposed into a set of tokens T 1 and T 2 using a customizable tokenizer using punctuation, upper case, special symbols, and digits, e.g. LastName → {Last, Name}. The name similarity between the two sets of name tokens T 1 and T 2 is determined as the average best similarity of each token with a token in the other set. It is computed as follow: Nsim(T 1 , T 2 ) = ∑ t 1∈T 1 [max t2∈T 2 sim(t 1 , t 2 )] + ∑ t 2∈T 2 [max t1∈T 1 sim(t 2 , t 1 )] |T1| + |T2| To measure the string similarity between a pair of tokens, sim(t 1 , t 2 ), several string similarity functions, such as the edit distance and n-grams, can be used [Cohen et al. 2003]. (2) Data Type Similarity. The Built-in XML data types hierarchy 6 can be used in order to compute data type similarity. Based on the XML schema data type hierarchy, a 6 http://www.w3.org/TR/xmlschema-2/ ACM Computing Surveys, Vol. , No. , 2009.
14 · Alsayed Algergawy et al. data type compatibility table is built, as the one used in [Madhavan et al. 2001], as shown in Fig. 7(a). The figure illustrates that elements having the same data types or belonging to the same data type category have the possibility to be similar and their type similarities (Typesim) are high. The type similarity between two objects using the type compatibility table is: Typesim(type1, type2) = TypeTable(type1, type2) (3) Constraint Similarity. Another feature of an object that makes a small contribution in determining the simple similarity is its cardinality constraint. The authors of XClust [Lee et al. 2002] have defined a cardinality table for DTD constraints, as shown in Fig. 7(b). Also, the authors of PCXSS [Nayak and Tran 2007] have adapted the cardinality constraint table for constraint matching of XSDs. The cardinality similarity between two objects using the constraint cardinality table is: Cardsim(card1, card2) = CardinalityTable(card1, card2) (4) Content Similarity. To measure the similarity between text nodes of XML documents, the content information of these nodes should be captured. To this end, a token-based similarity measure can be used. According to the comparison made in [Cohen et al. 2003], the TFIDF ranking performed best among several tokenbased similarity measures. TFIDF (Term Frequency Inverse Document Frequency) is a statistical measure used to evaluate how important a word is to a document or corpus. The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus. It is used here to assess the similarity between text nodes. In this case, the content of these nodes can be considered as multisets (or bags) of words. Given the content of two text nodes represented as texts cont 1 and cont 2 , The TFIDF measure between them is given by [Cohen et al. 2003]: where, and, Contsim(cont 1 , cont 2 ) = V (w, cont 1 ) = ∑ w∈cont 1∩cont 2 V (w, cont 1 ).V (w, cont 2 ) V ′ (w, cont 1 ) √∑ w ′ V ′ (w, cont 1 ) 2 V ′ (w, cont 1 ) = log (TF w,cont1 + 1). log (IDF w ) where, TF w,cont1 is the frequency of the word w in cont 1 , and IDF w is the inverse of the fraction of names in the corpus that contain w. These simple similarity measures are then aggregated using an aggregation function, such as the weighted sum, to produce the simple similarity between two objects. ACM Computing Surveys, Vol. , No. , 2009.
Page 1 and 2: This is the author’s version of a
Page 3 and 4: 2 · Alsayed Algergawy et al. been
Page 5 and 6: 4 · Alsayed Algergawy et al. featu
Page 7 and 8: 6 · Alsayed Algergawy et al. large
Page 9 and 10: 8 · Alsayed Algergawy et al. now a
Page 11 and 12: 10 · Alsayed Algergawy et al. —p
Page 13: 12 · Alsayed Algergawy et al. Fig.
Page 17 and 18: 16 · Alsayed Algergawy et al. Fig.
Page 19 and 20: 18 · Alsayed Algergawy et al. Tabl
Page 21 and 22: 20 · Alsayed Algergawy et al. as f
Page 23 and 24: 22 · Alsayed Algergawy et al. Expe
Page 25 and 26: 24 · Alsayed Algergawy et al. merg
Page 27 and 28: 26 · Alsayed Algergawy et al. The
Page 29 and 30: 28 · Alsayed Algergawy et al. (a)
Page 31 and 32: 30 · Alsayed Algergawy et al. (a)
Page 33 and 34: 32 · Alsayed Algergawy et al. in S
Page 35 and 36: 34 · Alsayed Algergawy et al. 6.2
Page 37 and 38: 36 · Alsayed Algergawy et al. —S
Page 39 and 40: 38 · Alsayed Algergawy et al. larg
Page 41 and 42: 40 · Alsayed Algergawy et al. mult
Page 43 and 44: 42 · Alsayed Algergawy et al. HUCK
Page 45: 44 · Alsayed Algergawy et al. WANG

PDF (1MB) - QUT ePrints

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?