PDF (1MB) - QUT ePrints

More documents

Recommendations

Info

XML Data Clustering: An Overview · 11 (a) XML document instance of D1. (b) Feature vector (d x) and feature matrix (∆ x) of D1. ❳❳ ❳❳❳ XML doc. ❳ path ❳❳❳❳ D 1 D 2 D 3 D 4 D 5 D 6 P 1(Book/Title) 1 1 1 0 0 1 P 2(Book/Author/Name) 1 0 0 0 0 1 P 3 (Book/Publisher/Name) 1 0 0 0 0 1 P 4(Book/Author) 0 2 2 0 0 0 P 5(Book/Publisher) 0 1 0 0 0 0 P 6(Book/Year) 0 1 0 0 0 0 P 7(Conference/ConfTitle) 0 0 0 1 1 0 P 8(Conf/ConfAuthor) 0 0 0 1 1 0 P 9(Conf/ConfName) 0 0 0 1 1 0 P 10(Conf/ConfYear) 0 0 0 1 0 0 P 11(Conf/ConfLoc) 0 0 0 1 1 0 (c) A structure-based representation for Fig.1(b). ❳❳ ❳❳❳ XML doc. ❳ path ❳❳❳ ❳ D 1 D 2 D 3 D 4 D 5 D 6 P 1 1 1 1 0 0 1 P 2 1 0 0 0 0 1 P 3 1 0 0 0 0 1 P 4 0 1 1 0 0 0 P 5 0 1 0 0 0 0 P 6 0 1 0 0 0 0 P 7 0 0 0 1 1 0 P 8 0 0 0 1 1 0 P 9 0 0 0 1 1 0 P 10 0 0 0 1 0 0 P 11 0 0 0 1 1 0 (d) A bitmap index for Fig. 1(b). Fig. 5: Vector representation of an XML document. level of an XML element is a feature whose domain are the positive integers (0 for the root, 1 for the first level, and so on), while the name of the element is another feature whose domain is string. This representation model introduces a challenge to incorporate the structural relationships between elements. In the following, we elaborate on different methods used to model XML data based on their either content, structure, or content & structure. —Content-based representation. Each XML document, doc x , is represented using a vector, called the feature vector d x , that stores the frequency of (distinct) words in the document. Fig. 5(a,b) shows the content-based representation of D1 depicted in Fig. 1a. —Structure-based representation. In this case, the feature vector capture the structure information of the XML document instead of the content information. Several methods have proposed to achieve this task. Most of them are based on exploiting path information. The approach in [Tran et al. 2008] models each XML document as a vector {P 1 , P 2 , ..., P n }, where each element, P i , of the vector represents the frequency of the path that appears in the document, as shown in Fig. 5c. A path, P i , contains element names from the root element to a leaf element. The leaf element is an element that contains the textual content. The bitmap index technique is also used to represent XML data [Yoon et al. 2001]. A set of XML documents is represented using a 2-dimensional matrix. As illustrated in Fig. 5d, if a document has path, then the corresponding bit in the bitmap index is set to 1. Otherwise, all bits are set to 0. —Content and structure-based representation. Representing XML documents using either the content or the structure feature is not sufficient to effectively model them. To this, ACM Computing Surveys, Vol. , No. , 2009.
12 · Alsayed Algergawy et al. Fig. 6: Similarity measures. both features should be taken into account. An extended version of the vector space model called structural link vector model (SLVM) is used to capture syntactic structure tagged by XML elements [Yang et al. 2009]. SLVM represents an XML document doc x using a document feature matrix ∆ x ∈ R n×m , given as ∆ x = [∆ x(1) , ∆ x(2) , ..., ∆ x(m) ] where m is the number of distinct XML elements, ∆ x(i) ∈ R n is the TFIDF feature vector representing the ith XML element, (1 ≤ i ≤ m), given as ∆ x(i) = TF(ρ j , doc x .e i ).IDF(ρ j ) for all j = 1 to n, where TF(ρ j , doc x .e i ) is the frequency of the term w j in the element e i of doc x . The SLVM representation of an XML document instance for D1 depicted in Fig. 1a is reported in Fig. 5b, which illustrates, for example, that the term XML appears one time in D1 (from the document feature vector d x ) under the element title (from the document feature matrix ∆ x ). Another vector-based representation that captures both structure and content of the XML data is represented in [Yoon et al. 2001]. The bitmap indexing technique, shown in Fig. 5d is extended, where a set of XML documents is represented using a 3-dimensional matrix, called BitCube. Each document is defined as a set of (path, word), where path is a root-to-leaf path, and word denotes the word or content of the path. If a document has path, then the corresponding bit in the bitmap index is set to 1. Otherwise, all bits are set to 0 (and if path contains a word, the bit is set to 1, and 0 otherwise). 4. SIMILARITY MEASURES AND COMPUTATION Starting from the representation model of objects and their features, the similarity between XML data can be identified and determined by exploiting objects, objects’ features, and relationships among them. There are various aspects that allow the description and categorization of XML data similarity measures, such as the kind of methodology being used, the kind of XML data representation, and the planned application domain [Tekli et al. 2009]. In the following, for homogeneity of presentation, we survey several XML similarity measures based on the used data representation, as shown in Fig. 6. ACM Computing Surveys, Vol. , No. , 2009.
Page 1 and 2: This is the author’s version of a
Page 3 and 4: 2 · Alsayed Algergawy et al. been
Page 5 and 6: 4 · Alsayed Algergawy et al. featu
Page 7 and 8: 6 · Alsayed Algergawy et al. large
Page 9 and 10: 8 · Alsayed Algergawy et al. now a
Page 11: 10 · Alsayed Algergawy et al. —p
Page 15 and 16: 14 · Alsayed Algergawy et al. data
Page 17 and 18: 16 · Alsayed Algergawy et al. Fig.
Page 19 and 20: 18 · Alsayed Algergawy et al. Tabl
Page 21 and 22: 20 · Alsayed Algergawy et al. as f
Page 23 and 24: 22 · Alsayed Algergawy et al. Expe
Page 25 and 26: 24 · Alsayed Algergawy et al. merg
Page 27 and 28: 26 · Alsayed Algergawy et al. The
Page 29 and 30: 28 · Alsayed Algergawy et al. (a)
Page 31 and 32: 30 · Alsayed Algergawy et al. (a)
Page 33 and 34: 32 · Alsayed Algergawy et al. in S
Page 35 and 36: 34 · Alsayed Algergawy et al. 6.2
Page 37 and 38: 36 · Alsayed Algergawy et al. —S
Page 39 and 40: 38 · Alsayed Algergawy et al. larg
Page 41 and 42: 40 · Alsayed Algergawy et al. mult
Page 43 and 44: 42 · Alsayed Algergawy et al. HUCK
Page 45: 44 · Alsayed Algergawy et al. WANG

PDF (1MB) - QUT ePrints

Create successful ePaper yourself

Delete template?

Save as template?