30.12.2014 Views

PDF (1MB) - QUT ePrints

PDF (1MB) - QUT ePrints

PDF (1MB) - QUT ePrints

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

12 · Alsayed Algergawy et al.<br />

Fig. 6: Similarity measures.<br />

both features should be taken into account. An extended version of the vector space<br />

model called structural link vector model (SLVM) is used to capture syntactic structure<br />

tagged by XML elements [Yang et al. 2009]. SLVM represents an XML document doc x<br />

using a document feature matrix ∆ x ∈ R n×m , given as<br />

∆ x = [∆ x(1) , ∆ x(2) , ..., ∆ x(m) ]<br />

where m is the number of distinct XML elements, ∆ x(i) ∈ R n is the TFIDF feature<br />

vector representing the ith XML element, (1 ≤ i ≤ m), given as ∆ x(i) =<br />

TF(ρ j , doc x .e i ).IDF(ρ j ) for all j = 1 to n, where TF(ρ j , doc x .e i ) is the frequency<br />

of the term w j in the element e i of doc x . The SLVM representation of an XML document<br />

instance for D1 depicted in Fig. 1a is reported in Fig. 5b, which illustrates, for<br />

example, that the term XML appears one time in D1 (from the document feature vector<br />

d x ) under the element title (from the document feature matrix ∆ x ).<br />

Another vector-based representation that captures both structure and content of the XML<br />

data is represented in [Yoon et al. 2001]. The bitmap indexing technique, shown in Fig.<br />

5d is extended, where a set of XML documents is represented using a 3-dimensional<br />

matrix, called BitCube. Each document is defined as a set of (path, word), where path<br />

is a root-to-leaf path, and word denotes the word or content of the path. If a document<br />

has path, then the corresponding bit in the bitmap index is set to 1. Otherwise, all bits<br />

are set to 0 (and if path contains a word, the bit is set to 1, and 0 otherwise).<br />

4. SIMILARITY MEASURES AND COMPUTATION<br />

Starting from the representation model of objects and their features, the similarity between<br />

XML data can be identified and determined by exploiting objects, objects’ features, and<br />

relationships among them. There are various aspects that allow the description and categorization<br />

of XML data similarity measures, such as the kind of methodology being used, the<br />

kind of XML data representation, and the planned application domain [Tekli et al. 2009].<br />

In the following, for homogeneity of presentation, we survey several XML similarity measures<br />

based on the used data representation, as shown in Fig. 6.<br />

ACM Computing Surveys, Vol. , No. , 2009.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!