PDF (1MB) - QUT ePrints
PDF (1MB) - QUT ePrints
PDF (1MB) - QUT ePrints
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
12 · Alsayed Algergawy et al.<br />
Fig. 6: Similarity measures.<br />
both features should be taken into account. An extended version of the vector space<br />
model called structural link vector model (SLVM) is used to capture syntactic structure<br />
tagged by XML elements [Yang et al. 2009]. SLVM represents an XML document doc x<br />
using a document feature matrix ∆ x ∈ R n×m , given as<br />
∆ x = [∆ x(1) , ∆ x(2) , ..., ∆ x(m) ]<br />
where m is the number of distinct XML elements, ∆ x(i) ∈ R n is the TFIDF feature<br />
vector representing the ith XML element, (1 ≤ i ≤ m), given as ∆ x(i) =<br />
TF(ρ j , doc x .e i ).IDF(ρ j ) for all j = 1 to n, where TF(ρ j , doc x .e i ) is the frequency<br />
of the term w j in the element e i of doc x . The SLVM representation of an XML document<br />
instance for D1 depicted in Fig. 1a is reported in Fig. 5b, which illustrates, for<br />
example, that the term XML appears one time in D1 (from the document feature vector<br />
d x ) under the element title (from the document feature matrix ∆ x ).<br />
Another vector-based representation that captures both structure and content of the XML<br />
data is represented in [Yoon et al. 2001]. The bitmap indexing technique, shown in Fig.<br />
5d is extended, where a set of XML documents is represented using a 3-dimensional<br />
matrix, called BitCube. Each document is defined as a set of (path, word), where path<br />
is a root-to-leaf path, and word denotes the word or content of the path. If a document<br />
has path, then the corresponding bit in the bitmap index is set to 1. Otherwise, all bits<br />
are set to 0 (and if path contains a word, the bit is set to 1, and 0 otherwise).<br />
4. SIMILARITY MEASURES AND COMPUTATION<br />
Starting from the representation model of objects and their features, the similarity between<br />
XML data can be identified and determined by exploiting objects, objects’ features, and<br />
relationships among them. There are various aspects that allow the description and categorization<br />
of XML data similarity measures, such as the kind of methodology being used, the<br />
kind of XML data representation, and the planned application domain [Tekli et al. 2009].<br />
In the following, for homogeneity of presentation, we survey several XML similarity measures<br />
based on the used data representation, as shown in Fig. 6.<br />
ACM Computing Surveys, Vol. , No. , 2009.