30.12.2014 Views

PDF (1MB) - QUT ePrints

PDF (1MB) - QUT ePrints

PDF (1MB) - QUT ePrints

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

10 · Alsayed Algergawy et al.<br />

—parent-child (induced) relationships, that is the relationship between each element node<br />

and its direct subelement/attribute/text node;<br />

—ancestor-descendant (embedded) relationships, that is the relationship between each element<br />

node and its direct or indirect subelement/attribute/text nodes;<br />

—order relationships among siblings.<br />

The tree-based representation of XML data introduces some limitations for the presentation<br />

of these relationships. For example, association relationships that are structural relationships<br />

specifying that two nodes are conceptually at the same level, are not included in<br />

the tree representation. Association relationships essentially model key/keyref and substitution<br />

group mechanisms. As a result, another tree like structure, such as directed acyclic<br />

graphs, is used [Boukottaya and Vanoirbeek 2005]. Fig. 4(a) represents a substitution<br />

group mechanism, which allows customer names to be in an English or a German style.<br />

This type of relationship can not be represented using the tree representation. It needs a<br />

bi-directional edge, as shown in Fig. 4(b).<br />

(a) Substitution group example.<br />

(b) Data graph.<br />

Fig. 4: Graph representation of an XML schema part.<br />

3.2 Vector-based representation<br />

The Vector Space Model (VSM) model [Salton et al. 1975] is a widely used data representation<br />

for text documents. In this model each document is represented as a feature vector<br />

of the words that appear in documents of the data set. The term weights (usually term<br />

frequencies) of the words are also contained in each feature vector. VSM represents a text<br />

document, doc x , using the document feature vector, d x , as [Yang et al. 2009]<br />

d x = [d x(1) , d x(2) , ..., d x(n) ] T , d x(i) = TF(ρ i , dox x ).IDF(ρ i )<br />

where TF(ρ i , doc x ) is the frequency of the term ρ i of doc x , IDF(ρ i ) = log(<br />

DF(ρ ) i)<br />

is the inverse document frequency of the term ρ i for discounting the importance of the<br />

frequently appearing terms, |D| is the total number of documents, DF(ρ i ) is the number<br />

of documents containing ρ i , and n is the number of distinct terms in the document set.<br />

Applying VSM directly to represent semi-structured documents is not desirable, as the<br />

document syntactic structure tagged by their XML elements will be ignored.<br />

XML data generally can be represented as vectors in an abstract n-dimensional feature<br />

space. A set of objects can be extracted from an XML data. Each object has a set of<br />

features, where each feature is associated with a domain to define its allowed values. The<br />

ACM Computing Surveys, Vol. , No. , 2009.<br />

|D|

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!