PDF (1MB) - QUT ePrints
PDF (1MB) - QUT ePrints
PDF (1MB) - QUT ePrints
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
10 · Alsayed Algergawy et al.<br />
—parent-child (induced) relationships, that is the relationship between each element node<br />
and its direct subelement/attribute/text node;<br />
—ancestor-descendant (embedded) relationships, that is the relationship between each element<br />
node and its direct or indirect subelement/attribute/text nodes;<br />
—order relationships among siblings.<br />
The tree-based representation of XML data introduces some limitations for the presentation<br />
of these relationships. For example, association relationships that are structural relationships<br />
specifying that two nodes are conceptually at the same level, are not included in<br />
the tree representation. Association relationships essentially model key/keyref and substitution<br />
group mechanisms. As a result, another tree like structure, such as directed acyclic<br />
graphs, is used [Boukottaya and Vanoirbeek 2005]. Fig. 4(a) represents a substitution<br />
group mechanism, which allows customer names to be in an English or a German style.<br />
This type of relationship can not be represented using the tree representation. It needs a<br />
bi-directional edge, as shown in Fig. 4(b).<br />
(a) Substitution group example.<br />
(b) Data graph.<br />
Fig. 4: Graph representation of an XML schema part.<br />
3.2 Vector-based representation<br />
The Vector Space Model (VSM) model [Salton et al. 1975] is a widely used data representation<br />
for text documents. In this model each document is represented as a feature vector<br />
of the words that appear in documents of the data set. The term weights (usually term<br />
frequencies) of the words are also contained in each feature vector. VSM represents a text<br />
document, doc x , using the document feature vector, d x , as [Yang et al. 2009]<br />
d x = [d x(1) , d x(2) , ..., d x(n) ] T , d x(i) = TF(ρ i , dox x ).IDF(ρ i )<br />
where TF(ρ i , doc x ) is the frequency of the term ρ i of doc x , IDF(ρ i ) = log(<br />
DF(ρ ) i)<br />
is the inverse document frequency of the term ρ i for discounting the importance of the<br />
frequently appearing terms, |D| is the total number of documents, DF(ρ i ) is the number<br />
of documents containing ρ i , and n is the number of distinct terms in the document set.<br />
Applying VSM directly to represent semi-structured documents is not desirable, as the<br />
document syntactic structure tagged by their XML elements will be ignored.<br />
XML data generally can be represented as vectors in an abstract n-dimensional feature<br />
space. A set of objects can be extracted from an XML data. Each object has a set of<br />
features, where each feature is associated with a domain to define its allowed values. The<br />
ACM Computing Surveys, Vol. , No. , 2009.<br />
|D|