30.12.2014 Views

PDF (1MB) - QUT ePrints

PDF (1MB) - QUT ePrints

PDF (1MB) - QUT ePrints

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

XML Data Clustering: An Overview · 27<br />

integer a > 0. The level similarity, Sim L1,L 2<br />

, between two level representations is:<br />

Sim L1,L 2<br />

= 0.5 × ∑ M−1<br />

i=0 c1 i × aM−i−1 + 0.5 × ∑ M−1<br />

∑ M−1<br />

k=0 t k × a M−k−1<br />

j=0 c2 j × aM−j−1<br />

where c 1 i denotes the number of common elements in level i of L 1 and some level of<br />

L 2 , where c 2 j denotes the number of common elements in level j of L 2 and some level<br />

of L 1 during the matching process, t k denotes the total number of distinct elements in<br />

the level k of L 1 , and M is the number of levels in the first level structure. Since the<br />

level similarity measure is not transitive, XCLS determines the structural similarity from<br />

the first XML document to the second one and viceversa and chooses the highest value<br />

between the two.<br />

—Clustering/Grouping. In this phase, an incremental clustering algorithm is used to group<br />

the XML documents within various XML sources considering the level similarity. The<br />

clustering algorithm progressively places each coming XML document into a new cluster<br />

or into an existing cluster that has the maximum level similarity with it.<br />

—Evaluation Criteria. The quality of XCLS is evaluated using the standard criteria, such<br />

as the intra- and inter-cluster similarity, purity, entropy, and FScore. The scalability<br />

of the XCLS system is also evaluated to validate its space and time complexity. Since<br />

XSLC uses an incremental clustering algorithm that avoids the need of pairwise comparison,<br />

it has a time complexity of O(NKdc), where c is the number of distinct elements<br />

in clusters. Experimental evaluation indicates that XCLS achieves a similar quality result<br />

as the pairwise clustering algorithms in much less time. However, XCLS lacks a<br />

unified clustering framework that can be applied efficiently both to homogeneous and<br />

heterogenous collections of XML documents.<br />

6.1.2 XEdge. XML documents Clustering using Edge Level Summaries (XEdge) [Antonellis<br />

et al. 2008] represents a unified clustering algorithm for both homogeneous and<br />

heterogenous XML documents. Based on the type of the XML documents, the proposed<br />

method modifies its distance metric in order to adapt to the special structure features of<br />

homogeneous and heterogeneous documents.<br />

—Data Representation. Like XCLS, XEdge represents XML documents as ordered data<br />

trees where leaf nodes represent content values. Instead of summarizing the distinct<br />

nodes as XCLS, XEdge introduces the LevelEdge representation, which summarizes all<br />

distinct parent/child relationships (distinguished by a number associated to each edge as<br />

shown in Fig. 11(a)) in each level of the XML document, as shown in Fig. 11(c). The<br />

distinct edges are first encoded as integers and those integers are used to construct the<br />

LevelEdge representation.<br />

—Similarity Computation. In order to compute the similarity among XML documents,<br />

XEdge proposes two similarity measures to distinguish between similarity computation<br />

among homogeneous XML documents and similarity computation among heterogeneous<br />

ones. Assuming that homogeneous XML documents are derived from sub-<br />

DTDs of the same DTD, XEdge only searches for common edges in the same levels in<br />

both documents. To measure the similarity between two homogeneous XML documents<br />

represented as LevelEdge L 1 and L 2 , it uses the following similarity function:<br />

ACM Computing Surveys, Vol. , No. , 2009.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!