PDF (1MB) - QUT ePrints
PDF (1MB) - QUT ePrints
PDF (1MB) - QUT ePrints
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
XML Data Clustering: An Overview · 27<br />
integer a > 0. The level similarity, Sim L1,L 2<br />
, between two level representations is:<br />
Sim L1,L 2<br />
= 0.5 × ∑ M−1<br />
i=0 c1 i × aM−i−1 + 0.5 × ∑ M−1<br />
∑ M−1<br />
k=0 t k × a M−k−1<br />
j=0 c2 j × aM−j−1<br />
where c 1 i denotes the number of common elements in level i of L 1 and some level of<br />
L 2 , where c 2 j denotes the number of common elements in level j of L 2 and some level<br />
of L 1 during the matching process, t k denotes the total number of distinct elements in<br />
the level k of L 1 , and M is the number of levels in the first level structure. Since the<br />
level similarity measure is not transitive, XCLS determines the structural similarity from<br />
the first XML document to the second one and viceversa and chooses the highest value<br />
between the two.<br />
—Clustering/Grouping. In this phase, an incremental clustering algorithm is used to group<br />
the XML documents within various XML sources considering the level similarity. The<br />
clustering algorithm progressively places each coming XML document into a new cluster<br />
or into an existing cluster that has the maximum level similarity with it.<br />
—Evaluation Criteria. The quality of XCLS is evaluated using the standard criteria, such<br />
as the intra- and inter-cluster similarity, purity, entropy, and FScore. The scalability<br />
of the XCLS system is also evaluated to validate its space and time complexity. Since<br />
XSLC uses an incremental clustering algorithm that avoids the need of pairwise comparison,<br />
it has a time complexity of O(NKdc), where c is the number of distinct elements<br />
in clusters. Experimental evaluation indicates that XCLS achieves a similar quality result<br />
as the pairwise clustering algorithms in much less time. However, XCLS lacks a<br />
unified clustering framework that can be applied efficiently both to homogeneous and<br />
heterogenous collections of XML documents.<br />
6.1.2 XEdge. XML documents Clustering using Edge Level Summaries (XEdge) [Antonellis<br />
et al. 2008] represents a unified clustering algorithm for both homogeneous and<br />
heterogenous XML documents. Based on the type of the XML documents, the proposed<br />
method modifies its distance metric in order to adapt to the special structure features of<br />
homogeneous and heterogeneous documents.<br />
—Data Representation. Like XCLS, XEdge represents XML documents as ordered data<br />
trees where leaf nodes represent content values. Instead of summarizing the distinct<br />
nodes as XCLS, XEdge introduces the LevelEdge representation, which summarizes all<br />
distinct parent/child relationships (distinguished by a number associated to each edge as<br />
shown in Fig. 11(a)) in each level of the XML document, as shown in Fig. 11(c). The<br />
distinct edges are first encoded as integers and those integers are used to construct the<br />
LevelEdge representation.<br />
—Similarity Computation. In order to compute the similarity among XML documents,<br />
XEdge proposes two similarity measures to distinguish between similarity computation<br />
among homogeneous XML documents and similarity computation among heterogeneous<br />
ones. Assuming that homogeneous XML documents are derived from sub-<br />
DTDs of the same DTD, XEdge only searches for common edges in the same levels in<br />
both documents. To measure the similarity between two homogeneous XML documents<br />
represented as LevelEdge L 1 and L 2 , it uses the following similarity function:<br />
ACM Computing Surveys, Vol. , No. , 2009.