PDF (1MB) - QUT ePrints

More documents

Recommendations

Info

Table II: Features of XML Data Clustering Approaches ACM Computing Surveys, Vol. , No. , 2009. Prototypes ❵❵❵❵❵❵❵❵❵❵❵❵ Criteria XClust XMine PCXSS SeqXClust XCLS XEdge PSim XProj SemXClust S-GRACE SumXClust VectXClust DFTXClust Clustered DTD DTD& XSD XSD XML XML XML XML XML XML XML XML XML data type schema XSD schema schema schema documents documents documents documents documents documents documents documents documents Data rooted unordered rooted ordered rooted ordered rooted ordered rooted ordered rooted ordered rooted rooted ordered rooted directed rooted ordered binary time representation labeled tree labeled tree labeled tree labeled tree labeled tree labeled tree labeled tree labeled tree labeled tree graph labeled tree branch vector series Exploited elements, attributes, elements, attributes, elements, attributes, level information subtrees whole XML tree whole XML tree whole XML whole XML node paths edge information paths substructures objects paths paths node context elements (tree tuple) (s-graph) (structural summaries) tree tree name, linguistic, name, name, Schema matching cardinality, hierarchical structure, data type, data type, — — — — — — — — — path context path similarity constraint node context Similarity level edge path frequent content & distance structural approach Tree matching — — — — similarity similarity similarity substructure structural similarity metric distance — — Vector binary branch DFT — — — — — — — — — — — based distance distance Similarity DTD similarity schema similarity schema similarity edge similarity path similarity distance fully connected — — — — — representation matrix matrix matrix matrix matrix array graph — Clustering hierarchical constrained graph partitioning partition-base partitional hierarchical single link hierarchical incremental incremental K-means algorithm agglomerative hierarchical agglomerative greedy algorithm approach XTrK-means algorithm ROCK approach hierarchical — — Cluster hierarchy of hierarchy of hierarchy of representation schema classes schema classes K-clusters schema classes K-clusters K-clusters K-clusters K-clusters K-clusters K-clusters K-clusters — — (dendrogram) (dendrogram) (dendrogram) quality of inter & intra-cluster inter & intra-cluster inter & intra-cluster external& internal measures Precision closeness precision Quality criteria integrated purity, entropy, FScore FScore — inter & intra-cluster standard deviation — — FScore FScore FScore recall recall schema inter & intra cluster outlier ratio Efficiency response time& space page I/O response response response response — — — — — evaluation time complexity time time time time — Application XML data general XML general XML general XML general XML general XML XML storage XML data XML data XML query hierarchical structural Query Storage & retrieval area integration management management management management management for query processing mining mining processing management processing of large documents XML Data Clustering: An Overview · 37
38 · Alsayed Algergawy et al. large amounts of data. However, no evaluation work has been done considering a tradingoff between the two performance aspects. Finally, clustering XML data is useful in numerous applications. Table II shows that some prototypes have been developed and implemented for specific XML applications, such as XClust for DTDs integration and S-GRACE for XML query processing, while other prototypes have been developed for generic applications, such as XMine and XCLS for XML management in general. 7. SUMMARY AND FUTURE DIRECTIONS Clustering XML data is a basic problem in many data application domains such as Bioinformatics, XML information retrieval, XML data integration, Web mining, and XML query processing. In this paper, we conducted a survey about XML data clustering methodologies and implementations. In particular, we proposed a classification scheme for these methodologies based on three major criteria: the type of clustered data (clustering either XML documents or XML schemas), the used similarity measure approach (either the tree similarity-based approach or the vector-based approach), and the used clustering algorithm. As well as, in order to base the comparison between their implementations, we proposed a generic clustering framework. In this paper, we devised the main current approaches for clustering XML data relying on similarity measures. The discussion also pointed out new research trends that are currently under investigation. In the remainder we sum up interesting research issues that still deserve attention from the research community classified according to four main directions: Semantics, performance, trade-off between clustering quality and clustering scalability, and application context directions. —Semantics. More semantic information should be integrated in clustering techniques in order to improve the quality of the generated clusters. Examples of semantic information are: Schema information that can be associated with documents or extracted from a set of documents; Ontologies and Thesauri, that can be exploited to identify similar concepts; Domain knowledge, that can used within a specific context to identify as similar concepts that in general are retained different (for example, usually papers in proceedings are considered different from journal papers, however a paper in the VLDB proceedings has the same relevance as a paper in an important journal). The semantics of the data to be clustered can also be exploited for the selection of the measure to be employed. Indeed, clustering approaches mainly rely on well-known fundamental measures (e.g., tree edit distance, cosine and Manhattan distances) that are general purpose and cannot be easily adapted to the characteristics of data. A top-down or bottom-up strategy can be followed for the specification of these kinds of functions. Following the top-down strategy, a user interface can be developed offering the user a set of buildingblocks similarity measures (including the fundamental ones) that can be applied on different parts of XML data, and users can compose these measures in order to obtain a new one that is specific for their needs. The user interface shall offer the possibility to compare the effectiveness of different measures and choose the one that best suits their needs. In this strategy, the user is in charge of specifying the measure to be used for clustering. In the context of approximate retrieval of XML documents, the multisimilarity system arHex [Sanz et al. 2006] has been proposed. Users can specify and compose their own similarity measures and compare the retrieved results depending on ACM Computing Surveys, Vol. , No. , 2009.
Page 1 and 2: This is the author’s version of a
Page 3 and 4: 2 · Alsayed Algergawy et al. been
Page 5 and 6: 4 · Alsayed Algergawy et al. featu
Page 7 and 8: 6 · Alsayed Algergawy et al. large
Page 9 and 10: 8 · Alsayed Algergawy et al. now a
Page 11 and 12: 10 · Alsayed Algergawy et al. —p
Page 13 and 14: 12 · Alsayed Algergawy et al. Fig.
Page 15 and 16: 14 · Alsayed Algergawy et al. data
Page 17 and 18: 16 · Alsayed Algergawy et al. Fig.
Page 19 and 20: 18 · Alsayed Algergawy et al. Tabl
Page 21 and 22: 20 · Alsayed Algergawy et al. as f
Page 23 and 24: 22 · Alsayed Algergawy et al. Expe
Page 25 and 26: 24 · Alsayed Algergawy et al. merg
Page 27 and 28: 26 · Alsayed Algergawy et al. The
Page 29 and 30: 28 · Alsayed Algergawy et al. (a)
Page 31 and 32: 30 · Alsayed Algergawy et al. (a)
Page 33 and 34: 32 · Alsayed Algergawy et al. in S
Page 35 and 36: 34 · Alsayed Algergawy et al. 6.2
Page 37: 36 · Alsayed Algergawy et al. —S
Page 41 and 42: 40 · Alsayed Algergawy et al. mult
Page 43 and 44: 42 · Alsayed Algergawy et al. HUCK
Page 45: 44 · Alsayed Algergawy et al. WANG

PDF (1MB) - QUT ePrints

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?