PDF (1MB) - QUT ePrints
PDF (1MB) - QUT ePrints
PDF (1MB) - QUT ePrints
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
XML Data Clustering: An Overview<br />
Alsayed Algergawy<br />
Magdeburg University,<br />
Marco Mesiti<br />
University of Milano,<br />
Richi Nayak<br />
Queensland University of Technology<br />
and<br />
Gunter Saake<br />
Magdeburg University<br />
In the last few years we have observed a proliferation of approaches for clustering XML documents<br />
and schemas based on their structure and content. The presence of such a huge amount<br />
of approaches is due to the different applications requiring the XML data to be clustered. These<br />
applications need data in the form of similar contents, tags, paths, structures and semantics. In<br />
this paper, we first outline the application contexts in which clustering is useful, then we survey<br />
approaches so far proposed relying on the abstract representation of data (instances or schema),<br />
on the identified similarity measure, and on the clustering algorithm. This presentation leads to<br />
draw a taxonomy in which the current approaches can be classified and compared. We aim at<br />
introducing an integrated view that is useful when comparing XML data clustering approaches,<br />
when developing a new clustering algorithm, and when implementing an XML clustering component.<br />
Finally, the paper moves into the description of future trends and research issues that still<br />
need to be faced.<br />
Categories and Subject Descriptors: H.2.8 [Database Management]: Database Applications— Data mining;—<br />
documentation; H.3 [Information Systems]: Information Search and Retrieval, Online Information Services<br />
General Terms: Documentation, Algorithms, Performance<br />
Additional Key Words and Phrases: XML data, Clustering, Tree similarity, Schema matching,<br />
Semantic similarity, Structural similarity<br />
1. INTRODUCTION<br />
The eXtensible Markup Language (XML) has emerged as a standard for information representation<br />
and exchange on the Web and the Intranet [Wilde and Glushko 2008]. Consequently,<br />
a huge amount of information is represented in XML and several tools have<br />
Author’s address: A. Algergawy, Magdeburg University, Computer Science Department, 39106 Magdeburg,<br />
Germany, email:alshahat@ovgu.de; Marco Mesiti, DICO - Universita’ di Milano, via Comelico 39/41, 20135<br />
Milano (Italy) e-mail: mesiti@dico.unimi.it; Richi Nayak, Queensland University of Technology, Faculty of<br />
Science and Technology, GPO Box 2434 Brisbane, Australia, email: r.nayak@qut.edu.au; G. Saake, Magdeburg<br />
University, Computer Science Department, 39106 Magdeburg, Germany, email:saake@iti.cs.uni-magdeburg.de<br />
Permission to make digital/hard copy of all or part of this material without fee for personal or classroom use<br />
provided that the copies are not made or distributed for profit or commercial advantage, the ACM copyright/server<br />
notice, the title of the publication, and its date appear, and notice is given that copying is by permission of the<br />
ACM, Inc. To copy otherwise, to republish, to post on servers, or to redistribute to lists requires prior specific<br />
permission and/or a fee.<br />
c○ 2009 ACM 1529-3785/2009/0700-0001 $5.00<br />
ACM Computing Surveys, Vol. , No. , 2009, Pages 1–0.