30.12.2014 Views

PDF (1MB) - QUT ePrints

PDF (1MB) - QUT ePrints

PDF (1MB) - QUT ePrints

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

XML Data Clustering: An Overview<br />

Alsayed Algergawy<br />

Magdeburg University,<br />

Marco Mesiti<br />

University of Milano,<br />

Richi Nayak<br />

Queensland University of Technology<br />

and<br />

Gunter Saake<br />

Magdeburg University<br />

In the last few years we have observed a proliferation of approaches for clustering XML documents<br />

and schemas based on their structure and content. The presence of such a huge amount<br />

of approaches is due to the different applications requiring the XML data to be clustered. These<br />

applications need data in the form of similar contents, tags, paths, structures and semantics. In<br />

this paper, we first outline the application contexts in which clustering is useful, then we survey<br />

approaches so far proposed relying on the abstract representation of data (instances or schema),<br />

on the identified similarity measure, and on the clustering algorithm. This presentation leads to<br />

draw a taxonomy in which the current approaches can be classified and compared. We aim at<br />

introducing an integrated view that is useful when comparing XML data clustering approaches,<br />

when developing a new clustering algorithm, and when implementing an XML clustering component.<br />

Finally, the paper moves into the description of future trends and research issues that still<br />

need to be faced.<br />

Categories and Subject Descriptors: H.2.8 [Database Management]: Database Applications— Data mining;—<br />

documentation; H.3 [Information Systems]: Information Search and Retrieval, Online Information Services<br />

General Terms: Documentation, Algorithms, Performance<br />

Additional Key Words and Phrases: XML data, Clustering, Tree similarity, Schema matching,<br />

Semantic similarity, Structural similarity<br />

1. INTRODUCTION<br />

The eXtensible Markup Language (XML) has emerged as a standard for information representation<br />

and exchange on the Web and the Intranet [Wilde and Glushko 2008]. Consequently,<br />

a huge amount of information is represented in XML and several tools have<br />

Author’s address: A. Algergawy, Magdeburg University, Computer Science Department, 39106 Magdeburg,<br />

Germany, email:alshahat@ovgu.de; Marco Mesiti, DICO - Universita’ di Milano, via Comelico 39/41, 20135<br />

Milano (Italy) e-mail: mesiti@dico.unimi.it; Richi Nayak, Queensland University of Technology, Faculty of<br />

Science and Technology, GPO Box 2434 Brisbane, Australia, email: r.nayak@qut.edu.au; G. Saake, Magdeburg<br />

University, Computer Science Department, 39106 Magdeburg, Germany, email:saake@iti.cs.uni-magdeburg.de<br />

Permission to make digital/hard copy of all or part of this material without fee for personal or classroom use<br />

provided that the copies are not made or distributed for profit or commercial advantage, the ACM copyright/server<br />

notice, the title of the publication, and its date appear, and notice is given that copying is by permission of the<br />

ACM, Inc. To copy otherwise, to republish, to post on servers, or to redistribute to lists requires prior specific<br />

permission and/or a fee.<br />

c○ 2009 ACM 1529-3785/2009/0700-0001 $5.00<br />

ACM Computing Surveys, Vol. , No. , 2009, Pages 1–0.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!