NUI Galway – UL Alliance First Annual ENGINEERING AND - ARAN ...
NUI Galway – UL Alliance First Annual ENGINEERING AND - ARAN ...
NUI Galway – UL Alliance First Annual ENGINEERING AND - ARAN ...
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
Provenance in the Web of Data:<br />
a building block for user profiling and trust in online communities<br />
Fabrizio Orlandi, Alexandre Passant<br />
Digital Enterprise Research Institute<br />
National University of Ireland, <strong>Galway</strong><br />
fabrizio.orlandi@deri.org - alexandre.passant@deri.org<br />
Abstract<br />
Online collaborative knowledge bases such as<br />
Wikipedia provide an extensive source of information,<br />
not only to their readers, but also to a wide range of<br />
applications and Web services. For example, DBpedia,<br />
one of the largest datasets on the Web of Data, is<br />
widely used as a reference for data interlinking and as<br />
a basis for applications employing Semantic Web<br />
technologies. Yet its dataset, directly derived from<br />
Wikipedia articles, could contain errors due to<br />
inexperience or anonymity of the contributors. By<br />
analysing the Wikipedia edit history and the users'<br />
contributions we provide detailed provenance<br />
information for DBpedia statements and we make this<br />
information publicly available on the Web of Data.<br />
The dataset we provide is then fundamental for<br />
analysing users' activities/interests and computing<br />
trust measures.<br />
Collaborative websites such as Wikipedia have<br />
recently shown the benefit of being able to create and<br />
manage very large public knowledge bases. However,<br />
one of the most common concerns about these types of<br />
information sources is the trustworthiness of their<br />
content which can be arbitrarily edited by everyone.<br />
The DBpedia project 1 , which aims at converting<br />
Wikipedia content into structured knowledge, is then<br />
not exempt from this concern. Especially considering<br />
that one of the main objectives of DBpedia is to build a<br />
dataset such that Semantic Web technologies can be<br />
employed against it. Hence this allows not only to<br />
formulate sophisticated queries against Wikipedia, but<br />
also to link it to other datasets on the Web, or create<br />
new applications or mashups. Thanks to its large<br />
dataset and its cross-domain nature DBpedia has<br />
become one of the most important and interlinked<br />
datasets on the Web of Data. Therefore providing<br />
information about where DBpedia data comes from and<br />
how it was extracted and processed is crucial. This type<br />
of information is called provenance and it describes the<br />
entire data life cycle, from its origin to its subsequent<br />
processing history.<br />
Having provenance information about Wikipedia<br />
data allows us to identify quality measures for<br />
Wikipedia articles and estimate the trustworthiness of<br />
their content. Then, since the DBpedia content is<br />
directly extracted from Wikipedia, the same trust and<br />
quality values can be propagated to the DBpedia<br />
dataset. We apply this process to DBpedia, but this is<br />
just one particular use-case, the same considerations<br />
about provenance are suitable for every dataset on the<br />
Web of Data. The benefits of using data provenance to<br />
develop trust on the Web, and the Semantic Web in<br />
particular, have been already widely described in the<br />
state of the art. Provenance data provides useful<br />
information such as timeliness and authorship of data.<br />
It can be used as a ground basis for various<br />
applications and use cases such as identifying trust<br />
values for pages or pages fragments, or measuring<br />
users' expertise by analysing their contributions and<br />
then personalize trust metrics based on the user profile<br />
of a person on a particular topic. Moreover, providing<br />
also provenance meta-data as RDF and making it<br />
available on the Web of Data offers more interchange<br />
possibilities and transparency. This would let people<br />
link to provenance information from other sources. It<br />
provides them the opportunity to compare these sources<br />
and choose the most appropriate one or the one with<br />
higher quality. In our specific context of DBpedia for<br />
example, by indicating by whom and when a RDF<br />
triple was created (or contributed by), it could let any<br />
application flag, reject or approve this statement based<br />
on particular criteria.<br />
In our work [1][2] we propose a modelling solution<br />
to semantically represent information about provenance<br />
of data in DBpedia and an extraction framework<br />
capable of computing provenance for DBpedia<br />
statements using Wikipedia edits. The framework<br />
consists of: (i) a lightweight modelling solution to<br />
semantically represent provenance of both DBpedia<br />
resources and Wikipedia content, (ii) an information<br />
extraction process and a provenance-computation<br />
system combining Wikipedia articles' history with<br />
DBpedia information, (iii) a set of scripts to make<br />
provenance information about DBpedia statements<br />
directly available when browsing this source, (iv) a<br />
publicly available web service that exposes in RDF as<br />
Linked Open Data our provenance dataset letting<br />
software agents and developers consume it.<br />
References<br />
[1] Orlandi F., Champin P-A., Passant A., “Semantic<br />
Representation of Provenance in Wikipedia,”, Semantic Web<br />
Provenance Management workshop at ISWC2010, CEUR-<br />
WS, Shanghai, 2010.<br />
[2] Orlandi F., Passant A., “Modelling Provenance of<br />
DBpedia Resources Using Wikipedia Contributions”,<br />
Journal of Web Semantics, (to be published), 2011.<br />
* This work has been funded in part by Science Foundation Ireland under Grant No. SFI/08/CE/I1380 (Lion-2) and by an IRCSET Scholarship.<br />
1 http://dbpedia.org/<br />
97