NUI Galway – UL Alliance First Annual ENGINEERING AND - ARAN ...
NUI Galway – UL Alliance First Annual ENGINEERING AND - ARAN ...
NUI Galway – UL Alliance First Annual ENGINEERING AND - ARAN ...
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
1. INTRODUCTION<br />
The vast realms of the Web encompass a substantial portion<br />
of the human knowledge nowadays. However, the particular<br />
pieces of the knowledge are interleaved with a lot of noise<br />
and scattered among many information resources of various<br />
relevance. Thus it is often difficult or even infeasible to get<br />
what one needs to know from the ballast of largely irrelevant<br />
content. Our goal is to contribute to giving more meaning to<br />
the content on the Web, identifying and interlinking relevant<br />
knowledge out there so that machines can help humans to<br />
make use of it more efficiently. We are particularly interested<br />
in giving a clear empirical grounding to the symbolic meaning<br />
being extracted from or asserted on the Web. To do so, we<br />
have investigated a generalised extension of the distributional<br />
semantics principles (a sub-field of computational linguistics).<br />
This is to allow for identification of general, reliable patterns<br />
and links among the loosely structured, heterogeneous and<br />
noisy (Semantic) Web data, and enable meaningful symbolic<br />
reasoning with such relevant content.<br />
2. SOLUTION<br />
Our approach is partly motivated by the recently devised<br />
method of universal distributional representation and analysis<br />
of natural language corpora [1]. We generalise and extend the<br />
tensor-based representation of weighed co-occurrence relations<br />
between natural language expressions proposed in [1] to reflect<br />
the domain of Linked Data and knowledge on the Semantic<br />
Web. Moreover, we provide novel methods of smooth combination<br />
of the distributional and symbolic semantic levels<br />
in order to allow for automated formal reasoning about the<br />
empirically grounded knowledge emerging from the Web.<br />
As of now, we have delivered EUREEKA [2] and<br />
CORAAL [3] <strong>–</strong> two prototypes presenting particular implementations<br />
of the general theoretical principles of our<br />
approach. EUREEKA is a software library for extraction,<br />
representation, integration and processing of emergent knowledge.<br />
Currently it focuses mainly on knowledge acquired from<br />
natural language texts and existing ontologies. It implements<br />
a specific multi-context perspective of the basic corpus data,<br />
where uncertain binary relations may further be specified<br />
by an arbitrary number of additional context arguments (explicitly<br />
attached provenance, time, space, etc.). It deals with<br />
a single sub-perspective of entities and their corresponding<br />
relationships, employing selective distance-based similarities<br />
for the execution of queries, evaluation of inference rules<br />
and simple analogical reasoning. The entities are ranked by<br />
Empirically Grounded Linked Knowledge<br />
Vít Nováček<br />
Digital Enterprise Research Institute (DERI),<br />
National University of Ireland <strong>Galway</strong> (<strong>NUI</strong>G)<br />
IDA Business Park, Lower Dangan, <strong>Galway</strong>, Ireland<br />
e-mail: vit.novacek@deri.org<br />
134<br />
means of a generalised IR approach and the ranking values<br />
serve for a basic implementation of anytime versions of the<br />
reasoning and querying algorithms. See [2] for more details.<br />
The EUREEKA library can be downloaded at http://pypi.<br />
python.org/pypi/eureeka/0.1.<br />
On the top of EUREEKA, we delivered CORAAL [3],<br />
an intelligent publication search engine deployed primarily<br />
in the life sciences domain. In a nutshell, CORAAL is able<br />
to extract knowledge in the form of argument-link-argument<br />
statements associated with a positive or negative certainty<br />
value and publication provenance information (i.e., which<br />
publications were used for extraction and/or inference of the<br />
statements). The extracted knowledge can be automatically<br />
integrated with existing domain resources (such as machinereadable<br />
life science thesauri) and augmented or refined by<br />
means of the EUREEKA reasoning services. The content of<br />
the CORAAL knowledge base is served to users via a search<br />
interface that allows for complex statement queries in addition<br />
to classical full-text capabilities. The search results can be<br />
browsed and filtered along multiple facets, which enables the<br />
users to quickly pinpoint the knowledge (i.e., statements) that<br />
interests them, as well as the publication sources pertinent to<br />
it. The CORAAL tool can be accessed at http://coraal.deri.ie.<br />
3. CONCLUSION<br />
We evaluated our approach via the CORAAL search engine.<br />
A sample of actual users (experts in the domain of<br />
cancer research and clinical care) helped us to assess various<br />
quantitative and qualitative aspects of the emergent content<br />
processed within our framework. The results of our approach<br />
outperformed available base-lines by rather large margins<br />
(improvements in the range of 189 − 374%), demonstrating<br />
a promising potential of the proposed approach. Note that this<br />
short paper is but a very superficial and rather incomplete<br />
overview. Readers interested in learning more about our research,<br />
its evaluation and about the large volume of related<br />
work are invited to have a look at [2].<br />
Acknowledgements: This work has been supported by<br />
Science Foundation Ireland under Grant No. SFI/08/CE/I1380.<br />
REFERENCES<br />
[1] M. Baroni and A. Lenci. Distributional memory: A general framework<br />
for corpus-based semantics. Computational Linguistics, 2010.<br />
[2] V. Nováček. EUREEKA! Towards a Practical Emergent Knowledge<br />
Processing. PhD thesis, DERI, <strong>NUI</strong>G, 2010. Available at (Dec 2010):<br />
http://goo.gl/M1WkO.<br />
[3] V. Nováček, T. Groza, S. Handschuh, and S. Decker. Coraal<strong>–</strong>dive into<br />
publications, bathe in the knowledge. Journal of Web Semantics, 8(2-<br />
3):176 <strong>–</strong> 181, 2010.<br />
1