29.06.2013 Views

NUI Galway – UL Alliance First Annual ENGINEERING AND - ARAN ...

NUI Galway – UL Alliance First Annual ENGINEERING AND - ARAN ...

NUI Galway – UL Alliance First Annual ENGINEERING AND - ARAN ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

1. INTRODUCTION<br />

The vast realms of the Web encompass a substantial portion<br />

of the human knowledge nowadays. However, the particular<br />

pieces of the knowledge are interleaved with a lot of noise<br />

and scattered among many information resources of various<br />

relevance. Thus it is often difficult or even infeasible to get<br />

what one needs to know from the ballast of largely irrelevant<br />

content. Our goal is to contribute to giving more meaning to<br />

the content on the Web, identifying and interlinking relevant<br />

knowledge out there so that machines can help humans to<br />

make use of it more efficiently. We are particularly interested<br />

in giving a clear empirical grounding to the symbolic meaning<br />

being extracted from or asserted on the Web. To do so, we<br />

have investigated a generalised extension of the distributional<br />

semantics principles (a sub-field of computational linguistics).<br />

This is to allow for identification of general, reliable patterns<br />

and links among the loosely structured, heterogeneous and<br />

noisy (Semantic) Web data, and enable meaningful symbolic<br />

reasoning with such relevant content.<br />

2. SOLUTION<br />

Our approach is partly motivated by the recently devised<br />

method of universal distributional representation and analysis<br />

of natural language corpora [1]. We generalise and extend the<br />

tensor-based representation of weighed co-occurrence relations<br />

between natural language expressions proposed in [1] to reflect<br />

the domain of Linked Data and knowledge on the Semantic<br />

Web. Moreover, we provide novel methods of smooth combination<br />

of the distributional and symbolic semantic levels<br />

in order to allow for automated formal reasoning about the<br />

empirically grounded knowledge emerging from the Web.<br />

As of now, we have delivered EUREEKA [2] and<br />

CORAAL [3] <strong>–</strong> two prototypes presenting particular implementations<br />

of the general theoretical principles of our<br />

approach. EUREEKA is a software library for extraction,<br />

representation, integration and processing of emergent knowledge.<br />

Currently it focuses mainly on knowledge acquired from<br />

natural language texts and existing ontologies. It implements<br />

a specific multi-context perspective of the basic corpus data,<br />

where uncertain binary relations may further be specified<br />

by an arbitrary number of additional context arguments (explicitly<br />

attached provenance, time, space, etc.). It deals with<br />

a single sub-perspective of entities and their corresponding<br />

relationships, employing selective distance-based similarities<br />

for the execution of queries, evaluation of inference rules<br />

and simple analogical reasoning. The entities are ranked by<br />

Empirically Grounded Linked Knowledge<br />

Vít Nováček<br />

Digital Enterprise Research Institute (DERI),<br />

National University of Ireland <strong>Galway</strong> (<strong>NUI</strong>G)<br />

IDA Business Park, Lower Dangan, <strong>Galway</strong>, Ireland<br />

e-mail: vit.novacek@deri.org<br />

134<br />

means of a generalised IR approach and the ranking values<br />

serve for a basic implementation of anytime versions of the<br />

reasoning and querying algorithms. See [2] for more details.<br />

The EUREEKA library can be downloaded at http://pypi.<br />

python.org/pypi/eureeka/0.1.<br />

On the top of EUREEKA, we delivered CORAAL [3],<br />

an intelligent publication search engine deployed primarily<br />

in the life sciences domain. In a nutshell, CORAAL is able<br />

to extract knowledge in the form of argument-link-argument<br />

statements associated with a positive or negative certainty<br />

value and publication provenance information (i.e., which<br />

publications were used for extraction and/or inference of the<br />

statements). The extracted knowledge can be automatically<br />

integrated with existing domain resources (such as machinereadable<br />

life science thesauri) and augmented or refined by<br />

means of the EUREEKA reasoning services. The content of<br />

the CORAAL knowledge base is served to users via a search<br />

interface that allows for complex statement queries in addition<br />

to classical full-text capabilities. The search results can be<br />

browsed and filtered along multiple facets, which enables the<br />

users to quickly pinpoint the knowledge (i.e., statements) that<br />

interests them, as well as the publication sources pertinent to<br />

it. The CORAAL tool can be accessed at http://coraal.deri.ie.<br />

3. CONCLUSION<br />

We evaluated our approach via the CORAAL search engine.<br />

A sample of actual users (experts in the domain of<br />

cancer research and clinical care) helped us to assess various<br />

quantitative and qualitative aspects of the emergent content<br />

processed within our framework. The results of our approach<br />

outperformed available base-lines by rather large margins<br />

(improvements in the range of 189 − 374%), demonstrating<br />

a promising potential of the proposed approach. Note that this<br />

short paper is but a very superficial and rather incomplete<br />

overview. Readers interested in learning more about our research,<br />

its evaluation and about the large volume of related<br />

work are invited to have a look at [2].<br />

Acknowledgements: This work has been supported by<br />

Science Foundation Ireland under Grant No. SFI/08/CE/I1380.<br />

REFERENCES<br />

[1] M. Baroni and A. Lenci. Distributional memory: A general framework<br />

for corpus-based semantics. Computational Linguistics, 2010.<br />

[2] V. Nováček. EUREEKA! Towards a Practical Emergent Knowledge<br />

Processing. PhD thesis, DERI, <strong>NUI</strong>G, 2010. Available at (Dec 2010):<br />

http://goo.gl/M1WkO.<br />

[3] V. Nováček, T. Groza, S. Handschuh, and S. Decker. Coraal<strong>–</strong>dive into<br />

publications, bathe in the knowledge. Journal of Web Semantics, 8(2-<br />

3):176 <strong>–</strong> 181, 2010.<br />

1

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!