24.07.2013 Views

Onto.PT: Towards the Automatic Construction of a Lexical Ontology ...

Onto.PT: Towards the Automatic Construction of a Lexical Ontology ...

Onto.PT: Towards the Automatic Construction of a Lexical Ontology ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Abstract<br />

The existence <strong>of</strong> a broad-coverage lexical-semantic knowledge base has a positive<br />

impact on <strong>the</strong> computational processing <strong>of</strong> its target language. This is <strong>the</strong> case<br />

<strong>of</strong> Princeton WordNet, for English, which has been used in a wide range <strong>of</strong> natural<br />

language processing (NLP) tasks. WordNet is, however, created manually by<br />

experts. So, despite ensuring highly reliable contents, its creation is expensive, timeconsuming<br />

and has negative consequences on <strong>the</strong> resource coverage and growth.<br />

For Portuguese, <strong>the</strong>re are several lexical-semantic knowledge bases, but none<br />

<strong>of</strong> <strong>the</strong>m is as successful as WordNet is for English. Moreover, all <strong>of</strong> <strong>the</strong>m have<br />

limitations, that go from not handling ambiguity at <strong>the</strong> word level and having limited<br />

coverage (e.g. only nouns, or synonymy relations) to availability restrictions.<br />

Having this in mind, we have set <strong>the</strong> final goal <strong>of</strong> this research to <strong>the</strong> automatic<br />

construction <strong>of</strong> <strong>Onto</strong>.<strong>PT</strong>, a lexical ontology for Portuguese, structured in a similar<br />

fashion to WordNet. <strong>Onto</strong>.<strong>PT</strong> contains synsets – groups <strong>of</strong> synonymous words which<br />

are lexicalisations <strong>of</strong> a concept – and semantic relations, held between synsets. For<br />

this purpose, we took advantage <strong>of</strong> information extraction techniques and focused<br />

on <strong>the</strong> development <strong>of</strong> computational tools for <strong>the</strong> acquisition and organisation <strong>of</strong><br />

lexical-semantic knowledge from text.<br />

Our work starts by exploring textual sources for <strong>the</strong> extraction <strong>of</strong> relations,<br />

connecting lexical items according to <strong>the</strong>ir possible senses. Dictionaries were our<br />

first choice, because <strong>the</strong>y are structured in words and meanings, and cover a large<br />

part <strong>of</strong> <strong>the</strong> lexicon. But, as natural language is ambiguous, a lexical item, identified<br />

by its orthographical form, is sometimes not enough to denote a concept. Therefore,<br />

in a second step, we use a synset-based <strong>the</strong>saurus for Portuguese as a starting point.<br />

The synsets <strong>of</strong> this <strong>the</strong>saurus are augmented with new synonyms acquired in <strong>the</strong><br />

first step, and new synsets are discovered from <strong>the</strong> remaining synonymy relations,<br />

after <strong>the</strong> identification <strong>of</strong> word clusters. In <strong>the</strong> last step, <strong>the</strong> whole set <strong>of</strong> extracted<br />

relations is exploited for attaching <strong>the</strong> arguments <strong>of</strong> <strong>the</strong> non-synonymy relations to<br />

<strong>the</strong> most suitable synsets available.<br />

In this <strong>the</strong>sis, we describe each <strong>of</strong> <strong>the</strong> aforementioned steps and present <strong>the</strong><br />

results <strong>the</strong>y produce for Portuguese, toge<strong>the</strong>r with <strong>the</strong>ir evaluation. Each step is a<br />

contribution to <strong>the</strong> automatic creation and enrichment <strong>of</strong> lexical-semantic knowledge<br />

bases, and results in a new resource, namely: a lexical network; a fuzzy and a<br />

simple <strong>the</strong>saurus; and <strong>Onto</strong>.<strong>PT</strong>, a wordnet-like lexical ontology. An overview <strong>of</strong> <strong>the</strong><br />

current version <strong>of</strong> <strong>Onto</strong>.<strong>PT</strong> is also provided, toge<strong>the</strong>r with some scenarios where it<br />

may be useful. This resource, which can be fur<strong>the</strong>r augmented, is freely available for<br />

download and can be used in a wide range <strong>of</strong> NLP tasks for Portuguese, as WordNet<br />

is for English. Despite <strong>the</strong> current limitations <strong>of</strong> an automatic creation approach,<br />

we believe that <strong>Onto</strong>.<strong>PT</strong> will contribute for advancing <strong>the</strong> state-<strong>of</strong>-<strong>the</strong>-art <strong>of</strong> <strong>the</strong><br />

computational processing <strong>of</strong> Portuguese.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!