24.07.2013 Views

Onto.PT: Towards the Automatic Construction of a Lexical Ontology ...

Onto.PT: Towards the Automatic Construction of a Lexical Ontology ...

Onto.PT: Towards the Automatic Construction of a Lexical Ontology ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

6 Chapter 1. Introduction<br />

by <strong>the</strong> conceptual base, <strong>the</strong>y establish a smaller synonymy network. This network<br />

is finally exploited for <strong>the</strong> identification <strong>of</strong> word clusters, which can be<br />

seen as new synsets.<br />

3. <strong>Onto</strong>logisation: <strong>the</strong> lexical items in <strong>the</strong> arguments <strong>of</strong> <strong>the</strong> non-synonymy relation<br />

instances are attached to suitable synsets. Once again, this is achieved<br />

by exploiting <strong>the</strong> network established by all extracted relations, in order to,<br />

given a relation instance, select <strong>the</strong> most similar pair <strong>of</strong> candidate synsets.<br />

As <strong>the</strong> resulting resource is structured in synsets and semantic relations between<br />

<strong>the</strong>m, it can be seen as a wordnet. Given <strong>the</strong> three aforementioned steps, this<br />

approach for creating wordnets automatically was baptised as ECO, which stands<br />

for Extraction, Clustering and <strong>Onto</strong>logisation.<br />

1.3 Contributions<br />

Given our main goal, <strong>Onto</strong>.<strong>PT</strong> can be seen as <strong>the</strong> main contribution <strong>of</strong> this research.<br />

<strong>Onto</strong>.<strong>PT</strong> is a wordnet-like lexical ontology for Portuguese, whose current<br />

version integrates lexical-semantic knowledge from five lexical resources, more precisely<br />

three dictionaries and two <strong>the</strong>sauri. Actually, after noticing that most <strong>of</strong><br />

<strong>the</strong> Portuguese lexical resources were somehow complementary (Santos et al., 2010;<br />

Teixeira et al., 2010), we integrated in <strong>Onto</strong>.<strong>PT</strong> those that were public.<br />

The current version <strong>of</strong> <strong>Onto</strong>.<strong>PT</strong> contains more than 100,000 synsets and more<br />

than 170,000 labelled connections, which represent semantic relations. This new<br />

resource is a public alternative to existing Portuguese LKBs and can be used as a<br />

wordnet. This means that, for Portuguese, <strong>Onto</strong>.<strong>PT</strong> can be used in most NLP tasks<br />

that exploit <strong>the</strong> structure <strong>of</strong> a wordnet for achieving <strong>the</strong>ir goal, except for those that<br />

use <strong>the</strong> synset glosses, unavailable in <strong>Onto</strong>.<strong>PT</strong>.<br />

But <strong>Onto</strong>.<strong>PT</strong> is not a static resource. It is created in a three step flexible<br />

approach, ECO, briefly described in <strong>the</strong> previous section. ECO enables <strong>the</strong> integration<br />

<strong>of</strong> lexical-semantic knowledge from different heterogeneous sources, and<br />

can be used to create different instances <strong>of</strong> <strong>the</strong> resource, using different parameters.<br />

Moreover, although applied only to <strong>the</strong> creation <strong>of</strong> <strong>Onto</strong>.<strong>PT</strong>, we propose ECO as an<br />

approach that may be adopted in <strong>the</strong> creation or enrichment <strong>of</strong> wordnets in o<strong>the</strong>r<br />

languages. It is thus ano<strong>the</strong>r important contribution <strong>of</strong> this <strong>the</strong>sis.<br />

Each step <strong>of</strong> ECO can also be individually seen as contribution to <strong>the</strong> fields<br />

<strong>of</strong> information extraction and automatic creation <strong>of</strong> wordnets. These steps include<br />

procedures for:<br />

1. Enriching an existing <strong>the</strong>saurus with new synonymys.<br />

2. Discovering synsets (or fuzzy synsets) from dictionary definitions.<br />

3. Moving from term-based to synset-based semantic relations, without accessing<br />

<strong>the</strong> extraction context.<br />

On <strong>the</strong> o<strong>the</strong>r hand, <strong>the</strong> procedure for extracting semantic relations from dictionaries<br />

cannot be seen as novel. Still, we have compared <strong>the</strong> structure and contents<br />

in different dictionaries <strong>of</strong> Portuguese, which led to <strong>the</strong> conclusion that many regularities<br />

are kept across <strong>the</strong> definitions <strong>of</strong> each dictionary. This comparison,

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!