24.07.2013 Views

Onto.PT: Towards the Automatic Construction of a Lexical Ontology ...

Onto.PT: Towards the Automatic Construction of a Lexical Ontology ...

Onto.PT: Towards the Automatic Construction of a Lexical Ontology ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

4 Chapter 1. Introduction<br />

too much human effort, which may be seen as a bottleneck for <strong>the</strong> development <strong>of</strong><br />

<strong>the</strong> resource. Not to mention that, over time, language evolves with its users.<br />

The truth is that as long as <strong>the</strong>re is intensive labour involved in manually encoding<br />

lexical resources, lexical capabilities <strong>of</strong> NLP systems will be weak (Briscoe,<br />

1991), and coverage limitations will always be present. The same happens for o<strong>the</strong>r<br />

kinds <strong>of</strong> knowledge base – handcrafting <strong>the</strong>m is impractical and undesirable. We<br />

should <strong>the</strong>refore take advantage <strong>of</strong> available NLP tools in order to automate part <strong>of</strong><br />

this task and reduce <strong>the</strong> need <strong>of</strong> manual input (Brewster and Wilks, 2004).<br />

Having this in mind, especially before <strong>the</strong> establishment <strong>of</strong> WordNet, researchers<br />

studied how to automatise <strong>the</strong> task <strong>of</strong> acquiring lexical-semantic knowledge from<br />

text, with relative success. For instance, MindNet (Richardson et al., 1998) shown<br />

that it is possible to develop a lexical(-semantic) knowledge base (LKB) by automatic<br />

means.<br />

Ano<strong>the</strong>r common alternative to <strong>the</strong> manual creation <strong>of</strong> wordnets is <strong>the</strong> translation<br />

<strong>of</strong> a target wordnet (usually Princeton WordNet) to o<strong>the</strong>r languages (de Melo<br />

and Weikum, 2008). However, ano<strong>the</strong>r problem arises because different languages<br />

represent different socio-cultural realities, <strong>the</strong>y do not cover exactly <strong>the</strong> same part<br />

<strong>of</strong> <strong>the</strong> lexicon and, even where <strong>the</strong>y seem to be common, several concepts are lexicalised<br />

differently (Hirst, 2004). Therefore, we believe that a wordnet for a language,<br />

whe<strong>the</strong>r created manually, semi-automatically or automatically, should be developed<br />

from scratch for that language.<br />

As mentioned before, <strong>the</strong> manual creation <strong>of</strong> a knowledge base results in slow<br />

development and consequently in limited coverage, not only <strong>of</strong> lexical, but mostly<br />

on world knowledge. This is why, after <strong>the</strong> establishment <strong>of</strong> WordNet, researchers<br />

using this resource as <strong>the</strong>ir only knowledge base soon had to cope with information<br />

sparsity issues. So, apart from <strong>the</strong> work on <strong>the</strong> automatic construction <strong>of</strong> LKBs from<br />

scratch, <strong>the</strong>re have been automatic attempts to enrich wordnets (e.g. Hearst (1998))<br />

and also to link <strong>the</strong>m with o<strong>the</strong>r knowledge bases (e.g. Gurevych et al. (2012) or<br />

H<strong>of</strong>fart et al. (2011)), in order to create broader resources.<br />

In <strong>the</strong> work described in this <strong>the</strong>sis, we look at <strong>the</strong> Portuguese scenario, and<br />

tackle <strong>the</strong> limitations <strong>of</strong> <strong>the</strong> LKBs for this language. We developed an automatic<br />

approach for <strong>the</strong> acquisition <strong>of</strong> lexical-semantic information and for <strong>the</strong> creation<br />

<strong>of</strong> lexical-semantic computational resources, dubbed ECO – Extraction, Clustering,<br />

<strong>Onto</strong>lgisation. The application <strong>of</strong> ECO to Portuguese resources resulted in a<br />

wordnet-like resource, dubbed <strong>Onto</strong>.<strong>PT</strong>. In <strong>the</strong> remaining <strong>of</strong> this chapter, we state<br />

<strong>the</strong> main goals <strong>of</strong> this research, briefly present <strong>the</strong> ECO approach, refer <strong>the</strong> main<br />

contributions <strong>of</strong> this work, and describe <strong>the</strong> structure <strong>of</strong> <strong>the</strong> rest <strong>of</strong> this <strong>the</strong>sis.<br />

1.1 Research Goals<br />

For <strong>the</strong> Portuguese language, <strong>the</strong>re have been some attempts to create a wordnet<br />

or a related resource (see more in section 3.1.2), but all <strong>of</strong> <strong>the</strong>m have one or more<br />

<strong>of</strong> <strong>the</strong> following main limitations:<br />

• They are proprietary and unavailble, or <strong>the</strong>ir utilisation is not free;<br />

• They are handcrafted, and thus suffer from limited coverage;

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!