Onto.PT: Towards the Automatic Construction of a Lexical Ontology ...
Onto.PT: Towards the Automatic Construction of a Lexical Ontology ...
Onto.PT: Towards the Automatic Construction of a Lexical Ontology ...
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
4 Chapter 1. Introduction<br />
too much human effort, which may be seen as a bottleneck for <strong>the</strong> development <strong>of</strong><br />
<strong>the</strong> resource. Not to mention that, over time, language evolves with its users.<br />
The truth is that as long as <strong>the</strong>re is intensive labour involved in manually encoding<br />
lexical resources, lexical capabilities <strong>of</strong> NLP systems will be weak (Briscoe,<br />
1991), and coverage limitations will always be present. The same happens for o<strong>the</strong>r<br />
kinds <strong>of</strong> knowledge base – handcrafting <strong>the</strong>m is impractical and undesirable. We<br />
should <strong>the</strong>refore take advantage <strong>of</strong> available NLP tools in order to automate part <strong>of</strong><br />
this task and reduce <strong>the</strong> need <strong>of</strong> manual input (Brewster and Wilks, 2004).<br />
Having this in mind, especially before <strong>the</strong> establishment <strong>of</strong> WordNet, researchers<br />
studied how to automatise <strong>the</strong> task <strong>of</strong> acquiring lexical-semantic knowledge from<br />
text, with relative success. For instance, MindNet (Richardson et al., 1998) shown<br />
that it is possible to develop a lexical(-semantic) knowledge base (LKB) by automatic<br />
means.<br />
Ano<strong>the</strong>r common alternative to <strong>the</strong> manual creation <strong>of</strong> wordnets is <strong>the</strong> translation<br />
<strong>of</strong> a target wordnet (usually Princeton WordNet) to o<strong>the</strong>r languages (de Melo<br />
and Weikum, 2008). However, ano<strong>the</strong>r problem arises because different languages<br />
represent different socio-cultural realities, <strong>the</strong>y do not cover exactly <strong>the</strong> same part<br />
<strong>of</strong> <strong>the</strong> lexicon and, even where <strong>the</strong>y seem to be common, several concepts are lexicalised<br />
differently (Hirst, 2004). Therefore, we believe that a wordnet for a language,<br />
whe<strong>the</strong>r created manually, semi-automatically or automatically, should be developed<br />
from scratch for that language.<br />
As mentioned before, <strong>the</strong> manual creation <strong>of</strong> a knowledge base results in slow<br />
development and consequently in limited coverage, not only <strong>of</strong> lexical, but mostly<br />
on world knowledge. This is why, after <strong>the</strong> establishment <strong>of</strong> WordNet, researchers<br />
using this resource as <strong>the</strong>ir only knowledge base soon had to cope with information<br />
sparsity issues. So, apart from <strong>the</strong> work on <strong>the</strong> automatic construction <strong>of</strong> LKBs from<br />
scratch, <strong>the</strong>re have been automatic attempts to enrich wordnets (e.g. Hearst (1998))<br />
and also to link <strong>the</strong>m with o<strong>the</strong>r knowledge bases (e.g. Gurevych et al. (2012) or<br />
H<strong>of</strong>fart et al. (2011)), in order to create broader resources.<br />
In <strong>the</strong> work described in this <strong>the</strong>sis, we look at <strong>the</strong> Portuguese scenario, and<br />
tackle <strong>the</strong> limitations <strong>of</strong> <strong>the</strong> LKBs for this language. We developed an automatic<br />
approach for <strong>the</strong> acquisition <strong>of</strong> lexical-semantic information and for <strong>the</strong> creation<br />
<strong>of</strong> lexical-semantic computational resources, dubbed ECO – Extraction, Clustering,<br />
<strong>Onto</strong>lgisation. The application <strong>of</strong> ECO to Portuguese resources resulted in a<br />
wordnet-like resource, dubbed <strong>Onto</strong>.<strong>PT</strong>. In <strong>the</strong> remaining <strong>of</strong> this chapter, we state<br />
<strong>the</strong> main goals <strong>of</strong> this research, briefly present <strong>the</strong> ECO approach, refer <strong>the</strong> main<br />
contributions <strong>of</strong> this work, and describe <strong>the</strong> structure <strong>of</strong> <strong>the</strong> rest <strong>of</strong> this <strong>the</strong>sis.<br />
1.1 Research Goals<br />
For <strong>the</strong> Portuguese language, <strong>the</strong>re have been some attempts to create a wordnet<br />
or a related resource (see more in section 3.1.2), but all <strong>of</strong> <strong>the</strong>m have one or more<br />
<strong>of</strong> <strong>the</strong> following main limitations:<br />
• They are proprietary and unavailble, or <strong>the</strong>ir utilisation is not free;<br />
• They are handcrafted, and thus suffer from limited coverage;