24.07.2013 Views

Onto.PT: Towards the Automatic Construction of a Lexical Ontology ...

Onto.PT: Towards the Automatic Construction of a Lexical Ontology ...

Onto.PT: Towards the Automatic Construction of a Lexical Ontology ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

162 Chapter 9. Final discussion<br />

Anyway, before growing in terms <strong>of</strong> new relations and exploited resources,<br />

<strong>Onto</strong>.<strong>PT</strong> will probably shrink, as we will try to minimise some problems, including<br />

incorrect extractions, that currently add noise to its contents. As it is generated by<br />

an automatic approach, and although it is very large, broad-coverage, and shown to<br />

be useful, <strong>Onto</strong>.<strong>PT</strong> is still far from being highly reliable. Therefore, some directions<br />

should be taken in order to improve its quality.<br />

The manual evaluation <strong>of</strong> <strong>the</strong> extracted semantic relations is an important source<br />

for identifying specific problems, which might lead to future changes in <strong>the</strong> extraction<br />

grammars or in <strong>the</strong> filters applied after extraction. We can also exploit o<strong>the</strong>r<br />

sources <strong>of</strong> information, in order to compute <strong>the</strong> confidence <strong>of</strong> <strong>the</strong> extracted semantic<br />

relations. A common approach for this task relies on <strong>the</strong> application <strong>of</strong> similarity<br />

measures, based on <strong>the</strong> <strong>the</strong> occurrences <strong>of</strong> <strong>the</strong> relation arguments in corpora, or in<br />

<strong>the</strong> Web (Downey et al., 2005; Costa et al., 2011). These approaches could yet be<br />

combined with o<strong>the</strong>r kinds <strong>of</strong> information, including <strong>the</strong> frequency <strong>of</strong> <strong>the</strong> extracted<br />

relation, <strong>the</strong> confidence on <strong>the</strong> resource or method that lead to its extraction 2 , and,<br />

as referred earlier, <strong>the</strong> occurrence <strong>of</strong> a relation in some resource in o<strong>the</strong>r language,<br />

after its translation. In a similar fashion to Wandmacher et al. (2007)’s work, this<br />

could lead to an integrated confidence.<br />

Besides o<strong>the</strong>r benefits, a confidence value would enable <strong>the</strong> integration <strong>of</strong> only<br />

relations for which confidence is above a predefined threshold. This threshold could<br />

be an additional parameter to study in an extensive evaluation <strong>of</strong> ECO and <strong>Onto</strong>.<strong>PT</strong>.<br />

Such an evaluation would compare <strong>the</strong> impact <strong>of</strong> different parameters, including,<br />

but not limited to, <strong>the</strong> clustering thresholds and <strong>the</strong> similarity measures used in<br />

ontologisation. The idea would be to select different parameters to create different<br />

versions <strong>of</strong> <strong>the</strong> resource and <strong>the</strong>n compare properties like covered lexical items and<br />

semantic relations, sense granularity or synset size. Ano<strong>the</strong>r parameter could be<br />

a threshold on <strong>the</strong> corpus frequency <strong>of</strong> <strong>the</strong> integrated lexical items. Since much<br />

<strong>of</strong> <strong>the</strong> covered knowledge is extracted from dictionaries, <strong>Onto</strong>.<strong>PT</strong> contains some<br />

unfrequent, and possibly less useful, words. Besides being <strong>of</strong> no use for several<br />

applications, those words might as well work as additional sources <strong>of</strong> noise.<br />

One final mention should be given to <strong>the</strong> adoption <strong>of</strong> <strong>the</strong> new spelling reform<br />

<strong>of</strong> Portuguese, agreed by <strong>the</strong> governments <strong>of</strong> <strong>the</strong> Portuguese speaking countries in<br />

1990, but only started to be implemented in 2009. This reform aims to unify <strong>the</strong><br />

orthography <strong>of</strong> <strong>the</strong> European and Brazilian variants <strong>of</strong> Portuguese. In <strong>Onto</strong>.<strong>PT</strong>,<br />

however, we have not adopted this reform because:<br />

• All resources we have exploited are not yet converted. As some <strong>of</strong> <strong>the</strong>m are<br />

written in <strong>the</strong> European variant <strong>of</strong> Portuguese, o<strong>the</strong>rs in <strong>the</strong> Brazilian, and<br />

o<strong>the</strong>rs in both, most <strong>of</strong> <strong>the</strong> written variations are covered, as well as some<br />

dropped forms;<br />

• The transition period, where using dropped written forms is tolerated, is still<br />

going on in most <strong>of</strong> <strong>the</strong> countries. In Portugal, it ends in 2015.<br />

• There is still a huge debate going on <strong>the</strong> adoption <strong>of</strong> this reform, and on its<br />

real benefits for Portuguese;<br />

Never<strong>the</strong>less, we believe that an eventual conversion <strong>of</strong> <strong>Onto</strong>.<strong>PT</strong> to <strong>the</strong> new spelling<br />

2 Given that a <strong>the</strong>saurus as TeP is created manually, <strong>the</strong>re is certainly more confidence on a<br />

relation acquired directly from it than one extracted from text, by an automatic procedure.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!