24.07.2013 Views

Onto.PT: Towards the Automatic Construction of a Lexical Ontology ...

Onto.PT: Towards the Automatic Construction of a Lexical Ontology ...

Onto.PT: Towards the Automatic Construction of a Lexical Ontology ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

144 Chapter 8. <strong>Onto</strong>.<strong>PT</strong>: a lexical ontology for Portuguese<br />

To give an idea on <strong>the</strong> quality <strong>of</strong> each relation type, table 8.6 presents <strong>the</strong> results<br />

<strong>of</strong> <strong>the</strong> evaluation <strong>of</strong> <strong>the</strong> 300 non-hypernymy tb-triples. The number <strong>of</strong> evaluated<br />

sb-triples is not enough to take strong conclusions, but <strong>the</strong> quantity <strong>of</strong> dizSeDoQue<br />

correct tb-triples stands out. It is higher than <strong>the</strong> extraction accuracy <strong>of</strong> <strong>the</strong>se<br />

relations (adj property-<strong>of</strong> v, 71-77% correct in section 4.2.5) from dictionaries. Given<br />

that most <strong>of</strong> <strong>the</strong> problems about <strong>the</strong>se relations were due to incorrect arguments, we<br />

view this improvement as a consequence both <strong>of</strong> <strong>the</strong> new lemmatisation rules, and<br />

<strong>of</strong> <strong>the</strong> removal <strong>of</strong> tb-triples with arguments not occurring in <strong>the</strong> corpus, performed<br />

before ontologisation. On <strong>the</strong> o<strong>the</strong>r hand, <strong>the</strong> quality <strong>of</strong> <strong>the</strong> parteDe sb-triples is<br />

quite low. Similarly to what happens with hypernymy, <strong>the</strong> main problem about this<br />

relation seems to be <strong>the</strong> ambiguity and underspecification <strong>of</strong> its arguments. This<br />

has a negative impact both on <strong>the</strong> establishment <strong>of</strong> correct synsets with <strong>the</strong>se words<br />

and on <strong>the</strong> ontologisation <strong>of</strong> tb-triples.<br />

8.3.3 Global coverage<br />

If compared to <strong>the</strong> number <strong>of</strong> sb-triples in Princeton WordNet 3.0 (see section 3.1.1),<br />

developed manually between 1985 and 2006, <strong>Onto</strong>.<strong>PT</strong> v.0.35 is larger because all<br />

<strong>of</strong> its relations can be inverted. This means that <strong>Onto</strong>.<strong>PT</strong> contains about 346,000<br />

sb-triples against <strong>the</strong> 285,000 <strong>of</strong> WordNet 3.0. This number, which may soon increase,<br />

if more resources are exploited, highlights <strong>the</strong> potential <strong>of</strong> an automatic<br />

approach.<br />

As this number is insufficient to quantify <strong>the</strong> coverage <strong>of</strong> <strong>Onto</strong>.<strong>PT</strong>, we evaluated<br />

its coverage <strong>of</strong> base concepts, that should be represented in wordnets. The Global<br />

WordNet Association 5 provides several lists with this kind <strong>of</strong> concepts. One <strong>of</strong> <strong>the</strong>m<br />

contains 164 base concepts, referred to as <strong>the</strong> “most important” in <strong>the</strong> wordnets<br />

<strong>of</strong> English, Spanish, Dutch and Italian 6 . The concepts are divided into 98 abstract<br />

and 66 concrete, and are represented as Princeton WordNet 1.5 synsets.<br />

In order to evaluate <strong>the</strong> global coverage <strong>of</strong> <strong>Onto</strong>.<strong>PT</strong>, we tried to make rough<br />

matches, manually, between <strong>the</strong> 164 base concepts and <strong>Onto</strong>.<strong>PT</strong> synsets. Given <strong>the</strong><br />

WordNet synset denoting each <strong>of</strong> <strong>the</strong> 164 concepts, we selected <strong>the</strong> <strong>Onto</strong>.<strong>PT</strong> synset<br />

closer to its meaning. In <strong>the</strong> end, we concluded that <strong>Onto</strong>.<strong>PT</strong> roughly covers most<br />

<strong>of</strong> <strong>the</strong> concepts in <strong>the</strong> list, more precisely 92 abstract and 61 concrete synsets (93%).<br />

All <strong>the</strong> defined matches are reported in <strong>the</strong> appendix B <strong>of</strong> this <strong>the</strong>sis. More<br />

precisely, <strong>the</strong> concrete concepts are in table B.1, and <strong>the</strong> abstract in tables B.2.<br />

There, we can see that <strong>the</strong> <strong>Onto</strong>.<strong>PT</strong> synsets are, on average, larger than WordNet’s,<br />

which means, on <strong>the</strong> one hand, that <strong>the</strong>y are very rich, with many synonyms –<br />

most synsets include various levels <strong>of</strong> language (formal, informal, figurative, older<br />

forms...) and variants <strong>of</strong> Portuguese (Portugal, Brazil, Africa). This can be very<br />

useful for tasks from information retrieval (see section 8.4.3) to creative writing (e.g.<br />

poetry). On <strong>the</strong> o<strong>the</strong>r hand, <strong>the</strong> matches show that <strong>the</strong>re are synsets that go beyond<br />

including only synonyms – most noisy items are more like near-synonyms and some<br />

are closely related words.<br />

Considering just <strong>the</strong> abstract concepts not covered by <strong>Onto</strong>.<strong>PT</strong> (e.g. change<br />

magnitude, definite quantity, visual property), <strong>the</strong>y seem to have been created ar-<br />

5 See website at http://www.globalwordnet.org/ (September 2012)<br />

6 See more about this list in http://www.globalwordnet.org/gwa/ewn_to_bc/corebcs.html<br />

(September 2012)

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!