Onto.PT: Towards the Automatic Construction of a Lexical Ontology ...
Onto.PT: Towards the Automatic Construction of a Lexical Ontology ...
Onto.PT: Towards the Automatic Construction of a Lexical Ontology ...
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
144 Chapter 8. <strong>Onto</strong>.<strong>PT</strong>: a lexical ontology for Portuguese<br />
To give an idea on <strong>the</strong> quality <strong>of</strong> each relation type, table 8.6 presents <strong>the</strong> results<br />
<strong>of</strong> <strong>the</strong> evaluation <strong>of</strong> <strong>the</strong> 300 non-hypernymy tb-triples. The number <strong>of</strong> evaluated<br />
sb-triples is not enough to take strong conclusions, but <strong>the</strong> quantity <strong>of</strong> dizSeDoQue<br />
correct tb-triples stands out. It is higher than <strong>the</strong> extraction accuracy <strong>of</strong> <strong>the</strong>se<br />
relations (adj property-<strong>of</strong> v, 71-77% correct in section 4.2.5) from dictionaries. Given<br />
that most <strong>of</strong> <strong>the</strong> problems about <strong>the</strong>se relations were due to incorrect arguments, we<br />
view this improvement as a consequence both <strong>of</strong> <strong>the</strong> new lemmatisation rules, and<br />
<strong>of</strong> <strong>the</strong> removal <strong>of</strong> tb-triples with arguments not occurring in <strong>the</strong> corpus, performed<br />
before ontologisation. On <strong>the</strong> o<strong>the</strong>r hand, <strong>the</strong> quality <strong>of</strong> <strong>the</strong> parteDe sb-triples is<br />
quite low. Similarly to what happens with hypernymy, <strong>the</strong> main problem about this<br />
relation seems to be <strong>the</strong> ambiguity and underspecification <strong>of</strong> its arguments. This<br />
has a negative impact both on <strong>the</strong> establishment <strong>of</strong> correct synsets with <strong>the</strong>se words<br />
and on <strong>the</strong> ontologisation <strong>of</strong> tb-triples.<br />
8.3.3 Global coverage<br />
If compared to <strong>the</strong> number <strong>of</strong> sb-triples in Princeton WordNet 3.0 (see section 3.1.1),<br />
developed manually between 1985 and 2006, <strong>Onto</strong>.<strong>PT</strong> v.0.35 is larger because all<br />
<strong>of</strong> its relations can be inverted. This means that <strong>Onto</strong>.<strong>PT</strong> contains about 346,000<br />
sb-triples against <strong>the</strong> 285,000 <strong>of</strong> WordNet 3.0. This number, which may soon increase,<br />
if more resources are exploited, highlights <strong>the</strong> potential <strong>of</strong> an automatic<br />
approach.<br />
As this number is insufficient to quantify <strong>the</strong> coverage <strong>of</strong> <strong>Onto</strong>.<strong>PT</strong>, we evaluated<br />
its coverage <strong>of</strong> base concepts, that should be represented in wordnets. The Global<br />
WordNet Association 5 provides several lists with this kind <strong>of</strong> concepts. One <strong>of</strong> <strong>the</strong>m<br />
contains 164 base concepts, referred to as <strong>the</strong> “most important” in <strong>the</strong> wordnets<br />
<strong>of</strong> English, Spanish, Dutch and Italian 6 . The concepts are divided into 98 abstract<br />
and 66 concrete, and are represented as Princeton WordNet 1.5 synsets.<br />
In order to evaluate <strong>the</strong> global coverage <strong>of</strong> <strong>Onto</strong>.<strong>PT</strong>, we tried to make rough<br />
matches, manually, between <strong>the</strong> 164 base concepts and <strong>Onto</strong>.<strong>PT</strong> synsets. Given <strong>the</strong><br />
WordNet synset denoting each <strong>of</strong> <strong>the</strong> 164 concepts, we selected <strong>the</strong> <strong>Onto</strong>.<strong>PT</strong> synset<br />
closer to its meaning. In <strong>the</strong> end, we concluded that <strong>Onto</strong>.<strong>PT</strong> roughly covers most<br />
<strong>of</strong> <strong>the</strong> concepts in <strong>the</strong> list, more precisely 92 abstract and 61 concrete synsets (93%).<br />
All <strong>the</strong> defined matches are reported in <strong>the</strong> appendix B <strong>of</strong> this <strong>the</strong>sis. More<br />
precisely, <strong>the</strong> concrete concepts are in table B.1, and <strong>the</strong> abstract in tables B.2.<br />
There, we can see that <strong>the</strong> <strong>Onto</strong>.<strong>PT</strong> synsets are, on average, larger than WordNet’s,<br />
which means, on <strong>the</strong> one hand, that <strong>the</strong>y are very rich, with many synonyms –<br />
most synsets include various levels <strong>of</strong> language (formal, informal, figurative, older<br />
forms...) and variants <strong>of</strong> Portuguese (Portugal, Brazil, Africa). This can be very<br />
useful for tasks from information retrieval (see section 8.4.3) to creative writing (e.g.<br />
poetry). On <strong>the</strong> o<strong>the</strong>r hand, <strong>the</strong> matches show that <strong>the</strong>re are synsets that go beyond<br />
including only synonyms – most noisy items are more like near-synonyms and some<br />
are closely related words.<br />
Considering just <strong>the</strong> abstract concepts not covered by <strong>Onto</strong>.<strong>PT</strong> (e.g. change<br />
magnitude, definite quantity, visual property), <strong>the</strong>y seem to have been created ar-<br />
5 See website at http://www.globalwordnet.org/ (September 2012)<br />
6 See more about this list in http://www.globalwordnet.org/gwa/ewn_to_bc/corebcs.html<br />
(September 2012)