24.07.2013 Views

Onto.PT: Towards the Automatic Construction of a Lexical Ontology ...

Onto.PT: Towards the Automatic Construction of a Lexical Ontology ...

Onto.PT: Towards the Automatic Construction of a Lexical Ontology ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

108 Chapter 6. Thesaurus Enrichment<br />

most ambiguous word (Max(senses)). On <strong>the</strong> synsets (table 6.9), we present <strong>the</strong>ir<br />

quantity (Total), <strong>the</strong>ir average size in terms <strong>of</strong> words (Avg(size)), <strong>the</strong> number <strong>of</strong><br />

synsets <strong>of</strong> size 2 (size = 2) and size greater than 25 (size > 25) and, also, <strong>the</strong> size<br />

<strong>of</strong> <strong>the</strong> largest synset (max(size)).<br />

Thesaurus POS<br />

TeP 2.0<br />

1 st iteration<br />

2 nd iteration<br />

Clusters<br />

TRIP<br />

Words<br />

Total Ambiguous Avg(senses) Max(senses)<br />

Noun 17,149 5,802 1.71 20<br />

Verb 8,280 4,680 2.69 50<br />

Adjective 14,568 3,730 1.46 19<br />

Adverb 1,095 227 1.30 11<br />

Noun 28,693 11,794 1.98 22<br />

Verb 11,272 6,357 2.85 50<br />

Adjective 19,148 7,149 1.85 21<br />

Adverb 1,865 499 1.40 12<br />

Noun 29,223 11,988 1.99 22<br />

Verb 11,301 6,374 2.86 50<br />

Adjective 19,291 7,213 1.85 21<br />

Adverb 1,914 513 1.40 12<br />

Noun 21,126 2,196 1.14 5<br />

Verb 1,801 177 1.13 4<br />

Adjective 4,687 359 1.10 5<br />

Adverb 743 89 1.15 3<br />

Noun 45,457 15,392 1.80 22<br />

Verb 11,924 6,607 2.87 52<br />

Adjective 22,316 7,782 1.83 22<br />

Adverb 2,488 694 1.42 12<br />

Table 6.8: Thesauri comparison in terms <strong>of</strong> words.<br />

After <strong>the</strong> assignments, <strong>the</strong> number <strong>of</strong> words grows and <strong>the</strong> number <strong>of</strong> synsets<br />

becomes slightly lower. This might seem strange, but as some synsets in TeP are<br />

very similar to each o<strong>the</strong>r, after <strong>the</strong> assignments, <strong>the</strong>y become <strong>the</strong> same synset, and<br />

one <strong>of</strong> <strong>the</strong>m is discarded. Fur<strong>the</strong>rmore, as expected, ambiguity becomes higher at<br />

this stage. As <strong>the</strong>re is <strong>the</strong> same number <strong>of</strong> synsets, but more words, some words<br />

are added to more than one synset. And <strong>the</strong> synsets also become larger, as <strong>the</strong>y are<br />

augmented.<br />

The <strong>the</strong>saurus obtained after clustering is smaller and much less ambiguous than<br />

<strong>the</strong> o<strong>the</strong>rs. Besides <strong>the</strong> high threshold (θ = 0.5), this happens because <strong>the</strong> words<br />

not covered by TeP tend to be less frequent, which are typically more specific and<br />

thus less ambiguous. Never<strong>the</strong>less, for nouns, <strong>the</strong>re is still a synset with 31 words.<br />

The words in TRIP are slightly more ambiguous than in TeP and <strong>the</strong> synsets <strong>of</strong><br />

TRIP are also larger than TeP’s. It is clear that TRIP is much larger than TeP. It<br />

contains about two and a half times more noun and adverb lexical items, about 3,500<br />

more verbs and 8,000 more adjectives. The highest number <strong>of</strong> synsets means that<br />

<strong>the</strong> new <strong>the</strong>saurus is broader also in terms <strong>of</strong> covered natural language concepts.<br />

On <strong>the</strong> o<strong>the</strong>r hand, <strong>the</strong> new <strong>the</strong>saurus is more ambiguous and has larger synsets.<br />

For instance, it has almost 600 synsets with more than 25 words, which can be seen<br />

as too large for being practical (Borin and Forsberg, 2010). TeP has just 66 <strong>of</strong> those<br />

synsets. Never<strong>the</strong>less, we have looked to <strong>the</strong> largest synsets <strong>of</strong> TRIP and noticed<br />

that most <strong>of</strong> <strong>the</strong>m are well-formed as <strong>the</strong>y only contain synonymous words.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!