24.07.2013 Views

Onto.PT: Towards the Automatic Construction of a Lexical Ontology ...

Onto.PT: Towards the Automatic Construction of a Lexical Ontology ...

Onto.PT: Towards the Automatic Construction of a Lexical Ontology ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

86 Chapter 5. Synset Discovery<br />

cota and planta. Besides an italian dish (not included), <strong>the</strong> word pasta might have<br />

<strong>the</strong> popular meaning <strong>of</strong> money, <strong>the</strong> figurative meaning <strong>of</strong> a mixture <strong>of</strong> things, or<br />

it might denote a file or a briefcase. As for <strong>the</strong> word cota, besides height (not<br />

included), it can be a quota or portion, or refer to an old and respectable person,<br />

and informally denote a fa<strong>the</strong>r or a mo<strong>the</strong>r. The word planta might ei<strong>the</strong>r denote<br />

a plan or some guidelines, or it might denote a vegetable. Besides some synonyms<br />

<strong>of</strong> plant/vegetable (e.g. planta, vegetal), <strong>the</strong> synset with <strong>the</strong> vegetable meaning<br />

contains many actual plants or vegetables (e.g. maruge, camélia). After analysing<br />

this problem, we noticed that <strong>the</strong> dictionary DA contains several definitions <strong>of</strong> plants<br />

where <strong>the</strong> first sentence is just planta, without any differentia. Therefore, even<br />

though <strong>the</strong> correct relation to extract would be hypernymy, our grammars see those<br />

definitions as denoting synonymy. Ano<strong>the</strong>r limitation shown by <strong>the</strong>se examples is<br />

that, sometimes, <strong>the</strong> fuzzy synsets contain words which are not synonyms, but have<br />

similar neighbourhoods.<br />

5.3.3 Thesaurus data for different cut points<br />

After analysing <strong>the</strong> fuzzy synsets, we inspected <strong>the</strong> impact <strong>of</strong> applying different<br />

cut-points (θ) in <strong>the</strong> transformation <strong>of</strong> <strong>the</strong> fuzzy <strong>the</strong>saurus into a simple <strong>the</strong>saurus.<br />

Tables 5.3 and 5.4 present <strong>the</strong> properties <strong>of</strong> <strong>the</strong> different <strong>the</strong>sauri obtained with<br />

different values for θ. Considering just <strong>the</strong> words <strong>of</strong> <strong>the</strong> <strong>the</strong>sauri, table 5.3 includes<br />

<strong>the</strong> number <strong>of</strong> words, how many <strong>of</strong> those are ambiguous, <strong>the</strong> average number <strong>of</strong><br />

senses per word, and <strong>the</strong> number <strong>of</strong> senses <strong>of</strong> <strong>the</strong> most ambiguous word. As for<br />

synsets, table 5.4 shows <strong>the</strong> total number <strong>of</strong> synsets, <strong>the</strong> average synset size in<br />

terms <strong>of</strong> words, synsets <strong>of</strong> size 2 and size larger than 25, which are less likely to be<br />

useful (Borin and Forsberg, 2010), as well as <strong>the</strong> largest synset. Both tables do not<br />

consider synsets <strong>of</strong> size 1.<br />

Before collecting <strong>the</strong> data in Tables 5.3 and 5.4, we followed one <strong>of</strong> <strong>the</strong> clustering<br />

methods for word senses proposed for EuroWordNet, which suggests that synsets<br />

with three members in common can be merged (Peters et al., 1998). However, <strong>the</strong><br />

design <strong>of</strong> our clustering algorithm and <strong>the</strong> configuration <strong>of</strong> our synonymy networks<br />

are prone to create synsets sharing more than one word. So, to minimise <strong>the</strong> possibility<br />

<strong>of</strong> merging synsets denoting different concepts, we made sure that merged<br />

synsets had at least 75% overlap, computed as follows, where |Synset| denotes <strong>the</strong><br />

number <strong>of</strong> words <strong>of</strong> a synset:<br />

Overlap(Synseta, Synsetb) =<br />

Synseta ∩ Synsetb<br />

min(|Synseta|, |Synsetb|)<br />

(5.9)<br />

As expected, as θ grows, ambiguity drops. This is observed not only from <strong>the</strong><br />

number <strong>of</strong> ambiguous words, but also from <strong>the</strong> average number <strong>of</strong> word senses and<br />

<strong>the</strong> number <strong>of</strong> synsets. For instance, with θ = 0.5, despite <strong>the</strong> establishment <strong>of</strong><br />

8,000 clusters, each word has only one sense, which means <strong>the</strong>re is no ambiguity.<br />

Out <strong>of</strong> curiosity, <strong>the</strong> largest synset in CLIP, with θ = 0.075, denotes <strong>the</strong> concept <strong>of</strong><br />

money. It contains <strong>the</strong> following 58 words:<br />

• jimbo, pastel, bagarote, guines, baguines, parrolo, marcaureles, ouro, grana, arame,<br />

massaroca, tutu, metal, bagalho, níquel, bilhestres, milho, jan-da-cruz, china, cumquibus,<br />

mussuruco, cobre, numerário, pilim, bagaço, pasta, zerzulho, painço, finanças,<br />

chelpa, calique, posses, bagalhoça, pecuniária, boro, dieiro, pila, gaita,

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!