24.07.2013 Views

Onto.PT: Towards the Automatic Construction of a Lexical Ontology ...

Onto.PT: Towards the Automatic Construction of a Lexical Ontology ...

Onto.PT: Towards the Automatic Construction of a Lexical Ontology ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

120 Chapter 7. Moving from term-based to synset-based relations<br />

Brazilian Portuguese and thus contains some unusual words or meanings for European<br />

Portuguese. On <strong>the</strong> o<strong>the</strong>r hand, OT.<strong>PT</strong> is smaller, but made for European<br />

Portuguese, and contains words and meanings not covered by TeP.<br />

Therefore, we used TeP as a starting point for <strong>the</strong> creation <strong>of</strong> a new noun <strong>the</strong>saurus<br />

3 , TePOT, with <strong>the</strong> noun synsets from both TeP and OT.<strong>PT</strong>. The <strong>the</strong>sauri<br />

are merged according to <strong>the</strong> following automatic procedure:<br />

1. The overlap between each synset in OT.<strong>PT</strong>, Oi, and each synset <strong>of</strong> TeP,<br />

Tj, is measured. For each Oi ∈ OT.<strong>PT</strong> a first set <strong>of</strong> candidates,<br />

Ci = {Ci1, Ci2, ..., Cin} ⊂ TeP, will contain <strong>the</strong> TeP synsets that maximise <strong>the</strong><br />

Overlap measure, Overlap(Oi, Cik) = max(Overlap(Oi, Tj)):<br />

Overlap(Oi, Tj) =<br />

Oi ∩ Tj<br />

min(|Oi|, |Tj|)<br />

If max(Overlap(Oi, Tj)) = 0, it means that <strong>the</strong> OT.<strong>PT</strong> synset contains only<br />

words that are not in TeP, and is thus added to TePOT as it is.<br />

2. O<strong>the</strong>rwise, <strong>the</strong> candidate(s) in Ci with higher Jaccard coefficient are selected,<br />

Cil ∈ C ′ i → Jaccard(Oi, Cil) = max(Jaccard(Oi, Cik)):<br />

Jaccard(Oi, Cik) = Oi ∩ Cik<br />

Oi ∪ Cik<br />

Usually, C ′ i has just one synset but, if it has more than one, <strong>the</strong>y are merged in<br />

<strong>the</strong> same synset. Then, <strong>the</strong> new synset is merged with Oi. A new TePOT synset<br />

Si will contain all words in Oi and in <strong>the</strong> synsets in C ′ i . Si = {w1, w2, ..., wm} :<br />

∀(wj ∈ Si) → wj ∈ Oi ∨ wj ∈ Cil, Cil ∈ C ′ i .<br />

3. Synsets <strong>of</strong> TeP which have not been merged with any OT.<strong>PT</strong> synset are finally<br />

added to TePOT without any change.<br />

In <strong>the</strong> end, TePOT contains 18,501 nouns, organised in 8,293 synsets – 6,237 <strong>of</strong><br />

<strong>the</strong> nouns are ambiguous and, on average, one synset has 3.84 terms and one term<br />

is in 1.72 synsets.<br />

Tb-triples<br />

The algorithms were evaluated for ontologising tb-triples <strong>of</strong> three different types:<br />

hypernymy, part-<strong>of</strong> and purpose-<strong>of</strong>, all held between nouns. The tb-triples used<br />

were obtained from PAPEL 2.0, which was, at <strong>the</strong> time when we started to create<br />

<strong>the</strong> gold reference, <strong>the</strong> most recent version <strong>of</strong> PAPEL. As a resource extracted automatically<br />

from dictionaries, <strong>the</strong> reliability <strong>of</strong> PAPEL is not 100% (see section 4.2.5<br />

for evaluation details), but it was <strong>the</strong> largest lexical-semantic resource <strong>of</strong> this kind<br />

freely available. In order to minimise <strong>the</strong> noise <strong>of</strong> using incorrect tb-triples, we<br />

added additional constraints to <strong>the</strong>ir selection, namely:<br />

3 We only used nouns because <strong>the</strong> reported experimentations only dealt with semantic relations<br />

between nouns, namely hypernymy, part-<strong>of</strong>, and purpose-<strong>of</strong>.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!