24.07.2013 Views

Onto.PT: Towards the Automatic Construction of a Lexical Ontology ...

Onto.PT: Towards the Automatic Construction of a Lexical Ontology ...

Onto.PT: Towards the Automatic Construction of a Lexical Ontology ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Chapter 6<br />

Thesaurus Enrichment<br />

General language dictionaries and language <strong>the</strong>sauri cover <strong>the</strong> same kind <strong>of</strong> knowledge,<br />

but represent it differently. While <strong>the</strong> former consist <strong>of</strong> lists <strong>of</strong> word senses<br />

and respective natural language sense descriptions, <strong>the</strong> latter group synonymous<br />

words toge<strong>the</strong>r, so that <strong>the</strong>y can be seen as possible lexicalisations <strong>of</strong> concepts.<br />

WordNet (Fellbaum, 1998) can actually be seen as a resource that bridges <strong>the</strong> gap<br />

between both kinds <strong>of</strong> resources, because each synset contains a textual gloss.<br />

However, in previous chapters, we have shown that, even though <strong>the</strong>y intend to<br />

cover <strong>the</strong> same kind <strong>of</strong> knowledge, most <strong>of</strong> <strong>the</strong> information in public handcrafted<br />

Portuguese <strong>the</strong>saurus is complementary to <strong>the</strong> information extracted from dictionaries.<br />

Therefore, it should be more fruitful to integrate <strong>the</strong>ir information in <strong>Onto</strong>.<strong>PT</strong><br />

instead <strong>of</strong> using <strong>the</strong>m merely as a reference for comparison. Ano<strong>the</strong>r aspect in favour<br />

<strong>of</strong> this option is that, besides its size, TeP was manually created by experts. This<br />

means that, more than integrating <strong>the</strong> information in TeP, we can take advantage<br />

<strong>of</strong> its structure to have more reliable synsets and more controlled sense granularity.<br />

The work presented in this chapter can be seen both as an alternative or a<br />

complement <strong>of</strong> <strong>the</strong> previous chapter, as we use <strong>the</strong> synsets <strong>of</strong> TeP as a starting<br />

point for <strong>the</strong> construction <strong>of</strong> a broader <strong>the</strong>saurus. To this end, we follow a fourstep<br />

approach for enriching an existing electronic <strong>the</strong>saurus, structured in synsets,<br />

with information extracted from electronic dictionaries, represented as synonymy<br />

pairs (synpairs) 1 :<br />

1. Extraction <strong>of</strong> synpairs from dictionary definitions;<br />

2. Assignment <strong>of</strong> synpairs to suitable synsets <strong>of</strong> <strong>the</strong> <strong>the</strong>saurus;<br />

3. Discovery <strong>of</strong> new synsets after clustering <strong>the</strong> remaining synpairs;<br />

4. Integration <strong>of</strong> <strong>the</strong> new synsets in <strong>the</strong> <strong>the</strong>saurus.<br />

In step 1, any approach for <strong>the</strong> automatic acquisition <strong>of</strong> synpairs from dictionaries,<br />

such as <strong>the</strong> one described in chapter 4, may be followed. Therefore, we will not<br />

go fur<strong>the</strong>r on this step. We start this chapter by presenting its main contribution,<br />

which is <strong>the</strong> algorithm for <strong>the</strong> automatic assignment <strong>of</strong> synpairs to synsets. Then,<br />

we evaluate <strong>the</strong> algorithm against a gold standard and select <strong>the</strong> most adequate<br />

settings for using it in <strong>the</strong> enrichment <strong>of</strong> TeP. Any graph clustering procedure suits<br />

step 3 <strong>of</strong> our approach. We chose to follow an approach similar to <strong>the</strong> one introduced<br />

1 Synpairs are synonymy tb-triples. They can be extracted from several sources, however, as we<br />

are dealing with general language knowledge, dictionaries are <strong>the</strong> obvious targets.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!