24.07.2013 Views

Onto.PT: Towards the Automatic Construction of a Lexical Ontology ...

Onto.PT: Towards the Automatic Construction of a Lexical Ontology ...

Onto.PT: Towards the Automatic Construction of a Lexical Ontology ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

62 Chapter 4. Acquisition <strong>of</strong> Semantic Relations<br />

suggestions in Simões et al. (2010). However, in order to minimise <strong>the</strong> probability<br />

<strong>of</strong> generating invalid lemmas, if <strong>the</strong>y do not exist in TeP nor PAPEL, <strong>the</strong> tb-triple<br />

is discarded.<br />

For handling <strong>the</strong> wikitext <strong>of</strong> <strong>the</strong> Wiktionary.<strong>PT</strong> dump, we developed a specific<br />

parser (Anton Pérez et al., 2011). Although <strong>the</strong>re is an available API, JWKTL 12 , for<br />

processing Wiktionary (Zesch et al., 2008a), it is only compatible with <strong>the</strong> English<br />

and German versions <strong>of</strong> <strong>the</strong> resource. The main problem is that different language<br />

editions <strong>of</strong> Wiktionary use distinct delimiter elements to represent <strong>the</strong> information<br />

<strong>of</strong> each entry, so every Wiktionary parser needs to be adapted according to <strong>the</strong><br />

language edition. Since <strong>the</strong> source code <strong>of</strong> JWKTL was not available, we could not<br />

adapt it for Wiktionary.<strong>PT</strong>.<br />

In <strong>the</strong> dictionary conversion process, only definitions <strong>of</strong> open-category words<br />

were used, and changed to one common notation: nome for nouns, verbo for verbs,<br />

adj for adjectives and adv for adverbs. The format adopted for representing <strong>the</strong><br />

dictionaries contains a definition per line. Before <strong>the</strong> definition, we include <strong>the</strong><br />

definiendum and its POS, as in <strong>the</strong> following definition for <strong>the</strong> word coco (coconut):<br />

coco nome fruto gerado pelo coqueiro, muito usado para se fazer<br />

doces e para consumo de seu líquido<br />

In this format, words with more than one definition originate more than one<br />

line. Also, since Wiktionary provides synonymy lists for some <strong>of</strong> its entries, we<br />

transformed <strong>the</strong>se lists in definitions with only one word, as in <strong>the</strong> following example<br />

for <strong>the</strong> synonyms <strong>of</strong> <strong>the</strong> word bravo (brave):<br />

Sinónimos: corajoso, destemido ⇒<br />

bravo adj corajoso<br />

bravo adj destemido<br />

After <strong>the</strong> conversion <strong>of</strong> DA and Wiktionary.<strong>PT</strong> we obtained about 229,000 and<br />

72,000 definitions, respectively for each dictionary. We do not have direct access to<br />

DLP, but we can say that it contains 176,000 definitions which gave origin to, at<br />

least, one relation.<br />

Wiktionary.<strong>PT</strong> is <strong>the</strong> smaller resource, which resulted in <strong>the</strong> lowest number <strong>of</strong><br />

definitions among <strong>the</strong> three dictionaries. However, before collecting <strong>the</strong>se definitions,<br />

we discarded: (i) definitions corresponding to words in o<strong>the</strong>r languages; (ii) definitions<br />

<strong>of</strong> closed-category and inflected words (including verbal forms); (iii) definitions<br />

in entries with alternative syntaxes, not recognised by our parser. As Wiktionaries<br />

are created by volunteers, <strong>of</strong>ten not experts, and because <strong>the</strong>re is no standard syntax<br />

for representing Wiktionary entries in wikitext, <strong>the</strong> structure <strong>of</strong> <strong>the</strong> entries is fairly<br />

inconsistent. It is thus impossible to develop a parser to handle all syntax variations,<br />

and thus 100% reliable. This problem seems to be common to o<strong>the</strong>r editions<br />

<strong>of</strong> Wiktionary, as it is referred by o<strong>the</strong>r authors (e.g. Navarro et al. (2009)).<br />

4.2.3 Regularities in <strong>the</strong> Definitions<br />

One <strong>of</strong> <strong>the</strong> main reasons for using dictionaries in <strong>the</strong> automatic acquisition <strong>of</strong> lexicalsemantic<br />

relations is that <strong>the</strong>y typically use simple and systematic vocabulary, suitable<br />

for being exploited in information extraction. Having this in mind, during <strong>the</strong><br />

creation <strong>of</strong> PAPEL, we developed a set <strong>of</strong> grammars, with lexical-syntactic patterns<br />

12 See http://www.ukp.tu-darmstadt.de/s<strong>of</strong>tware/jwktl/ (September 2012)

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!