Onto.PT: Towards the Automatic Construction of a Lexical Ontology ...
Onto.PT: Towards the Automatic Construction of a Lexical Ontology ...
Onto.PT: Towards the Automatic Construction of a Lexical Ontology ...
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
62 Chapter 4. Acquisition <strong>of</strong> Semantic Relations<br />
suggestions in Simões et al. (2010). However, in order to minimise <strong>the</strong> probability<br />
<strong>of</strong> generating invalid lemmas, if <strong>the</strong>y do not exist in TeP nor PAPEL, <strong>the</strong> tb-triple<br />
is discarded.<br />
For handling <strong>the</strong> wikitext <strong>of</strong> <strong>the</strong> Wiktionary.<strong>PT</strong> dump, we developed a specific<br />
parser (Anton Pérez et al., 2011). Although <strong>the</strong>re is an available API, JWKTL 12 , for<br />
processing Wiktionary (Zesch et al., 2008a), it is only compatible with <strong>the</strong> English<br />
and German versions <strong>of</strong> <strong>the</strong> resource. The main problem is that different language<br />
editions <strong>of</strong> Wiktionary use distinct delimiter elements to represent <strong>the</strong> information<br />
<strong>of</strong> each entry, so every Wiktionary parser needs to be adapted according to <strong>the</strong><br />
language edition. Since <strong>the</strong> source code <strong>of</strong> JWKTL was not available, we could not<br />
adapt it for Wiktionary.<strong>PT</strong>.<br />
In <strong>the</strong> dictionary conversion process, only definitions <strong>of</strong> open-category words<br />
were used, and changed to one common notation: nome for nouns, verbo for verbs,<br />
adj for adjectives and adv for adverbs. The format adopted for representing <strong>the</strong><br />
dictionaries contains a definition per line. Before <strong>the</strong> definition, we include <strong>the</strong><br />
definiendum and its POS, as in <strong>the</strong> following definition for <strong>the</strong> word coco (coconut):<br />
coco nome fruto gerado pelo coqueiro, muito usado para se fazer<br />
doces e para consumo de seu líquido<br />
In this format, words with more than one definition originate more than one<br />
line. Also, since Wiktionary provides synonymy lists for some <strong>of</strong> its entries, we<br />
transformed <strong>the</strong>se lists in definitions with only one word, as in <strong>the</strong> following example<br />
for <strong>the</strong> synonyms <strong>of</strong> <strong>the</strong> word bravo (brave):<br />
Sinónimos: corajoso, destemido ⇒<br />
bravo adj corajoso<br />
bravo adj destemido<br />
After <strong>the</strong> conversion <strong>of</strong> DA and Wiktionary.<strong>PT</strong> we obtained about 229,000 and<br />
72,000 definitions, respectively for each dictionary. We do not have direct access to<br />
DLP, but we can say that it contains 176,000 definitions which gave origin to, at<br />
least, one relation.<br />
Wiktionary.<strong>PT</strong> is <strong>the</strong> smaller resource, which resulted in <strong>the</strong> lowest number <strong>of</strong><br />
definitions among <strong>the</strong> three dictionaries. However, before collecting <strong>the</strong>se definitions,<br />
we discarded: (i) definitions corresponding to words in o<strong>the</strong>r languages; (ii) definitions<br />
<strong>of</strong> closed-category and inflected words (including verbal forms); (iii) definitions<br />
in entries with alternative syntaxes, not recognised by our parser. As Wiktionaries<br />
are created by volunteers, <strong>of</strong>ten not experts, and because <strong>the</strong>re is no standard syntax<br />
for representing Wiktionary entries in wikitext, <strong>the</strong> structure <strong>of</strong> <strong>the</strong> entries is fairly<br />
inconsistent. It is thus impossible to develop a parser to handle all syntax variations,<br />
and thus 100% reliable. This problem seems to be common to o<strong>the</strong>r editions<br />
<strong>of</strong> Wiktionary, as it is referred by o<strong>the</strong>r authors (e.g. Navarro et al. (2009)).<br />
4.2.3 Regularities in <strong>the</strong> Definitions<br />
One <strong>of</strong> <strong>the</strong> main reasons for using dictionaries in <strong>the</strong> automatic acquisition <strong>of</strong> lexicalsemantic<br />
relations is that <strong>the</strong>y typically use simple and systematic vocabulary, suitable<br />
for being exploited in information extraction. Having this in mind, during <strong>the</strong><br />
creation <strong>of</strong> PAPEL, we developed a set <strong>of</strong> grammars, with lexical-syntactic patterns<br />
12 See http://www.ukp.tu-darmstadt.de/s<strong>of</strong>tware/jwktl/ (September 2012)