24.07.2013 Views

Onto.PT: Towards the Automatic Construction of a Lexical Ontology ...

Onto.PT: Towards the Automatic Construction of a Lexical Ontology ...

Onto.PT: Towards the Automatic Construction of a Lexical Ontology ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

4.1. Semantic relations from definitions 59<br />

where <strong>the</strong> grammars are created, and two automatic steps. Semantic relations, held<br />

between words in <strong>the</strong> definitions and <strong>the</strong> definiendum, are extracted after processing<br />

dictionary entries. Extracted relation instances are represented as term-based<br />

relational triples (hereafter, tb-triples) with <strong>the</strong> following structure:<br />

arg1 RELATION NAME arg2<br />

A tb-triple indicates that one sense <strong>of</strong> <strong>the</strong> lexical item in <strong>the</strong> first argument (arg1)<br />

is related to one sense <strong>of</strong> <strong>the</strong> lexical item in <strong>the</strong> second argument (arg2) by means<br />

<strong>of</strong> a relation identified by RELATION NAME. For instance:<br />

animal HIPERONIMO DE c~ao (animal HYPERNYM OF dog)<br />

Each step <strong>of</strong> <strong>the</strong> extraction procedure is illustrated in figure 4.2, and encompasses<br />

<strong>the</strong> following steps:<br />

1. Creation <strong>of</strong> <strong>the</strong> extraction grammars: After a careful analysis <strong>of</strong> <strong>the</strong><br />

structure <strong>of</strong> <strong>the</strong> dictionary definitions, patterns that denote semantic relations<br />

are manually compiled into grammars. The rules <strong>of</strong> <strong>the</strong> grammars are made<br />

specifically for <strong>the</strong> extraction <strong>of</strong> relations between words in dictionary definitions<br />

and <strong>the</strong>ir definiendum.<br />

2. Extraction <strong>of</strong> semantic relations: The grammars are used toge<strong>the</strong>r with<br />

a parser 1 that processes <strong>the</strong> dictionary definitions. Only definitions <strong>of</strong> open<br />

category words (nouns, verbs, adjectives and adverbs) are processed. In <strong>the</strong><br />

end, if definitions match <strong>the</strong> patterns, instances <strong>of</strong> semantic relations are extracted<br />

and represented as tb-triples t = {w1 R w2} where w1 is a word in <strong>the</strong><br />

definition, w2 is <strong>the</strong> definiendum, and R is <strong>the</strong> name <strong>of</strong> a relation established<br />

by one sense <strong>of</strong> w1 and one sense <strong>of</strong> w2.<br />

3. Cleaning and lemmatisation: After extraction, some relations have invalid<br />

arguments, including punctuation marks or prepositions. Definitions are thus<br />

POS-tagged with <strong>the</strong> tagger provided by <strong>the</strong> OpenNLP toolkit 2 , using <strong>the</strong><br />

models for Portuguese 3 . Triples with invalid arguments are discarded 4 . Moreover,<br />

if <strong>the</strong> arguments <strong>of</strong> <strong>the</strong> triples are inflected and thus not defined in <strong>the</strong><br />

dictionary, lemmatisation rules are applied 5 .<br />

This procedure results in a set <strong>of</strong> tb-triples <strong>of</strong> different predefined types. The<br />

resulting set may be formally seen as a term-based directed lexical network (see<br />

section 2.2.3). To this end, each tb-triple t = {w1 R w2} will denote an edge with<br />

label R, connecting words w1 and w2, which will be <strong>the</strong> nodes.<br />

1We used <strong>the</strong> chart parser PEN, available from https://code.google.com/p/pen/ (September<br />

2012)<br />

2Available from http://incubator.apache.org/opennlp/ (September 2012)<br />

3See http://opennlp.sourceforge.net/models-1.5/ (September 2012)<br />

4Definitions are not tagged before extraction because <strong>the</strong> tagger models were trained in corpora<br />

text and do not work as well as <strong>the</strong>y should for dictionary definitions. Fur<strong>the</strong>rmore, <strong>the</strong> grammars<br />

<strong>of</strong> PAPEL do not consider tags. Tagging at this stage should only be seen as a complement to <strong>the</strong><br />

information provided by <strong>the</strong> dictionary.<br />

5The lemmatisation rules were compiled by our colleague Ricardo Rodrigues, and take advantage<br />

<strong>of</strong> <strong>the</strong> annotation provided by <strong>the</strong> OpenNLP POS tagger.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!