Onto.PT: Towards the Automatic Construction of a Lexical Ontology ...
Onto.PT: Towards the Automatic Construction of a Lexical Ontology ...
Onto.PT: Towards the Automatic Construction of a Lexical Ontology ...
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
4.1. Semantic relations from definitions 59<br />
where <strong>the</strong> grammars are created, and two automatic steps. Semantic relations, held<br />
between words in <strong>the</strong> definitions and <strong>the</strong> definiendum, are extracted after processing<br />
dictionary entries. Extracted relation instances are represented as term-based<br />
relational triples (hereafter, tb-triples) with <strong>the</strong> following structure:<br />
arg1 RELATION NAME arg2<br />
A tb-triple indicates that one sense <strong>of</strong> <strong>the</strong> lexical item in <strong>the</strong> first argument (arg1)<br />
is related to one sense <strong>of</strong> <strong>the</strong> lexical item in <strong>the</strong> second argument (arg2) by means<br />
<strong>of</strong> a relation identified by RELATION NAME. For instance:<br />
animal HIPERONIMO DE c~ao (animal HYPERNYM OF dog)<br />
Each step <strong>of</strong> <strong>the</strong> extraction procedure is illustrated in figure 4.2, and encompasses<br />
<strong>the</strong> following steps:<br />
1. Creation <strong>of</strong> <strong>the</strong> extraction grammars: After a careful analysis <strong>of</strong> <strong>the</strong><br />
structure <strong>of</strong> <strong>the</strong> dictionary definitions, patterns that denote semantic relations<br />
are manually compiled into grammars. The rules <strong>of</strong> <strong>the</strong> grammars are made<br />
specifically for <strong>the</strong> extraction <strong>of</strong> relations between words in dictionary definitions<br />
and <strong>the</strong>ir definiendum.<br />
2. Extraction <strong>of</strong> semantic relations: The grammars are used toge<strong>the</strong>r with<br />
a parser 1 that processes <strong>the</strong> dictionary definitions. Only definitions <strong>of</strong> open<br />
category words (nouns, verbs, adjectives and adverbs) are processed. In <strong>the</strong><br />
end, if definitions match <strong>the</strong> patterns, instances <strong>of</strong> semantic relations are extracted<br />
and represented as tb-triples t = {w1 R w2} where w1 is a word in <strong>the</strong><br />
definition, w2 is <strong>the</strong> definiendum, and R is <strong>the</strong> name <strong>of</strong> a relation established<br />
by one sense <strong>of</strong> w1 and one sense <strong>of</strong> w2.<br />
3. Cleaning and lemmatisation: After extraction, some relations have invalid<br />
arguments, including punctuation marks or prepositions. Definitions are thus<br />
POS-tagged with <strong>the</strong> tagger provided by <strong>the</strong> OpenNLP toolkit 2 , using <strong>the</strong><br />
models for Portuguese 3 . Triples with invalid arguments are discarded 4 . Moreover,<br />
if <strong>the</strong> arguments <strong>of</strong> <strong>the</strong> triples are inflected and thus not defined in <strong>the</strong><br />
dictionary, lemmatisation rules are applied 5 .<br />
This procedure results in a set <strong>of</strong> tb-triples <strong>of</strong> different predefined types. The<br />
resulting set may be formally seen as a term-based directed lexical network (see<br />
section 2.2.3). To this end, each tb-triple t = {w1 R w2} will denote an edge with<br />
label R, connecting words w1 and w2, which will be <strong>the</strong> nodes.<br />
1We used <strong>the</strong> chart parser PEN, available from https://code.google.com/p/pen/ (September<br />
2012)<br />
2Available from http://incubator.apache.org/opennlp/ (September 2012)<br />
3See http://opennlp.sourceforge.net/models-1.5/ (September 2012)<br />
4Definitions are not tagged before extraction because <strong>the</strong> tagger models were trained in corpora<br />
text and do not work as well as <strong>the</strong>y should for dictionary definitions. Fur<strong>the</strong>rmore, <strong>the</strong> grammars<br />
<strong>of</strong> PAPEL do not consider tags. Tagging at this stage should only be seen as a complement to <strong>the</strong><br />
information provided by <strong>the</strong> dictionary.<br />
5The lemmatisation rules were compiled by our colleague Ricardo Rodrigues, and take advantage<br />
<strong>of</strong> <strong>the</strong> annotation provided by <strong>the</strong> OpenNLP POS tagger.