24.07.2013 Views

Onto.PT: Towards the Automatic Construction of a Lexical Ontology ...

Onto.PT: Towards the Automatic Construction of a Lexical Ontology ...

Onto.PT: Towards the Automatic Construction of a Lexical Ontology ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

148 Chapter 8. <strong>Onto</strong>.<strong>PT</strong>: a lexical ontology for Portuguese<br />

Given a sentence to disambiguate, both <strong>of</strong> <strong>the</strong> algorithms take advantage <strong>of</strong> a<br />

given context W = {w1, w2, ..., wn}, which includes all <strong>the</strong> (content) words <strong>of</strong> <strong>the</strong><br />

sentence (nouns, verbs and, eventually, adjectives and adverbs). Before applying<br />

<strong>the</strong> algorithms, <strong>the</strong> sentence is POS-tagged and lemmatised. Then, for each word<br />

wi ∈ W to be disambiguated, <strong>the</strong> set <strong>of</strong> candidate synsets, Ci = {Si1, Si2, ..., Sim},<br />

is retrieved from <strong>Onto</strong>.<strong>PT</strong>. Each candidate synset must contain <strong>the</strong> word to be<br />

disambiguated, Sj ∈ C → wi ∈ Sj. The goal <strong>of</strong> each algorithm is to select a suitable<br />

synset Sk ∈ C, for <strong>the</strong> occurrence <strong>of</strong> <strong>the</strong> word wi in <strong>the</strong> context W . The selected<br />

synset should transmit <strong>the</strong> meaning <strong>of</strong> <strong>the</strong> word, when in <strong>the</strong> given context. The<br />

selection <strong>of</strong> <strong>the</strong> best candidate depends on <strong>the</strong> used algorithm:<br />

Bag-<strong>of</strong>-Words: For each candidate Sj, a set Rj = {qj1, qj2, ..., qjp} is established<br />

with all <strong>the</strong> words in Sj and in synsets directly related with Sj, in<br />

<strong>Onto</strong>.<strong>PT</strong>. The selected synset is <strong>the</strong> one maximising <strong>the</strong> similarity with <strong>the</strong> context<br />

Sk : sim(Rk, W ) = max(sim(Ri, W )). Similarities may be computed by measures<br />

typically used for comparing <strong>the</strong> similarity <strong>of</strong> sets, such as <strong>the</strong> Jaccard or <strong>the</strong> Overlap<br />

coefficient (both referred in section 6.1.2 and o<strong>the</strong>r sections <strong>of</strong> this <strong>the</strong>sis).<br />

This algorithm is actually an adaptation <strong>of</strong> <strong>the</strong> Lesk algorithm (Lesk, 1986;<br />

Banerjee and Pedersen, 2002), with two main differences. First, in <strong>the</strong> Lesk algorithm<br />

adapted for WordNet, <strong>the</strong> “context” <strong>of</strong> a sense consists not only <strong>of</strong> <strong>the</strong> words<br />

in <strong>the</strong> synset, but also <strong>of</strong> words in its gloss and in example sentences. As <strong>Onto</strong>.<strong>PT</strong><br />

does not contain synset glosses, we use all <strong>the</strong> words in related synsets. Second, in<br />

<strong>the</strong> Lesk algorithm, <strong>the</strong> similarity <strong>of</strong> contexts is given by <strong>the</strong> number <strong>of</strong> common<br />

terms, while we use a more complex similarity measure. This way, <strong>the</strong> selection <strong>of</strong><br />

<strong>the</strong> most suitable synset is not biased towards synsets with larger “contexts”.<br />

Personalized PageRank: As referred in section 7.1, <strong>the</strong> PageRank algorithm<br />

(Brin and Page, 1998) ranks <strong>the</strong> nodes <strong>of</strong> a graph according to <strong>the</strong>ir structural<br />

importance. However, it has been used to solve more specific problems, including<br />

WSD with a wordnet (Agirre and Soroa, 2009). Our implementation is based on <strong>the</strong><br />

later work, and uses all <strong>Onto</strong>.<strong>PT</strong>. For such, we consider that <strong>Onto</strong>.<strong>PT</strong> is a graph<br />

G = (V, E), with |V | nodes, representing <strong>the</strong> synsets, and |E| undirected edges, for<br />

each relation between synsets. For a given context W , only <strong>the</strong> synsets with words<br />

in <strong>the</strong> context have initial weights, which are uniformly distributed. The rest <strong>of</strong> <strong>the</strong><br />

synsets do not have an initial weight. After several iterations, it is expected that<br />

more relevant synsets for <strong>the</strong> given context are ranked higher. Therefore, for each<br />

word wi, this algorithm selects <strong>the</strong> highest ranked candidate synset.<br />

WSD using <strong>Onto</strong>.<strong>PT</strong> is exemplified in <strong>the</strong> following real sentences, obtained from<br />

AC/DC (Santos and Bick, 2000). For each sentence, we used all nouns and verbs<br />

as context, and applied <strong>the</strong> Personalized PageRank algorithm to assign a suitable<br />

synset for <strong>the</strong> occurrence <strong>of</strong> each noun. Each sentence is presented with <strong>the</strong> nouns<br />

underlined. Then, for each noun, we show, in paren<strong>the</strong>sis, <strong>the</strong> number <strong>of</strong> senses <strong>the</strong>y<br />

have in <strong>Onto</strong>.<strong>PT</strong>, which is <strong>the</strong> number <strong>of</strong> alternative synsets including <strong>the</strong>m, and,<br />

<strong>of</strong> course, we show <strong>the</strong> selected synset.<br />

(1) Vai estar, seguramente, colocado num local envergonhado e inacessível que<br />

obrigará o pobre cidadão que pretenda reclamar a sujeitar-se à censura de

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!