Onto.PT: Towards the Automatic Construction of a Lexical Ontology ...

More documents

Recommendations

Info

148 Chapter 8. Onto.PT: a lexical ontology for Portuguese Given a sentence to disambiguate, both of the algorithms take advantage of a given context W = {w1, w2, ..., wn}, which includes all the (content) words of the sentence (nouns, verbs and, eventually, adjectives and adverbs). Before applying the algorithms, the sentence is POS-tagged and lemmatised. Then, for each word wi ∈ W to be disambiguated, the set of candidate synsets, Ci = {Si1, Si2, ..., Sim}, is retrieved from Onto.PT. Each candidate synset must contain the word to be disambiguated, Sj ∈ C → wi ∈ Sj. The goal of each algorithm is to select a suitable synset Sk ∈ C, for the occurrence of the word wi in the context W . The selected synset should transmit the meaning of the word, when in the given context. The selection of the best candidate depends on the used algorithm: Bag-of-Words: For each candidate Sj, a set Rj = {qj1, qj2, ..., qjp} is established with all the words in Sj and in synsets directly related with Sj, in Onto.PT. The selected synset is the one maximising the similarity with the context Sk : sim(Rk, W ) = max(sim(Ri, W )). Similarities may be computed by measures typically used for comparing the similarity of sets, such as the Jaccard or the Overlap coefficient (both referred in section 6.1.2 and other sections of this thesis). This algorithm is actually an adaptation of the Lesk algorithm (Lesk, 1986; Banerjee and Pedersen, 2002), with two main differences. First, in the Lesk algorithm adapted for WordNet, the “context” of a sense consists not only of the words in the synset, but also of words in its gloss and in example sentences. As Onto.PT does not contain synset glosses, we use all the words in related synsets. Second, in the Lesk algorithm, the similarity of contexts is given by the number of common terms, while we use a more complex similarity measure. This way, the selection of the most suitable synset is not biased towards synsets with larger “contexts”. Personalized PageRank: As referred in section 7.1, the PageRank algorithm (Brin and Page, 1998) ranks the nodes of a graph according to their structural importance. However, it has been used to solve more specific problems, including WSD with a wordnet (Agirre and Soroa, 2009). Our implementation is based on the later work, and uses all Onto.PT. For such, we consider that Onto.PT is a graph G = (V, E), with |V | nodes, representing the synsets, and |E| undirected edges, for each relation between synsets. For a given context W , only the synsets with words in the context have initial weights, which are uniformly distributed. The rest of the synsets do not have an initial weight. After several iterations, it is expected that more relevant synsets for the given context are ranked higher. Therefore, for each word wi, this algorithm selects the highest ranked candidate synset. WSD using Onto.PT is exemplified in the following real sentences, obtained from AC/DC (Santos and Bick, 2000). For each sentence, we used all nouns and verbs as context, and applied the Personalized PageRank algorithm to assign a suitable synset for the occurrence of each noun. Each sentence is presented with the nouns underlined. Then, for each noun, we show, in parenthesis, the number of senses they have in Onto.PT, which is the number of alternative synsets including them, and, of course, we show the selected synset. (1) Vai estar, seguramente, colocado num local envergonhado e inacessível que obrigará o pobre cidadão que pretenda reclamar a sujeitar-se à censura de
8.4. Using Onto.PT 149 todos os funcionários presentes. (It will, certainly, be placed in some shy and inaccessible place, that will force the poor citizen who wishes to complain submit themself to the censorship of all present workers.) local (2) → {situação, lado, lugar, local, sítio, localidade, arrozal, logo, sombral, loco, básis} cidadão (1) → {homem, tipo, cidadão, indivíduo, cara, sujeito, camarada, cabra, gajo, freguês, caramelo, meco, sicrano, nego, tal, zinho, dito-cujo, ...} censura (5) → {acusação, censura, exprobração, increpação, objurgação, objurgatória} funcionário (1) → {trabalhador, funcionário, empregado, contratado} (2) Ambos, na opinião do autor, atingiram um plano muito elevado, pelo que não é o facto de se terem retirado da vida política activa que os deixou igual a toda a gente. (Both of them, in the author’s opinion, reached a very high level, so it is not the fact that they have withdrawn from active political life that left them as everyone else.) opinião (12) → {opinião, voto, conselho, parecer, sugestão, arbítrio, alvitre} autor (7) → {autor, produtor, artífice, perpetrador, fabricador, responsável} plano (10) → {nível, plano} facto (6) → {facto, coisa, negócio, realidade, espécie, passo, acto, fenómeno, mistério, cousa} vida (16) → {vida, biografia} gente (6) → {gente, ser humano} (3) O marketing da convenção prevê a distribuição de 200 outdoors e anúncios comerciais em três emissoras de televisão e oito de rádio. (The convention marketing foresees the distribution of 200 billboards and advertisements in three television and eight radio stations.) marketing (1) → {marketing, mercadologia} convenção (8) → {acordo, negócio, contrato, tratado, convenção, convénio, concórdia} distribuição (7) → {distribuição, circulação} outdoor (1) → {cartaz, ecrã, painel, retábulo, outdoor} anúncio (8) → {anúncio, publicidade, propaganda, comercial, proclamação, cartel, utilitário, pregão, deixa, reclame, reclamo, papeleta, apostolado} emissora (2) → {emissora, transmissora} televisão (3) → {televisão, tv, tevê, televisora} rádio (2) → {rádio, transmissão, radiodifusão, radiocomunicação, radiotransmissão, radiofonia} Identifying the synset with the meaning of a word in context is important for handling ambiguities at the semantic level, and is the starting point for sense-aware NLP. Furthermore, it can be used to obtain other related words not referred in the text, useful for several tasks, including IR, where queries can be expanded with related information (see section 8.4.3); writing aids; or text simplification (Woodsend and Lapata, 2011). On the last, synonyms enable to rewrite the sentence with more frequent words, while keeping a very similar meaning. If we replace the nouns of sentence (1) with their synonyms with higher frequency in the AC/DC lists, we obtain the following sentence: (4) Vai estar, seguramente, colocado numa situação envergonhada e inacessível que obrigará o pobre homem que pretenda reclamar a sujeitar-se à acusação de todos os trabalhadores presentes.
Page 1:
PhD Thesis Doctoral Program in Info
Page 5:
Preface About six years ago, almost
Page 9 and 10:
Resumo Não há grandes dúvidas qu
Page 11 and 12:
Contents Chapter 1: Introduction .
Page 13:
8.2.1 Semantic Web model . . . . .
Page 16 and 17:
6.1 Illustrative synonymy network.
Page 18 and 19:
6.3 Evaluation against intersection
Page 21 and 22:
Chapter 1 Introduction A substantia
Page 23 and 24:
1.2. Approach 5 • They are not bu
Page 25 and 26:
1.4. Outline of the thesis 7 which
Page 27 and 28:
Chapter 2 Background Knowledge The
Page 29 and 30:
2.1. Lexical Semantics 11 that, in
Page 31 and 32:
2.1. Lexical Semantics 13 Meronymy
Page 33 and 34:
2.2. Lexical Knowledge Formalisms a
Page 35 and 36:
Page 37 and 38:
Page 39 and 40:
2.3. Information Extraction from Te
Page 41 and 42:
2.3. Information Extraction from Te
Page 43:
2.4. Remarks on this section 25 usi
Page 46 and 47:
28 Chapter 3. Related Work in group
Page 48 and 49:
30 Chapter 3. Related Work ple rela
Page 50 and 51:
32 Chapter 3. Related Work knowledg
Page 52 and 53:
34 Chapter 3. Related Work the ELRA
Page 54 and 55:
36 Chapter 3. Related Work resource
Page 56 and 57:
38 Chapter 3. Related Work English
Page 58 and 59:
40 Chapter 3. Related Work of super
Page 60 and 61:
42 Chapter 3. Related Work • part
Page 62 and 63:
44 Chapter 3. Related Work LSIE fro
Page 64 and 65:
46 Chapter 3. Related Work modifier
Page 66 and 67:
48 Chapter 3. Related Work 6. {,}
Page 68 and 69:
50 Chapter 3. Related Work 1. Extra
Page 70 and 71:
52 Chapter 3. Related Work Due to t
Page 72 and 73:
54 Chapter 3. Related Work comparis
Page 74 and 75:
56 Chapter 3. Related Work creation
Page 76 and 77:
58 Chapter 4. Acquisition of Semant
Page 78 and 79:
Page 80 and 81:
Page 82 and 83:
Page 84 and 85:
Page 86 and 87:
Page 88 and 89:
Page 90 and 91:
Page 92 and 93:
Page 94 and 95:
Page 96 and 97:
Page 98 and 99:
80 Chapter 5. Synset Discovery Ther
Page 100 and 101:
82 Chapter 5. Synset Discovery the
Page 102 and 103:
84 Chapter 5. Synset Discovery tb-t
Page 104 and 105:
86 Chapter 5. Synset Discovery cota
Page 106 and 107:
88 Chapter 5. Synset Discovery θ W
Page 108 and 109:
90 Chapter 5. Synset Discovery Tabl
Page 110 and 111:
92 Chapter 5. Synset Discovery word
Page 113 and 114:
Chapter 6 Thesaurus Enrichment Gene
Page 115 and 116: 6.1. Automatic Assignment of synpai
Page 117 and 118: 6.2. Evaluation of the assignment p
Page 119 and 120: 6.3. Clustering and integrating new
Page 121 and 122: 6.4. A large thesaurus for Portugue
Page 129: 6.5. Discussion 111 Another contrib
Page 132 and 133: 114 Chapter 7. Moving from term-bas
Page 149 and 150: Chapter 8 Onto.PT: a lexical ontolo
Page 151 and 152: 8.1. Overview 133 items inside a sy
Page 153 and 154: 8.2. Access and Availability 135 no
Page 155 and 156: 8.2. Access and Availability 137 Ex
Page 157 and 158: 8.3. Evaluation 139 Figure 8.3: Ins
Page 159 and 160: 8.3. Evaluation 141 the most reliab
Page 161 and 162: 8.3. Evaluation 143 imation of the
Page 163 and 164: 8.3. Evaluation 145 Relation parteD
Page 165: 8.4. Using Onto.PT 147 • S: (n) a
Page 169 and 170: 8.4. Using Onto.PT 151 In addition
Page 171 and 172: 8.4. Using Onto.PT 153 based approa
Page 173: 8.4. Using Onto.PT 155 Uma populaç
Page 176 and 177: 158 Chapter 9. Final discussion 3.
Page 178 and 179: 160 Chapter 9. Final discussion - G
Page 180 and 181: 162 Chapter 9. Final discussion Any
Page 183 and 184: References Agichtein, E. and Gravan
Page 185 and 186: References 167 for storing and quer
Page 187 and 188: References 169 15th International C
Page 189 and 190: References 171 Symposium (STAIRS 20
Page 191 and 192: References 173 Hovy, E., Hermjakob,
Page 193 and 194: References 175 ACM, 38(11):39-41. M
Page 195 and 196: References 177 ACL Press. Partee, B
Page 197 and 198: References 179 Russell, S. and Norv
Page 199 and 200: References 181 Proceedings of 13th
Page 201 and 202: Appendix A Description of the extra
Page 203 and 204: • x propriedadeDeAlgoQueCausa y -
Page 205: • x antonimoAdjDe y Property - x
Page 208 and 209: 190 Appendix B. Coverage of EuroWor
Page 210 and 211: 192 Appendix B. Coverage of EuroWor
Page 212: 194 Appendix B. Coverage of EuroWor
show all

Onto.PT: Towards the Automatic Construction of a Lexical Ontology ...

Create successful ePaper yourself

Delete template?

Save as template?