Onto.PT: Towards the Automatic Construction of a Lexical Ontology ...

More documents

Recommendations

Info

150 Chapter 8. Onto.PT: a lexical ontology for Portuguese 8.4.3 Query expansion In IR, query expansion consists of refining a certain request for information (query) in order to improve the retrieval performance. Expansion may, for instance, replace the terms of the query by their lemmas or stems, give different weights to different terms in the query, or add related terms, such as synonyms, which can be alternatively searched for. When it comes to adding related terms, LKBs have revealed to be very useful. For instance, Navigli and Velardi (2003) used Princeton WordNet for this task. They made several experiments where they first disambiguate the query terms with respect to WordNet. Then, they expand the query with words in the same synsets, on hypernym synsets, as well as words in the respective synset glosses. For Portuguese, Sarmento et al. (2008) analysed the benefits of using OpentThesaurus.PT and an automatically generated verb thesaurus for query expansion. A previous version of Onto.PT (v.0.31) was recently used for query expansion in the system Rapportágico (Rodrigues et al., 2012). This system participated in Págico 7 (Mota et al., 2012; Santos et al., 2012), an IR joint task for Portuguese. Págico is briefly described as a task where, given a list of 150 information requests (topics), written in natural language, the goal is to identify pages of the Portuguese Wikipedia which answer each topic. If the answer is not in the answer page, the page supporting the answer should also be given. All the topics were about the culture of Portuguese speaking countries. The following are real examples of Págico topics: (5) Grupos indígenas que habitavam o litoral do Brasil quando chegaram os europeus. (Indigenous groups who inhabited the coast of Brazil when the Europeans arrived.) (6) Viajantes ou exploradores que escreveram sobre o Brasil do século XVI. (Travelers or explorers who wrote about the sixteenth-century Brazil.) (7) Sambistas negros que abordam o racismo em suas letras. (Black samba musicians that addressed racism in their lyrics.) Rapportágico is based on a shallow analysis of the topic, which it converts into a query for retrieving relevant documents, indexed by the Apache Lucene search engine 8 . The baseline approach of Rapportágico uses the lemmas of the nouns and the verbs in the topic as search keywords. In all the runs submitted to Págico, the baseline approach adds two refinements: • All occurrences of words related with Portuguese speaking countries (e.g. lusófono) were expanded to all the effective names of these countries (in Portuguese, Portugal, Brasil, Angola, Moçambique, Guiné Bissau, Cabo Verde, S~ao Tomé e Príncipe, Timor); • The first noun of the topic was considered to be the category of the topic, and was always searched for appended to a very common hypernymy pattern in Wikipedia pages – é um, in Portuguese, is a, in English 7 See http://www.linguateca.pt/Pagico/ (September 2012) 8 Freely available from http://lucene.apache.org/ (September 2012)
8.4. Using Onto.PT 151 In addition to the previous baseline, we had two official runs where Onto.PT was used to perform an additional expansion on the 67 topics containing verb phrases (VPs) with only one verb. There, the verbs were disambiguated, and their synonyms were used as search alternatives. The main idea behind this expansion was the improvement of the system’s recall. Still, only alternatives with more than 20 occurrences in the the corpora provided by AC/DC were used. The only difference between these two runs was that, in number 2, disambiguation was performed using the Bag-of-Words algorithm, while run number 3 used the Personalized PageRank. Moreover, after the official evaluation, we sent additional unofficial runs, where, besides other experiments, we had similar runs to 2 and 3, but this time, the category of all the topics was disambiguated and expanded as well. In order to illustrate how expansion worked, figure 8.6 presents the expansions of the category and the VP of the previously shown topics, obtained with the Personalized PageRank. For the sake of simplicity, we omitted the hypernymy pattern from the category expansion. Topic Original Category Expanded Original VP Expanded 5 tribo grupo OR tribo habitavam habitar OR colonizar OR povoar OR ocupar 6 viajantes ou viajante OR peregrino OR escreveram redigir OR exploradores viageiro OR passageiro OR escrever OR caminhante OR viandante OR explorador grafar 7 sambistas sambador OR sambista abordam tratar OR apalavrar OR abordar OR versar Figure 8.6: Category and VP expansions in Rapportágico, using Onto.PT. Given the simplistic approach followed by Rapportágico and the high complexity of Págico, we can say that the obtained results were interesting. Rapportágico’s performance was below most of the human participants, but it was better than RENOIR (Cardoso, 2012), the other automatic participant. Nevertheless, RENOIR also followed a simplistic approach, and was heavily penalised by the large number of given answers per topic (100). The most relevant conclusions for our research was that the runs where VPs were expanded into their synonyms performed better than the baseline approach. Among these two runs, Personalized PageRank performed better than the Bag-of-Words method. The results of the official participation of Rapportágico in Págico are shown in table 8.7, for each run. In the same table, we present the results of the best human participation (actually, a groups of participants), ludIT (Veiga et al., 2012), which show that we are still very far from a human approach to this task, and we show the results of the best run of RENOIR. Performance is given by the following measures: • Answered topics: number of topics with at least one given answer. • Given answers: total number of given answers.
Page 1:
PhD Thesis Doctoral Program in Info
Page 5:
Preface About six years ago, almost
Page 9 and 10:
Resumo Não há grandes dúvidas qu
Page 11 and 12:
Contents Chapter 1: Introduction .
Page 13:
8.2.1 Semantic Web model . . . . .
Page 16 and 17:
6.1 Illustrative synonymy network.
Page 18 and 19:
6.3 Evaluation against intersection
Page 21 and 22:
Chapter 1 Introduction A substantia
Page 23 and 24:
1.2. Approach 5 • They are not bu
Page 25 and 26:
1.4. Outline of the thesis 7 which
Page 27 and 28:
Chapter 2 Background Knowledge The
Page 29 and 30:
2.1. Lexical Semantics 11 that, in
Page 31 and 32:
2.1. Lexical Semantics 13 Meronymy
Page 33 and 34:
2.2. Lexical Knowledge Formalisms a
Page 35 and 36:
Page 37 and 38:
Page 39 and 40:
2.3. Information Extraction from Te
Page 41 and 42:
2.3. Information Extraction from Te
Page 43:
2.4. Remarks on this section 25 usi
Page 46 and 47:
28 Chapter 3. Related Work in group
Page 48 and 49:
30 Chapter 3. Related Work ple rela
Page 50 and 51:
32 Chapter 3. Related Work knowledg
Page 52 and 53:
34 Chapter 3. Related Work the ELRA
Page 54 and 55:
36 Chapter 3. Related Work resource
Page 56 and 57:
38 Chapter 3. Related Work English
Page 58 and 59:
40 Chapter 3. Related Work of super
Page 60 and 61:
42 Chapter 3. Related Work • part
Page 62 and 63:
44 Chapter 3. Related Work LSIE fro
Page 64 and 65:
46 Chapter 3. Related Work modifier
Page 66 and 67:
48 Chapter 3. Related Work 6. {,}
Page 68 and 69:
50 Chapter 3. Related Work 1. Extra
Page 70 and 71:
52 Chapter 3. Related Work Due to t
Page 72 and 73:
54 Chapter 3. Related Work comparis
Page 74 and 75:
56 Chapter 3. Related Work creation
Page 76 and 77:
58 Chapter 4. Acquisition of Semant
Page 78 and 79:
Page 80 and 81:
Page 82 and 83:
Page 84 and 85:
Page 86 and 87:
Page 88 and 89:
Page 90 and 91:
Page 92 and 93:
Page 94 and 95:
Page 96 and 97:
Page 98 and 99:
80 Chapter 5. Synset Discovery Ther
Page 100 and 101:
82 Chapter 5. Synset Discovery the
Page 102 and 103:
84 Chapter 5. Synset Discovery tb-t
Page 104 and 105:
86 Chapter 5. Synset Discovery cota
Page 106 and 107:
88 Chapter 5. Synset Discovery θ W
Page 108 and 109:
90 Chapter 5. Synset Discovery Tabl
Page 110 and 111:
92 Chapter 5. Synset Discovery word
Page 113 and 114:
Chapter 6 Thesaurus Enrichment Gene
Page 115 and 116:
6.1. Automatic Assignment of synpai
Page 117 and 118: 6.2. Evaluation of the assignment p
Page 119 and 120: 6.3. Clustering and integrating new
Page 121 and 122: 6.4. A large thesaurus for Portugue
Page 129: 6.5. Discussion 111 Another contrib
Page 132 and 133: 114 Chapter 7. Moving from term-bas
Page 149 and 150: Chapter 8 Onto.PT: a lexical ontolo
Page 151 and 152: 8.1. Overview 133 items inside a sy
Page 153 and 154: 8.2. Access and Availability 135 no
Page 155 and 156: 8.2. Access and Availability 137 Ex
Page 157 and 158: 8.3. Evaluation 139 Figure 8.3: Ins
Page 159 and 160: 8.3. Evaluation 141 the most reliab
Page 161 and 162: 8.3. Evaluation 143 imation of the
Page 163 and 164: 8.3. Evaluation 145 Relation parteD
Page 165 and 166: 8.4. Using Onto.PT 147 • S: (n) a
Page 167: 8.4. Using Onto.PT 149 todos os fun
Page 171 and 172: 8.4. Using Onto.PT 153 based approa
Page 173: 8.4. Using Onto.PT 155 Uma populaç
Page 176 and 177: 158 Chapter 9. Final discussion 3.
Page 178 and 179: 160 Chapter 9. Final discussion - G
Page 180 and 181: 162 Chapter 9. Final discussion Any
Page 183 and 184: References Agichtein, E. and Gravan
Page 185 and 186: References 167 for storing and quer
Page 187 and 188: References 169 15th International C
Page 189 and 190: References 171 Symposium (STAIRS 20
Page 191 and 192: References 173 Hovy, E., Hermjakob,
Page 193 and 194: References 175 ACM, 38(11):39-41. M
Page 195 and 196: References 177 ACL Press. Partee, B
Page 197 and 198: References 179 Russell, S. and Norv
Page 199 and 200: References 181 Proceedings of 13th
Page 201 and 202: Appendix A Description of the extra
Page 203 and 204: • x propriedadeDeAlgoQueCausa y -
Page 205: • x antonimoAdjDe y Property - x
Page 208 and 209: 190 Appendix B. Coverage of EuroWor
Page 210 and 211: 192 Appendix B. Coverage of EuroWor
Page 212: 194 Appendix B. Coverage of EuroWor
show all

Onto.PT: Towards the Automatic Construction of a Lexical Ontology ...

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?