Onto.PT: Towards the Automatic Construction of a Lexical Ontology ...

More documents

Recommendations

Info

4 Chapter 1. Introduction too much human effort, which may be seen as a bottleneck for the development of the resource. Not to mention that, over time, language evolves with its users. The truth is that as long as there is intensive labour involved in manually encoding lexical resources, lexical capabilities of NLP systems will be weak (Briscoe, 1991), and coverage limitations will always be present. The same happens for other kinds of knowledge base – handcrafting them is impractical and undesirable. We should therefore take advantage of available NLP tools in order to automate part of this task and reduce the need of manual input (Brewster and Wilks, 2004). Having this in mind, especially before the establishment of WordNet, researchers studied how to automatise the task of acquiring lexical-semantic knowledge from text, with relative success. For instance, MindNet (Richardson et al., 1998) shown that it is possible to develop a lexical(-semantic) knowledge base (LKB) by automatic means. Another common alternative to the manual creation of wordnets is the translation of a target wordnet (usually Princeton WordNet) to other languages (de Melo and Weikum, 2008). However, another problem arises because different languages represent different socio-cultural realities, they do not cover exactly the same part of the lexicon and, even where they seem to be common, several concepts are lexicalised differently (Hirst, 2004). Therefore, we believe that a wordnet for a language, whether created manually, semi-automatically or automatically, should be developed from scratch for that language. As mentioned before, the manual creation of a knowledge base results in slow development and consequently in limited coverage, not only of lexical, but mostly on world knowledge. This is why, after the establishment of WordNet, researchers using this resource as their only knowledge base soon had to cope with information sparsity issues. So, apart from the work on the automatic construction of LKBs from scratch, there have been automatic attempts to enrich wordnets (e.g. Hearst (1998)) and also to link them with other knowledge bases (e.g. Gurevych et al. (2012) or Hoffart et al. (2011)), in order to create broader resources. In the work described in this thesis, we look at the Portuguese scenario, and tackle the limitations of the LKBs for this language. We developed an automatic approach for the acquisition of lexical-semantic information and for the creation of lexical-semantic computational resources, dubbed ECO – Extraction, Clustering, Ontolgisation. The application of ECO to Portuguese resources resulted in a wordnet-like resource, dubbed Onto.PT. In the remaining of this chapter, we state the main goals of this research, briefly present the ECO approach, refer the main contributions of this work, and describe the structure of the rest of this thesis. 1.1 Research Goals For the Portuguese language, there have been some attempts to create a wordnet or a related resource (see more in section 3.1.2), but all of them have one or more of the following main limitations: • They are proprietary and unavailble, or their utilisation is not free; • They are handcrafted, and thus suffer from limited coverage;
1.2. Approach 5 • They are not built for Portuguese from scratch, and thus have to deal with translation issues, and include problems as lexical gaps; • They do not handle word senses, which might lead to inconsistencies regarding lexical ambiguity. Looking at this scenario, we set our goal to the development of computational tools for acquiring, structuring and integrating lexical-semantic knowledge from text. Although some of these tools can be used independently, their development had in mind the exploitation of Portuguese resources and the aim of creating a new lexical ontology for Portuguese, where the aforementioned limitations were minimised. Consequently, the resulting resource would be: • Public domain and thus free for being used by anyone, both in a research or in a commercial setting. We believe this is the best way for the resource to play its role in helping to advance the state-of-the-art of Portuguese NLP. Furthermore, a bigger community of users tends to provide important feedback, useful for improving the resource. • Created automatically, which would be done by exploiting textual resources and other public LKBs, all created from scratch for one or more variants of Portuguese. An automatic construction enables the creation of larger and broader resources, in a trade-off for lower reliability, but still acceptable for most tasks. • Structured according to the wordnet model. This option relied on the great acceptance of this model and on the wide range of algorithms that work over this kind of structure to achieve various NLP tasks. 1.2 Approach Our flexible approach for the acquisition, organisation and integration of lexical-semantic knowledge involves three main automatic steps. Each step is independent of each other and can be used for the achievement of simpler tasks. Alternatively, their combination enables the integration of lexical-semantic knowledge from different heterogeneous sources and results in a wordnet-like ontology. The three steps are briefly described as follows: 1. Extraction: instances of semantic relations, held between lexical items, are automatically extracted from text. As long as the extracted instances are represented as triples (two items connected by a predicate), the extraction techniques used in this step do not affect the following steps. In the specific case of our work, we followed a pattern based extraction on dictionary definitions. 2. Thesaurus enrichment and clustering: if there is a conceptual base with synsets for the target language, its synsets are augmented with the extracted synonymy relations. For this purpose, the network established by all extracted synonymy instances (synpairs) is exploited for computing the similarities between each synset and synpair. Both elements of a synpair are then added to their most similar synset. As for synpairs with two lexical items not covered
Page 1: PhD Thesis Doctoral Program in Info
Page 5: Preface About six years ago, almost
Page 9 and 10: Resumo Não há grandes dúvidas qu
Page 11 and 12: Contents Chapter 1: Introduction .
Page 13: 8.2.1 Semantic Web model . . . . .
Page 16 and 17: 6.1 Illustrative synonymy network.
Page 18 and 19: 6.3 Evaluation against intersection
Page 21: Chapter 1 Introduction A substantia
Page 25 and 26: 1.4. Outline of the thesis 7 which
Page 27 and 28: Chapter 2 Background Knowledge The
Page 29 and 30: 2.1. Lexical Semantics 11 that, in
Page 31 and 32: 2.1. Lexical Semantics 13 Meronymy
Page 33 and 34: 2.2. Lexical Knowledge Formalisms a
Page 39 and 40: 2.3. Information Extraction from Te
Page 41 and 42: 2.3. Information Extraction from Te
Page 43: 2.4. Remarks on this section 25 usi
Page 46 and 47: 28 Chapter 3. Related Work in group
Page 48 and 49: 30 Chapter 3. Related Work ple rela
Page 50 and 51: 32 Chapter 3. Related Work knowledg
Page 52 and 53: 34 Chapter 3. Related Work the ELRA
Page 54 and 55: 36 Chapter 3. Related Work resource
Page 56 and 57: 38 Chapter 3. Related Work English
Page 58 and 59: 40 Chapter 3. Related Work of super
Page 60 and 61: 42 Chapter 3. Related Work • part
Page 62 and 63: 44 Chapter 3. Related Work LSIE fro
Page 64 and 65: 46 Chapter 3. Related Work modifier
Page 66 and 67: 48 Chapter 3. Related Work 6. {,}
Page 68 and 69: 50 Chapter 3. Related Work 1. Extra
Page 70 and 71: 52 Chapter 3. Related Work Due to t
Page 72 and 73:
54 Chapter 3. Related Work comparis
Page 74 and 75:
56 Chapter 3. Related Work creation
Page 76 and 77:
58 Chapter 4. Acquisition of Semant
Page 78 and 79:
Page 80 and 81:
Page 82 and 83:
Page 84 and 85:
Page 86 and 87:
Page 88 and 89:
Page 90 and 91:
Page 92 and 93:
Page 94 and 95:
Page 96 and 97:
Page 98 and 99:
80 Chapter 5. Synset Discovery Ther
Page 100 and 101:
82 Chapter 5. Synset Discovery the
Page 102 and 103:
84 Chapter 5. Synset Discovery tb-t
Page 104 and 105:
86 Chapter 5. Synset Discovery cota
Page 106 and 107:
88 Chapter 5. Synset Discovery θ W
Page 108 and 109:
90 Chapter 5. Synset Discovery Tabl
Page 110 and 111:
92 Chapter 5. Synset Discovery word
Page 113 and 114:
Chapter 6 Thesaurus Enrichment Gene
Page 115 and 116:
6.1. Automatic Assignment of synpai
Page 117 and 118:
6.2. Evaluation of the assignment p
Page 119 and 120:
6.3. Clustering and integrating new
Page 121 and 122:
6.4. A large thesaurus for Portugue
Page 123 and 124:
Page 125 and 126:
Page 127 and 128:
Page 129:
6.5. Discussion 111 Another contrib
Page 132 and 133:
114 Chapter 7. Moving from term-bas
Page 134 and 135:
Page 136 and 137:
Page 138 and 139:
Page 140 and 141:
Page 142 and 143:
Page 144 and 145:
Page 146 and 147:
Page 149 and 150:
Chapter 8 Onto.PT: a lexical ontolo
Page 151 and 152:
8.1. Overview 133 items inside a sy
Page 153 and 154:
8.2. Access and Availability 135 no
Page 155 and 156:
8.2. Access and Availability 137 Ex
Page 157 and 158:
8.3. Evaluation 139 Figure 8.3: Ins
Page 159 and 160:
8.3. Evaluation 141 the most reliab
Page 161 and 162:
8.3. Evaluation 143 imation of the
Page 163 and 164:
8.3. Evaluation 145 Relation parteD
Page 165 and 166:
8.4. Using Onto.PT 147 • S: (n) a
Page 167 and 168:
8.4. Using Onto.PT 149 todos os fun
Page 169 and 170:
8.4. Using Onto.PT 151 In addition
Page 171 and 172:
8.4. Using Onto.PT 153 based approa
Page 173:
8.4. Using Onto.PT 155 Uma populaç
Page 176 and 177:
158 Chapter 9. Final discussion 3.
Page 178 and 179:
160 Chapter 9. Final discussion - G
Page 180 and 181:
162 Chapter 9. Final discussion Any
Page 183 and 184:
References Agichtein, E. and Gravan
Page 185 and 186:
References 167 for storing and quer
Page 187 and 188:
References 169 15th International C
Page 189 and 190:
References 171 Symposium (STAIRS 20
Page 191 and 192:
References 173 Hovy, E., Hermjakob,
Page 193 and 194:
References 175 ACM, 38(11):39-41. M
Page 195 and 196:
References 177 ACL Press. Partee, B
Page 197 and 198:
References 179 Russell, S. and Norv
Page 199 and 200:
References 181 Proceedings of 13th
Page 201 and 202:
Appendix A Description of the extra
Page 203 and 204:
• x propriedadeDeAlgoQueCausa y -
Page 205:
• x antonimoAdjDe y Property - x
Page 208 and 209:
190 Appendix B. Coverage of EuroWor
Page 210 and 211:
Page 212:
show all

Onto.PT: Towards the Automatic Construction of a Lexical Ontology ...

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?