Onto.PT: Towards the Automatic Construction of a Lexical Ontology ...

More documents

Recommendations

Info

58 Chapter 4. Acquisition of Semantic Relations added to synsets are the target of a clustering algorithm, similar to the one presented in chapter 5. • Chapter 7 proposes several algorithms for moving from term-based semantic relations to relations held between synsets, using only the extracted termbased relations and the discovered synsets. • After presenting all the steps, chapter 8 shows how they can be combined in the ECO approach, in order to reach our final goal, Onto.PT, a lexical ontology for Portuguese. In the same chapter, an overview of the current version of Onto.PT is provided. It is possible to integrate any kind of information, from any source, in Onto.PT, as long as it is represented as term-based triples. Still, regarding the goal of creating a broad-coverage lexical ontology, and despite some experiments using Wikipedia (Gonçalo Oliveira et al., 2010a), electronic dictionaries were our main target for exploitation, as in the MindNet project (Richardson et al., 1998; Vanderwende et al., 2005). As referred in section 2, language dictionaries are the main source of general lexical information of a language. They are structured on words and senses and are more exhaustive on this field than other textual resources. At the same time, they are systematic and thus easier to parse. This chapter describes the extraction of semantic relations from three Portuguese dictionaries, which resulted in the LKB named CARTÃO, a large lexical-semantic network for Portuguese. Part of the work presented here is also reported in Gonçalo Oliveira et al. (2011). We start this chapter by introducing our approach to the acquisition of termbased relational triples from dictionary definitions. Then, we describe the work performed on the creation of CARTÃO, starting with a brief introduction about the dictionaries used, some issues about their parsing and about the structure of their definitions. After that, we present the contents of CARTÃO, we compare the knowledge extracted from each of the three dictionaries, and evaluate it using different procedures. We end this chapter with a brief discussion on the utility of a LKB structured as CARTÃO. 4.1 Semantic relations from definitions In our work, the extraction of semantic relations from dictionaries is based on a fixed set of handcrafted rules, as opposing to state-of-the art bootstrapping algorithms that learn relations given a small set of seeds (see more in section 3.2.2). Although our approach is more time-consuming, especially in the construction of the grammars, which have to be manually adapted to new situations, this is not critical for dictionaries. As we will discuss in section 4.2.3, many regularities are preserved along definitions in the same dictionary, and even in different dictionaries. The vocabulary thus tends to be simple and easy to parse. Also, most bootstrapping algorithms rely heavily on redundancy in large collections of text, while dictionaries are smaller and much less redundant. Furthermore, our approach provides higher control over the discriminating patterns. The extraction of semantic relations is inspired by the construction of PAPEL, reported in Gonçalo Oliveira et al. (2009, 2010b), and consists of one manual step,
4.1. Semantic relations from definitions 59 where the grammars are created, and two automatic steps. Semantic relations, held between words in the definitions and the definiendum, are extracted after processing dictionary entries. Extracted relation instances are represented as term-based relational triples (hereafter, tb-triples) with the following structure: arg1 RELATION NAME arg2 A tb-triple indicates that one sense of the lexical item in the first argument (arg1) is related to one sense of the lexical item in the second argument (arg2) by means of a relation identified by RELATION NAME. For instance: animal HIPERONIMO DE c~ao (animal HYPERNYM OF dog) Each step of the extraction procedure is illustrated in figure 4.2, and encompasses the following steps: 1. Creation of the extraction grammars: After a careful analysis of the structure of the dictionary definitions, patterns that denote semantic relations are manually compiled into grammars. The rules of the grammars are made specifically for the extraction of relations between words in dictionary definitions and their definiendum. 2. Extraction of semantic relations: The grammars are used together with a parser 1 that processes the dictionary definitions. Only definitions of open category words (nouns, verbs, adjectives and adverbs) are processed. In the end, if definitions match the patterns, instances of semantic relations are extracted and represented as tb-triples t = {w1 R w2} where w1 is a word in the definition, w2 is the definiendum, and R is the name of a relation established by one sense of w1 and one sense of w2. 3. Cleaning and lemmatisation: After extraction, some relations have invalid arguments, including punctuation marks or prepositions. Definitions are thus POS-tagged with the tagger provided by the OpenNLP toolkit 2 , using the models for Portuguese 3 . Triples with invalid arguments are discarded 4 . Moreover, if the arguments of the triples are inflected and thus not defined in the dictionary, lemmatisation rules are applied 5 . This procedure results in a set of tb-triples of different predefined types. The resulting set may be formally seen as a term-based directed lexical network (see section 2.2.3). To this end, each tb-triple t = {w1 R w2} will denote an edge with label R, connecting words w1 and w2, which will be the nodes. 1We used the chart parser PEN, available from https://code.google.com/p/pen/ (September 2012) 2Available from http://incubator.apache.org/opennlp/ (September 2012) 3See http://opennlp.sourceforge.net/models-1.5/ (September 2012) 4Definitions are not tagged before extraction because the tagger models were trained in corpora text and do not work as well as they should for dictionary definitions. Furthermore, the grammars of PAPEL do not consider tags. Tagging at this stage should only be seen as a complement to the information provided by the dictionary. 5The lemmatisation rules were compiled by our colleague Ricardo Rodrigues, and take advantage of the annotation provided by the OpenNLP POS tagger.
Page 1:
PhD Thesis Doctoral Program in Info
Page 5:
Preface About six years ago, almost
Page 9 and 10:
Resumo Não há grandes dúvidas qu
Page 11 and 12:
Contents Chapter 1: Introduction .
Page 13:
8.2.1 Semantic Web model . . . . .
Page 16 and 17:
6.1 Illustrative synonymy network.
Page 18 and 19:
6.3 Evaluation against intersection
Page 21 and 22:
Chapter 1 Introduction A substantia
Page 23 and 24:
1.2. Approach 5 • They are not bu
Page 25 and 26: 1.4. Outline of the thesis 7 which
Page 27 and 28: Chapter 2 Background Knowledge The
Page 29 and 30: 2.1. Lexical Semantics 11 that, in
Page 31 and 32: 2.1. Lexical Semantics 13 Meronymy
Page 33 and 34: 2.2. Lexical Knowledge Formalisms a
Page 39 and 40: 2.3. Information Extraction from Te
Page 41 and 42: 2.3. Information Extraction from Te
Page 43: 2.4. Remarks on this section 25 usi
Page 46 and 47: 28 Chapter 3. Related Work in group
Page 48 and 49: 30 Chapter 3. Related Work ple rela
Page 50 and 51: 32 Chapter 3. Related Work knowledg
Page 52 and 53: 34 Chapter 3. Related Work the ELRA
Page 54 and 55: 36 Chapter 3. Related Work resource
Page 56 and 57: 38 Chapter 3. Related Work English
Page 58 and 59: 40 Chapter 3. Related Work of super
Page 60 and 61: 42 Chapter 3. Related Work • part
Page 62 and 63: 44 Chapter 3. Related Work LSIE fro
Page 64 and 65: 46 Chapter 3. Related Work modifier
Page 66 and 67: 48 Chapter 3. Related Work 6. {,}
Page 68 and 69: 50 Chapter 3. Related Work 1. Extra
Page 70 and 71: 52 Chapter 3. Related Work Due to t
Page 72 and 73: 54 Chapter 3. Related Work comparis
Page 74 and 75: 56 Chapter 3. Related Work creation
Page 78 and 79: 60 Chapter 4. Acquisition of Semant
Page 98 and 99: 80 Chapter 5. Synset Discovery Ther
Page 100 and 101: 82 Chapter 5. Synset Discovery the
Page 102 and 103: 84 Chapter 5. Synset Discovery tb-t
Page 104 and 105: 86 Chapter 5. Synset Discovery cota
Page 106 and 107: 88 Chapter 5. Synset Discovery θ W
Page 108 and 109: 90 Chapter 5. Synset Discovery Tabl
Page 110 and 111: 92 Chapter 5. Synset Discovery word
Page 113 and 114: Chapter 6 Thesaurus Enrichment Gene
Page 115 and 116: 6.1. Automatic Assignment of synpai
Page 117 and 118: 6.2. Evaluation of the assignment p
Page 119 and 120: 6.3. Clustering and integrating new
Page 121 and 122: 6.4. A large thesaurus for Portugue
Page 127 and 128:
6.4. A large thesaurus for Portugue
Page 129:
6.5. Discussion 111 Another contrib
Page 132 and 133:
114 Chapter 7. Moving from term-bas
Page 134 and 135:
Page 136 and 137:
Page 138 and 139:
Page 140 and 141:
Page 142 and 143:
Page 144 and 145:
Page 146 and 147:
Page 149 and 150:
Chapter 8 Onto.PT: a lexical ontolo
Page 151 and 152:
8.1. Overview 133 items inside a sy
Page 153 and 154:
8.2. Access and Availability 135 no
Page 155 and 156:
8.2. Access and Availability 137 Ex
Page 157 and 158:
8.3. Evaluation 139 Figure 8.3: Ins
Page 159 and 160:
8.3. Evaluation 141 the most reliab
Page 161 and 162:
8.3. Evaluation 143 imation of the
Page 163 and 164:
8.3. Evaluation 145 Relation parteD
Page 165 and 166:
8.4. Using Onto.PT 147 • S: (n) a
Page 167 and 168:
8.4. Using Onto.PT 149 todos os fun
Page 169 and 170:
8.4. Using Onto.PT 151 In addition
Page 171 and 172:
8.4. Using Onto.PT 153 based approa
Page 173:
8.4. Using Onto.PT 155 Uma populaç
Page 176 and 177:
158 Chapter 9. Final discussion 3.
Page 178 and 179:
160 Chapter 9. Final discussion - G
Page 180 and 181:
162 Chapter 9. Final discussion Any
Page 183 and 184:
References Agichtein, E. and Gravan
Page 185 and 186:
References 167 for storing and quer
Page 187 and 188:
References 169 15th International C
Page 189 and 190:
References 171 Symposium (STAIRS 20
Page 191 and 192:
References 173 Hovy, E., Hermjakob,
Page 193 and 194:
References 175 ACM, 38(11):39-41. M
Page 195 and 196:
References 177 ACL Press. Partee, B
Page 197 and 198:
References 179 Russell, S. and Norv
Page 199 and 200:
References 181 Proceedings of 13th
Page 201 and 202:
Appendix A Description of the extra
Page 203 and 204:
• x propriedadeDeAlgoQueCausa y -
Page 205:
• x antonimoAdjDe y Property - x
Page 208 and 209:
190 Appendix B. Coverage of EuroWor
Page 210 and 211:
Page 212:
show all

Onto.PT: Towards the Automatic Construction of a Lexical Ontology ...

Create successful ePaper yourself

Delete template?

Save as template?