Onto.PT: Towards the Automatic Construction of a Lexical Ontology ...

More documents

Recommendations

Info

24 Chapter 2. Background Knowledge IE. The input of a OIE system is a corpus and the output is a set of facts, represented as relational triples t = {e1, relation phrase, e2}. There is no need for annotated data nor need for specifying the relations to extract. For learning a classifier, an OIE system starts by identifying the noun phrases of several thousands of sentences in the input corpus. The parsing structure of the words connecting noun phrases is also analysed. This sequence is labelled as positive or negative examples of trustworthy relations, according to predefined heuristics. Positive and negative tuples are finally used to establish triples, where a pair of noun phrases is connected by a relation phrase. Triples are mapped into feature vectors, used as the input of a classifier. For the extraction, only a single pass is needed over the input corpus. Each pair of noun phrases is used as the arguments of a triple, and the text connecting them is used as the relation phrase. Triples classified as trustworthy are extracted. 2.4 Remarks on this section In this section, we bridge the theoretical work described in the previous sections with the work developed in the scope of this thesis. The first part targets the knowledge representation in our work and the second is about the information extraction techniques applied. 2.4.1 Knowledge representation in our work In our work, instances of semantic relations are first extracted as relational triples t = {w1, R, w2}, which can both be seen as logical predicates or as the edges of a lexical network. The types of semantic relations are typical relations between word senses, including synonymy, hypernymy, several types of meronymy and most of the relations introduced in section 2.1.2. The arguments of these relations are lexical items, described by their orthographical form. Word senses are not handled. On the other hand, the final resource of this work, Onto.PT, can be seen as lexical ontology, as we have adopted a model inspired by Princeton WordNet (see more about this resource in section 3.1.1). In order to represent natural language concepts, Onto.PT groups synonymous words in synsets, which are groups of synonymous words. This part of the resource can thus be seen as a thesaurus. As for other semantic relations, Onto.PT includes several predefined types established between synsets. Given that the presence of a lexical item in a synset defines a new possible sense of this item, different senses of the same word are recognised. 2.4.2 Information Extraction techniques in our work We have only exploited dictionaries for the extraction of semantic relations. For this purpose, we used symbolic techniques over the dictionary definitions (see section 4). We recall that dictionaries provide a wide coverage of the lexicon and they are structured in words and meanings. Moreover, definitions tend to use simple vocabulary and follow regularities, which makes most of them easily predictable. Therefore, after careful observation, we manually encoded a set of semantic patterns, organised in grammars, for processing them. Despite the manual labour involved in the manual creation of the grammars, we could take advantage of one of the pros of
2.4. Remarks on this section 25 using handcrafted knowledge over machine learning techniques – we have more control on the obtained results. But most of this work was done in the scope of the project PAPEL (Gonçalo Oliveira et al., 2008, 2010b). Given that the grammars for extracting semantic relations from one dictionary were available, the manual effort was minimised. Furthermore, as we will see in more detail in section 4, definitions follow similar regularities across different dictionaries, which gives some portability to the grammars. Other reasons for discarding machine learning techniques for this task include: • As far as we know, there is no Portuguese dictionary (or corpus), with annotated semantic relations between lexical items, suitable, for instance, for supervised extraction approaches – the ReRelEM collection (Freitas et al., 2009) only contains relations between named entities. The creation of such a resource is more time-consuming than the manual creation of grammars. • When it comes to the same relation instance, dictionary text has limited redundancy. So, in a few experiments that we have performed, weakly supervised bootstrapping techniques only discovered a small set of relations. • OIE is more suitable for extracting open-domain relations from corpora, and not for extracting predefined relations. If OIE was applied, we would need to later convert the discovered predicates into one of our relation types. Moreover, the discovered relational triples typically connect generic world concepts, and not word senses. For discovering synsets (see sections 5 and 6), we have used graph clustering techniques (Schaeffer, 2007) over the synonymy graph extracted from the dictionaries. Lexical items are grouped in synsets according the similarity of their adjacencies in the graph. Also, in order to represent ambiguity, the clusters might be overlapping. Finally, the integration of semantic relations in the thesaurus (see section 7) can also be seen as kind of clustering, as each argument of a triple is attached to most similar synset.
Page 1: PhD Thesis Doctoral Program in Info
Page 5: Preface About six years ago, almost
Page 9 and 10: Resumo Não há grandes dúvidas qu
Page 11 and 12: Contents Chapter 1: Introduction .
Page 13: 8.2.1 Semantic Web model . . . . .
Page 16 and 17: 6.1 Illustrative synonymy network.
Page 18 and 19: 6.3 Evaluation against intersection
Page 21 and 22: Chapter 1 Introduction A substantia
Page 23 and 24: 1.2. Approach 5 • They are not bu
Page 25 and 26: 1.4. Outline of the thesis 7 which
Page 27 and 28: Chapter 2 Background Knowledge The
Page 29 and 30: 2.1. Lexical Semantics 11 that, in
Page 31 and 32: 2.1. Lexical Semantics 13 Meronymy
Page 33 and 34: 2.2. Lexical Knowledge Formalisms a
Page 39 and 40: 2.3. Information Extraction from Te
Page 41: 2.3. Information Extraction from Te
Page 46 and 47: 28 Chapter 3. Related Work in group
Page 48 and 49: 30 Chapter 3. Related Work ple rela
Page 50 and 51: 32 Chapter 3. Related Work knowledg
Page 52 and 53: 34 Chapter 3. Related Work the ELRA
Page 54 and 55: 36 Chapter 3. Related Work resource
Page 56 and 57: 38 Chapter 3. Related Work English
Page 58 and 59: 40 Chapter 3. Related Work of super
Page 60 and 61: 42 Chapter 3. Related Work • part
Page 62 and 63: 44 Chapter 3. Related Work LSIE fro
Page 64 and 65: 46 Chapter 3. Related Work modifier
Page 66 and 67: 48 Chapter 3. Related Work 6. {,}
Page 68 and 69: 50 Chapter 3. Related Work 1. Extra
Page 70 and 71: 52 Chapter 3. Related Work Due to t
Page 72 and 73: 54 Chapter 3. Related Work comparis
Page 74 and 75: 56 Chapter 3. Related Work creation
Page 76 and 77: 58 Chapter 4. Acquisition of Semant
Page 92 and 93:
74 Chapter 4. Acquisition of Semant
Page 94 and 95:
Page 96 and 97:
Page 98 and 99:
80 Chapter 5. Synset Discovery Ther
Page 100 and 101:
82 Chapter 5. Synset Discovery the
Page 102 and 103:
84 Chapter 5. Synset Discovery tb-t
Page 104 and 105:
86 Chapter 5. Synset Discovery cota
Page 106 and 107:
88 Chapter 5. Synset Discovery θ W
Page 108 and 109:
90 Chapter 5. Synset Discovery Tabl
Page 110 and 111:
92 Chapter 5. Synset Discovery word
Page 113 and 114:
Chapter 6 Thesaurus Enrichment Gene
Page 115 and 116:
6.1. Automatic Assignment of synpai
Page 117 and 118:
6.2. Evaluation of the assignment p
Page 119 and 120:
6.3. Clustering and integrating new
Page 121 and 122:
6.4. A large thesaurus for Portugue
Page 123 and 124:
Page 125 and 126:
Page 127 and 128:
Page 129:
6.5. Discussion 111 Another contrib
Page 132 and 133:
114 Chapter 7. Moving from term-bas
Page 134 and 135:
Page 136 and 137:
Page 138 and 139:
Page 140 and 141:
Page 142 and 143:
Page 144 and 145:
Page 146 and 147:
Page 149 and 150:
Chapter 8 Onto.PT: a lexical ontolo
Page 151 and 152:
8.1. Overview 133 items inside a sy
Page 153 and 154:
8.2. Access and Availability 135 no
Page 155 and 156:
8.2. Access and Availability 137 Ex
Page 157 and 158:
8.3. Evaluation 139 Figure 8.3: Ins
Page 159 and 160:
8.3. Evaluation 141 the most reliab
Page 161 and 162:
8.3. Evaluation 143 imation of the
Page 163 and 164:
8.3. Evaluation 145 Relation parteD
Page 165 and 166:
8.4. Using Onto.PT 147 • S: (n) a
Page 167 and 168:
8.4. Using Onto.PT 149 todos os fun
Page 169 and 170:
8.4. Using Onto.PT 151 In addition
Page 171 and 172:
8.4. Using Onto.PT 153 based approa
Page 173:
8.4. Using Onto.PT 155 Uma populaç
Page 176 and 177:
158 Chapter 9. Final discussion 3.
Page 178 and 179:
160 Chapter 9. Final discussion - G
Page 180 and 181:
162 Chapter 9. Final discussion Any
Page 183 and 184:
References Agichtein, E. and Gravan
Page 185 and 186:
References 167 for storing and quer
Page 187 and 188:
References 169 15th International C
Page 189 and 190:
References 171 Symposium (STAIRS 20
Page 191 and 192:
References 173 Hovy, E., Hermjakob,
Page 193 and 194:
References 175 ACM, 38(11):39-41. M
Page 195 and 196:
References 177 ACL Press. Partee, B
Page 197 and 198:
References 179 Russell, S. and Norv
Page 199 and 200:
References 181 Proceedings of 13th
Page 201 and 202:
Appendix A Description of the extra
Page 203 and 204:
• x propriedadeDeAlgoQueCausa y -
Page 205:
• x antonimoAdjDe y Property - x
Page 208 and 209:
190 Appendix B. Coverage of EuroWor
Page 210 and 211:
Page 212:
show all

Onto.PT: Towards the Automatic Construction of a Lexical Ontology ...

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?