Onto.PT: Towards the Automatic Construction of a Lexical Ontology ...

More documents

Recommendations

Info

120 Chapter 7. Moving from term-based to synset-based relations Brazilian Portuguese and thus contains some unusual words or meanings for European Portuguese. On the other hand, OT.PT is smaller, but made for European Portuguese, and contains words and meanings not covered by TeP. Therefore, we used TeP as a starting point for the creation of a new noun thesaurus 3 , TePOT, with the noun synsets from both TeP and OT.PT. The thesauri are merged according to the following automatic procedure: 1. The overlap between each synset in OT.PT, Oi, and each synset of TeP, Tj, is measured. For each Oi ∈ OT.PT a first set of candidates, Ci = {Ci1, Ci2, ..., Cin} ⊂ TeP, will contain the TeP synsets that maximise the Overlap measure, Overlap(Oi, Cik) = max(Overlap(Oi, Tj)): Overlap(Oi, Tj) = Oi ∩ Tj min(|Oi|, |Tj|) If max(Overlap(Oi, Tj)) = 0, it means that the OT.PT synset contains only words that are not in TeP, and is thus added to TePOT as it is. 2. Otherwise, the candidate(s) in Ci with higher Jaccard coefficient are selected, Cil ∈ C ′ i → Jaccard(Oi, Cil) = max(Jaccard(Oi, Cik)): Jaccard(Oi, Cik) = Oi ∩ Cik Oi ∪ Cik Usually, C ′ i has just one synset but, if it has more than one, they are merged in the same synset. Then, the new synset is merged with Oi. A new TePOT synset Si will contain all words in Oi and in the synsets in C ′ i . Si = {w1, w2, ..., wm} : ∀(wj ∈ Si) → wj ∈ Oi ∨ wj ∈ Cil, Cil ∈ C ′ i . 3. Synsets of TeP which have not been merged with any OT.PT synset are finally added to TePOT without any change. In the end, TePOT contains 18,501 nouns, organised in 8,293 synsets – 6,237 of the nouns are ambiguous and, on average, one synset has 3.84 terms and one term is in 1.72 synsets. Tb-triples The algorithms were evaluated for ontologising tb-triples of three different types: hypernymy, part-of and purpose-of, all held between nouns. The tb-triples used were obtained from PAPEL 2.0, which was, at the time when we started to create the gold reference, the most recent version of PAPEL. As a resource extracted automatically from dictionaries, the reliability of PAPEL is not 100% (see section 4.2.5 for evaluation details), but it was the largest lexical-semantic resource of this kind freely available. In order to minimise the noise of using incorrect tb-triples, we added additional constraints to their selection, namely: 3 We only used nouns because the reported experimentations only dealt with semantic relations between nouns, namely hypernymy, part-of, and purpose-of.
7.2. Ontologising performance 121 • Only tb-triples supported by CETEMPúblico (Santos and Rocha, 2001), a newspaper corpus of Portuguese, were used. This was done based on the results of the automatic validation, as reported in section 4.2.5. We thus had some confidence on the quality of the triples, as their arguments co-occurred at least once in the corpus, connected by discriminating textual patterns for their relation. • Triples with the following frequent but abstract arguments were discarded: acto (act), efeito (effect), acção (action), estado (state), coisa (thing), qualidade (quality) as well as tb-triples with arguments with less than 25 occurrences in CETEMPúblico. Some of the frequent and abstract arguments were actually considered as “empty heads” (see more on section 3.2.1 of this thesis) since PAPEL 3.0. This means that, in the current version of PAPEL, there are not hypernymy tb-triples where these words are the hypernym. Furthermore, we unified all meronymy relations (part-of, member-of, containedin, material-of) in a unique type, part-of. This option relied on the fact that the distinction of different meronymy subtypes is sometimes too fine-grained, and because, as it occurs for English (Ittoo and Bouma, 2010), for Portuguese there are textual patterns that might be used to denote more that one subtype. Attachments From the previous selection of tb-triples, we chose those held between words included in, at least, one TePOT synset, and whose attachment raised no doubts. It was possible to have tb-triples where all possible attachments were correct, as well as tb-triples without a plausible attachment, because the sense of one of the arguments was not covered by the thesaurus. For each tb-triple, the gold reference contained all plausible attachments, as in the examples of figure 7.5. In the end, the gold reference consisted of 452 tb-triples and their possible attachments, with those that were plausible marked. Table 7.1 shows the distribution of tb-triples per relation type, the average number of possible attachments, and the average number of plausible attachments. The proportion of plausible attachments per tb-triple can be seen as the random chance of selecting a plausible attachment from the possible alternatives. This number is between 40%, for hypernymy, and 50% for purpose-of. Relation tb-triples Attachments Avg(possible) Avg(plausible) Hypernym-of 210 13.7 5.5 (40.2%) Part-of 175 11.2 5.5 (49.5%) Purpose-of 67 13.5 6.8 (50.1%) Table 7.1: Matching possibilities in the gold resource. 7.2.2 Performance comparison In order to compare the performance of the algorithms, we used them to ontologise the 452 tb-triples in the gold reference into the candidate synsets. However, instead
Page 1:
PhD Thesis Doctoral Program in Info
Page 5:
Preface About six years ago, almost
Page 9 and 10:
Resumo Não há grandes dúvidas qu
Page 11 and 12:
Contents Chapter 1: Introduction .
Page 13:
8.2.1 Semantic Web model . . . . .
Page 16 and 17:
6.1 Illustrative synonymy network.
Page 18 and 19:
6.3 Evaluation against intersection
Page 21 and 22:
Chapter 1 Introduction A substantia
Page 23 and 24:
1.2. Approach 5 • They are not bu
Page 25 and 26:
1.4. Outline of the thesis 7 which
Page 27 and 28:
Chapter 2 Background Knowledge The
Page 29 and 30:
2.1. Lexical Semantics 11 that, in
Page 31 and 32:
2.1. Lexical Semantics 13 Meronymy
Page 33 and 34:
2.2. Lexical Knowledge Formalisms a
Page 35 and 36:
Page 37 and 38:
Page 39 and 40:
2.3. Information Extraction from Te
Page 41 and 42:
2.3. Information Extraction from Te
Page 43:
2.4. Remarks on this section 25 usi
Page 46 and 47:
28 Chapter 3. Related Work in group
Page 48 and 49:
30 Chapter 3. Related Work ple rela
Page 50 and 51:
32 Chapter 3. Related Work knowledg
Page 52 and 53:
34 Chapter 3. Related Work the ELRA
Page 54 and 55:
36 Chapter 3. Related Work resource
Page 56 and 57:
38 Chapter 3. Related Work English
Page 58 and 59:
40 Chapter 3. Related Work of super
Page 60 and 61:
42 Chapter 3. Related Work • part
Page 62 and 63:
44 Chapter 3. Related Work LSIE fro
Page 64 and 65:
46 Chapter 3. Related Work modifier
Page 66 and 67:
48 Chapter 3. Related Work 6. {,}
Page 68 and 69:
50 Chapter 3. Related Work 1. Extra
Page 70 and 71:
52 Chapter 3. Related Work Due to t
Page 72 and 73:
54 Chapter 3. Related Work comparis
Page 74 and 75:
56 Chapter 3. Related Work creation
Page 76 and 77:
58 Chapter 4. Acquisition of Semant
Page 78 and 79:
Page 80 and 81:
Page 82 and 83:
Page 84 and 85:
Page 86 and 87:
Page 88 and 89: 70 Chapter 4. Acquisition of Semant
Page 98 and 99: 80 Chapter 5. Synset Discovery Ther
Page 100 and 101: 82 Chapter 5. Synset Discovery the
Page 102 and 103: 84 Chapter 5. Synset Discovery tb-t
Page 104 and 105: 86 Chapter 5. Synset Discovery cota
Page 106 and 107: 88 Chapter 5. Synset Discovery θ W
Page 108 and 109: 90 Chapter 5. Synset Discovery Tabl
Page 110 and 111: 92 Chapter 5. Synset Discovery word
Page 113 and 114: Chapter 6 Thesaurus Enrichment Gene
Page 115 and 116: 6.1. Automatic Assignment of synpai
Page 117 and 118: 6.2. Evaluation of the assignment p
Page 119 and 120: 6.3. Clustering and integrating new
Page 121 and 122: 6.4. A large thesaurus for Portugue
Page 129: 6.5. Discussion 111 Another contrib
Page 132 and 133: 114 Chapter 7. Moving from term-bas
Page 149 and 150: Chapter 8 Onto.PT: a lexical ontolo
Page 151 and 152: 8.1. Overview 133 items inside a sy
Page 153 and 154: 8.2. Access and Availability 135 no
Page 155 and 156: 8.2. Access and Availability 137 Ex
Page 157 and 158: 8.3. Evaluation 139 Figure 8.3: Ins
Page 159 and 160: 8.3. Evaluation 141 the most reliab
Page 161 and 162: 8.3. Evaluation 143 imation of the
Page 163 and 164: 8.3. Evaluation 145 Relation parteD
Page 165 and 166: 8.4. Using Onto.PT 147 • S: (n) a
Page 167 and 168: 8.4. Using Onto.PT 149 todos os fun
Page 169 and 170: 8.4. Using Onto.PT 151 In addition
Page 171 and 172: 8.4. Using Onto.PT 153 based approa
Page 173: 8.4. Using Onto.PT 155 Uma populaç
Page 176 and 177: 158 Chapter 9. Final discussion 3.
Page 178 and 179: 160 Chapter 9. Final discussion - G
Page 180 and 181: 162 Chapter 9. Final discussion Any
Page 183 and 184: References Agichtein, E. and Gravan
Page 185 and 186: References 167 for storing and quer
Page 187 and 188: References 169 15th International C
Page 189 and 190:
References 171 Symposium (STAIRS 20
Page 191 and 192:
References 173 Hovy, E., Hermjakob,
Page 193 and 194:
References 175 ACM, 38(11):39-41. M
Page 195 and 196:
References 177 ACL Press. Partee, B
Page 197 and 198:
References 179 Russell, S. and Norv
Page 199 and 200:
References 181 Proceedings of 13th
Page 201 and 202:
Appendix A Description of the extra
Page 203 and 204:
• x propriedadeDeAlgoQueCausa y -
Page 205:
• x antonimoAdjDe y Property - x
Page 208 and 209:
190 Appendix B. Coverage of EuroWor
Page 210 and 211:
Page 212:
show all

Onto.PT: Towards the Automatic Construction of a Lexical Ontology ...

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?