A Wordnet from the Ground Up

More documents

Recommendations

Info

1.2. The Goals of the plWordNet Project 17According to our initial plans, an extraction algorithm should suggest both newsynsets and instances of lexico-semantic relations. In the end, the WordNet Weaver generatesonly suggestions of attachment points (Section 4.5.3): synsets in which a givennew LU can be included or to which it can be attached as a new hyponym/hypernym oreven meronym. The accuracy of clustering-based methods of suggesting new synsetsended up too low for practical applications (Section 3.5). The use of support toolsnotwithstanding, we wanted to abide by the principle that the ultimate responsibilityfor every wordnet element rests with its authors in every phase of the wordnet development.It was tempting to speed up the development of our wordnet at the cost ofslightly lower accuracy, but we are convinced that a smaller wordnet with excellentaccuracy is more useful in applications than a larger but less reliable resource.Despite the limited funds, we fully expected to build a wordnet of a size comparableto several much better established European wordnets. The introduction ofthe automated methods in the second phase of the project was meant to reduce thelinguistic workload considerably 8 . Section 4.5.4 reports on the extent to which thissucceeded.There are many methods of extracting lexico-semantic relations from corpora. Wepresent an overview and a detailed discussion of selected methods throughout Chapters3 and 4. They can be roughly divided into two main groups of methods, basedondistribution (Chapter 3) and on patterns (Chapter 4). The former can achieve a relativelygood accuracy in extracting instances of hypernymy – pairs of LUs – but veryrarely of other relations such as synonymy, meronymy or antonymy; the recall is low.Distributional methods achieve good recall, because they can generate a descriptionfor any pair of LUs, but their accuracy is quite low: they do not distinguish betweendifferent lexico-semantic relations and produce a vague measure of semantic relatedness.A well-known weakness of distributional methods is in distinguishing different LUsfor the given lemma. Henceforth, we will understand lemma to be a basic morphologicalword form that represents the occurrences of one or a few particular LUs inlanguage expressions. A lemma is monosemous if it represents one LU, and polysemousotherwise. The basic morphological word form, or base form, is a word formor language expression with conventional values of grammatical categories, such asthe nominative case and singular number for nouns. A base form represents a set ofword forms with the same meaning and different values of grammatical categories. Wedecided to operate on lemmas during the extraction of relation instances, because thenumber of different word forms is very high in the strongly inflected Polish language.Lemmatisation, or the mapping of word forms to lemmas, must be done automatically8 That is why we have allotted the funds approximately in the proportion 1:2 to manual work and tothe software design and development work.
18 Chapter 1. Motivation, Goals, Early Decisionsfor large corpora; some error ratio is inevitable. We will discus corpus preprocessingin Section 3.4.3.That is why we assumed from the start that it will be necessary to construct hybridsolutions: combine several methods, at least one following the pattern-based paradigmand one based on Distributional Semantics, see Section 3.2. We had been sceptical –justifiably, as Section 3.5 shows – about the possibility of recognising different LUsrepresented by a lemma on the basis of semantic clustering of lemmas. We thereforealso planned to develop sense extraction for lemmas by clustering documents or atleast longer segments that include occurrences of particular lemmas. We assumed thatpolysemous lemmas would occur in several documents. This part of our initial planswas the least successful (Section 3.5), but the other hybrid methods, when combinedin the WordNet Weaver, achieved a level sufficient for practical application in thelinguists’ work.1.3 Early Decisions1.3.1 Models for wordnet developmentPWN began as a psychological experiment and gradually morphed into a large ongoinglexical resource project. We naturally tried to explore the accumulated effects of longtermwork on PWN, but the EWN project (Vossen, 2002) also attracted our attention.EWN aimed to develop a family of aligned wordnets (Section 1.1.4), and the scaleof the enterprise required careful design. The EWN team also had an opportunity toanalyse the previous PWN experience. All of this made the EWN project an importantreference point for us.There is a fundamental difference between the EWN and plWordNet projects: theformer was oriented toward the development of aligned wordnets, while the presentstage of plWordNet construction focusses on the appropriate description of Polish. Weleave the question of mapping onto other wordnets for the upcoming continuation ofthe present plWordNet project 9 . The question of the appropriate sense-relating twowaymapping of wordnets for pairs of languages influenced how EWN constructed thewordnets. The solution was to link by expressing, in particular wordnets, the samelexicalised concepts from a shared set using the Inter-Lingual Index (Section 1.1.4).Besides this strategy, which somehow imposed seeking out lexicalisation of the sameconcepts in each language considered, two basic models of wordnet development havebeen worked out in EWN (Vossen, 2002, pp. 52):9 The budget of our project was too limited to investigate the problems of mapping (or, regrettably,to write glosses).
Page 1 and 2: A Wordnetfrom the Ground Up
Page 3 and 4: Work financed by the Polish Ministr
Page 7 and 8: 6 Prefaceheartfelt thanks go to all
Page 9: 8 Chapter 1. Motivation, Goals, Ear
Page 12 and 13: 1.1. Motivation 11[a] special form
Page 14 and 15: 1.1. Motivation 13Affect (Strappara
Page 16 and 17: 1.2. The Goals of the plWordNet Pro
Page 20 and 21: 1.3. Early Decisions 19Merge Model:
Page 22: 1.3. Early Decisions 214. On the ot
Page 25 and 26: 24 Chapter 2. Building a Wordnet Co
Page 49 and 50: 48 Chapter 3. Discovering Semantic
Page 69 and 70:
68 Chapter 3. Discovering Semantic
Page 71 and 72:
Page 73 and 74:
Page 75 and 76:
Page 77 and 78:
Page 79 and 80:
Page 81 and 82:
Page 83 and 84:
Page 85 and 86:
Page 87 and 88:
Page 89 and 90:
Page 91 and 92:
Page 93 and 94:
Page 95 and 96:
Page 97 and 98:
Page 99 and 100:
Page 101 and 102:
Page 103 and 104:
102 Chapter 4. Extracting Relation
Page 105 and 106:
Page 107 and 108:
Page 109 and 110:
Page 111 and 112:
Page 113 and 114:
Page 115 and 116:
Page 117 and 118:
Page 119 and 120:
Page 121 and 122:
Page 123 and 124:
Page 125 and 126:
Page 127 and 128:
Page 129 and 130:
Page 131 and 132:
Page 133 and 134:
Page 135 and 136:
Page 137 and 138:
Page 139 and 140:
Page 141 and 142:
Page 143 and 144:
Page 145 and 146:
Page 147 and 148:
Page 149 and 150:
Page 151 and 152:
Page 153 and 154:
Page 155 and 156:
Page 157 and 158:
Page 159 and 160:
Page 161 and 162:
Page 163 and 164:
Page 165 and 166:
Page 167 and 168:
166 Chapter 5. Polish WordNet Today
Page 169 and 170:
Page 171 and 172:
Page 173 and 174:
Page 175 and 176:
Page 177 and 178:
Page 179 and 180:
Page 181 and 182:
Page 183 and 184:
Page 186 and 187:
Appendix ATests for Lexico-semantic
Page 188 and 189:
187Test for adjectives (T. IX)1. p1
Page 190 and 191:
189RelatednessTest for nouns (T. XV
Page 192 and 193:
BibliographyAgarwal, Abhaya and Alo
Page 194 and 195:
Bibliography 193on Deep Lexical Acq
Page 196 and 197:
Bibliography 195Derwojedowa, Magdal
Page 198 and 199:
Bibliography 197Grefenstette, Grego
Page 200 and 201:
Bibliography 199Kurc, Roman. (2008)
Page 202 and 203:
Bibliography 201Mohammad, Saif and
Page 204 and 205:
Bibliography 203. (2006) “The pot
Page 206 and 207:
Bibliography 205and Technology 7(1-
Page 208 and 209:
List of Tables2.1 The size of the c
Page 210 and 211:
List of Figures2.1 The LU perspecti
Page 212 and 213:
List of Figures 2114.16 Completely
Page 214 and 215:
Index 213CBC, see Clustering by Com
Page 216 and 217:
Index 215169, 177, 178, 180, 182hyp
Page 218 and 219:
Index 217mutual hypernymy, 24Mutual
Page 220 and 221:
Index 219SUMO, 14Supported Vector M
Page 222:
A language without a wordnet is at
show all

A Wordnet from the Ground Up

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?