A Wordnet from the Ground Up

More documents

Recommendations

Info

3.5. Sense Discovery by Clustering 97in language data. The indirect evaluation defined in (Pantel, 2003, Lin and Pantel,2002) will measure the level of resemblance between the division into senses made bylinguists constructing the wordnet and that extracted via clustering.We wanted to evaluate the algorithm’s ability to reconstruct plWordNet synsets.That would confirm the applicability of the algorithm in the semi-automatic constructionof wordnets. We put nouns from plWordNet on the input list of nouns (E in thealgorithm). Because plWordNet is constructed bottom-up, the list consisted of 13298most frequent nouns in IPIC plus some most general nouns, see Section 3.4.5. Theconstraints were parameterised by 96142 features (41599 adjectives and participles,and 54543 nouns).Several thresholds used in the CBC algorithm (plus a few more in the evaluation)are the major difficulty in its exact re-implementation. No method of optimising CBCin relation to thresholds was proposed in (Pantel, 2003, Lin and Pantel, 2002) 27 and thevalues of all thresholds in (Pantel, 2003) were established experimentally. There alsowas no discussion of their dependence on the applied tools, corpus and characteristicsof the given language.Broda et al. (2008) performed such analysis in relation to Polish. Here we willoutline only most important conclusions. Experiments with using RWF instead ofPMI showed that RWF gives higher precision (38.81% versus 22.37%), but leads tofewer resulting word assigned to groups (744 versus 2980). The value of σ, whichcontrols when to stop assigning words to a committee (step 2b in Phase III of thealgorithm), must be carefully selected for each type of MSR separately. As the valueof σ increases, the precision increases too, but the number of words clustered dropssignificantly. When we make σ small and θ ElCom (meaning that word “is not similar”to any committee), we get relatively good precision but more words clustered. Wefound that contrary to the statement and chart in (Pantel, 2003), tuning both thresholdswas important in our case.The experiments confirmed our intuition that removing overlapping features inPhase III of CBC is too radical. The application of both proposed heuristics was testedexperimentally and resulted in the increased precision. The minimal-value heuristicincreased the precision from 38.8% to 41.0% on 695 words clustered. The usage ofthe ratio heuristic improves the result even further: the precision rises to 42.5% on 701words clustered. A manual inspection of the results showed that the algorithm tendsto produce too many overlapping senses when it uses the ratio heuristic.Because of indirect nature of evaluation proposed in (Pantel, 2003) we wanted toevaluate CBC in more direct and intuitive way. We assumed that proper clustering27 Automatising this process is very difficult, because the whole process is computationally very expensive.A full iteration takes 5–7 hours on a 2.13GHz PC with 6GB of RAM, which makes, say, anapplication of Genetic Algorithms barely possible.
98 Chapter 3. Discovering Semantic Relatednessshould be able to clear the MSR from accidental or remote associations. That is tosay, if two words belong to the same word group, it is a strong evidence of their beingnear-synonyms or at least being closely related in the hypernymy structure. If that istrue then accuracy in WBST+H and EWBST tests (Sec. 3.3) of MSR enriched withoutput of CBC should be better than MSR alone. That kind of evaluation of CBC wasperformed by Broda et al. (2008). Accuracy of joined algorithm was lower then forMSR alone. Both methods of evaluation showed that CBC applied for Polish tends togroup loosely related lemmas too often. Even improvement in removing overlappingfeatures in Phase III did not yield satisfying precision.In order to illustrate the work of the algorithm, we selected two examples of correctword senses extracted for two polysemous LUs. The word senses are representedby committees described by numeric identifiers. It is thus emphasised that committeemembers define only some word sense and are not necessarily near synonyms of thegiven LU.LU: bessa economic slumpid=95 committee: {niezdolność inability, paraliż paralysis, rozkład decomposition,rozpad decay, zablokowanie blockage, zapaść collapse, zastój stagnation}id=153 committee: {tendencja tendency, trend trend}LU: chirurgia surgery109 committee: {biologia biology, fizjologia physiology, genetyka genetics, medycynamedicine}196 committee: {ambulatorium outpatient unit, gabinet cabinet, klinika clinic, lecznictwomedical care, poradnia clinic, przychodnia dispensary}Now, the same but with the proposed heuristic of minimal value activated.LU: bessa64 committee: {pobyt stay, podróż travel} – a spurious sense95 committee: as above153 committee: as aboveLU: chirurgia109 committee: as above
Page 1 and 2:
A Wordnetfrom the Ground Up
Page 3 and 4:
Work financed by the Polish Ministr
Page 7 and 8:
6 Prefaceheartfelt thanks go to all
Page 9:
8 Chapter 1. Motivation, Goals, Ear
Page 12 and 13:
1.1. Motivation 11[a] special form
Page 14 and 15:
1.1. Motivation 13Affect (Strappara
Page 16 and 17:
1.2. The Goals of the plWordNet Pro
Page 18 and 19:
1.2. The Goals of the plWordNet Pro
Page 20 and 21:
1.3. Early Decisions 19Merge Model:
Page 22:
1.3. Early Decisions 214. On the ot
Page 25 and 26:
24 Chapter 2. Building a Wordnet Co
Page 27 and 28:
Page 29 and 30:
Page 31 and 32:
Page 33 and 34:
Page 35 and 36:
Page 37 and 38:
Page 39 and 40:
Page 41 and 42:
Page 43 and 44:
Page 45 and 46:
Page 47 and 48: 46 Chapter 2. Building a Wordnet Co
Page 49 and 50: 48 Chapter 3. Discovering Semantic
Page 97: 96 Chapter 3. Discovering Semantic
Page 103 and 104: 102 Chapter 4. Extracting Relation
Page 149 and 150:
148 Chapter 4. Extracting Relation
Page 151 and 152:
Page 153 and 154:
Page 155 and 156:
Page 157 and 158:
Page 159 and 160:
Page 161 and 162:
Page 163 and 164:
Page 165 and 166:
Page 167 and 168:
166 Chapter 5. Polish WordNet Today
Page 169 and 170:
Page 171 and 172:
Page 173 and 174:
Page 175 and 176:
Page 177 and 178:
Page 179 and 180:
Page 181 and 182:
Page 183 and 184:
Page 186 and 187:
Appendix ATests for Lexico-semantic
Page 188 and 189:
187Test for adjectives (T. IX)1. p1
Page 190 and 191:
189RelatednessTest for nouns (T. XV
Page 192 and 193:
BibliographyAgarwal, Abhaya and Alo
Page 194 and 195:
Bibliography 193on Deep Lexical Acq
Page 196 and 197:
Bibliography 195Derwojedowa, Magdal
Page 198 and 199:
Bibliography 197Grefenstette, Grego
Page 200 and 201:
Bibliography 199Kurc, Roman. (2008)
Page 202 and 203:
Bibliography 201Mohammad, Saif and
Page 204 and 205:
Bibliography 203. (2006) “The pot
Page 206 and 207:
Bibliography 205and Technology 7(1-
Page 208 and 209:
List of Tables2.1 The size of the c
Page 210 and 211:
List of Figures2.1 The LU perspecti
Page 212 and 213:
List of Figures 2114.16 Completely
Page 214 and 215:
Index 213CBC, see Clustering by Com
Page 216 and 217:
Index 215169, 177, 178, 180, 182hyp
Page 218 and 219:
Index 217mutual hypernymy, 24Mutual
Page 220 and 221:
Index 219SUMO, 14Supported Vector M
Page 222:
A language without a wordnet is at
show all

A Wordnet from the Ground Up

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?