A Wordnet from the Ground Up

More documents

Recommendations

Info

4.5. Hybrid Combinations 153Figure 4.8: WNW’s suggestions for ‘pocałunek’ (kiss). Glosses (from left to the right): ‘nagroda’prize/goal, ‘gest’ gesture, ‘ruch’ movement – the fit, ‘dotyk’ touch, ‘nagroda’ prize/possession,‘pieszczota’ caressending line that links the new lemma oval with the corresponding attachment centre,even in the case of an addition at a high hypernymy distance from the centre. The site ofthe addition is recorded as the description of the positive proposal. Graphs present onthe screen can be selectively unfolded and traversed along hyponymy/hypernymy links(folding/unfolding is accessible via triangle-marked buttons in the top-right corner ofa synset symbol), so adding is not limited to synsets marked as fitting the new lemma.At any moment, the linguist can initiate the Algorithm of Activation-area Attachment[AAA] in order to redefine the attachment areas and centres. Changes affect allnew lemmas, but all decisions made so far are kept on the screen. For example, synsetswith the new lemma already added are shown as green octagons, together with theirrelation links.The total set of new lemmas was automatically divided – by repeatedly running thek-means clustering algorithm from the Cluto package (Karypis, 2002) – into groupsthat represented mostly quite coherent semantic subdomains. The linguist is shown onlyone lemma group at a time. She can concentrate on a part of the hypernymy structure.Moreover, such a work procedure is facilitated by the re-computation mechanism, whichcan improve the attachment proposals by using the information about the new lemmasintroduced into the plWordNet structure up to this moment.
154 Chapter 4. Extracting Relation InstancesThe whole attachment screen is embedded in the full plWordNetApp, so the linguistcan change any element of the plWordNet database by switching to another panel.The AAA algorithm runs on the server. On the client side, mainly visualisation isleft. WNW is written in Java and can run unchanged on many platforms.4.5.4 Benefits of weaving the expanded structureWNW has been designed to facilitate the actual process of wordnet expansion. Itsprimary evaluation was based on the work of a linguist with rich experience in editingplWordNet, who was adding new nominal lemmas. The candidates came from thesame set of 13285 nominal lemmas, which has been defined as a basis for expandingplWordNet during work on MSR extraction, cf Section 3.4.5. The set includes lemmasfrom a small Polish-English dictionary (Piotrowski and Saloni, 1999), two-wordlemmas from a general dictionary of Polish (PWN, 2007) and frequent nouns (>1000)from the joint corpus 14 (≈ 581 million tokens, see Section 3.4.5).For evaluation purposes, we used 1360 new lemmas divided into subdomains correspondingto animals (113 LUs), food (170), people (323), people 2 (269), plants (81),places (243), plus a sample of 161 LUs randomly drawn across all clusters (rand. inTable 4.5). Prior to the experiment, the linguist had used only traditional means ofher work – electronic dictionaries and corpus browsing. We assumed three types ofevaluation:1. subjective opinions and observations of the linguist collected during actual workover a longer period, 18 person-days,2. monitoring and analysing the linguist’s decisions recorded in the database togetherwith descriptions,3. automatic evaluation following the general scheme of re-building the existingwordnet by applying the AAA algorithm autonomously.The linguist’s observationsWNW has turned out to be useful in the inclusion of new lemmas given a narrowdomain such as jedz 15 (names of foodstuffs) or rsl (plant names). For such lemmasthe accuracy was high, and it increased even more as the database grew and as theoperation of recomputing the graphs became available. As an example, the program14 As described in Section 3.4.5, the joint corpus consists of IPIC (≈ 254 million tokens)(Przepiórkowski, 2004), texts from the electronic edition of a Polish daily Rzeczpospolita(≈ 113 million tokens) (Rzeczpospolita, 2008) and a corpus of large Polish texts collected from theInternet (≈ 214 million tokens).15 We cite here the original labels assigned to the domains in plWordNet.
Page 1 and 2:
A Wordnetfrom the Ground Up
Page 3 and 4:
Work financed by the Polish Ministr
Page 7 and 8:
6 Prefaceheartfelt thanks go to all
Page 9:
8 Chapter 1. Motivation, Goals, Ear
Page 12 and 13:
1.1. Motivation 11[a] special form
Page 14 and 15:
1.1. Motivation 13Affect (Strappara
Page 16 and 17:
1.2. The Goals of the plWordNet Pro
Page 18 and 19:
1.2. The Goals of the plWordNet Pro
Page 20 and 21:
1.3. Early Decisions 19Merge Model:
Page 22:
1.3. Early Decisions 214. On the ot
Page 25 and 26:
24 Chapter 2. Building a Wordnet Co
Page 27 and 28:
Page 29 and 30:
Page 31 and 32:
Page 33 and 34:
Page 35 and 36:
Page 37 and 38:
Page 39 and 40:
Page 41 and 42:
Page 43 and 44:
Page 45 and 46:
Page 47 and 48:
Page 49 and 50:
48 Chapter 3. Discovering Semantic
Page 51 and 52:
Page 53 and 54:
Page 55 and 56:
Page 57 and 58:
Page 59 and 60:
Page 61 and 62:
Page 63 and 64:
Page 65 and 66:
Page 67 and 68:
Page 69 and 70:
Page 71 and 72:
Page 73 and 74:
Page 75 and 76:
Page 77 and 78:
Page 79 and 80:
Page 81 and 82:
Page 83 and 84:
Page 85 and 86:
Page 87 and 88:
Page 89 and 90:
Page 91 and 92:
Page 93 and 94:
Page 95 and 96:
Page 97 and 98:
Page 99 and 100:
Page 101 and 102:
Page 103 and 104: 102 Chapter 4. Extracting Relation
Page 153: 152 Chapter 4. Extracting Relation
Page 167 and 168: 166 Chapter 5. Polish WordNet Today
Page 186 and 187: Appendix ATests for Lexico-semantic
Page 188 and 189: 187Test for adjectives (T. IX)1. p1
Page 190 and 191: 189RelatednessTest for nouns (T. XV
Page 192 and 193: BibliographyAgarwal, Abhaya and Alo
Page 194 and 195: Bibliography 193on Deep Lexical Acq
Page 196 and 197: Bibliography 195Derwojedowa, Magdal
Page 198 and 199: Bibliography 197Grefenstette, Grego
Page 200 and 201: Bibliography 199Kurc, Roman. (2008)
Page 202 and 203: Bibliography 201Mohammad, Saif and
Page 204 and 205:
Bibliography 203. (2006) “The pot
Page 206 and 207:
Bibliography 205and Technology 7(1-
Page 208 and 209:
List of Tables2.1 The size of the c
Page 210 and 211:
List of Figures2.1 The LU perspecti
Page 212 and 213:
List of Figures 2114.16 Completely
Page 214 and 215:
Index 213CBC, see Clustering by Com
Page 216 and 217:
Index 215169, 177, 178, 180, 182hyp
Page 218 and 219:
Index 217mutual hypernymy, 24Mutual
Page 220 and 221:
Index 219SUMO, 14Supported Vector M
Page 222:
A language without a wordnet is at
show all

A Wordnet from the Ground Up

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?