A Wordnet from the Ground Up

More documents

Recommendations

Info

5.1. Weaving the Full-fledged Structure 1675. Selected groups of new lemmas were loaded into WNW and the Algorithm ofActivation-area Attachment [AAA] was run to generate suggestions of attachmentareas.6. Linguists worked freely with the lemma groups; they browsed suggestions in anyorder and edited the wordnet structure.7. At any moment of the process, linguists could re-run AAA to get perhaps bettersuggestions for those new lemmas that have not been edited yet.8. Linguists notified the coordinator about finishing work with particular groups;the coordinator then could analyse the results using the same WNW system(accessing it via the Internet, just like the linguists).The whole process of extracting data sets – sources of evidence for AAA – performedin steps 1-2 took approximately 25 days on a standard PC (3GHz, 4GB RAM,one single-core processor). The time could be reduced to 2-4 days by applying a gridof at least several PCs. This one-time operation is computationally very intensive, butit prepares all data sets except classifiers at the beginning of a long-term expansionprocess. This is done once per each list of new lemmas, independent of the size ofthe list. Classifier training, to be repeated several times with the increasing size of thewordnet, it is much less computationally demanding than the other tasks. AAA is performedon the server, not on the linguists’ PCs. It takes 10-20 minutes on a PC-classserver.Clustering (step 4) is optional from the point of view of the WNW application,which can work efficiently with a list of several thousand new lemmas. Clusteringis necessary for people: a huge flat list is just too difficult to comprehend, and it ispractically impossible to organise around it work lasting several weeks.The idea behind clustering was to divide the initial list into lemma groups in sucha way that each group consists of lemmas with senses belonging to one domain commonto all of them (at least the intersection of the lemma senses should belong to onedomain). There is no perfect clustering algorithm, but manual grouping would be toolabourious to be feasible. We applied an off-the-shelf implementation of clusteringalgorithms in the Cluto package (Karypis, 2002). The input to the clustering algorithmswere values which describe semantic relatedness of lemma pairs acquired fromMSR GRW F . We experimented with different algorithms. After a manual inspectionof the results, we selected graph-based clustering. We did not evaluate the qualityof clustering exhaustively: the mechanism played only a minor, supporting role. Dueto the properties of the clustering algorithms, we repeated the process several times,each time getting some groups and a large set of ‘outliers’, which was next the inputto another run. The obtained groups were loaded into WNW – all in all, 92 groupswere constructed.
168 Chapter 5. Polish WordNet Today and Tomorrowakacja ‘black locust (false acacia)’, bez ‘lilac’, bluszcz ‘ivy’, brzoza ‘birch’, buk ‘beech’, busz ‘bush’,bylina ‘perennial’, cedr ‘cedar’, choinka ‘Christmas tree’, chrust ‘dry twigs’, chryzantema ‘chrysantemum’,chwast ‘weed’, cis ‘yew’, cyprys ‘cypress’, darnia [a lemmatisation error; should be darń‘sward’], drzewko ‘(small) tree’, drzewostan ‘forestation’, fiołek ‘violet’, gałązka ‘twig’, gęstwina ‘thicket’,girlanda ‘garland’, głóg ‘hawthorn’, goździk ‘carnation’, hiacynt ‘hyacinth’, irys ‘iris’, jabłoń ‘apple tree’,jawor ‘sycamore maple’, jemioła ‘mistletoe’, jeżyna ‘blackberry’, jodła ‘fir’, kaktus ‘cactus’, klon ‘maple’,koniczyna ‘clover’, konwalia ‘lily of the valley’, kora ‘bark’, korzenie ‘roots’, krokus ‘crocus’, kwiatek‘(small) flower’, leszczyna ‘hazel’, lilia ‘lily’, listowie ‘foliage’, łyko ‘phloem’, mech ‘moss’, modrzew‘larch’, narcyz ‘narcissus’, orchidea, oset ‘orchid, thistle’, osika ‘aspen’, palma ‘palm tree’, papirus‘papyrus’, paproć ‘fern’, platan ‘plane tree’, pnącz [a lemmatisation error; should be pnącze ‘creeper’],pnącze ‘creeper’, pokrzywa ‘nettle’, polano ‘log’, rododendron ‘rhododenron’, roślinność ‘vegetation’,sadzonka ‘seedling’, sitowie ‘rush’, słonecznik ‘sunflower’, sosna ‘pine’, stokrotka ‘daisy’, szałwia ‘sage’,szyszka ‘cone’, ściernisko ‘stubble field’, świerk ‘spruce’, topola ‘polar’, trzcina ‘reed’, tulipan ‘tulip’,wiąz ‘elm’, wić ‘runner’, wieniec ‘wreath’, wierzba ‘willow’, winorośl ‘grape vine’, wodorost ‘alga,seaweed’, wrzos ‘heather’, zarośle ‘thicket’, źdźbło ‘blade (of grass)’, żonkil ‘daffodil’, żywopłot ‘hedge’aktówka ‘briefcase’, atrament ‘ink’, bagaż ‘luggage’, bibuła ‘blotting paper’, bibułka ‘tissue paper’,bloczek ‘notepad’, cerata ‘oilcloth’, chlebak ‘haversack’, cyrkiel ‘compass (for drawing)’, długopis‘ball-point pen’, dzianina ‘hosiery’, filc ‘felt’, grzechotka ‘rattle’, gumka ‘eraser’, hamak ‘hammock’,juk ‘saddle bag’, kabura ‘holster’, karton ‘carton’, klocek ‘(toy) block’, kojec ‘pen (for a child)’,kołyska ‘cradle’, koperta ‘envelope’, kredka ‘crayon’, leżak ‘deck chair’, łóżeczko ‘(small) bed’, markiza‘awning’, mat ‘mate, matte’, mata ‘mat’, muślin ‘muslin’, namiot ‘tent’, nosze ‘stretchers’, nożyczki‘scissors’, ołówek ‘pencil’, otomana ‘sofa’, paczuszka ‘(small) package’, pakunek ‘package’, pergamin‘parchment’, perkal ‘gingham’, pędzel ‘brush’, pierzyna ‘duvet’, plastelina ‘plasticine’, poduszeczka‘(small) pillow’, przybór ‘implement’, saszetka ‘sachet’, segregator ‘binder’, siodełko ‘seat’, skakanka‘skip rope’, skoroszyt ‘folder’, spinacz ‘paper clip’, stalówka ‘nib’, stołek ‘stool’, szala ‘tray (in scales)’,sztaluga ‘easel’, tłumok ‘(large) bundle’, tobół ‘(large) bundle’, tornister ‘knapsack’, tusz ‘ink’, włóczka‘yarn’, woreczek ‘(small) sack’, worek ‘sack’, wór ‘(large) sack’, wyściółka ‘lining, padding’, zawiniątko‘bundle’, zwitek ‘scroll, wad, roll’, zwitka [a lemmatisation error; should be zwitek ‘scroll, wad, roll’]Figure 5.1: Examples of groups of new lemmas created by automatic clusteringIt was very hard to find a pure one-domain group, but most groups seem to fall intoonly two-three domains. Figure 5.1 shows two examples. This had positive influenceon the expansion process. Skimming a group usually sufficed to identify its maindomains, so we could direct the expansion process first toward the missing parts in thehypernymy structure. The linguists could concentrate on a few domains and graduallyexpand the given hypernymy subgraphs while working with a given group. Afteradding some LUs to the given domain, AAA could be rerun to recompute suggestionsfor the still unedited lemmas; in narrow domains with deeper hypernymy structure,such as food or clothing, this increased the accuracy of suggestions and facilitated thelinguists’ work. Later on, experienced linguists were able to decide for which groupthe slightly time-consuming recomputation is worth doing.WNW was designed as a plug-in to the wordnet editor (Section 2.4). AAAgeneratedsuggestions (step 6) presented as shown in Section 4.5.4 appear in a panel,
Page 1 and 2:
A Wordnetfrom the Ground Up
Page 3 and 4:
Work financed by the Polish Ministr
Page 7 and 8:
6 Prefaceheartfelt thanks go to all
Page 9:
8 Chapter 1. Motivation, Goals, Ear
Page 12 and 13:
1.1. Motivation 11[a] special form
Page 14 and 15:
1.1. Motivation 13Affect (Strappara
Page 16 and 17:
1.2. The Goals of the plWordNet Pro
Page 18 and 19:
1.2. The Goals of the plWordNet Pro
Page 20 and 21:
1.3. Early Decisions 19Merge Model:
Page 22:
1.3. Early Decisions 214. On the ot
Page 25 and 26:
24 Chapter 2. Building a Wordnet Co
Page 27 and 28:
Page 29 and 30:
Page 31 and 32:
Page 33 and 34:
Page 35 and 36:
Page 37 and 38:
Page 39 and 40:
Page 41 and 42:
Page 43 and 44:
Page 45 and 46:
Page 47 and 48:
Page 49 and 50:
48 Chapter 3. Discovering Semantic
Page 51 and 52:
Page 53 and 54:
Page 55 and 56:
Page 57 and 58:
Page 59 and 60:
Page 61 and 62:
Page 63 and 64:
Page 65 and 66:
Page 67 and 68:
Page 69 and 70:
Page 71 and 72:
Page 73 and 74:
Page 75 and 76:
Page 77 and 78:
Page 79 and 80:
Page 81 and 82:
Page 83 and 84:
Page 85 and 86:
Page 87 and 88:
Page 89 and 90:
Page 91 and 92:
Page 93 and 94:
Page 95 and 96:
Page 97 and 98:
Page 99 and 100:
Page 101 and 102:
Page 103 and 104:
102 Chapter 4. Extracting Relation
Page 105 and 106:
Page 107 and 108:
Page 109 and 110:
Page 111 and 112:
Page 113 and 114:
Page 115 and 116:
Page 117 and 118: 116 Chapter 4. Extracting Relation
Page 167: 166 Chapter 5. Polish WordNet Today
Page 171 and 172: 170 Chapter 5. Polish WordNet Today
Page 186 and 187: Appendix ATests for Lexico-semantic
Page 188 and 189: 187Test for adjectives (T. IX)1. p1
Page 190 and 191: 189RelatednessTest for nouns (T. XV
Page 192 and 193: BibliographyAgarwal, Abhaya and Alo
Page 194 and 195: Bibliography 193on Deep Lexical Acq
Page 196 and 197: Bibliography 195Derwojedowa, Magdal
Page 198 and 199: Bibliography 197Grefenstette, Grego
Page 200 and 201: Bibliography 199Kurc, Roman. (2008)
Page 202 and 203: Bibliography 201Mohammad, Saif and
Page 204 and 205: Bibliography 203. (2006) “The pot
Page 206 and 207: Bibliography 205and Technology 7(1-
Page 208 and 209: List of Tables2.1 The size of the c
Page 210 and 211: List of Figures2.1 The LU perspecti
Page 212 and 213: List of Figures 2114.16 Completely
Page 214 and 215: Index 213CBC, see Clustering by Com
Page 216 and 217: Index 215169, 177, 178, 180, 182hyp
Page 218 and 219:
Index 217mutual hypernymy, 24Mutual
Page 220 and 221:
Index 219SUMO, 14Supported Vector M
Page 222:
A language without a wordnet is at
show all

A Wordnet from the Ground Up

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?