A Wordnet from the Ground Up

More documents

Recommendations

Info

3.5. Sense Discovery by Clustering 91After the creation of clusters, the weight w is assigned using χ 2 test for testingif there is a bias in occurrences of the lemma with the group. Within our approach,a weight for a lemma is calculated in a way combining the methods of Indyka-Piasecka(2004) and Matsuo and Ishizuka (2004):w l = α · mintf.idf l+β · cv l + γ · χ 2 l , (3.8)where min tf.idfl is the minimal tf.idf weight for the given term l across the documentsin a cluster, α, β, γ are parameters controlling impact of every measure on final weight.Words which are assigned the highest weights are used as labels for the group ofdocuments in the cluster tree.3.5.2 Benefits of document clusters for constructing a wordnetOur ultimate goal in document clustering was to obtain the basic structure for plWord-Net. Document group labels could be used as synsets and cluster tree as a hypernymyhierarchy.We evaluated our approach on plWordNet. The automatically created thesaurus wascompared with the plWordNet hypernymy hierarchy. This failed: only 86 hypernymicinstances (word pairs) were present in the thesaurus, fewer than 1% of all relations.Clustering whole documents might be a reason of low accuracy, but experiments withdocument segmentation decreased the quality of clustering (Broda, 2007, Broda and Piasecki,2008a). On the other hand, keyword extraction methods developed primarilyfor information retrieval are not suitable for the discovery of relations between wordsthat describe different groups of documents.The extracted group labels are still quite very descriptive. For example, a group ofdocuments about “interventionist purchase of grain and harvest in the area of Małopolska”are labelled with zboże (grain), pszenica (wheat), tona (tonne), rolnik (farmer)and agencja (agency). Another possible use of extracted words is to measure the degreeof polysemy, because different meanings of words occurs in different branches ofhierarchy.3.5.3 Clustering by Committee as an example of word sense discoveryA good MSR can provide valuable information about word similarity during wordnetconstruction. For every word x, an MSR can produce a list of its k most similar words(denoted as MSRlist(x, k)) . Because of the nature of MSRs, those lists consists notonly of words related by one lexico-semantic relation (Section 3.4). Part of the wordson those similarity lists can be even unrelated to the target word. Choosing the rightvalue for k can also be problematic. Not only does it depend on the MSR algorithm,
92 Chapter 3. Discovering Semantic Relatednessbut also the training phase can influence it. Worse still, the value of “good” k canchange with word x for the same MSR.Clustering techniques may help create better lists or groups of words. We wouldlike to find a method that identifies lists of tightly interlinked word groups representingnear-synonymy and close hypernymy, which could be added to plWordNet with as littleintervention of the linguists as possible.Standard partitioning clustering methods are ill-suited to the task of clusteringlemmas. They can assign one word to a single cluster, which is problematic forpolysemous lemmas. For lemmas that have one predominant meaning, only a clusterfor one sense will be created. For polysemous lemmas without a predominant meaningthe situation may be even less pleasant: such lemmas can lead to the creation of clustersthat mix lemmas that have more than one of the polysemous lemma senses. That iswhy we need specialized clustering method.Several clustering algorithms for the task of grouping words have been discussedin the literature. Among them, Clustering by Committee [CBC] (Pantel, 2003, Linand Pantel, 2002) has been reported to achieve especially good accuracy with respectto evaluation performed on the basis of PWN. It is often referred to in the literatureas one of the most interesting clustering algorithms (Pedersen, 2006).CBC relies only on a modestly advanced dependency parser and on a MSR based onPointwise Mutual Information [PMI] extended with a discounting factor (Lin and Pantel,2002). This MSR is a modification of Lin’s measure (Lin, 1998) analysed inSection 3.4 and in (Broda et al., 2008) in application to Polish. Both measures areclose to the RWF measure (Piasecki et al., 2007a) that achieves good accuracy insynonymy tests generated out of plWordNet (Section 3.3).Applications of CBC to languages other than English are rarely reported in theliterature. Tomuro et al. (2007) mentioned briefly some experiments with Japanese, butgave no results. Differences between languages, and especially differences in resourceavailability for different languages, can affect the construction of the similarity functionat the heart of CBC. CBC also crucially depends on several thresholds whose valueswere established experimentally. It is quite unclear to what extent they can be reusedor re-discovered for different languages and language resources.The CBC algorithm has been well described by its authors (Pantel, 2003, Linand Pantel, 2002). We will therefore only outline its general organisation, following(Lin and Pantel, 2002) and emphasising selected key points. We have reformulatedsome steps in order to name consistently all thresholds present in the algorithm. Otherwise,we keep the original names.I. Find most similar elements1. For each word e in the input set E, select k most similar words consid-
Page 1 and 2:
A Wordnetfrom the Ground Up
Page 3 and 4:
Work financed by the Polish Ministr
Page 7 and 8:
6 Prefaceheartfelt thanks go to all
Page 9:
8 Chapter 1. Motivation, Goals, Ear
Page 12 and 13:
1.1. Motivation 11[a] special form
Page 14 and 15:
1.1. Motivation 13Affect (Strappara
Page 16 and 17:
1.2. The Goals of the plWordNet Pro
Page 18 and 19:
1.2. The Goals of the plWordNet Pro
Page 20 and 21:
1.3. Early Decisions 19Merge Model:
Page 22:
1.3. Early Decisions 214. On the ot
Page 25 and 26:
24 Chapter 2. Building a Wordnet Co
Page 27 and 28:
Page 29 and 30:
Page 31 and 32:
Page 33 and 34:
Page 35 and 36:
Page 37 and 38:
Page 39 and 40:
Page 41 and 42: 40 Chapter 2. Building a Wordnet Co
Page 49 and 50: 48 Chapter 3. Discovering Semantic
Page 91: 90 Chapter 3. Discovering Semantic
Page 103 and 104: 102 Chapter 4. Extracting Relation
Page 143 and 144:
142 Chapter 4. Extracting Relation
Page 145 and 146:
Page 147 and 148:
Page 149 and 150:
Page 151 and 152:
Page 153 and 154:
Page 155 and 156:
Page 157 and 158:
Page 159 and 160:
Page 161 and 162:
Page 163 and 164:
Page 165 and 166:
Page 167 and 168:
166 Chapter 5. Polish WordNet Today
Page 169 and 170:
Page 171 and 172:
Page 173 and 174:
Page 175 and 176:
Page 177 and 178:
Page 179 and 180:
Page 181 and 182:
Page 183 and 184:
Page 186 and 187:
Appendix ATests for Lexico-semantic
Page 188 and 189:
187Test for adjectives (T. IX)1. p1
Page 190 and 191:
189RelatednessTest for nouns (T. XV
Page 192 and 193:
BibliographyAgarwal, Abhaya and Alo
Page 194 and 195:
Bibliography 193on Deep Lexical Acq
Page 196 and 197:
Bibliography 195Derwojedowa, Magdal
Page 198 and 199:
Bibliography 197Grefenstette, Grego
Page 200 and 201:
Bibliography 199Kurc, Roman. (2008)
Page 202 and 203:
Bibliography 201Mohammad, Saif and
Page 204 and 205:
Bibliography 203. (2006) “The pot
Page 206 and 207:
Bibliography 205and Technology 7(1-
Page 208 and 209:
List of Tables2.1 The size of the c
Page 210 and 211:
List of Figures2.1 The LU perspecti
Page 212 and 213:
List of Figures 2114.16 Completely
Page 214 and 215:
Index 213CBC, see Clustering by Com
Page 216 and 217:
Index 215169, 177, 178, 180, 182hyp
Page 218 and 219:
Index 217mutual hypernymy, 24Mutual
Page 220 and 221:
Index 219SUMO, 14Supported Vector M
Page 222:
A language without a wordnet is at
show all

A Wordnet from the Ground Up

Create successful ePaper yourself

Delete template?

Save as template?