A Wordnet from the Ground Up

More documents

Recommendations

Info

3.5. Sense Discovery by Clustering 89documents can be used; for details see (Guha et al., 2000) and (Broda and Piasecki,2008a). Even if documents are not very similar, they can form a cluster in ROCK ifthey have many common neighbours. This clusters of unusual shapes possible.The other clustering algorithm we considered is called Growing Hierarchical Self-Organising Map [GHSOM] (Rauber et al., 2002) an extension of Self-Organising Map[SOM] (Kohonen et al., 2000). SOM is an artificial neural network. Every neuronconsists of a weight vector and a vector of positions in the map 20 . Training SOM is donein an unsupervised manner by applying a winner-takes-all strategy. Every document isdelivered to the network several times. The neuron most similar to a given documentis the winner. Weights of the winning neuron and neurons in its neighbourhood 21are updated to be even more similar to the input pattern. The learning algorithm isconstructed so that the neighbourhood of a neuron and the rate of weight updatingdecrease over time.The GHSOM algorithm addresses one of SOM’s most important drawbacks – thea priori definition of the map structure. Rauber et al. (2002) proposed an algorithmfor growing SOM both in terms of the number of map neurons and the hierarchy.Clustering results will be used in the extraction of polysemy information, labellingclusters with keywords and generation of a basic structure for a wordnet, so we wantedto be sure to select clustering algorithms that performs well on collections of Polishdocuments.There exists a few approaches to the evaluation of clustering (Forster, 2006). Forexample, one can study the theoretical properties of an algorithm, or measure somemathematical properties of the resulting clusters. In some domains those methodscan be appropriate, but we argue that for the domain of documents the most suitableevaluation method is by referring to external criteria, such as a comparison of theresults with manually created pre-existing categories.Our evaluation used parts of the Polish daily paper “Dziennik Polski”, included inthe IPI PAN Corpus [IPIC] (Przepiórkowski, 2004). It has been manually partitionedinto categories: Economy, Sport, Magazine, Home News, and so on. Both ROCK andGHSOM gave results satisfactory in comparison to the “Dziennik Polski” data (Brodaand Piasecki, 2008a). A manual inspection of the produced clusters confirmed thoseresults. We did not find any mixing of major topics in groups, for example there wasno document from Sport put into clusters with documents talking about Economy.The algorithms also found more categories than are actually present in the corpus. Forexample, different sport disciplines were partitioned into separate groups. An important20 For us, a map is a two-dimensional grid, with neurons placed in the nodes of the grid. This isnot the only possible representation for SOM: a map can be hexagon-based or neurons can be placed ina three dimensional space.21 Note that this neighbourhood is different from neighbourhood in ROCK. In SOM it is defined simplyas certain number of neurons in the map that are close to the winning neuron.
90 Chapter 3. Discovering Semantic Relatednessdrawback of ROCK is that it sometimes produces a very deep and unbalanced hierarchy.On the other hand, GHSOM assigned pairs of documents into one cluster which didnot appear together in any manually created category more often then ROCK.We wanted to label with representative words document groups clustered in a hierarchicaltree. Words describing groups of documents closer to the root of the treeshould be more general than words used for the documents in the leaves. Ideally, wewould obtain a basic hypernymy structure for plWordNet (or at least instances of is-arelation) out of the assigned labels.Keyword extraction can be supervised or unsupervised. Supervised algorithm requiresample manually constructed resources. We applied only such unsupervisedmethods that try to capture statistical properties of words occurrences to identifywords which best describe the given document. The statistics can be counted locally,using data from a single document only, or estimated from a large body of text.To benefit from both local and global strategies, we extended the method proposed byIndyka-Piasecka (2004) with the algorithm of Matsuo and Ishizuka (2004) into a hybridkeyword extraction method.Indyka-Piasecka (2004) assigns a weight w to every lemma l that occurs in eachdocument of a group. Additionally, lemmas are filtered on the basis of their documentfrequency df l , that is a number of documents in which lemma l occurred. Both rare andfrequent lemmas are not good discriminator of document content (cf. Indyka-Piasecka,2004). The weight w is calculated by using two weighting schemes:and cue validitytf.idf l,d = tf l,d × log N df l(3.6)cv = tf group(3.7)tfwhere tf and df denote term frequency and document frequency.Matsuo and Ishizuka (2004) used a three–step–process to assign a weight to everylemma. First, all words in a document are reduced to their lemmas (basic morphologicalforms) and filtered on the basis of term frequency and a stoplist. Then, theycluster lemmas in a document using two algorithms. If two lemmas have similar distributions,it means that they belong to the same group. As a measure of the probabilitydistribution similarity between two lemmas l 1 and l 2 (Matsuo and Ishizuka, 2004)used the Jensen–Shannon divergence 22 . Lemmas are also clustered when they followsimilar co-occurrence pattern with other lemmas. This can be measured using MutualInformation.22 Jensen–Shannon divergence is a symmetrised and smoothed version of Kullback–Leibler divergence(Manning and Schütze, 2001).
Page 1 and 2:
A Wordnetfrom the Ground Up
Page 3 and 4:
Work financed by the Polish Ministr
Page 7 and 8:
6 Prefaceheartfelt thanks go to all
Page 9:
8 Chapter 1. Motivation, Goals, Ear
Page 12 and 13:
1.1. Motivation 11[a] special form
Page 14 and 15:
1.1. Motivation 13Affect (Strappara
Page 16 and 17:
1.2. The Goals of the plWordNet Pro
Page 18 and 19:
1.2. The Goals of the plWordNet Pro
Page 20 and 21:
1.3. Early Decisions 19Merge Model:
Page 22:
1.3. Early Decisions 214. On the ot
Page 25 and 26:
24 Chapter 2. Building a Wordnet Co
Page 27 and 28:
Page 29 and 30:
Page 31 and 32:
Page 33 and 34:
Page 35 and 36:
Page 37 and 38:
Page 39 and 40: 38 Chapter 2. Building a Wordnet Co
Page 49 and 50: 48 Chapter 3. Discovering Semantic
Page 89: 88 Chapter 3. Discovering Semantic
Page 103 and 104: 102 Chapter 4. Extracting Relation
Page 141 and 142:
140 Chapter 4. Extracting Relation
Page 143 and 144:
Page 145 and 146:
Page 147 and 148:
Page 149 and 150:
Page 151 and 152:
Page 153 and 154:
Page 155 and 156:
Page 157 and 158:
Page 159 and 160:
Page 161 and 162:
Page 163 and 164:
Page 165 and 166:
Page 167 and 168:
166 Chapter 5. Polish WordNet Today
Page 169 and 170:
Page 171 and 172:
Page 173 and 174:
Page 175 and 176:
Page 177 and 178:
Page 179 and 180:
Page 181 and 182:
Page 183 and 184:
Page 186 and 187:
Appendix ATests for Lexico-semantic
Page 188 and 189:
187Test for adjectives (T. IX)1. p1
Page 190 and 191:
189RelatednessTest for nouns (T. XV
Page 192 and 193:
BibliographyAgarwal, Abhaya and Alo
Page 194 and 195:
Bibliography 193on Deep Lexical Acq
Page 196 and 197:
Bibliography 195Derwojedowa, Magdal
Page 198 and 199:
Bibliography 197Grefenstette, Grego
Page 200 and 201:
Bibliography 199Kurc, Roman. (2008)
Page 202 and 203:
Bibliography 201Mohammad, Saif and
Page 204 and 205:
Bibliography 203. (2006) “The pot
Page 206 and 207:
Bibliography 205and Technology 7(1-
Page 208 and 209:
List of Tables2.1 The size of the c
Page 210 and 211:
List of Figures2.1 The LU perspecti
Page 212 and 213:
List of Figures 2114.16 Completely
Page 214 and 215:
Index 213CBC, see Clustering by Com
Page 216 and 217:
Index 215169, 177, 178, 180, 182hyp
Page 218 and 219:
Index 217mutual hypernymy, 24Mutual
Page 220 and 221:
Index 219SUMO, 14Supported Vector M
Page 222:
A language without a wordnet is at
show all

A Wordnet from the Ground Up

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?