A Wordnet from the Ground Up

More documents

Recommendations

Info

3.5. Sense Discovery by Clustering 87possibility of constructing an MSR significantly better in this respect. Semantic andpragmatic constraints make many LUs semantically related to many other LUs andMRSs based on distributional semantics generate a continuum of relatedness valuesfor pairs of LUs. Wordnet relations appear as just weakly identifiable characteristicsubspaces in the continuum of semantic relatedness. We need an additional way ofselecting those LU pairs from the MSRlist (x,k) lists which represent particular wordnetrelations. Two ways appear to emerge:1. application of lexico-syntactic patterns (Sections 3.2 and 4) as an additionalsource of knowledge,2. introduction of an additional classifier trained on the plWordNet data and used forfiltering out MSRlist (x,k) pairs which are not instances of any wordnet relation(Section 4.5.1).We also mentioned briefly in the discussion of verbal and adjectival MSRs theidea of changing the perspective from automatic extraction of sets of instances of thewordnet relations to expanding the existing wordnet with new lemmas anchored inexisting synsets. This can significantly extend the amount of knowledge available andreduce the complexity of the problem. We will present in Section 4.5.3 a solutionfollowing this idea. Let us emphasise here that it is automated wordnet expansionwhich was our assumed goal, not automatic wordnet construction from scratch.3.5 Sense Discovery by ClusteringThe synset is one of the most fundamental building blocks of the wordnet structure.An algorithm for automatic extraction of synsets would be very helpful for linguistswho build a wordnet up manually (though usually with substantial software support).Clustering groups objects on a hyperplane so as to minimise the distance betweenobjects inside a group and maximise the distance between objects from different groups.A definition of distance, or similarity, between objects is required for such grouping.For clustering of lemmas into synset-like groups, we could use directly a Measure ofSemantic Relatedness [MSR] (Section 3.4). A drawback would be that MSRs tend tomerge different lemma senses in one vector that represents the meaning of lemmas,or to over-represent one predominant sense of a given lemma (Piasecki et al., 2007a).That is why we need a clustering method aware of ambiguity in lemma meaning.The Most Frequent Sense heuristic states that in one genre or domain one senseof a given lemma is dominant (Agirre and Edmonds, 2006). Without thematicallylabelled corpora one can hope that clustering techniques make it possible to achieveapproximation of domains, because documents are grouped by similarity. On the other
88 Chapter 3. Discovering Semantic Relatednesshand, the one-sense-per-discourse heuristic states that a lemma is used only in one ofits senses in one discourse (Agirre and Edmonds, 2006). Combining both heuristics,we can assume that polysemous lemmas will be used in one dominant sense in onecluster.One can hope to alleviate the ambiguity present in MSR by incorporating knowledgeof the domain of the documents that contain the given lemma. Document hierarchycould also be used as a base structure for a wordnet (or only parts of a wordnet).The approach based on document clustering in sense discovery is discussed inSection 3.5.1. Another remedy for inherent polysemy in any MSR can be a specializedalgorithm, for example Clustering by Committee [CBC] (Pantel, 2003). Section 3.5.3describes an adaptation of CBC to Polish, and an extension.3.5.1 Document clustering in sense discoveryDocument clustering in our work had two reasons. First, we wanted to explore thepossibilities of extracting knowledge about polysemy of lemmas from document groups.One-sense-per-discourse heuristic suggests that a polysemous lemma will appear in agiven domain only in one of its meanings. On the other hand, document clusters can belabelled with keywords – most representative words for a document group. Arrangingdocument clusters in a hierarchical tree could form the basic structure for a wordnet.There are many clustering algorithms. Following a review of the possibilities (Jainet al., 1999, Forster, 2006, Broda, 2007) we chose two algorithms for further analysisand experiments. We looked at following properties of clustering algorithms: the abilityto cluster high-dimensional data (such as documents represented by vectors), the abilityto detect clusters of irregular shapes and the possibility of building hierarchical trees.There are many ways of representing documents for clustering (Forster, 2006,Broda, 2007). In this work we used the Vector Space Model. In this model documentsare represented as vectors in high-dimensional space. Each dimension of the space correspondsto occurrences of a specific word. Vectors store data describing occurrencesof words in documents.RObust Clustering using linKs [ROCK] (Guha et al., 2000) follows the agglomerativeclustering scheme. Initially, each document is in a one-element cluster. Pairs ofthe most similar clusters are merged iteratively. The algorithm differs from others inhow the merging is decided. ROCK selects for merging a cluster that maximises thenumber of links between documents. To avoid oversized clusters (or even putting alldocuments into one cluster), the algorithms imposes an expected number of links fora cluster of a given size.The notion of links can be explained by common neighbours. Neighbourhood isdefined using a similarity function: if two documents are similar enough, they are consideredneighbours. If links replace similarity in clustering, global information about
Page 1 and 2:
A Wordnetfrom the Ground Up
Page 3 and 4:
Work financed by the Polish Ministr
Page 7 and 8:
6 Prefaceheartfelt thanks go to all
Page 9:
8 Chapter 1. Motivation, Goals, Ear
Page 12 and 13:
1.1. Motivation 11[a] special form
Page 14 and 15:
1.1. Motivation 13Affect (Strappara
Page 16 and 17:
1.2. The Goals of the plWordNet Pro
Page 18 and 19:
1.2. The Goals of the plWordNet Pro
Page 20 and 21:
1.3. Early Decisions 19Merge Model:
Page 22:
1.3. Early Decisions 214. On the ot
Page 25 and 26:
24 Chapter 2. Building a Wordnet Co
Page 27 and 28:
Page 29 and 30:
Page 31 and 32:
Page 33 and 34:
Page 35 and 36:
Page 37 and 38: 36 Chapter 2. Building a Wordnet Co
Page 49 and 50: 48 Chapter 3. Discovering Semantic
Page 87: 86 Chapter 3. Discovering Semantic
Page 103 and 104: 102 Chapter 4. Extracting Relation
Page 139 and 140:
138 Chapter 4. Extracting Relation
Page 141 and 142:
Page 143 and 144:
Page 145 and 146:
Page 147 and 148:
Page 149 and 150:
Page 151 and 152:
Page 153 and 154:
Page 155 and 156:
Page 157 and 158:
Page 159 and 160:
Page 161 and 162:
Page 163 and 164:
Page 165 and 166:
Page 167 and 168:
166 Chapter 5. Polish WordNet Today
Page 169 and 170:
Page 171 and 172:
Page 173 and 174:
Page 175 and 176:
Page 177 and 178:
Page 179 and 180:
Page 181 and 182:
Page 183 and 184:
Page 186 and 187:
Appendix ATests for Lexico-semantic
Page 188 and 189:
187Test for adjectives (T. IX)1. p1
Page 190 and 191:
189RelatednessTest for nouns (T. XV
Page 192 and 193:
BibliographyAgarwal, Abhaya and Alo
Page 194 and 195:
Bibliography 193on Deep Lexical Acq
Page 196 and 197:
Bibliography 195Derwojedowa, Magdal
Page 198 and 199:
Bibliography 197Grefenstette, Grego
Page 200 and 201:
Bibliography 199Kurc, Roman. (2008)
Page 202 and 203:
Bibliography 201Mohammad, Saif and
Page 204 and 205:
Bibliography 203. (2006) “The pot
Page 206 and 207:
Bibliography 205and Technology 7(1-
Page 208 and 209:
List of Tables2.1 The size of the c
Page 210 and 211:
List of Figures2.1 The LU perspecti
Page 212 and 213:
List of Figures 2114.16 Completely
Page 214 and 215:
Index 213CBC, see Clustering by Com
Page 216 and 217:
Index 215169, 177, 178, 180, 182hyp
Page 218 and 219:
Index 217mutual hypernymy, 24Mutual
Page 220 and 221:
Index 219SUMO, 14Supported Vector M
Page 222:
A language without a wordnet is at
show all

A Wordnet from the Ground Up

Create successful ePaper yourself

Delete template?

Save as template?