06.08.2015 Views

A Wordnet from the Ground Up

A Wordnet from the Ground Up - School of Information Technology ...

A Wordnet from the Ground Up - School of Information Technology ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

88 Chapter 3. Discovering Semantic Relatednesshand, <strong>the</strong> one-sense-per-discourse heuristic states that a lemma is used only in one ofits senses in one discourse (Agirre and Edmonds, 2006). Combining both heuristics,we can assume that polysemous lemmas will be used in one dominant sense in onecluster.One can hope to alleviate <strong>the</strong> ambiguity present in MSR by incorporating knowledgeof <strong>the</strong> domain of <strong>the</strong> documents that contain <strong>the</strong> given lemma. Document hierarchycould also be used as a base structure for a wordnet (or only parts of a wordnet).The approach based on document clustering in sense discovery is discussed inSection 3.5.1. Ano<strong>the</strong>r remedy for inherent polysemy in any MSR can be a specializedalgorithm, for example Clustering by Committee [CBC] (Pantel, 2003). Section 3.5.3describes an adaptation of CBC to Polish, and an extension.3.5.1 Document clustering in sense discoveryDocument clustering in our work had two reasons. First, we wanted to explore <strong>the</strong>possibilities of extracting knowledge about polysemy of lemmas <strong>from</strong> document groups.One-sense-per-discourse heuristic suggests that a polysemous lemma will appear in agiven domain only in one of its meanings. On <strong>the</strong> o<strong>the</strong>r hand, document clusters can belabelled with keywords – most representative words for a document group. Arrangingdocument clusters in a hierarchical tree could form <strong>the</strong> basic structure for a wordnet.There are many clustering algorithms. Following a review of <strong>the</strong> possibilities (Jainet al., 1999, Forster, 2006, Broda, 2007) we chose two algorithms for fur<strong>the</strong>r analysisand experiments. We looked at following properties of clustering algorithms: <strong>the</strong> abilityto cluster high-dimensional data (such as documents represented by vectors), <strong>the</strong> abilityto detect clusters of irregular shapes and <strong>the</strong> possibility of building hierarchical trees.There are many ways of representing documents for clustering (Forster, 2006,Broda, 2007). In this work we used <strong>the</strong> Vector Space Model. In this model documentsare represented as vectors in high-dimensional space. Each dimension of <strong>the</strong> space correspondsto occurrences of a specific word. Vectors store data describing occurrencesof words in documents.RObust Clustering using linKs [ROCK] (Guha et al., 2000) follows <strong>the</strong> agglomerativeclustering scheme. Initially, each document is in a one-element cluster. Pairs of<strong>the</strong> most similar clusters are merged iteratively. The algorithm differs <strong>from</strong> o<strong>the</strong>rs inhow <strong>the</strong> merging is decided. ROCK selects for merging a cluster that maximises <strong>the</strong>number of links between documents. To avoid oversized clusters (or even putting alldocuments into one cluster), <strong>the</strong> algorithms imposes an expected number of links fora cluster of a given size.The notion of links can be explained by common neighbours. Neighbourhood isdefined using a similarity function: if two documents are similar enough, <strong>the</strong>y are consideredneighbours. If links replace similarity in clustering, global information about

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!