06.08.2015 Views

A Wordnet from the Ground Up

A Wordnet from the Ground Up - School of Information Technology ...

A Wordnet from the Ground Up - School of Information Technology ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

3.5. Sense Discovery by Clustering 89documents can be used; for details see (Guha et al., 2000) and (Broda and Piasecki,2008a). Even if documents are not very similar, <strong>the</strong>y can form a cluster in ROCK if<strong>the</strong>y have many common neighbours. This clusters of unusual shapes possible.The o<strong>the</strong>r clustering algorithm we considered is called Growing Hierarchical Self-Organising Map [GHSOM] (Rauber et al., 2002) an extension of Self-Organising Map[SOM] (Kohonen et al., 2000). SOM is an artificial neural network. Every neuronconsists of a weight vector and a vector of positions in <strong>the</strong> map 20 . Training SOM is donein an unsupervised manner by applying a winner-takes-all strategy. Every document isdelivered to <strong>the</strong> network several times. The neuron most similar to a given documentis <strong>the</strong> winner. Weights of <strong>the</strong> winning neuron and neurons in its neighbourhood 21are updated to be even more similar to <strong>the</strong> input pattern. The learning algorithm isconstructed so that <strong>the</strong> neighbourhood of a neuron and <strong>the</strong> rate of weight updatingdecrease over time.The GHSOM algorithm addresses one of SOM’s most important drawbacks – <strong>the</strong>a priori definition of <strong>the</strong> map structure. Rauber et al. (2002) proposed an algorithmfor growing SOM both in terms of <strong>the</strong> number of map neurons and <strong>the</strong> hierarchy.Clustering results will be used in <strong>the</strong> extraction of polysemy information, labellingclusters with keywords and generation of a basic structure for a wordnet, so we wantedto be sure to select clustering algorithms that performs well on collections of Polishdocuments.There exists a few approaches to <strong>the</strong> evaluation of clustering (Forster, 2006). Forexample, one can study <strong>the</strong> <strong>the</strong>oretical properties of an algorithm, or measure somema<strong>the</strong>matical properties of <strong>the</strong> resulting clusters. In some domains those methodscan be appropriate, but we argue that for <strong>the</strong> domain of documents <strong>the</strong> most suitableevaluation method is by referring to external criteria, such as a comparison of <strong>the</strong>results with manually created pre-existing categories.Our evaluation used parts of <strong>the</strong> Polish daily paper “Dziennik Polski”, included in<strong>the</strong> IPI PAN Corpus [IPIC] (Przepiórkowski, 2004). It has been manually partitionedinto categories: Economy, Sport, Magazine, Home News, and so on. Both ROCK andGHSOM gave results satisfactory in comparison to <strong>the</strong> “Dziennik Polski” data (Brodaand Piasecki, 2008a). A manual inspection of <strong>the</strong> produced clusters confirmed thoseresults. We did not find any mixing of major topics in groups, for example <strong>the</strong>re wasno document <strong>from</strong> Sport put into clusters with documents talking about Economy.The algorithms also found more categories than are actually present in <strong>the</strong> corpus. Forexample, different sport disciplines were partitioned into separate groups. An important20 For us, a map is a two-dimensional grid, with neurons placed in <strong>the</strong> nodes of <strong>the</strong> grid. This isnot <strong>the</strong> only possible representation for SOM: a map can be hexagon-based or neurons can be placed ina three dimensional space.21 Note that this neighbourhood is different <strong>from</strong> neighbourhood in ROCK. In SOM it is defined simplyas certain number of neurons in <strong>the</strong> map that are close to <strong>the</strong> winning neuron.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!