06.08.2015 Views

A Wordnet from the Ground Up

A Wordnet from the Ground Up - School of Information Technology ...

A Wordnet from the Ground Up - School of Information Technology ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

92 Chapter 3. Discovering Semantic Relatednessbut also <strong>the</strong> training phase can influence it. Worse still, <strong>the</strong> value of “good” k canchange with word x for <strong>the</strong> same MSR.Clustering techniques may help create better lists or groups of words. We wouldlike to find a method that identifies lists of tightly interlinked word groups representingnear-synonymy and close hypernymy, which could be added to plWordNet with as littleintervention of <strong>the</strong> linguists as possible.Standard partitioning clustering methods are ill-suited to <strong>the</strong> task of clusteringlemmas. They can assign one word to a single cluster, which is problematic forpolysemous lemmas. For lemmas that have one predominant meaning, only a clusterfor one sense will be created. For polysemous lemmas without a predominant meaning<strong>the</strong> situation may be even less pleasant: such lemmas can lead to <strong>the</strong> creation of clustersthat mix lemmas that have more than one of <strong>the</strong> polysemous lemma senses. That iswhy we need specialized clustering method.Several clustering algorithms for <strong>the</strong> task of grouping words have been discussedin <strong>the</strong> literature. Among <strong>the</strong>m, Clustering by Committee [CBC] (Pantel, 2003, Linand Pantel, 2002) has been reported to achieve especially good accuracy with respectto evaluation performed on <strong>the</strong> basis of PWN. It is often referred to in <strong>the</strong> literatureas one of <strong>the</strong> most interesting clustering algorithms (Pedersen, 2006).CBC relies only on a modestly advanced dependency parser and on a MSR based onPointwise Mutual Information [PMI] extended with a discounting factor (Lin and Pantel,2002). This MSR is a modification of Lin’s measure (Lin, 1998) analysed inSection 3.4 and in (Broda et al., 2008) in application to Polish. Both measures areclose to <strong>the</strong> RWF measure (Piasecki et al., 2007a) that achieves good accuracy insynonymy tests generated out of plWordNet (Section 3.3).Applications of CBC to languages o<strong>the</strong>r than English are rarely reported in <strong>the</strong>literature. Tomuro et al. (2007) mentioned briefly some experiments with Japanese, butgave no results. Differences between languages, and especially differences in resourceavailability for different languages, can affect <strong>the</strong> construction of <strong>the</strong> similarity functionat <strong>the</strong> heart of CBC. CBC also crucially depends on several thresholds whose valueswere established experimentally. It is quite unclear to what extent <strong>the</strong>y can be reusedor re-discovered for different languages and language resources.The CBC algorithm has been well described by its authors (Pantel, 2003, Linand Pantel, 2002). We will <strong>the</strong>refore only outline its general organisation, following(Lin and Pantel, 2002) and emphasising selected key points. We have reformulatedsome steps in order to name consistently all thresholds present in <strong>the</strong> algorithm. O<strong>the</strong>rwise,we keep <strong>the</strong> original names.I. Find most similar elements1. For each word e in <strong>the</strong> input set E, select k most similar words consid-

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!