06.08.2015 Views

A Wordnet from the Ground Up

A Wordnet from the Ground Up - School of Information Technology ...

A Wordnet from the Ground Up - School of Information Technology ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

3.5. Sense Discovery by Clustering 91After <strong>the</strong> creation of clusters, <strong>the</strong> weight w is assigned using χ 2 test for testingif <strong>the</strong>re is a bias in occurrences of <strong>the</strong> lemma with <strong>the</strong> group. Within our approach,a weight for a lemma is calculated in a way combining <strong>the</strong> methods of Indyka-Piasecka(2004) and Matsuo and Ishizuka (2004):w l = α · mintf.idf l+β · cv l + γ · χ 2 l , (3.8)where min tf.idfl is <strong>the</strong> minimal tf.idf weight for <strong>the</strong> given term l across <strong>the</strong> documentsin a cluster, α, β, γ are parameters controlling impact of every measure on final weight.Words which are assigned <strong>the</strong> highest weights are used as labels for <strong>the</strong> group ofdocuments in <strong>the</strong> cluster tree.3.5.2 Benefits of document clusters for constructing a wordnetOur ultimate goal in document clustering was to obtain <strong>the</strong> basic structure for plWord-Net. Document group labels could be used as synsets and cluster tree as a hypernymyhierarchy.We evaluated our approach on plWordNet. The automatically created <strong>the</strong>saurus wascompared with <strong>the</strong> plWordNet hypernymy hierarchy. This failed: only 86 hypernymicinstances (word pairs) were present in <strong>the</strong> <strong>the</strong>saurus, fewer than 1% of all relations.Clustering whole documents might be a reason of low accuracy, but experiments withdocument segmentation decreased <strong>the</strong> quality of clustering (Broda, 2007, Broda and Piasecki,2008a). On <strong>the</strong> o<strong>the</strong>r hand, keyword extraction methods developed primarilyfor information retrieval are not suitable for <strong>the</strong> discovery of relations between wordsthat describe different groups of documents.The extracted group labels are still quite very descriptive. For example, a group ofdocuments about “interventionist purchase of grain and harvest in <strong>the</strong> area of Małopolska”are labelled with zboże (grain), pszenica (wheat), tona (tonne), rolnik (farmer)and agencja (agency). Ano<strong>the</strong>r possible use of extracted words is to measure <strong>the</strong> degreeof polysemy, because different meanings of words occurs in different branches ofhierarchy.3.5.3 Clustering by Committee as an example of word sense discoveryA good MSR can provide valuable information about word similarity during wordnetconstruction. For every word x, an MSR can produce a list of its k most similar words(denoted as MSRlist(x, k)) . Because of <strong>the</strong> nature of MSRs, those lists consists notonly of words related by one lexico-semantic relation (Section 3.4). Part of <strong>the</strong> wordson those similarity lists can be even unrelated to <strong>the</strong> target word. Choosing <strong>the</strong> rightvalue for k can also be problematic. Not only does it depend on <strong>the</strong> MSR algorithm,

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!