06.08.2015 Views

A Wordnet from the Ground Up

A Wordnet from the Ground Up - School of Information Technology ...

A Wordnet from the Ground Up - School of Information Technology ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

3.5. Sense Discovery by Clustering 95• <strong>the</strong> ratio of committee importance:w(f i ) = w(f i ) − w(f i ) v c(f i )∑vc (•)In <strong>the</strong> minimal value heuristic, we make quite a strong assumption that a featureis associated only with one sense on one of <strong>the</strong> sides: LU or committee. The lowervalue identifies <strong>the</strong> right side. The ratio heuristic is based on a weaker assumption:<strong>the</strong> feature corresponds to <strong>the</strong> committee description only to some extent.3.5.4 Benefits of discovered senses for constructing a wordnetDuring <strong>the</strong> reimplementation of CBC for Polish we stumbled upon two problems. Thereare significant typological differences between Polish and English, and <strong>the</strong> availabilityof language tools differs. For example, Polish – unlike English – is generally a freeword-order language; much syntactic information is encoded by rich inflection. Thismakes <strong>the</strong> construction of even a shallow parser for Polish more difficult than forEnglish – and CBC begins by running a dependency parser on <strong>the</strong> corpus. As shownin Section 3.4 and in (Piasecki et al., 2007a, Broda et al., 2008), this similar problemcan be solved by applying several types of lexico-morphosyntactic constraints. Thisidentifies a subset of structural dependencies mainly <strong>from</strong> morphosyntactic agreementamong words in a sentence and a few positional features like noun-noun head/modifierpairs. The constructed MSR gave results comparable with <strong>the</strong> results achieved byhumans in <strong>the</strong> same task (Piasecki et al., 2007b). We <strong>the</strong>refore assumed that <strong>the</strong>constructed MSR is at least comparable in quality to that used in (Pantel, 2003, Linand Pantel, 2002). We adopted <strong>the</strong> constraint-based approach here, applying a subset oflexico-morphosyntactic constraints as in Section 3.4.3: noun modification by a specificadjective or a specific adjectival participle (AdjC), and noun co-ordination with aspecific noun (NcC).Evaluation of <strong>the</strong> extracted word senses proposed in (Lin and Pantel, 2002, Pantel,2003) is based on comparing <strong>the</strong> extracted senses with those defined for <strong>the</strong> samewords in PWN. It is assumed that a correct sense of word w is described by a wordgroup c containing w if a PWN synset s containing w is sufficiently similar to c. Thelatter condition is represented by ano<strong>the</strong>r threshold θ.Similarity between wordnet synsets is central to <strong>the</strong> evaluation proposed in (Linand Pantel, 2002, Pantel, 2003). Similarity was defined through probabilities assignedto synsets and derived <strong>from</strong> a corpus annotated with synsets. This kind of synsetsimilarity is very difficult to estimate for languages for which <strong>the</strong>re is no such corpus,as is <strong>the</strong> case of Polish. In order to avoid any kind of unsupervised estimation of synset

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!