06.08.2015 Views

A Wordnet from the Ground Up

A Wordnet from the Ground Up - School of Information Technology ...

A Wordnet from the Ground Up - School of Information Technology ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

3.4. Measures of Semantic Relatedness 73correspond to all context types taken into consideration. The initial value of featuresare <strong>the</strong> frequencies of occurrences of <strong>the</strong> given target LU in <strong>the</strong> corresponding lexicosyntacticcontexts. This raw information, however, is very noisy and not reliable.• Some features deliver little or no information. Consider, for example, very frequentadjectives with vague meaning, such as “nowy” (new, 627874 occurrencesin corpora) or “wielki” (large, great, 615785), “mój” (mine, 592976), or very frequentverbs that occur with many subjects, such as “być” (be, 6944204), “mieć”(have, 2332773), and so on. They result in large values of <strong>the</strong> correspondingfeatures (frequencies), occur with <strong>the</strong> majority of target LUs and make every LUrelated to every o<strong>the</strong>r LU.• Accidental feature values caused by very infrequent, mostly singular, occurrencesof <strong>the</strong> corresponding lexical elements with <strong>the</strong> target LUs have negligible influenceon <strong>the</strong> well-described frequent target LUs with many non-zero features, butcan relate some infrequent LUs to many o<strong>the</strong>rs just because of a few accidentalfeature values, e.g. association of “pies” (dog) with “żelbeton” (reinforcedconcrete) found by noun-coordination constraint (NcC).• Raw feature values can also be biased by corpora in two ways: values of features<strong>from</strong> some subset can be increased (e.g., some specific modifiers repeatedly usedacross some set of documents) and for some subset of <strong>the</strong> target LUs <strong>the</strong> averagelevel of <strong>the</strong> values of <strong>the</strong>ir features can be increased in comparison to <strong>the</strong> rest of<strong>the</strong> target LUs (e.g., because LUs <strong>from</strong> <strong>the</strong> given subset occur more frequentlyin <strong>the</strong> corpora).Thus, most MSR extraction methods transform <strong>the</strong> initial raw frequencies before <strong>the</strong>final computation of <strong>the</strong> MSR value. Such a transformation is typically a combinationof filtering and weighting. The quality and behaviour of an MSR depend to a largeextent on <strong>the</strong> transformation applied. For example, in (Piasecki et al., 2007a), <strong>the</strong>increase <strong>from</strong> 82.72% of accuracy in WBST+H to 86.09% was achieved only bychanging <strong>the</strong> transformation.Transformations proposed in <strong>the</strong> literature usually combine initial filtering basedon simple heuristics referring to frequencies with weighting based on <strong>the</strong> analysis ofstatistical association between <strong>the</strong> given target LU and features. The filtering functionscan be applied to both target LUs and features, in order to remove target LUs forwhich we do not have enough information, or to exclude <strong>from</strong> description features thatdo not deliver enough information. Mostly, a filtering function is defined as a simplecomparison with <strong>the</strong> threshold. In <strong>the</strong> case of target LUs, LU frequency and <strong>the</strong> numberof non-zero features are tested, e.g. Lin (1998) filtered out all target LUs occurringless than 100 times in <strong>the</strong> corpus of about 64 million words. For features, elimination

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!