06.08.2015 Views

A Wordnet from the Ground Up

A Wordnet from the Ground Up - School of Information Technology ...

A Wordnet from the Ground Up - School of Information Technology ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

4.4. Benefits of Extracted Patterns 121priori by a human evaluator 10 . It is worth considering that <strong>the</strong> limited size ofplWordNet can influence precision negatively. Some LUs are ei<strong>the</strong>r not presentyet or <strong>the</strong>ir synsets and all hypernymic links are incomplete. This precision iscomputed for each element on <strong>the</strong> list of instances.2. Precision based on human judgement is evaluated according to a sample randomlydrawn <strong>from</strong> <strong>the</strong> list of instances. Due to its cost, this evaluation measurewas used only for selected experiments, see below. The error level of <strong>the</strong> samplewas 3% and <strong>the</strong> confidence level was 95% (Israel, 1992).3. Recall based on plWordNet is evaluated on <strong>the</strong> set of word pairs generated <strong>from</strong>plWordNet. This measure does not describe <strong>the</strong> recall in relation to <strong>the</strong> corpusused. In <strong>the</strong> case of many experiments, recall is also presented in <strong>the</strong> cut-offversion.The experiments were performed on exactly <strong>the</strong> same three corpora as for MSRextraction (Section 3.4.5):1. IPIC (Przepiórkowski, 2004) with ≈ 254 million tokens,2. <strong>the</strong> corpus of <strong>the</strong> electronic edition of a Polish newspaper Rzeczpospolita <strong>from</strong>January 1993 to March 2002 (Rzeczpospolita, 2008) with ≈ 113 million tokens,3. a corpus of large texts in Polish collected <strong>from</strong> <strong>the</strong> Internet, with ≈ 214 milliontokens.In contrast with <strong>the</strong> experiments with <strong>the</strong> manually constructed patterns, <strong>the</strong>re wasno limit of <strong>the</strong> set of nominal LUs processed. Only <strong>the</strong> set of possible multiword LUswas predefined to accommodate <strong>the</strong> method of <strong>the</strong> recognition of multiword LUs basedon <strong>the</strong> lexicon.We tested several configurations:ESP- – Espresso without generic patterns,ESP-nm – Espresso without generic patterns, but with <strong>the</strong> extended reliability measure(4.6),ESPmorf- – Espresso without generic patterns, but with additional morphological informationencoding part of speech and values of selected grammatical categories:case (nouns), case and degree (adjectives), aspect (verbs),10 The list had resulted <strong>from</strong> <strong>the</strong> preliminary experiments and was next kept in use because of <strong>the</strong>limited size of plWordNet.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!