06.08.2015 Views

A Wordnet from the Ground Up

A Wordnet from the Ground Up - School of Information Technology ...

A Wordnet from the Ground Up - School of Information Technology ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

3.4. Measures of Semantic Relatedness 773.4.5 Benefits for wordnet constructionOur aim in MSR extraction was ultimately to use MSR as <strong>the</strong> basic knowledge sourcein semi-automatically expanding plWordNet with new synsets and relation instances.We planned to obtain material for new synsets by clustering target LUs by MSRproducedvalues. We discuss <strong>the</strong> experiments next, in Section 3.5. We also hoped tofind on <strong>the</strong> MSRlist (x,k) lists new instances of wordnet relations between LUs presentin plWordNet and new LUs, as well among new LUs. We assumed manual verificationof <strong>the</strong> results. An MSR was constructed for a set of one-word and two-word nominalLUs 16 including all LUs already present in <strong>the</strong> core plWordNet, as well a set of LUsplanned as <strong>the</strong> basis for expansion. The drawbacks of a pure corpus approach, discussedin Section 2.4, made us take a more dictionary-based approach in defining <strong>the</strong> lemmalist for <strong>the</strong> expansion of plWordNet. In <strong>the</strong> end, 13285 nominal LUs have been selectedfor extracting an MSR for nominals:• 5340 nominal lemmas described in <strong>the</strong> core plWordNet,• additional lemmas (fur<strong>the</strong>r on referred to as new lemmas):– nominal lemmas acquired <strong>from</strong> a small Polish-English dictionary(Piotrowski and Saloni, 1999),– two-word LUs <strong>from</strong> a general dictionary of Polish (PWN, 2007),– <strong>the</strong> lemmas that occur over 1000 times in <strong>the</strong> largest available corpus ofPolish, IPIC (Przepiórkowski, 2004),The small Polish-English dictionary (Piotrowski and Saloni, 1999) was used as <strong>the</strong>main source, because its small size makes its entries close to <strong>the</strong> core of <strong>the</strong> Polishvocabulary.First experiments on MSR extraction had been performed only on IPIC (Piaseckiand Broda, 2007, Piasecki et al., 2007a,b). Later, when we collected o<strong>the</strong>r corpora, weobserved a correlation between an increase in WBST+H and <strong>the</strong> increasing size of <strong>the</strong>overall corpus used. The final version of <strong>the</strong> nominal MSR has been extracted <strong>from</strong>three corpora:• IPIC (including about 254 million tokens) (Przepiórkowski, 2004) (it is not balancedbut it covers a variety of genres: literature, poetry, newspapers, legal texts,stenographic parliamentary records and scientific texts);• a corpus of <strong>the</strong> electronic edition of a Polish newspaper Rzeczpospolita <strong>from</strong>January 1993 to March 2002 (about 113 million tokens) (Rzeczpospolita, 2008);16 We were limited to at most two-word LUs by <strong>the</strong> technology of <strong>the</strong> extraction of multiword expressionwe had developed (Broda et al., 2008).

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!