06.08.2015 Views

A Wordnet from the Ground Up

A Wordnet from the Ground Up - School of Information Technology ...

A Wordnet from the Ground Up - School of Information Technology ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

166 Chapter 5. Polish WordNet Today and TomorrowThe expansion process was <strong>the</strong>refore slightly biased towards a data driven approach 2 .Such a bias is not necessarily a drawback, however, because – assuming moderateinfluence of <strong>the</strong> linguists’ decisions – <strong>the</strong> resulting resource represents a structure oflexico-semantic relations implicitly present in <strong>the</strong> language data.The operation of pattern-based methods was also limited to <strong>the</strong> list of 15096 selectedlemmas – its late version supplemented by <strong>the</strong> frequent lemmas <strong>from</strong> <strong>the</strong> jointcorpus. (Support for this design decision comes <strong>from</strong> preliminary experiments, inwhich we applied pattern-based methods to an unrestricted, full list of lemmas collected<strong>from</strong> <strong>the</strong> joint corpus. In <strong>the</strong> absence of any restrictions, <strong>the</strong>re were very manyextracted instances, at <strong>the</strong> cost of much lower precision.)An improvement in <strong>the</strong> number of dictionary lemmas covered can naturally behad if <strong>the</strong>re is a larger and better balanced corpus to cover more domains and moreexamples per word sense. This is not a realistic wish, but <strong>the</strong>re still is much room insemantic extraction for increased use of what is present in texts. For example, MSRconstruction could easily benefit <strong>from</strong> a deeper analysis of <strong>the</strong> lexico-syntactic structureafforded by a syntactic analyser.We only applied <strong>the</strong> semi-automatic process to nominal lemmas. The presentversion WNW strongly depends on <strong>the</strong> hypernymy structure, rich for nouns, but quitelimited for verbs and very rare for adjectives. There simply was no basis for <strong>the</strong>identification of attachment areas. (In <strong>the</strong> future versions of WNW, we plan to usemore relation types in generating suggestions.) Also, few lexico-syntactic patternswork for verbs. It is a serious open problem to find effective use of information coming<strong>from</strong> single occurrences of verbal and adjectival lemmas in a style of pattern-basedapproaches.Here is how we have organised around WNW <strong>the</strong> process of expanding plWordNet.1. We collected and morpho-syntactically preprocessed a large corpus ( Section 3.4.3).2. From <strong>the</strong> corpus, we extracted data sets describing lexico-semantic relations; weapplied all constructed and experimentally verified automated extraction tools:<strong>the</strong> Measure of Semantic Relatedness MSR GRW F , manual patterns and Estratto(Section 4.3).3. A classifier for lexico-semantic relation (Section 4.5.1) was trained on <strong>the</strong> <strong>the</strong>ncurrentstate of <strong>the</strong> wordnet and <strong>the</strong> data sets <strong>from</strong> <strong>the</strong> previous step.4. New lemmas were clustered using MSR GRW F and an off-<strong>the</strong>-shelf clusteringtool; we applied <strong>the</strong> Cluto package (Karypis, 2002).2 It is important to note that <strong>the</strong> ultimate decisions belonged to linguists, who could also freely addrelation instances not suggested by WNW.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!