06.08.2015 Views

A Wordnet from the Ground Up

A Wordnet from the Ground Up - School of Information Technology ...

A Wordnet from the Ground Up - School of Information Technology ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

1.2. The Goals of <strong>the</strong> plWordNet Project 17According to our initial plans, an extraction algorithm should suggest both newsynsets and instances of lexico-semantic relations. In <strong>the</strong> end, <strong>the</strong> WordNet Weaver generatesonly suggestions of attachment points (Section 4.5.3): synsets in which a givennew LU can be included or to which it can be attached as a new hyponym/hypernym oreven meronym. The accuracy of clustering-based methods of suggesting new synsetsended up too low for practical applications (Section 3.5). The use of support toolsnotwithstanding, we wanted to abide by <strong>the</strong> principle that <strong>the</strong> ultimate responsibilityfor every wordnet element rests with its authors in every phase of <strong>the</strong> wordnet development.It was tempting to speed up <strong>the</strong> development of our wordnet at <strong>the</strong> cost ofslightly lower accuracy, but we are convinced that a smaller wordnet with excellentaccuracy is more useful in applications than a larger but less reliable resource.Despite <strong>the</strong> limited funds, we fully expected to build a wordnet of a size comparableto several much better established European wordnets. The introduction of<strong>the</strong> automated methods in <strong>the</strong> second phase of <strong>the</strong> project was meant to reduce <strong>the</strong>linguistic workload considerably 8 . Section 4.5.4 reports on <strong>the</strong> extent to which thissucceeded.There are many methods of extracting lexico-semantic relations <strong>from</strong> corpora. Wepresent an overview and a detailed discussion of selected methods throughout Chapters3 and 4. They can be roughly divided into two main groups of methods, basedondistribution (Chapter 3) and on patterns (Chapter 4). The former can achieve a relativelygood accuracy in extracting instances of hypernymy – pairs of LUs – but veryrarely of o<strong>the</strong>r relations such as synonymy, meronymy or antonymy; <strong>the</strong> recall is low.Distributional methods achieve good recall, because <strong>the</strong>y can generate a descriptionfor any pair of LUs, but <strong>the</strong>ir accuracy is quite low: <strong>the</strong>y do not distinguish betweendifferent lexico-semantic relations and produce a vague measure of semantic relatedness.A well-known weakness of distributional methods is in distinguishing different LUsfor <strong>the</strong> given lemma. Henceforth, we will understand lemma to be a basic morphologicalword form that represents <strong>the</strong> occurrences of one or a few particular LUs inlanguage expressions. A lemma is monosemous if it represents one LU, and polysemouso<strong>the</strong>rwise. The basic morphological word form, or base form, is a word formor language expression with conventional values of grammatical categories, such as<strong>the</strong> nominative case and singular number for nouns. A base form represents a set ofword forms with <strong>the</strong> same meaning and different values of grammatical categories. Wedecided to operate on lemmas during <strong>the</strong> extraction of relation instances, because <strong>the</strong>number of different word forms is very high in <strong>the</strong> strongly inflected Polish language.Lemmatisation, or <strong>the</strong> mapping of word forms to lemmas, must be done automatically8 That is why we have allotted <strong>the</strong> funds approximately in <strong>the</strong> proportion 1:2 to manual work and to<strong>the</strong> software design and development work.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!