06.08.2015 Views

A Wordnet from the Ground Up

A Wordnet from the Ground Up - School of Information Technology ...

A Wordnet from the Ground Up - School of Information Technology ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

36 Chapter 2. Building a <strong>Wordnet</strong> Core2.4 The First 7000 Lexical UnitsThe plWordNet project depends crucially on an initial manually built small network.It has been our firm belief <strong>from</strong> <strong>the</strong> start that semi-automated construction of a wordnet<strong>from</strong> <strong>the</strong> ground up requires a completely trustworthy core to achieve acceptableaccuracy. We assumed that automated methods will not perform well for more generalLUs, and will be unable to extract <strong>the</strong> basic structure of <strong>the</strong> future plWordNet. The firstassumption turned out to be too pessimistic. Measures of Semantic Relatedness (Section3.4) produced results of lower accuracy only for some more general LUs, while foro<strong>the</strong>rs <strong>the</strong> results were quite correct or even good. The second assumption, however,has been borne out by <strong>the</strong> experiments with state-of-<strong>the</strong>-art clustering algorithms for<strong>the</strong> extraction of synsets and possibly a hypernymy structure – see Section 3.5.We planned a fully manual construction 2 of core plWordNet with approximately7000 LUs. Those would be LUs with a general meaning, such as nouns located in <strong>the</strong>upper part of <strong>the</strong> hypernymy structure, including LUs which represent concepts suchas rzecz ‘thing’ or substancja ‘substance’. We had settled upon not translating anyexisting wordnet, and no monolingual dictionary in an electronic form was available tobe leveraged as a source of <strong>the</strong> plWordNet structure, especially <strong>the</strong> hypernymy structure.That is why we initially decided to rely only on a large enough corpus. The best choicefor Polish was <strong>the</strong> IPI PAN Corpus [IPIC] (Przepiórkowski, 2004) – <strong>the</strong> largest availablecorpus of Polish at <strong>the</strong> time when <strong>the</strong> project began. IPIC, designed as a corpus ofgeneral Polish, consists of about 254 million tokens and contains a range of genres,including literature, poetry, newspapers, scientific texts, legal texts and stenographicparliamentary records. It is not balanced: <strong>the</strong> last category dominates (Przepiórkowski,2006).We first extracted 10000 most frequent one-word lemmas 3 in IPIC 1.0, each taggedwith a grammatical class 4 (in <strong>the</strong> technical sense, see page 17, Section 1.2). Wecollected more lemmas than <strong>the</strong> planned size of <strong>the</strong> core plWordNet, because weexpected <strong>the</strong> list to shrink after manual revision. We divided <strong>the</strong>m manually into 45general semantic domains (26 nominal, 15 verbal, 4 adjectival) corresponding to <strong>the</strong>domains that label source files of Princeton WordNet [PWN] 1.5.Simultaneously with <strong>the</strong> grouping, <strong>the</strong> linguists filtered out typos and rare lemmaswhose high frequency was an artefact of errors in morphosyntactic tagging of IPIC 1.0.For example, <strong>the</strong> verb maić ‘≈ adorn with verdure’ normally occurs very rarely inra<strong>the</strong>r old fashioned constructions (<strong>the</strong> use of any finite form is hard to imagine). The2 There was logistical software support for <strong>the</strong> process, but all lexicographic decision were to be madeby linguists.3 A method of extracting two-word lemmas was developed later.4 In <strong>the</strong> IPIC tagset, word forms are divided into 32 grammatical classes, a division more-fine grainedthan <strong>the</strong> traditional parts of speech.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!