06.08.2015 Views

A Wordnet from the Ground Up

A Wordnet from the Ground Up - School of Information Technology ...

A Wordnet from the Ground Up - School of Information Technology ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

154 Chapter 4. Extracting Relation InstancesThe whole attachment screen is embedded in <strong>the</strong> full plWordNetApp, so <strong>the</strong> linguistcan change any element of <strong>the</strong> plWordNet database by switching to ano<strong>the</strong>r panel.The AAA algorithm runs on <strong>the</strong> server. On <strong>the</strong> client side, mainly visualisation isleft. WNW is written in Java and can run unchanged on many platforms.4.5.4 Benefits of weaving <strong>the</strong> expanded structureWNW has been designed to facilitate <strong>the</strong> actual process of wordnet expansion. Itsprimary evaluation was based on <strong>the</strong> work of a linguist with rich experience in editingplWordNet, who was adding new nominal lemmas. The candidates came <strong>from</strong> <strong>the</strong>same set of 13285 nominal lemmas, which has been defined as a basis for expandingplWordNet during work on MSR extraction, cf Section 3.4.5. The set includes lemmas<strong>from</strong> a small Polish-English dictionary (Piotrowski and Saloni, 1999), two-wordlemmas <strong>from</strong> a general dictionary of Polish (PWN, 2007) and frequent nouns (>1000)<strong>from</strong> <strong>the</strong> joint corpus 14 (≈ 581 million tokens, see Section 3.4.5).For evaluation purposes, we used 1360 new lemmas divided into subdomains correspondingto animals (113 LUs), food (170), people (323), people 2 (269), plants (81),places (243), plus a sample of 161 LUs randomly drawn across all clusters (rand. inTable 4.5). Prior to <strong>the</strong> experiment, <strong>the</strong> linguist had used only traditional means ofher work – electronic dictionaries and corpus browsing. We assumed three types ofevaluation:1. subjective opinions and observations of <strong>the</strong> linguist collected during actual workover a longer period, 18 person-days,2. monitoring and analysing <strong>the</strong> linguist’s decisions recorded in <strong>the</strong> database toge<strong>the</strong>rwith descriptions,3. automatic evaluation following <strong>the</strong> general scheme of re-building <strong>the</strong> existingwordnet by applying <strong>the</strong> AAA algorithm autonomously.The linguist’s observationsWNW has turned out to be useful in <strong>the</strong> inclusion of new lemmas given a narrowdomain such as jedz 15 (names of foodstuffs) or rsl (plant names). For such lemmas<strong>the</strong> accuracy was high, and it increased even more as <strong>the</strong> database grew and as <strong>the</strong>operation of recomputing <strong>the</strong> graphs became available. As an example, <strong>the</strong> program14 As described in Section 3.4.5, <strong>the</strong> joint corpus consists of IPIC (≈ 254 million tokens)(Przepiórkowski, 2004), texts <strong>from</strong> <strong>the</strong> electronic edition of a Polish daily Rzeczpospolita(≈ 113 million tokens) (Rzeczpospolita, 2008) and a corpus of large Polish texts collected <strong>from</strong> <strong>the</strong>Internet (≈ 214 million tokens).15 We cite here <strong>the</strong> original labels assigned to <strong>the</strong> domains in plWordNet.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!