06.08.2015 Views

A Wordnet from the Ground Up

A Wordnet from the Ground Up - School of Information Technology ...

A Wordnet from the Ground Up - School of Information Technology ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

4.4. Benefits of Extracted Patterns for <strong>Wordnet</strong> Expansion 119verified. The resulting patterns are run on a validating large corpus (for Espresso, <strong>the</strong>Web). A confidence measure is computed <strong>from</strong> <strong>the</strong> collected frequencies and comparedwith a threshold.In Espresso, <strong>the</strong> Internet served as a validating corpus for instances extracted bygeneric patterns. We need o<strong>the</strong>r resources because of <strong>the</strong> paucity of Polish Web pagesand <strong>the</strong> inherent difficulty of querying regardless of <strong>the</strong> inflectional variety. A secondlarge corpus (Rzeczpospolita, 2008) (but still much smaller than <strong>the</strong> data <strong>from</strong> <strong>the</strong> Weband even IPIC used for <strong>the</strong> extraction of <strong>the</strong> patterns) served <strong>the</strong> purpose of validationin Estratto. The necessary condition for finding occurrences of <strong>the</strong> patterns extracted<strong>from</strong> <strong>the</strong> primary corpus seems to be that <strong>the</strong> validating corpus cover similar genresand domains.The induction of patterns and <strong>the</strong> extraction of instances in Estratto are controlledby <strong>the</strong> following set of parameters:1. <strong>the</strong> number of top k patterns not to be discarded (preserved for <strong>the</strong> next iterations),2. <strong>the</strong> threshold measure of confidence for instances,3. <strong>the</strong> minimum frequency and maximum frequency values for patterns,4. <strong>the</strong> minimum size of a pattern – all patterns that consist of only matching locationsand conjunctions are discarded by definition,5. a filter on common words in instances and on instances with identical LUs onboth positions,6. <strong>the</strong> size of <strong>the</strong> validating corpus.4.4 Benefits of Extracted Patterns for <strong>Wordnet</strong> ExpansionWe investigated <strong>the</strong> use of algorithms like Espresso in order to find method for extractingvaluable instances of wordnet relations, at least hypernymy, with precision higherthan afforded by <strong>the</strong> handwritten lexico-morphosyntactic patterns. We did not expectready-to-add hypernymy instances. We only wanted to construct yet ano<strong>the</strong>r sourceof knowledge that suggests hypernymy occurrences and <strong>the</strong> correct direction of <strong>the</strong>relation.It is far <strong>from</strong> trivial properly to evaluate <strong>the</strong> extracted lexico-semantic resources(Section 3.3). It is much easier for lists of instances: we verify how many of <strong>the</strong>m arecorrect. Only two comparisons were possible for Polish:• with <strong>the</strong> existing structure of <strong>the</strong> core plWordNet,• with an evaluation by one of <strong>the</strong> co-authors.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!