06.08.2015 Views

A Wordnet from the Ground Up

A Wordnet from the Ground Up - School of Information Technology ...

A Wordnet from the Ground Up - School of Information Technology ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

4.3. Generic Patterns Verified Statistically 1134. Instance extraction: <strong>the</strong> selected patterns are used to extract instances; <strong>the</strong> instancesare next statistically evaluated using <strong>the</strong> patterns that match <strong>the</strong>ir occurrences;<strong>the</strong> m top instances are selected and kept for <strong>the</strong> next iteration –a possible expansion phase for extended retrieval of instances can take placebefore <strong>the</strong> selection.The first step is performed once at <strong>the</strong> beginning. Lacking a stop condition (itdepends on <strong>the</strong> number of extracted patterns and average pattern score decrease inrelation to <strong>the</strong> previous iteration), <strong>the</strong> next iteration starts <strong>from</strong> step 2.Preprocessing used <strong>the</strong> Alembic Workbench part-of-speech tagger (Day et al., 1997)but no shallow parser. Multiword terms (if left unrecognised) would decrease <strong>the</strong>accuracy of <strong>the</strong> extraction algorithm, because of instances generated <strong>from</strong> parts of <strong>the</strong>complex terms. We noticed this problem in <strong>the</strong> case of manually constructed patternsdiscussed in Section 4.1. Instead of using a shallow parser, Espresso applied a definitionof multiword terms as a regular expression (Pantel and Pennacchiotti, 2006, p. 115).This simple solution cannot be directly ported to languages typologically different <strong>from</strong>English, such as <strong>the</strong> morphology-rich, flexible word-order Polish. Morphology andword-order flexibility will be discussed shortly in <strong>the</strong> context of a proposed modificationof Espresso named Estratto.Pattern induction is based on <strong>the</strong> algorithm of Ravichandran and Hovy (2002),discussed earlier, with only one modification: all recognised multiword terms arereplaced with <strong>the</strong> label TR.The statistical reliability measures, introduced in Espresso for ranking patterns andinstances, follow <strong>the</strong> same basic scheme of recursive dependency of both measures:<strong>the</strong> reliability of instances depends on <strong>the</strong> patterns which extracted <strong>the</strong>m and <strong>the</strong> o<strong>the</strong>rway around. This scheme can be traced back to Snowball (Agichtein and Gravano,2000) and (Ravichandran and Hovy, 2002), but <strong>the</strong>re it was implemented in a lesssophisticated form. Statistical evaluation is clearly this element which is missing whenworking with manually extracted patterns. Reliability calculation is Espresso’s keyelement, so we present it in more detail. For <strong>the</strong> needs of ranking and selectingpatterns, Espresso introduced a reliability measure:r π =∑i∈I ( pmi(i,p)max pmi|I|∗ r t (i))(4.1)p is a pattern, i – an instance, r t – a reliability measure for instances, pmi – <strong>the</strong>Pointwise Mutual Information [PMI] measure, explained below, and |I| – <strong>the</strong> size of<strong>the</strong> instance set.The reliability of each seed delivered to Espresso is 1, so pattern reliability isproportional to <strong>the</strong> average strength of association between <strong>the</strong> pattern and <strong>the</strong> seeds

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!