06.08.2015 Views

A Wordnet from the Ground Up

A Wordnet from the Ground Up - School of Information Technology ...

A Wordnet from the Ground Up - School of Information Technology ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

116 Chapter 4. Extracting Relation Instancesof small frequencies, even after discounting dependent on <strong>the</strong> number of occurrences)can cause lower assessment of patterns with a balanced ratio of co-occurrence withmatched instances versus <strong>the</strong> pattern occurrences and instance occurrences alone. Suchsituations result in artificially increased values of max pmi . We would like to look fora measure which would be less sensitive to <strong>the</strong> low frequency of pattern matches orinstances matched. Also, <strong>the</strong> value 1 of pattern reliability is not guaranteed even fora pattern which occurs only with a subset of seeds, because of <strong>the</strong> max pmi value whichcan be increased by some infrequent pattern. That is why <strong>the</strong> propagation of reliabilityto <strong>the</strong> subsequent iterations causes new values (calculated for patterns <strong>from</strong> instancesand vice versa) to become gradually lower for <strong>the</strong> respective set. We seek a measureof reliability which returns 1 as <strong>the</strong> value for <strong>the</strong> best patterns or instances in everyiteration.r π (p) =∑i∈I (pmi(i, p) ∗ r t(i)) ∗ d(I, p)max p(∑i∈I (pmi(i, p) ∗ r t(i)) ) ∗ |I|(4.6)d(i, p) defines how many unique instances <strong>the</strong> given pattern is associated with.PMI in formula (4.6) is usually also modified by a discounting factor.The proposed modifications are intended to increase <strong>the</strong> reliability of <strong>the</strong> patterns,which not only extract a lot of instances, but occur with a large number of differentinstances. The modified measure, when applied to instances, promote those whichoccur often in <strong>the</strong> corpus associated with many different patterns.The choice of <strong>the</strong> pattern structure is crucial for <strong>the</strong>ir expressiveness and <strong>the</strong> abilityto capture those elements of <strong>the</strong> language structures that express semantic relationsbetween LUs (such as <strong>the</strong> linear order of constituents in English), but case-marking ofnoun phrases in Polish (<strong>the</strong>ir linear order is mostly insignificant for <strong>the</strong> potential lexicalsemantic relation between <strong>the</strong>m). Espresso follows roughly <strong>the</strong> scheme proposed byHearst (1992): patterns are regular expressions, in which <strong>the</strong> alphabet includes wordforms and <strong>the</strong> label TR for any multiword term, and a set of variables for noun phrasesmatched as elements of an instance. The role of part-of-speech tags is unclear in <strong>the</strong>approach of Pantel and Pennacchiotti (2006), but <strong>the</strong>y are present in <strong>the</strong> example of<strong>the</strong> generalisation of a sentence [p. 115]:Because/IN TR is/VBZ a/DT TR and/CC x is/VBZ a/DT y.We assumed that patterns are simplified regular expressions, with <strong>the</strong> Kleene closurebut without grouping. The alphabet for an inflectional language like Polish shouldra<strong>the</strong>r include roots than (numerous) word forms. Espresso patterns rely to some extenton <strong>the</strong> positional, linear syntactic structure of an English sentence. Porting toa significantly different language may be problematic.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!