06.08.2015 Views

A Wordnet from the Ground Up

A Wordnet from the Ground Up - School of Information Technology ...

A Wordnet from the Ground Up - School of Information Technology ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

4.3. Generic Patterns Verified Statistically 115Only instances with <strong>the</strong> confidence measure above a threshold are selected for <strong>the</strong>next iteration and used to induce and evaluate patterns. The implicit assumption isthat <strong>the</strong> confidence measure can be calculated for <strong>the</strong> majority of instances: reliablepatterns, which are less frequent but more specific, should occur at least a few timeswith <strong>the</strong> majority of correct instances on <strong>the</strong> Web. We discard all correct instanceswhich are not covered by <strong>the</strong> Web data matching reliable patterns.It is worth emphasizing that <strong>the</strong> evaluation of patterns in <strong>the</strong> next iteration is basedon instances used to induce <strong>the</strong>se patterns – not on instances which <strong>the</strong>y extracted.The intuition behind <strong>the</strong> measures of reliability and confidence is that patternswhich describe <strong>the</strong> given relation well often occur with many confident instances ofthis relation – and <strong>the</strong> o<strong>the</strong>r way around. The difference is that instances extractedby generic patterns will get high confidence if <strong>the</strong>y occur in contexts matched by <strong>the</strong>specific patterns of good reliability in <strong>the</strong> validating corpus.The measures of reliability and confidence reduce <strong>the</strong> need for manual supervisiononce Espresso has started. In a way <strong>the</strong>y define <strong>the</strong> degree to which <strong>the</strong> extractedinstances express <strong>the</strong> target relation and <strong>the</strong> patterns describe contexts. The former isan advantage over manual patterns, for which collected frequencies are mostly low andaccidental, and say a little about <strong>the</strong> quality of instances.The system of measures, instances and pattern selection are universal and do notrefer to any properties of any particular relation being extracted. Espresso can <strong>the</strong>reforebe applied to a wide range of relations. It was applied to hypernymy, meronymy andantonymy, as well to more specific relations such as person-company or person-jobtitle (Pantel and Pennacchiotti, 2006).To sum up, Pantel and Pennacchiotti (2006) list Espresso’s characteristics:• high recall toge<strong>the</strong>r with a small decrease in <strong>the</strong> precision of extracted instances,• autonomy of work (weakly supervised algorithm) – only several instances of <strong>the</strong>given relation must be defined at <strong>the</strong> beginning,• independence of <strong>the</strong> size of <strong>the</strong> corpus or domain used,• a wide range of relation types that can be extracted.EstrattoEstratto (Kurc, 2008, Kurc and Piasecki, 2008) is a modification of Espresso developedmainly to cope with <strong>the</strong> significant differences between English and Polish: rich morphology,flexible word-order and <strong>the</strong> much more limited size and access to <strong>the</strong> Webresources. Let us first present one language-independent adjustment.Two issues are unclear in how <strong>the</strong> reliability and confidence measures work. First,reliability is sensitive to fluctuations in PMI values. Higher values (e.g., <strong>the</strong> effect

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!