06.08.2015 Views

A Wordnet from the Ground Up

A Wordnet from the Ground Up - School of Information Technology ...

A Wordnet from the Ground Up - School of Information Technology ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

4.3. Generic Patterns Verified Statistically 111and in <strong>the</strong> evaluation of <strong>the</strong> extracted patterns and instances (<strong>the</strong> evaluation is notpresent in some methods). Very often <strong>the</strong> process is recursive: <strong>the</strong> extracted instancesare used to generate a new pattern set.Brin (1999) proposed a method of extracting patterns that discover author-bookpairs in Web pages. Pair occurrences are represented by <strong>the</strong> order of both names andby prefix, middle and suffix character context. Patterns are generated by first groupingseed occurrences by <strong>the</strong> order and <strong>the</strong> middle. For each group, <strong>the</strong> longest matchingprefix and suffix are identified, and one pattern for a group is extracted. Patterns areevaluated by <strong>the</strong>ir specificity defined as <strong>the</strong> product of <strong>the</strong> lengths of <strong>the</strong> prefix, middleand suffix. Pattern with specificity below a predefined threshold are rejected 7 . Brindid not present any thorough evaluation or any accuracy data.Agichtein and Gravano (2000) as well as Agichtein et al. (2001) follow Brin’sapproach, extended with respect to <strong>the</strong> recognition of an unlimited set of relationsbetween Named Entities, and to <strong>the</strong> generation and evaluation of patterns and extractedinstances. Their system has been aptly named Snowball to reflect <strong>the</strong> iterative characterof <strong>the</strong> algorithm. It starts with <strong>the</strong> seeds and an empty set of extraction patterns. Ineach round, it extracts new patterns and new set of instances but keeps only those thathave been evaluated as sufficiently reliable. The previous set of instances becomes <strong>the</strong>seed set for <strong>the</strong> next iteration.The text is first processed by a named-entity tagger 8 . During pattern generation,text fragments including pairs of named entities in focus are extracted. For each uniquepair, <strong>the</strong> left, middle and right contexts are represented as vectors of weight in 〈0, 1〉.The weights, produced <strong>from</strong> term frequencies, are meant to express <strong>the</strong> importanceof <strong>the</strong> term for <strong>the</strong> context. Weights for <strong>the</strong> middle context are higher to reflects itslarger importance for <strong>the</strong> relation representation. The size of <strong>the</strong> left and right contextis fixed to a specified text window.During extraction, a text fragment around two named entities is transformed toa vector and compared with pattern vectors. It is accepted as an instance if <strong>the</strong> similarityof vectors (measured as a product) exceeds <strong>the</strong> threshold. Extracted patterns andinstances are evaluated by confidence. The confidence of a patterns is <strong>the</strong> ratio ofpositive to negative matches, first measured for <strong>the</strong> initial instances. Pattern confidencefor <strong>the</strong> next iteration is a combination of <strong>the</strong> new and old value. The confidence ofan instance directly depends on <strong>the</strong> confidence of <strong>the</strong> patterns that select it and on <strong>the</strong>degree of matching between <strong>the</strong> instance and particular patterns. After each iteration,all instances with low confidence (below a threshold) are discarded.Ravichandran and Hovy (2002) developed a weakly supervised algorithm for extractionof question-answer pairs of named entities, based only on seeds. They used7 We have intentionally omitted <strong>the</strong> role of <strong>the</strong> URL addresses in pattern generation; we focus on aplain-text corpus as a source.8 The MITRE Corporation’s Alembic Workbench (Day et al., 1997) was used in Snowball.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!