06.08.2015 Views

A Wordnet from the Ground Up

A Wordnet from the Ground Up - School of Information Technology ...

A Wordnet from the Ground Up - School of Information Technology ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

112 Chapter 4. Extracting Relation Instancesa simple tokeniser and a simple sentence boundary recogniser, ra<strong>the</strong>r than advancedtools like <strong>the</strong> named-entity tagger in Snowball. This relaxation of assumptions made<strong>the</strong>ir algorithm more general. Patterns are extracted <strong>from</strong> sentences including seed occurrencesby means of suffix trees for extracting substrings of optimal length. Patternprecision is calculated as <strong>the</strong> ratio of <strong>the</strong> correctly matched instance occurrences to allmatches of <strong>the</strong> pattern. Instances are ordered by <strong>the</strong> precision of <strong>the</strong> patterns selecting<strong>the</strong>m. The process is not iterative.Pantel et al. (2004) proposed an algorithm for mining is-a relations <strong>from</strong> huge textcorpora. Text is first processed by a part-of-speech tagger (Brill, 1995) and stored ina two-level format: surface word forms and part-of-speech tags. Next, all sentencesincluding seeds are extracted. Patterns are learned <strong>from</strong> <strong>the</strong> sentences by calculating<strong>the</strong> minimal edit distance among sentences and registering <strong>the</strong> edit operations required.Patterns with relatively high occurrence and high precision are identified using <strong>the</strong> loglikelihoodprinciple (Dunning, 1993) for scoring. Only <strong>the</strong> 15 highest-score patternshave been used to extract hypernymy instances.Pantel and Pennacchiotti (2006) proposed a system called Espresso which seemsto combine all interesting properties of its predecessors. It does not make any assumptionsconcerning <strong>the</strong> relation described by <strong>the</strong> patterns. It works on plain text, usesonly a part-of-speech tagger and a simple chunker, and works iteratively during <strong>the</strong>subsequent phases of pattern and instance extraction and evaluation. It is also claimedto be weakly supervised, requiring only <strong>the</strong> initial set of seeds. Taking into account<strong>the</strong> foregoing selective overview of <strong>the</strong> previous algorithms and <strong>the</strong> results of <strong>the</strong> evaluationof Espresso (Pantel and Pennacchiotti, 2006), we decided to use Espresso as <strong>the</strong>starting point for <strong>the</strong> development of an algorithm that supports <strong>the</strong> expansion of <strong>the</strong>core plWordNet.EspressoEspresso follows <strong>the</strong> bootstrapping paradigm in a version exemplified already in <strong>the</strong>Snowball system (Agichtein and Gravano, 2000). Seeds are used to extract <strong>the</strong> first setof patterns; <strong>the</strong> subsequent phases of instance and pattern extraction go on automatically.The following four main phases can be identified in Espresso.1. Preprocessing: <strong>the</strong> input text is divided into tokens (some multiword expressionsare identified) and run through a part-of-speech tagger.2. Pattern induction: sentences including seeds are extracted and patterns arelearned using <strong>the</strong> algorithm in (Ravichandran and Hovy, 2002).3. Pattern selection: extracted patterns are statistically evaluated and ranked byinstances inducing <strong>the</strong>m; k top patterns are selected.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!