A Wordnet from the Ground Up

More documents

Recommendations

Info

4.3. Generic Patterns Verified Statistically 111and in the evaluation of the extracted patterns and instances (the evaluation is notpresent in some methods). Very often the process is recursive: the extracted instancesare used to generate a new pattern set.Brin (1999) proposed a method of extracting patterns that discover author-bookpairs in Web pages. Pair occurrences are represented by the order of both names andby prefix, middle and suffix character context. Patterns are generated by first groupingseed occurrences by the order and the middle. For each group, the longest matchingprefix and suffix are identified, and one pattern for a group is extracted. Patterns areevaluated by their specificity defined as the product of the lengths of the prefix, middleand suffix. Pattern with specificity below a predefined threshold are rejected 7 . Brindid not present any thorough evaluation or any accuracy data.Agichtein and Gravano (2000) as well as Agichtein et al. (2001) follow Brin’sapproach, extended with respect to the recognition of an unlimited set of relationsbetween Named Entities, and to the generation and evaluation of patterns and extractedinstances. Their system has been aptly named Snowball to reflect the iterative characterof the algorithm. It starts with the seeds and an empty set of extraction patterns. Ineach round, it extracts new patterns and new set of instances but keeps only those thathave been evaluated as sufficiently reliable. The previous set of instances becomes theseed set for the next iteration.The text is first processed by a named-entity tagger 8 . During pattern generation,text fragments including pairs of named entities in focus are extracted. For each uniquepair, the left, middle and right contexts are represented as vectors of weight in 〈0, 1〉.The weights, produced from term frequencies, are meant to express the importanceof the term for the context. Weights for the middle context are higher to reflects itslarger importance for the relation representation. The size of the left and right contextis fixed to a specified text window.During extraction, a text fragment around two named entities is transformed toa vector and compared with pattern vectors. It is accepted as an instance if the similarityof vectors (measured as a product) exceeds the threshold. Extracted patterns andinstances are evaluated by confidence. The confidence of a patterns is the ratio ofpositive to negative matches, first measured for the initial instances. Pattern confidencefor the next iteration is a combination of the new and old value. The confidence ofan instance directly depends on the confidence of the patterns that select it and on thedegree of matching between the instance and particular patterns. After each iteration,all instances with low confidence (below a threshold) are discarded.Ravichandran and Hovy (2002) developed a weakly supervised algorithm for extractionof question-answer pairs of named entities, based only on seeds. They used7 We have intentionally omitted the role of the URL addresses in pattern generation; we focus on aplain-text corpus as a source.8 The MITRE Corporation’s Alembic Workbench (Day et al., 1997) was used in Snowball.
112 Chapter 4. Extracting Relation Instancesa simple tokeniser and a simple sentence boundary recogniser, rather than advancedtools like the named-entity tagger in Snowball. This relaxation of assumptions madetheir algorithm more general. Patterns are extracted from sentences including seed occurrencesby means of suffix trees for extracting substrings of optimal length. Patternprecision is calculated as the ratio of the correctly matched instance occurrences to allmatches of the pattern. Instances are ordered by the precision of the patterns selectingthem. The process is not iterative.Pantel et al. (2004) proposed an algorithm for mining is-a relations from huge textcorpora. Text is first processed by a part-of-speech tagger (Brill, 1995) and stored ina two-level format: surface word forms and part-of-speech tags. Next, all sentencesincluding seeds are extracted. Patterns are learned from the sentences by calculatingthe minimal edit distance among sentences and registering the edit operations required.Patterns with relatively high occurrence and high precision are identified using the loglikelihoodprinciple (Dunning, 1993) for scoring. Only the 15 highest-score patternshave been used to extract hypernymy instances.Pantel and Pennacchiotti (2006) proposed a system called Espresso which seemsto combine all interesting properties of its predecessors. It does not make any assumptionsconcerning the relation described by the patterns. It works on plain text, usesonly a part-of-speech tagger and a simple chunker, and works iteratively during thesubsequent phases of pattern and instance extraction and evaluation. It is also claimedto be weakly supervised, requiring only the initial set of seeds. Taking into accountthe foregoing selective overview of the previous algorithms and the results of the evaluationof Espresso (Pantel and Pennacchiotti, 2006), we decided to use Espresso as thestarting point for the development of an algorithm that supports the expansion of thecore plWordNet.EspressoEspresso follows the bootstrapping paradigm in a version exemplified already in theSnowball system (Agichtein and Gravano, 2000). Seeds are used to extract the first setof patterns; the subsequent phases of instance and pattern extraction go on automatically.The following four main phases can be identified in Espresso.1. Preprocessing: the input text is divided into tokens (some multiword expressionsare identified) and run through a part-of-speech tagger.2. Pattern induction: sentences including seeds are extracted and patterns arelearned using the algorithm in (Ravichandran and Hovy, 2002).3. Pattern selection: extracted patterns are statistically evaluated and ranked byinstances inducing them; k top patterns are selected.
Page 1 and 2:
A Wordnetfrom the Ground Up
Page 3 and 4:
Work financed by the Polish Ministr
Page 7 and 8:
6 Prefaceheartfelt thanks go to all
Page 9:
8 Chapter 1. Motivation, Goals, Ear
Page 12 and 13:
1.1. Motivation 11[a] special form
Page 14 and 15:
1.1. Motivation 13Affect (Strappara
Page 16 and 17:
1.2. The Goals of the plWordNet Pro
Page 18 and 19:
1.2. The Goals of the plWordNet Pro
Page 20 and 21:
1.3. Early Decisions 19Merge Model:
Page 22:
1.3. Early Decisions 214. On the ot
Page 25 and 26:
24 Chapter 2. Building a Wordnet Co
Page 27 and 28:
Page 29 and 30:
Page 31 and 32:
Page 33 and 34:
Page 35 and 36:
Page 37 and 38:
Page 39 and 40:
Page 41 and 42:
Page 43 and 44:
Page 45 and 46:
Page 47 and 48:
Page 49 and 50:
48 Chapter 3. Discovering Semantic
Page 51 and 52:
Page 53 and 54:
Page 55 and 56:
Page 57 and 58:
Page 59 and 60:
Page 61 and 62: 60 Chapter 3. Discovering Semantic
Page 103 and 104: 102 Chapter 4. Extracting Relation
Page 111: 110 Chapter 4. Extracting Relation
Page 163 and 164:
162 Chapter 4. Extracting Relation
Page 165 and 166:
164 Chapter 4. Extracting Relation
Page 167 and 168:
166 Chapter 5. Polish WordNet Today
Page 169 and 170:
Page 171 and 172:
Page 173 and 174:
Page 175 and 176:
Page 177 and 178:
Page 179 and 180:
Page 181 and 182:
Page 183 and 184:
Page 186 and 187:
Appendix ATests for Lexico-semantic
Page 188 and 189:
187Test for adjectives (T. IX)1. p1
Page 190 and 191:
189RelatednessTest for nouns (T. XV
Page 192 and 193:
BibliographyAgarwal, Abhaya and Alo
Page 194 and 195:
Bibliography 193on Deep Lexical Acq
Page 196 and 197:
Bibliography 195Derwojedowa, Magdal
Page 198 and 199:
Bibliography 197Grefenstette, Grego
Page 200 and 201:
Bibliography 199Kurc, Roman. (2008)
Page 202 and 203:
Bibliography 201Mohammad, Saif and
Page 204 and 205:
Bibliography 203. (2006) “The pot
Page 206 and 207:
Bibliography 205and Technology 7(1-
Page 208 and 209:
List of Tables2.1 The size of the c
Page 210 and 211:
List of Figures2.1 The LU perspecti
Page 212 and 213:
List of Figures 2114.16 Completely
Page 214 and 215:
Index 213CBC, see Clustering by Com
Page 216 and 217:
Index 215169, 177, 178, 180, 182hyp
Page 218 and 219:
Index 217mutual hypernymy, 24Mutual
Page 220 and 221:
Index 219SUMO, 14Supported Vector M
Page 222:
A language without a wordnet is at
show all

A Wordnet from the Ground Up

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?