A Wordnet from the Ground Up

More documents

Recommendations

Info

4.3. Generic Patterns Verified Statistically 1134. Instance extraction: the selected patterns are used to extract instances; the instancesare next statistically evaluated using the patterns that match their occurrences;the m top instances are selected and kept for the next iteration –a possible expansion phase for extended retrieval of instances can take placebefore the selection.The first step is performed once at the beginning. Lacking a stop condition (itdepends on the number of extracted patterns and average pattern score decrease inrelation to the previous iteration), the next iteration starts from step 2.Preprocessing used the Alembic Workbench part-of-speech tagger (Day et al., 1997)but no shallow parser. Multiword terms (if left unrecognised) would decrease theaccuracy of the extraction algorithm, because of instances generated from parts of thecomplex terms. We noticed this problem in the case of manually constructed patternsdiscussed in Section 4.1. Instead of using a shallow parser, Espresso applied a definitionof multiword terms as a regular expression (Pantel and Pennacchiotti, 2006, p. 115).This simple solution cannot be directly ported to languages typologically different fromEnglish, such as the morphology-rich, flexible word-order Polish. Morphology andword-order flexibility will be discussed shortly in the context of a proposed modificationof Espresso named Estratto.Pattern induction is based on the algorithm of Ravichandran and Hovy (2002),discussed earlier, with only one modification: all recognised multiword terms arereplaced with the label TR.The statistical reliability measures, introduced in Espresso for ranking patterns andinstances, follow the same basic scheme of recursive dependency of both measures:the reliability of instances depends on the patterns which extracted them and the otherway around. This scheme can be traced back to Snowball (Agichtein and Gravano,2000) and (Ravichandran and Hovy, 2002), but there it was implemented in a lesssophisticated form. Statistical evaluation is clearly this element which is missing whenworking with manually extracted patterns. Reliability calculation is Espresso’s keyelement, so we present it in more detail. For the needs of ranking and selectingpatterns, Espresso introduced a reliability measure:r π =∑i∈I ( pmi(i,p)max pmi|I|∗ r t (i))(4.1)p is a pattern, i – an instance, r t – a reliability measure for instances, pmi – thePointwise Mutual Information [PMI] measure, explained below, and |I| – the size ofthe instance set.The reliability of each seed delivered to Espresso is 1, so pattern reliability isproportional to the average strength of association between the pattern and the seeds
114 Chapter 4. Extracting Relation Instancesmeasured by PMI. Later, associating a pattern with a larger number of more reliableinstances increases the pattern’s reliability.The reliability of instances is defined symmetrically: replace the reliability ofinstances is with the reliability of patterns r π (i) and the set of instances by the set ofpatterns P :∑ ( )pmi(i,p)p∈P max pmi∗ r π (p)r π =(4.2)|P |PMI originates from Information Theory. It measures the strength of associationbetween two events:|x, p, y||∗, ∗, ∗|pmi(i, p) = log (4.3)|x, ∗, y||∗, p, ∗||x, p, y| is the number of occurrences of x and y in contexts matching the patternp, x, ∗, y – the number of co-occurrences of x and y in the corpus regardless of thepattern, and so on.The definition of pmi presented by Pantel and Pennacchiotti (2006) does not includethe constituent |∗, ∗, ∗| (the number of contexts). The PMI measure, however, shouldbe usually greater than 0, while pmi defined in (Pantel and Pennacchiotti, 2006) is not.The missing constituent is also suggested by the general definition of PMI:pmi(i, p) = log p(I, P )p(I)p(P )(4.4)Because PMI is significantly higher when instances and patterns are not numerous(e.g., < 10), PMI is multiplied by a discounting factor proposed in (Panteland Ravichandran, 2004) that decreases the bias towards infrequent events.In Espresso, generic patterns are defined as generating 10 times more instances thanpreviously accepted reliable patterns. They extract many instances but are characterisedby lower reliability. Generic patterns are not excluded by definition. They increaserecall (the number of correct instances extracted), but inevitably decrease precision.In order to prevent an excessive reduction of the precision, an additional measure ofconfidence of instances has been introduced. It is based on the evaluation of instancesagainst reliable patterns only and the additional data acquired by searching the Webwith the queries generated from instances and patterns:S(i) = ∑S p (i) ∗ r π(p)(4.5)Tp∈P RP R is the set of reliable patterns (given a threshold), S p (i, p) is the PMI between i andp measured on the data acquired from the Web (using Google queries) and T is thesum over the reliability of reliable patterns.
Page 1 and 2:
A Wordnetfrom the Ground Up
Page 3 and 4:
Work financed by the Polish Ministr
Page 7 and 8:
6 Prefaceheartfelt thanks go to all
Page 9:
8 Chapter 1. Motivation, Goals, Ear
Page 12 and 13:
1.1. Motivation 11[a] special form
Page 14 and 15:
1.1. Motivation 13Affect (Strappara
Page 16 and 17:
1.2. The Goals of the plWordNet Pro
Page 18 and 19:
1.2. The Goals of the plWordNet Pro
Page 20 and 21:
1.3. Early Decisions 19Merge Model:
Page 22:
1.3. Early Decisions 214. On the ot
Page 25 and 26:
24 Chapter 2. Building a Wordnet Co
Page 27 and 28:
Page 29 and 30:
Page 31 and 32:
Page 33 and 34:
Page 35 and 36:
Page 37 and 38:
Page 39 and 40:
Page 41 and 42:
Page 43 and 44:
Page 45 and 46:
Page 47 and 48:
Page 49 and 50:
48 Chapter 3. Discovering Semantic
Page 51 and 52:
Page 53 and 54:
Page 55 and 56:
Page 57 and 58:
Page 59 and 60:
Page 61 and 62:
Page 63 and 64: 62 Chapter 3. Discovering Semantic
Page 103 and 104: 102 Chapter 4. Extracting Relation
Page 113: 112 Chapter 4. Extracting Relation
Page 165 and 166:
164 Chapter 4. Extracting Relation
Page 167 and 168:
166 Chapter 5. Polish WordNet Today
Page 169 and 170:
Page 171 and 172:
Page 173 and 174:
Page 175 and 176:
Page 177 and 178:
Page 179 and 180:
Page 181 and 182:
Page 183 and 184:
Page 186 and 187:
Appendix ATests for Lexico-semantic
Page 188 and 189:
187Test for adjectives (T. IX)1. p1
Page 190 and 191:
189RelatednessTest for nouns (T. XV
Page 192 and 193:
BibliographyAgarwal, Abhaya and Alo
Page 194 and 195:
Bibliography 193on Deep Lexical Acq
Page 196 and 197:
Bibliography 195Derwojedowa, Magdal
Page 198 and 199:
Bibliography 197Grefenstette, Grego
Page 200 and 201:
Bibliography 199Kurc, Roman. (2008)
Page 202 and 203:
Bibliography 201Mohammad, Saif and
Page 204 and 205:
Bibliography 203. (2006) “The pot
Page 206 and 207:
Bibliography 205and Technology 7(1-
Page 208 and 209:
List of Tables2.1 The size of the c
Page 210 and 211:
List of Figures2.1 The LU perspecti
Page 212 and 213:
List of Figures 2114.16 Completely
Page 214 and 215:
Index 213CBC, see Clustering by Com
Page 216 and 217:
Index 215169, 177, 178, 180, 182hyp
Page 218 and 219:
Index 217mutual hypernymy, 24Mutual
Page 220 and 221:
Index 219SUMO, 14Supported Vector M
Page 222:
A language without a wordnet is at
show all

A Wordnet from the Ground Up

Create successful ePaper yourself

Delete template?

Save as template?