A Wordnet from the Ground Up

More documents

Recommendations

Info

4.3. Generic Patterns Verified Statistically 115Only instances with the confidence measure above a threshold are selected for thenext iteration and used to induce and evaluate patterns. The implicit assumption isthat the confidence measure can be calculated for the majority of instances: reliablepatterns, which are less frequent but more specific, should occur at least a few timeswith the majority of correct instances on the Web. We discard all correct instanceswhich are not covered by the Web data matching reliable patterns.It is worth emphasizing that the evaluation of patterns in the next iteration is basedon instances used to induce these patterns – not on instances which they extracted.The intuition behind the measures of reliability and confidence is that patternswhich describe the given relation well often occur with many confident instances ofthis relation – and the other way around. The difference is that instances extractedby generic patterns will get high confidence if they occur in contexts matched by thespecific patterns of good reliability in the validating corpus.The measures of reliability and confidence reduce the need for manual supervisiononce Espresso has started. In a way they define the degree to which the extractedinstances express the target relation and the patterns describe contexts. The former isan advantage over manual patterns, for which collected frequencies are mostly low andaccidental, and say a little about the quality of instances.The system of measures, instances and pattern selection are universal and do notrefer to any properties of any particular relation being extracted. Espresso can thereforebe applied to a wide range of relations. It was applied to hypernymy, meronymy andantonymy, as well to more specific relations such as person-company or person-jobtitle (Pantel and Pennacchiotti, 2006).To sum up, Pantel and Pennacchiotti (2006) list Espresso’s characteristics:• high recall together with a small decrease in the precision of extracted instances,• autonomy of work (weakly supervised algorithm) – only several instances of thegiven relation must be defined at the beginning,• independence of the size of the corpus or domain used,• a wide range of relation types that can be extracted.EstrattoEstratto (Kurc, 2008, Kurc and Piasecki, 2008) is a modification of Espresso developedmainly to cope with the significant differences between English and Polish: rich morphology,flexible word-order and the much more limited size and access to the Webresources. Let us first present one language-independent adjustment.Two issues are unclear in how the reliability and confidence measures work. First,reliability is sensitive to fluctuations in PMI values. Higher values (e.g., the effect
116 Chapter 4. Extracting Relation Instancesof small frequencies, even after discounting dependent on the number of occurrences)can cause lower assessment of patterns with a balanced ratio of co-occurrence withmatched instances versus the pattern occurrences and instance occurrences alone. Suchsituations result in artificially increased values of max pmi . We would like to look fora measure which would be less sensitive to the low frequency of pattern matches orinstances matched. Also, the value 1 of pattern reliability is not guaranteed even fora pattern which occurs only with a subset of seeds, because of the max pmi value whichcan be increased by some infrequent pattern. That is why the propagation of reliabilityto the subsequent iterations causes new values (calculated for patterns from instancesand vice versa) to become gradually lower for the respective set. We seek a measureof reliability which returns 1 as the value for the best patterns or instances in everyiteration.r π (p) =∑i∈I (pmi(i, p) ∗ r t(i)) ∗ d(I, p)max p(∑i∈I (pmi(i, p) ∗ r t(i)) ) ∗ |I|(4.6)d(i, p) defines how many unique instances the given pattern is associated with.PMI in formula (4.6) is usually also modified by a discounting factor.The proposed modifications are intended to increase the reliability of the patterns,which not only extract a lot of instances, but occur with a large number of differentinstances. The modified measure, when applied to instances, promote those whichoccur often in the corpus associated with many different patterns.The choice of the pattern structure is crucial for their expressiveness and the abilityto capture those elements of the language structures that express semantic relationsbetween LUs (such as the linear order of constituents in English), but case-marking ofnoun phrases in Polish (their linear order is mostly insignificant for the potential lexicalsemantic relation between them). Espresso follows roughly the scheme proposed byHearst (1992): patterns are regular expressions, in which the alphabet includes wordforms and the label TR for any multiword term, and a set of variables for noun phrasesmatched as elements of an instance. The role of part-of-speech tags is unclear in theapproach of Pantel and Pennacchiotti (2006), but they are present in the example ofthe generalisation of a sentence [p. 115]:Because/IN TR is/VBZ a/DT TR and/CC x is/VBZ a/DT y.We assumed that patterns are simplified regular expressions, with the Kleene closurebut without grouping. The alphabet for an inflectional language like Polish shouldrather include roots than (numerous) word forms. Espresso patterns rely to some extenton the positional, linear syntactic structure of an English sentence. Porting toa significantly different language may be problematic.
Page 1 and 2:
A Wordnetfrom the Ground Up
Page 3 and 4:
Work financed by the Polish Ministr
Page 7 and 8:
6 Prefaceheartfelt thanks go to all
Page 9:
8 Chapter 1. Motivation, Goals, Ear
Page 12 and 13:
1.1. Motivation 11[a] special form
Page 14 and 15:
1.1. Motivation 13Affect (Strappara
Page 16 and 17:
1.2. The Goals of the plWordNet Pro
Page 18 and 19:
1.2. The Goals of the plWordNet Pro
Page 20 and 21:
1.3. Early Decisions 19Merge Model:
Page 22:
1.3. Early Decisions 214. On the ot
Page 25 and 26:
24 Chapter 2. Building a Wordnet Co
Page 27 and 28:
Page 29 and 30:
Page 31 and 32:
Page 33 and 34:
Page 35 and 36:
Page 37 and 38:
Page 39 and 40:
Page 41 and 42:
Page 43 and 44:
Page 45 and 46:
Page 47 and 48:
Page 49 and 50:
48 Chapter 3. Discovering Semantic
Page 51 and 52:
Page 53 and 54:
Page 55 and 56:
Page 57 and 58:
Page 59 and 60:
Page 61 and 62:
Page 63 and 64:
Page 65 and 66: 64 Chapter 3. Discovering Semantic
Page 103 and 104: 102 Chapter 4. Extracting Relation
Page 115: 114 Chapter 4. Extracting Relation
Page 167 and 168:
166 Chapter 5. Polish WordNet Today
Page 169 and 170:
Page 171 and 172:
Page 173 and 174:
Page 175 and 176:
Page 177 and 178:
Page 179 and 180:
Page 181 and 182:
Page 183 and 184:
Page 186 and 187:
Appendix ATests for Lexico-semantic
Page 188 and 189:
187Test for adjectives (T. IX)1. p1
Page 190 and 191:
189RelatednessTest for nouns (T. XV
Page 192 and 193:
BibliographyAgarwal, Abhaya and Alo
Page 194 and 195:
Bibliography 193on Deep Lexical Acq
Page 196 and 197:
Bibliography 195Derwojedowa, Magdal
Page 198 and 199:
Bibliography 197Grefenstette, Grego
Page 200 and 201:
Bibliography 199Kurc, Roman. (2008)
Page 202 and 203:
Bibliography 201Mohammad, Saif and
Page 204 and 205:
Bibliography 203. (2006) “The pot
Page 206 and 207:
Bibliography 205and Technology 7(1-
Page 208 and 209:
List of Tables2.1 The size of the c
Page 210 and 211:
List of Figures2.1 The LU perspecti
Page 212 and 213:
List of Figures 2114.16 Completely
Page 214 and 215:
Index 213CBC, see Clustering by Com
Page 216 and 217:
Index 215169, 177, 178, 180, 182hyp
Page 218 and 219:
Index 217mutual hypernymy, 24Mutual
Page 220 and 221:
Index 219SUMO, 14Supported Vector M
Page 222:
A language without a wordnet is at
show all

A Wordnet from the Ground Up

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?