A Wordnet from the Ground Up

More documents

Recommendations

Info

4.4. Benefits of Extracted Patterns for Wordnet Expansion 119verified. The resulting patterns are run on a validating large corpus (for Espresso, theWeb). A confidence measure is computed from the collected frequencies and comparedwith a threshold.In Espresso, the Internet served as a validating corpus for instances extracted bygeneric patterns. We need other resources because of the paucity of Polish Web pagesand the inherent difficulty of querying regardless of the inflectional variety. A secondlarge corpus (Rzeczpospolita, 2008) (but still much smaller than the data from the Weband even IPIC used for the extraction of the patterns) served the purpose of validationin Estratto. The necessary condition for finding occurrences of the patterns extractedfrom the primary corpus seems to be that the validating corpus cover similar genresand domains.The induction of patterns and the extraction of instances in Estratto are controlledby the following set of parameters:1. the number of top k patterns not to be discarded (preserved for the next iterations),2. the threshold measure of confidence for instances,3. the minimum frequency and maximum frequency values for patterns,4. the minimum size of a pattern – all patterns that consist of only matching locationsand conjunctions are discarded by definition,5. a filter on common words in instances and on instances with identical LUs onboth positions,6. the size of the validating corpus.4.4 Benefits of Extracted Patterns for Wordnet ExpansionWe investigated the use of algorithms like Espresso in order to find method for extractingvaluable instances of wordnet relations, at least hypernymy, with precision higherthan afforded by the handwritten lexico-morphosyntactic patterns. We did not expectready-to-add hypernymy instances. We only wanted to construct yet another sourceof knowledge that suggests hypernymy occurrences and the correct direction of therelation.It is far from trivial properly to evaluate the extracted lexico-semantic resources(Section 3.3). It is much easier for lists of instances: we verify how many of them arecorrect. Only two comparisons were possible for Polish:• with the existing structure of the core plWordNet,• with an evaluation by one of the co-authors.
120 Chapter 4. Extracting Relation InstancesThe former introduces a bias – plWordNet still is relatively small – enables testingthe whole set of instances, while manual evaluation is always laborious and can beperformed only on a sample. Yet, the samples have been chosen as for the manualpatterns (Israel, 1992), so the results can apply to the whole sets with a 95% confidence.In both types of comparison we applied the standard measures of precision andrecall (Manning and Schütze, 2001) 9 .Precision and recall are defined in the standard way: tp is the number of truepositives (extracted pairs of LUs which are instances of the target relation), fp is thenumber of false positives (incorrect instances marked by algorithms as correct), fn –false negatives (correct instances in text but not extracted by the algorithm).P =R =tptp + fptptp + fn(4.7)(4.8)Note that the denominator in R accounts for correct patterns or instances thatwere either marked as incorrect or not extracted at all. We cannot treat the limitedcore plWordNet as the exhaustive description of relations. That is why recall in ourapproach only measures the ratio of rediscovery of the plWordNet structure. It is not arecall in terms of all correct instances in the corpus or patterns that the corpus supports.Thus, following Pantel and Pennacchiotti (2006), we also use the relative recallmeasured in relation to the results of some other algorithm (Kurc, 2008, pp. 72):R A|B = R AR B=tp ACtp BC= tp Atp B= P A × (tp A + fp A )P B × (tp B + fp B )(4.9)where R A and R B denote the recall of the algorithms A and B, and C is the unknownnumber of instances occurring in the corpus.We extracted a ranked list of possible instances which can be sorted in descendingorder by reliability. The values are real numbers and there is no characteristic pointbelow which we can cut off the rest of pairs according to some analytical properties.Thus, instead of pure precision and recall, we prefer to use cut-off precision and cut-offrecall calculated only in relation to some n first positions on the sorted list of results(instances or patterns).In the end, then, we used three evaluation measures.1. Cut-off precision based on plWordNet marks as correct only those instances andpatterns that were found both in plWordNet and on an additional list provided a9 The F-measure could not be applied because of the limitations of recall based on plWordNet, to bediscussed later.
Page 1 and 2:
A Wordnetfrom the Ground Up
Page 3 and 4:
Work financed by the Polish Ministr
Page 7 and 8:
6 Prefaceheartfelt thanks go to all
Page 9:
8 Chapter 1. Motivation, Goals, Ear
Page 12 and 13:
1.1. Motivation 11[a] special form
Page 14 and 15:
1.1. Motivation 13Affect (Strappara
Page 16 and 17:
1.2. The Goals of the plWordNet Pro
Page 18 and 19:
1.2. The Goals of the plWordNet Pro
Page 20 and 21:
1.3. Early Decisions 19Merge Model:
Page 22:
1.3. Early Decisions 214. On the ot
Page 25 and 26:
24 Chapter 2. Building a Wordnet Co
Page 27 and 28:
Page 29 and 30:
Page 31 and 32:
Page 33 and 34:
Page 35 and 36:
Page 37 and 38:
Page 39 and 40:
Page 41 and 42:
Page 43 and 44:
Page 45 and 46:
Page 47 and 48:
Page 49 and 50:
48 Chapter 3. Discovering Semantic
Page 51 and 52:
Page 53 and 54:
Page 55 and 56:
Page 57 and 58:
Page 59 and 60:
Page 61 and 62:
Page 63 and 64:
Page 65 and 66:
Page 67 and 68:
Page 69 and 70: 68 Chapter 3. Discovering Semantic
Page 103 and 104: 102 Chapter 4. Extracting Relation
Page 119: 118 Chapter 4. Extracting Relation
Page 167 and 168: 166 Chapter 5. Polish WordNet Today
Page 169 and 170: 168 Chapter 5. Polish WordNet Today
Page 171 and 172:
170 Chapter 5. Polish WordNet Today
Page 173 and 174:
Page 175 and 176:
Page 177 and 178:
Page 179 and 180:
Page 181 and 182:
Page 183 and 184:
Page 186 and 187:
Appendix ATests for Lexico-semantic
Page 188 and 189:
187Test for adjectives (T. IX)1. p1
Page 190 and 191:
189RelatednessTest for nouns (T. XV
Page 192 and 193:
BibliographyAgarwal, Abhaya and Alo
Page 194 and 195:
Bibliography 193on Deep Lexical Acq
Page 196 and 197:
Bibliography 195Derwojedowa, Magdal
Page 198 and 199:
Bibliography 197Grefenstette, Grego
Page 200 and 201:
Bibliography 199Kurc, Roman. (2008)
Page 202 and 203:
Bibliography 201Mohammad, Saif and
Page 204 and 205:
Bibliography 203. (2006) “The pot
Page 206 and 207:
Bibliography 205and Technology 7(1-
Page 208 and 209:
List of Tables2.1 The size of the c
Page 210 and 211:
List of Figures2.1 The LU perspecti
Page 212 and 213:
List of Figures 2114.16 Completely
Page 214 and 215:
Index 213CBC, see Clustering by Com
Page 216 and 217:
Index 215169, 177, 178, 180, 182hyp
Page 218 and 219:
Index 217mutual hypernymy, 24Mutual
Page 220 and 221:
Index 219SUMO, 14Supported Vector M
Page 222:
A language without a wordnet is at
show all

A Wordnet from the Ground Up

Create successful ePaper yourself

Delete template?

Save as template?