Automatic Extraction of Examples for Word Sense Disambiguation

More documents

Recommendations

Info

CHAPTER 2. BASIC APPROACHES TO WORD SENSE DISAMBIGUATION 27 Decision lists (Yarowsky, 1995; Martínez et al., 2002) as their name suggests are simple ordered lists of rules that are of the form (condition, class, weight). Such rules are usually more easy to understand if thought of as if-then-else rules: if the condition is satisfied then the accord- ing class is assigned. However, in the form of the rules that we provided above there is also a third parameter that is taken into account - weight. Weights are used to determine the order of the rules in the list. Higher weights position the rules higher in the list and respectively lower weights mean that the rules can be found further down in the ordered list of rules. The order in the decision lists is important during classification since the rules are tested sequentially and the first rule that ”succeeds” is used to assign the sense to the example. Usually the default rule in a list is the last one that accepts all remaining cases. Decision trees (Mooney, 1996) are basically very similar to decision lists however not this often used for word sense disambiguation. They as well use classification rules but this time the rules are not ordered in a list but as an n-ary branching tree structure that represents the training set. Hence every branch of the tree represents some rule that is used to test the conjunction of features and to provide a prediction of the class label encoded in the terminal node (also called leaf node is a node in a tree data structure that has no child nodes). Some of the problems with decision trees, which makes them not really handy for WSD, are their computational cost and the data fragmentation (breaking up the data into many pieces that are not close together) that they employ. The latter leads to immense increase in computation if larger feature spaces are used. The same result is also triggered by the use of a large number of examples, however, if fewer training instances are provided a relative decrease in the reliability of the predictions for the class label can be observed. Rule combination for supervised word sense disambiguation means that a set of homoge- neous classification rules is combined and learned only by a single algorithm. AdaBoost (Schapire, 2003) is a very often used rule combination algorithm. It combines mul- tiple classification rules into a single classifier. The power of AdaBoost is based on the fact that the classification rules must not necessarily be very accurate but after they are combined the resulting classifier has an arbitrarily low error rate. Linear Classifier or also called binary classifier achieved in the last few decades considerably low results and thus the highest interest to them was in the field of Information Retrieval. Those kind of classifiers decide on the classification label based on the linear combination of the features in their FVs. They aim to group the instances with the most similar feature values. The limited amount of work on linear classifiers has resulted in several articles as for example (Mooney, 1996; Escudero et al., 2000b; Bartlett et al., 2004; Abdi et al., 1996; Cohen and Singer, 1999). In case a non-linear problem has to be decided, for which the expressivity of the linear classifiers is not enough, suggestions for the use of kernel functions (kernel methods) have been made.
CHAPTER 2. BASIC APPROACHES TO WORD SENSE DISAMBIGUATION 28 Kernel-based are the methods that try to find more general types of relations (and not linear as just noted above) in the FVs. Their popularity in the past few years has notably increased which can be seen from their growing participations in recent conferences as Senseval-3 for example (see Section 4.5). Examples of applications of kernel methods in supervised approaches are the ones described by Murata et al. (2001); Boser et al. (1992); Lee and Ng (2002); Cristianini and Shawe-Taylor (2000); Carpuat et al. (2004); Wu et al. (2004); Popescu (2004); Ciaramita and Johnson (2004). One of the most popular kernel-methods is the Support Vector Machines (SVM)s presented by Boser et al. (1992). As Màrquez et al. (2007) report, SVMs are established around the principle of Structural Risk Minimization from the Statistical Learning Theory (Vapnik, 1998). In their basic form SVMs are linear classifiers that view input data as two sets of vectors in an n-dimensional space. They construct a separating hyperplane (in geometry hyperplane is a higher-dimensional abstraction of the concepts of a line in a n-dimensional space) in that space, which is used to separate two data sets. To calculate the margin between those data sets, two parallel hyper- planes are constructed - one on each side of the separating hyperplane, which are directed to the two data sets. Naturally, a good separation is considered to be achieved by the hyperplane that has the largest distance to the neighboring data sets. In cases where non-linear classifiers are desired the selected SVM can be used with a kernel-function. Discourse properties are considered by the Yarowsky bootstrapping algorithm (Yarowsky, 1995). This algorithm is semi-supervised (see Section 2.4) which makes it hardly comparable with the other algorithms in that section but it is considered (Màrquez et al., 2007) relatively important for the following work on bootstrapping for WSD. It uses either automatically or man- ually annotated training data that is supposed to be complete (to represent each of the senses in the set) but not necessarily big. This initially smaller set is used together with a supervised learning algorithm to annotate other examples. If for a given example the annotation is being accomplished with a higher degree of confidence it is added to the ”seed” set and the process is further continued. Similarity-based is a family of methods that are most relevant to our thesis and thus we provide some more in-depth information about them. However, our aim is still the attempt to give an overview of those methods so that a better understanding of our use of them can be accomplished. Approaches of this kind are very often used in supervised WSD because they carry out the disambiguation process in a very simple way. They classify a new example via a similarity metric that compares the latter to previously seen examples and assigns a sense to it - usually this is the MFS in a pool of most similar examples. During the years, probably because of its increased usage, the approach has gained a wide variety of names: instance-based, case-based, similarity-based, example-based, memory-based, exemplar-based, analogical. As a result of the fact that the data is stored in the memory without any restructuring or abstraction
Page 1 and 2: SEMINAR FÜR SPRACHWISSENSCHAFT Aut
Page 3 and 4: Abstract In the following thesis we
Page 5 and 6: Contents 1 Introduction 10 2 Basic
Page 7 and 8: CONTENTS 6 6.8.4 Discussion . . . .
Page 9 and 10: LIST OF TABLES 8 6.8 System perform
Page 11 and 12: Chapter 1 Introduction Ambiguity is
Page 13 and 14: Chapter 2 Basic Approaches to Word
Page 15 and 16: CHAPTER 2. BASIC APPROACHES TO WORD
Page 27: CHAPTER 2. BASIC APPROACHES TO WORD
Page 33 and 34: Chapter 3 Comparability for WSD Sys
Page 35 and 36: CHAPTER 3. COMPARABILITY FOR WSD SY
Page 37 and 38: CHAPTER 4. EVALUATION OF WSD SYSTEM
Page 45 and 46: Chapter 5 TiMBL: Tilburg Memory-Bas
Page 47 and 48: CHAPTER 5. TIMBL: TILBURG MEMORY-BA
Page 49 and 50: CHAPTER 6. AUTOMATIC EXTRACTION OF
Page 75 and 76: Chapter 7 Conclusion, Future and Re
Page 77 and 78: CHAPTER 7. CONCLUSION, FUTURE AND R
Page 79 and 80:
BIBLIOGRAPHY 78 Baluja, S. (1998),
Page 81 and 82:
BIBLIOGRAPHY 80 Devijver, P. A. and
Page 83 and 84:
BIBLIOGRAPHY 82 Kilgarriff, A. (199
Page 85 and 86:
BIBLIOGRAPHY 84 Mihalcea, R., T. Ch
Page 87 and 88:
BIBLIOGRAPHY 86 Preiss, J. (2006),
Page 89 and 90:
BIBLIOGRAPHY 88 Villarejo, L., L. M
Page 91 and 92:
BIBLIOGRAPHY 90 B Pool of Features
Page 93 and 94:
BIBLIOGRAPHY 92 C Training and Test
Page 95 and 96:
BIBLIOGRAPHY 94 Figure 7.2: Accurac
Page 97 and 98:
BIBLIOGRAPHY 96 E Tables Table 7.1:
Page 99 and 100:
BIBLIOGRAPHY 98 Table 7.3: System p
show all

Automatic Extraction of Examples for Word Sense Disambiguation

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?