CHAPTER 2. BASIC APPROACHES TO WORD SENSE DISAMBIGUATION 25 has in WordNet (Fellbaum, 1998). Refer to Section 6.4.1 for further information on the final FVs that we constructed for our system. The test set is indeed very similar to the training set. However, since we need to evaluate the system, no class labels are included in the feature vectors. System performance is normally believed to increase with the amount of training data and thus usually the training set is a relatively larger portion of the whole data than the test set. It is possible to divide the data in 5 portions where 4 portions will be the training set and 1 portion the test set resulting in a ratio 4:1. Other often used ratios are 2:1 and 9:1. According to Palmer et al. (2007) a division of 2:1 may provide a more realistic indication of a system’s performance, since a larger test set is considered. Still, we know that labeled data is not plenty, which is why it is preferably taken for training and not for testing. By dividing the data in any particular split however, a bias is unquestionably involved. Consequently a better generalization accuracy measurement has to be used in real experiments - n-fold cross-validation or in particular 10-fold cross-validation (Weiss and Kulkowski, 1991). In n-fold cross-validation the data is divided into n number of folds for which it is desirable that they are of equal size. Accordingly n separate experiments are performed, and in each experiment (also called fold) n-1 portions of the data is used for training and 1 for testing, in such a way that each portion is used as a test item exactly once. If n equals the sample size (the size of the data set) the process is called leave-one-out cross-validation. 2.3.5 Supervised WSD Algorithms One of the main decisions which needs to be met when designing a supervised WSD system is the choice of the algorithm that is to be employed. In Table 2.5 on page 26 is a basic overview of the most often used alternatives as well as some literature where more information can be found about them. A short description of the algorithms is provided as well in order to give an outline of their usage and importance.
CHAPTER 2. BASIC APPROACHES TO WORD SENSE DISAMBIGUATION 26 Methods Algorithms Literature Probabilistic Naïve Bayes Maximum Entropy (Duda et al., 2001) (Berger et al., 1996) Vector Space Model (Yarowsky et al., 2001) Similarity-Based k-Nearest Neighbor (Ng and Lee, 1996; Ng, 1997a) (Daelemans et al., 1999) Discriminating Rules Decision Lists Decision Trees (Yarowsky, 1995; Martínez et al., 2002) (Mooney, 1996) Rule Combination AdaBoost LazyBoosting (Schapire, 2003) (Escudero et al., 2000a,b, 2001) Perceptron (Mooney, 1996) Winnow (Escudero et al., 2000b) Linear Classifier Exponentiated-Gradient (Bartlett et al., 2004) Widrow-Hoff (Abdi et al., 1996) Sleeping Experts (Cohen and Singer, 1999) (Murata et al., 2001) Support Vector Machines (Boser et al., 1992; Lee and Ng, 2002) Kernel-Based Kernel Principal Component Analysis (Cristianini and Shawe-Taylor, 2000) (Carpuat et al., 2004; Wu et al., 2004) Regularized Least Squares (Popescu, 2004) Average Multiclass Perceptron (Ciaramita and Johnson, 2004) Discourse Properties Yarowsky Bootstrapping (Yarowsky, 1995) Table 2.5: Supervised word sense disambiguation algorithms. Probabilistic methods categorize each of the new examples by using calculated probabilistic parameters. The latter convey the probability distributions of the categories and the contexts that are being described by the features in the feature vectors. Naïve Bayes (Duda et al., 2001) is one of the simplest representatives of probabilistic methods that presupposes the conditional independence of features given the class label. The main idea is that an example is created by selecting the most probable sense for the instance and as well for each of its features independently considering their individual distributions. The algorithm uses the Bayes inversion rule (Fienberg, 2006). It is often considered that the independence assumption is a problem for Naïve Bayes and thus alternative algorithms as the decomposable model by (Bruce and Wiebe, 1994) have been developed. Maximum entropy (Berger et al., 1996) is another quite robust probabilistic approach that combines stochastic evidence from multiple different sources without the need for any prior knowledge of the data. Discriminating rules assign a sense to an example by selecting one or more predefined rules that are satisfied by the features in the example and hence selecting the sense that the predic- tions of those rules yield. Examplesfor such methods are Decision lists and Decision trees.