Automatic Extraction of Examples for Word Sense Disambiguation

CHAPTER 2. BASIC APPROACHES TO WORD SENSE DISAMBIGUATION 19 Table 2.3: Performance and short description of the supervised systems participating in the SENSEVAL-3 English lexical sample Word Sense Disambiguation task. Precision (P) and recall (R) (see Section 4.1) figures are provided for both fine-grained and coarse-grained scoring (Mihalcea et al., 2004a).

CHAPTER 2. BASIC APPROACHES TO WORD SENSE DISAMBIGUATION 20 Figure 2.1: Supervised machine learning. The attempt to solve a problem with a supervised machine learning methods starts first with the identification of the data set that is feasible for the given task (see Section 2.3.2). Then this set is pre-processed (Section 2.3.3) and prepared to be divided in the required training and test sets - Section 2.3.4. Training, however, can first be accomplished after a proper algorithm is depicted together with the parameters that are most likely to give good results (Section 2.3.5). Once this setup is ready, the test set is used to ”assess” the future classifier. If the evaluation is satisfying the classifier is finished but in case the evaluation yields not acceptable results the process could be reactivated at any of its previous states. The main idea behind this machinery is that once provided with a training set the selected algorithm induces a specific classifier which is normally described as a hypothesis of a function that is used to map the examples (in our case the test examples) to the different classes that have already been observed in the training cases. We already know, that human annotators label the training data manually, but a good question at this point is where do they know which set of senses they can use for classification. The answer is easy and it is hidden behind the employment of the so called sense inventories. 2.3.1 Sense Inventories Most often machine readable dictionaries are the sources of the sense inventories that are used for the manual annotation of supervised WSD corpora and thus for the creation of the training and test sets. The most extensively used ones are nowadays WordNet (Miller, 1990; Fellbaum, 1998) for English and EuroWordNet (Vossen, 1998) (e.g. the DSO corpus Ng and Lee (1996), SemCor (Miller et al., 1993), Open Mind Word Expert (OMWE) (Mihalcea and Chklovski, 2003)) for languages as Dutch, Italian, Spanish, German, French, Czech, and Estonian. In Section 2.3.2 we point out which sense inventories have been used for the corresponding sense-annotated corpora but we do not explicitly keep track of their version (e.g. Wordnet 1.6, 1.7, 1.7.1, 2.0) since

