Automatic Extraction of Examples for Word Sense Disambiguation
CHAPTER 6. AUTOMATIC EXTRACTION OF EXAMPLES FOR WSD 57 (8) The alarm is activated by the lightest pressure . So, having our POS annotated examples, there is only one thing to do before we acquire our training and test sets and this is the extraction of the FVs from the prepared instances. 6.4 Training and Test Sets 6.4.1 Feature Vectors The context features that we finally chose to use for our FVs are selected out of the pool of features (see Appendix B) that we considered sensible for the task. Similar to the vectors shown in Table 2.4 on page 24 our feature vectors are build up from the collection of features shown in Table 6.3 on page 58. There are few things that can be noted here. First, our choice of the set of features ensures that the vectors are descriptive enough of the occurrence of the target word in a given context. This set of selected features lead to good results in Romanian (Dinu and Kübler, 2007) and as well to the training of good and precise word-experts in our system. Another interesting point to be noted are the features like VA and PB in our case - in other words the features that do not get covered by the context of the word. Such features get ”zero” or ”place holder” values so that their position is kept in the vector but no actual value is stated. Finally, our FV looks like this: The alarm is activated by the lightest DT NN VBZ VBN IN DT JJS pressure alarm - is by - 38201, and is ready to be included in our train- ing set.
CHAPTER 6. AUTOMATIC EXTRACTION OF EXAMPLES FOR WSD 58 Feature Description In Our Toy Example CT−3 token at positioon -3 from the TW The CT−2 token at positioon -2 from the TW alarm CT−1 token at positioon -1 from the TW is CT0 the target word activated CT1 token at positioon 1 from the TW by CT2 token at positioon 2 from the TW the CT3 token at positioon 3 from the TW lightest CP−3 the POS tag of the token at position -3 from the TW DT CP−2 the POS tag of the token at position -2 from the TW NN CP−1 the POS tag of the token at position -1 from the TW VBZ CP0 the POS tag of the target word VBN CP1 the POS tag of the token at position 1 from the TW IN CP2 the POS tag of the token at position 2 from the TW DT CP3 the POS tag of the token at position 3 from the TW JJS NA the first noun after the TW pressure NB the first noun before the TW alarm VA the first verb after the TW - VB the first verb before the TW is PA the first preposition after the TW by PB the first preposition before the TW - Answer the answer (only in the training features) 38201 Table 6.3: Features included in the feature vectors of our system and their corresponding values from our toy example. 6.4.2 Training Set The training set we have is a combination of manually and automatically annotated data from the sources we point out in Section 2.3.2. The manually annotated part of the corpus is the original Senseval-3 training set 11 . The total number of instances in the training set is 7 860 and a more detailed distribution of the examples can be seen again in Appendix C at the end of the thesis. 6.4.3 Test Set The test set 12 that we use is the one provided by Senseval-3 English lexical sample task. It has a total of 3944 manually annotated examples. A comprehensive description of the size of examples per word is given in Appendix C. We kept the complete size of the test set and the only 11 http://www.cse.unt.edu/∼rada/senseval/senseval3/data/EnglishLS/EnglishLS.train.tar.gz 12 http://www.cse.unt.edu/∼rada/senseval/senseval3/data/EnglishLS/EnglishLS.test.tar.gz
SEMINAR FÜR SPRACHWISSENSCHAFT Aut
Abstract In the following thesis we
Contents 1 Introduction 10 2 Basic