Automatic Extraction of Examples for Word Sense Disambiguation
CHAPTER 5. TIMBL: TILBURG MEMORY-BASED LEARNER 45 installation is considerably straightforward on the majority of UNIX-based systems. Originally TiMBL was designed to be the solution for the linguistic classification task, how- ever, it can be exploited for any alternative categorization task with appropriate (symbolic or numeric) features and discrete (non-continuous) classes for which training data is available. The latter again leads us to the already discussed topic on the acute shortage of labeled data. 5.2 Application As we have mentioned above the training data for WSD, namely the data that TiMBL uses in the process of learning is represented by feature vectors of which the exact structure is shown in Section 2.3.4. The format of the feature files is flexible, since TiMBL is able to guess the type of format in most of the cases. However, we will stick to the most often used format - feature vectors are features separated by spaces on a single line. As an example let us consider the situation in which we have a training set (delivered to TiMBL as the file data.train) and a test set (data.test). After running the tool as follows: > Timbl -f data.train -t data.test TiMBL returns a new file data.test.IB1.O.gr.k1.out, which consists basically of the data in our test file data.test. However, the system adds a new feature to each FV, which rep- resents the new class that it has predicted for the vector. The experiment is conducted with the default parameters for the systems and the results are sent to standard output (or if needed are written in a separate data file). For a more detailed information on the format and information of the results, refer to (Daelemans et al., 2007). The name of the output file data.test.IB1.O.gr.k1.out consists of the most important information for the conducted experiment. The first two parts represent the name of the test file that was used for the analysis (data.test) and together with .out it is referred to the output file of the experiment; IB1 represents the machine-based learning algorithm that was employed - the k-NN algorithm in this particular case; O stands for the similarity computed with weighted overlap; gr means that the relevance weights were computed with gain ratio and finally k1 represents the number of most similar patterns in the memory on which the output label was based. If those default settings are the ones one needs for the planned experiment, there is not much more to do. However, when we talked about supervised WSD methods we mentioned multiple algorithms that could be employed for the purpose and TiMBL supports a good selection of them. Another extremely wide range of possibilities is connected with the distance metrics that can be used with TiMBL in order to determine the similarity between the different instances. All different options can be specified directly on the command line before running an experiment with TiMBL. > Timbl -k 3 -f data.train -t data.test
CHAPTER 5. TIMBL: TILBURG MEMORY-BASED LEARNER 46 This command for instance will run the latter experiment. However this time a different number of nearest neighbors will be used for extrapolation. Normally the default value is 1, thus if anything else is needed it must be specified explicitly. A very important for us option, which we use further in our work is the +v n (verbosity) option. It allows us to output the nearest neighbors on which decisions are based. Daelemans et al. (1999) comprehensively describe all options and their possible value ranges that can be chosen.
BIBLIOGRAPHY 96 E Tables Table 7.1:
BIBLIOGRAPHY 98 Table 7.3: System p