06.08.2015 Views

A Wordnet from the Ground Up

A Wordnet from the Ground Up - School of Information Technology ...

A Wordnet from the Ground Up - School of Information Technology ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

138 Chapter 4. Extracting Relation Instancesfor Machine Learning. We manually annotated randomly selected pairs of LUs whichoccurred on MSRlists (a,20) for <strong>the</strong> LUs described by <strong>the</strong> MSR.From this selection, 1159 pairs classified as not relevant were collected into a setE. In some experiments, we added E to NH, see below.We experimented with two training sets produced by combining our regular datasets. Test sets were excluded randomly <strong>from</strong> training sets during tenfold cross-validation.Training sets are named in Table 4.4 according to <strong>the</strong> following description scheme:KH 1 + . . . + KH n ,NH 1 + . . . + NH mi.e. first <strong>the</strong> sets comprising KH are listed, next <strong>the</strong> sets <strong>from</strong> NH. The training setH+P2,P3+R includes only pairs extracted <strong>from</strong> plWordNet. It consists of 5027 KHpairs (H+P2) and 56531 NH pairs (P3+R). Tests on this set were done only on dataalready present in plWordNet. It is also more difficult than <strong>the</strong> sets used in (Snowet al., 2005), because <strong>the</strong> classifier is expected to distinguish between close hypernymsand more indirect hypernymic ancestors (P3 included in NH).Because plWordNet (<strong>the</strong> version June 2008) was still small, <strong>the</strong> second training setwas extended with <strong>the</strong> set E of manually classified pairs. We added only negative pairs,assuming that positive examples are well represented by pairs <strong>from</strong> plWordNet, whilemore difficult negative examples are hidden in <strong>the</strong> huge number of negative examplesautomatically extracted <strong>from</strong> plWordNet. The second training set consists of 5027 KH(H+P2) and 57690 NH (P3+R+E).In <strong>the</strong> experiments, we used Naïve Bayes (Mitchell, 1997) and two types of decisiontrees, C4.5 (Quinlan, 1986) and Logistic Model Tree [LMT] (Landwehr et al., 2003),all in <strong>the</strong> versions implemented in <strong>the</strong> Weka system (Witten and Frank, 2005). NaïveBayes classifiers are probabilistic, C4.5 is rule-based, and LMT combines rule-basedstructure of a decision tree with logistic regression in leaves. In order to facilitate acomparison of classifiers, we performed all experiments on <strong>the</strong> same training-test dataset. Because we selected C4.5 as our primary classifier, and we generated examples<strong>from</strong> <strong>the</strong> same corpus (so <strong>the</strong> frequencies occurring as values of some attributes couldbe compared directly), we did not introduce any data normalisation or discretisation.The range of data variety was also limited by <strong>the</strong> corpus used. The application of <strong>the</strong>same data to <strong>the</strong> training of a Naïve Bayes classifier resulted in a bias towards its morememory-based-like behaviour. According to <strong>the</strong> clear distinctions in <strong>the</strong> main groupof <strong>the</strong> applied data sets, however, <strong>the</strong> achieved result was positive, see Table 4.4.All experiments were run in <strong>the</strong> Weka environment (Witten and Frank, 2005). Ineach case, we applied tenfold cross-validation; <strong>the</strong> average results appear in Table 4.4.Because some classifiers, for example C4.5, are known to be sensitive to <strong>the</strong> biasedproportion of training examples for different classes (here, only two), we also tested<strong>the</strong> application of random subsampling of <strong>the</strong> negative examples (NH) in <strong>the</strong> trainingdata. The ratio KH:NH in <strong>the</strong> original sets is around 1:10. In some experiments <strong>the</strong>

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!