13.07.2015 Views

Data Mining: Practical Machine Learning Tools and ... - LIDeCC

Data Mining: Practical Machine Learning Tools and ... - LIDeCC

Data Mining: Practical Machine Learning Tools and ... - LIDeCC

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

5.6 PREDICTING PROBABILITIES 157In practice there is usually only a single dataset of limited size. What can bedone? We could split the data into (perhaps 10) subsets <strong>and</strong> perform a crossvalidationon each. However, the overall result will only tell us whether a learningscheme is preferable for that particular size—perhaps one-tenth of theoriginal dataset. Alternatively, the original dataset could be reused—forexample, with different r<strong>and</strong>omizations of the dataset for each cross-validation. 2However, the resulting cross-validation estimates will not be independentbecause they are not based on independent datasets. In practice, this means thata difference may be judged to be significant when in fact it is not. In fact, justincreasing the number of samples k, that is, the number of cross-validation runs,will eventually yield an apparently significant difference because the value of thet-statistic increases without bound.Various modifications of the st<strong>and</strong>ard t-test have been proposed to circumventthis problem, all of them heuristic <strong>and</strong> lacking sound theoretical justification.One that appears to work well in practice is the corrected resampled t-test.Assume for the moment that the repeated holdout method is used instead ofcross-validation, repeated k times on different r<strong>and</strong>om splits of the same datasetto obtain accuracy estimates for two learning methods. Each time, n 1 instancesare used for training <strong>and</strong> n 2 for testing, <strong>and</strong> differences d i are computed fromperformance on the test data. The corrected resampled t-test uses the modifiedstatistict =dÊ 1 n2+ˆsË k n ¯in exactly the same way as the st<strong>and</strong>ard t-statistic. A closer look at the formulashows that its value cannot be increased simply by increasing k. The same modifiedstatistic can be used with repeated cross-validation, which is just a specialcase of repeated holdout in which the individual test sets for one crossvalidationdo not overlap. For 10-fold cross-validation repeated 10 times,k = 100, n 2 /n 1 = 0.1/0.9, <strong>and</strong> s 2 d is based on 100 differences.5.6 Predicting probabilities12dThroughout this section we have tacitly assumed that the goal is to maximizethe success rate of the predictions. The outcome for each test instance is eithercorrect, if the prediction agrees with the actual value for that instance, or incorrect,if it does not. There are no grays: everything is black or white, correct or2 The method was advocated in the first edition of this book.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!