13.07.2015 Views

Data Mining: Practical Machine Learning Tools and ... - LIDeCC

Data Mining: Practical Machine Learning Tools and ... - LIDeCC

Data Mining: Practical Machine Learning Tools and ... - LIDeCC

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

90% used in 10-fold cross-validation. To compensate for this, we combine thetest-set error rate with the resubstitution error on the instances in the trainingset. The resubstitution figure, as we warned earlier, gives a very optimistic estimateof the true error <strong>and</strong> should certainly not be used as an error figure on itsown. But the bootstrap procedure combines it with the test error rate to give afinal estimate e as follows:e = 0. 632 ¥ e + 0. 368 ¥ e .test instances5.5 COMPARING DATA MINING METHODS 153training instancesThen, the whole bootstrap procedure is repeated several times, with differentreplacement samples for the training set, <strong>and</strong> the results averaged.The bootstrap procedure may be the best way of estimating error for verysmall datasets. However, like leave-one-out cross-validation, it has disadvantagesthat can be illustrated by considering a special, artificial situation. In fact, thevery dataset we considered previously will do: a completely r<strong>and</strong>om dataset withtwo classes. The true error rate is 50% for any prediction rule. But a scheme thatmemorized the training set would give a perfect resubstitution score of 100%so that e training instances = 0, <strong>and</strong> the 0.632 bootstrap will mix this in with a weightof 0.368 to give an overall error rate of only 31.6% (0.632 ¥ 50% + 0.368 ¥ 0%),which is misleadingly optimistic.5.5 Comparing data mining methodsWe often need to compare two different learning methods on the same problemto see which is the better one to use. It seems simple: estimate the error usingcross-validation (or any other suitable estimation procedure), perhaps repeatedseveral times, <strong>and</strong> choose the scheme whose estimate is smaller. This is quitesufficient in many practical applications: if one method has a lower estimatederror than another on a particular dataset, the best we can do is to use the formermethod’s model. However, it may be that the difference is simply caused by estimationerror, <strong>and</strong> in some circumstances it is important to determine whetherone scheme is really better than another on a particular problem. This is a st<strong>and</strong>ardchallenge for machine learning researchers. If a new learning algorithm isproposed, its proponents must show that it improves on the state of the art forthe problem at h<strong>and</strong> <strong>and</strong> demonstrate that the observed improvement is notjust a chance effect in the estimation process.This is a job for a statistical test that gives confidence bounds, the kind wemet previously when trying to predict true performance from a given test-seterror rate. If there were unlimited data, we could use a large amount for training<strong>and</strong> evaluate performance on a large independent test set, obtaining confidencebounds just as before. However, if the difference turns out to be significantwe must ensure that this is not just because of the particular dataset we

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!