13.07.2015 Views

Data Mining: Practical Machine Learning Tools and ... - LIDeCC

Data Mining: Practical Machine Learning Tools and ... - LIDeCC

Data Mining: Practical Machine Learning Tools and ... - LIDeCC

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

334 CHAPTER 7 | TRANSFORMATIONS: ENGINEERING THE INPUT AND OUTPUTfor this purpose. In the words of David Wolpert, the inventor of stacking, it isreasonable that “relatively global, smooth” level-1 generalizers should performwell. Simple linear models or trees with linear models at the leaves usually workwell.Stacking can also be applied to numeric prediction. In that case, the level-0models <strong>and</strong> the level-1 model all predict numeric values. The basic mechanismremains the same; the only difference lies in the nature of the level-1 data. Inthe numeric case, each level-1 attribute represents the numeric prediction madeby one of the level-0 models, <strong>and</strong> instead of a class value the numeric targetvalue is attached to level-1 training instances.Error-correcting output codesError-correcting output codes are a technique for improving the performanceof classification algorithms in multiclass learning problems. Recall from Chapter6 that some learning algorithms—for example, st<strong>and</strong>ard support vectormachines—only work with two-class problems. To apply such algorithms tomulticlass datasets, the dataset is decomposed into several independent twoclassproblems, the algorithm is run on each one, <strong>and</strong> the outputs of the resultingclassifiers are combined. Error-correcting output codes are a method formaking the most of this transformation. In fact, the method works so well thatit is often advantageous to apply it even when the learning algorithm can h<strong>and</strong>lemulticlass datasets directly.In Section 4.6 (page 123) we learned how to transform a multiclass datasetinto several two-class ones. For each class, a dataset is generated containing acopy of each instance in the original data, but with a modified class value. If theinstance has the class associated with the corresponding dataset it is tagged yes;otherwise no. Then classifiers are built for each of these binary datasets, classifiersthat output a confidence figure with their predictions—for example, theestimated probability that the class is yes. During classification, a test instanceis fed into each binary classifier, <strong>and</strong> the final class is the one associated with theclassifier that predicts yes most confidently. Of course, this method is sensitiveto the accuracy of the confidence figures produced by the classifiers: if someclassifiers have an exaggerated opinion of their own predictions, the overallresult will suffer.Consider a multiclass problem with the four classes a, b, c, <strong>and</strong> d. The transformationcan be visualized as shown in Table 7.1(a), where yes <strong>and</strong> no aremapped to 1 <strong>and</strong> 0, respectively. Each of the original class values is convertedinto a 4-bit code word, 1 bit per class, <strong>and</strong> the four classifiers predict thebits independently. Interpreting the classification process in terms of thesecode words, errors occur when the wrong binary bit receives the highestconfidence.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!