10.11.2016 Views

Learning Data Mining with Python

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

To compensate for this, we could create many decision trees and then ask each to<br />

predict the class value. We could take a majority vote and use that answer as our<br />

overall prediction. Random forests work on this principle.<br />

Chapter 3<br />

There are two problems <strong>with</strong> the aforementioned procedure. The first problem is<br />

that building decision trees is largely deterministic—using the same input will result<br />

in the same output each time. We only have one training dataset, which means our<br />

input (and therefore the output) will be the same if we try build multiple trees. We<br />

can address this by choosing a random subsample of our dataset, effectively creating<br />

new training sets. This process is called bagging.<br />

The second problem is that the features that are used for the first few decision nodes<br />

in our tree will be quite good. Even if we choose random subsamples of our training<br />

data, it is still quite possible that the decision trees built will be largely the same. To<br />

compensate for this, we also choose a random subset of the features to perform our<br />

data splits on.<br />

Then, we have randomly built trees using randomly chosen samples, using (nearly)<br />

randomly chosen features. This is a Random Forest and, perhaps unintuitively,<br />

this algorithm is very effective on many datasets.<br />

How do ensembles work?<br />

The randomness inherent in Random forests may make it seem like we are leaving<br />

the results of the algorithm up to chance. However, we apply the benefits of<br />

averaging to nearly randomly built decision trees, resulting in an algorithm that<br />

reduces the variance of the result.<br />

Variance is the error introduced by variations in the training dataset on the<br />

algorithm. Algorithms <strong>with</strong> a high variance (such as decision trees) can be greatly<br />

affected by variations to the training dataset. This results in models that have the<br />

problem of overfitting.<br />

In contrast, bias is the error introduced by assumptions in the<br />

algorithm rather than anything to do <strong>with</strong> the dataset, that is, if we<br />

had an algorithm that presumed that all features would be normally<br />

distributed then our algorithm may have a high error if the features<br />

were not. Negative impacts from bias can be reduced by analyzing the<br />

data to see if the classifier's data model matches that of the actual data.<br />

[ 55 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!