10.11.2016 Views

Learning Data Mining with Python

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Predicting Sports Winners <strong>with</strong> Decision Trees<br />

There are many algorithms for creating decision trees. Many of these algorithms<br />

are iterative. They start at the base node and decide the best feature to use for the<br />

first decision, then go to each node and choose the next best feature, and so on. This<br />

process is stopped at a certain point, when it is decided that nothing more can be<br />

gained from extending the tree further.<br />

The scikit-learn package implements the CART (Classification and Regression<br />

Trees) algorithm as its default decision tree class, which can use both categorical and<br />

continuous features.<br />

Parameters in decision trees<br />

One of the most important features for a decision tree is the stopping criterion.<br />

As a tree is built, the final few decisions can often be somewhat arbitrary and rely<br />

on only a small number of samples to make their decision. Using such specific nodes<br />

can results in trees that significantly overfit the training data. Instead, a stopping<br />

criterion can be used to ensure that the decision tree does not reach this exactness.<br />

Instead of using a stopping criterion, the tree could be created in full and then<br />

trimmed. This trimming process removes nodes that do not provide much<br />

information to the overall process. This is known as pruning.<br />

The decision tree implementation in scikit-learn provides a method to stop the<br />

building of a tree using the following options:<br />

• min_samples_split: This specifies how many samples are needed in order<br />

to create a new node in the decision tree<br />

• min_samples_leaf: This specifies how many samples must be resulting<br />

from a node for it to stay<br />

The first dictates whether a decision node will be created, while the second dictates<br />

whether a decision node will be kept.<br />

Another parameter for decision tress is the criterion for creating a decision.<br />

Gini impurity and Information gain are two popular ones:<br />

• Gini impurity: This is a measure of how often a decision node would<br />

incorrectly predict a sample's class<br />

• Information gain: This uses information-theory-based entropy to indicate<br />

how much extra information is gained by the decision node<br />

[ 48 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!