10.11.2016 Views

Learning Data Mining with Python

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Chapter 3<br />

Using decision trees<br />

We can import the DecisionTreeClassifier class and create a decision tree using<br />

scikit-learn:<br />

from sklearn.tree import DecisionTreeClassifier<br />

clf = DecisionTreeClassifier(random_state=14)<br />

We used 14 for our random_state again and will do so for<br />

most of the book. Using the same random seed allows for<br />

replication of experiments. However, <strong>with</strong> your experiments,<br />

you should mix up the random state to ensure that the<br />

algorithm's performance is not tied to the specific value.<br />

We now need to extract the dataset from our pandas data frame in order to use<br />

it <strong>with</strong> our scikit-learn classifier. We do this by specifying the columns we<br />

wish to use and using the values parameter of a view of the data frame. The<br />

following code creates a dataset using our last win values for both the home<br />

team and the visitor team:<br />

X_previouswins = dataset[["HomeLastWin", "VisitorLastWin"]].values<br />

Decision trees are estimators, as introduced in Chapter 2, Classifying <strong>with</strong> scikit-learn<br />

Estimators, and therefore have fit and predict methods. We can also use the<br />

cross_val_score method to get the average score (as we did previously):<br />

scores = cross_val_score(clf, X_previouswins, y_true,<br />

scoring='accuracy')<br />

print("Accuracy: {0:.1f}%".format(np.mean(scores) * 100))<br />

This scores 56.1 percent: we are better than choosing randomly! We should be<br />

able to do better. Feature engineering is one of the most difficult tasks in data<br />

mining, and choosing good features is key to getting good outcomes—more so<br />

than choosing the right algorithm!<br />

Sports outcome prediction<br />

We may be able to do better by trying other features. We have a method for<br />

testing how accurate our models are. The cross_val_score method allows us<br />

to try new features.<br />

[ 49 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!