10.11.2016 Views

Learning Data Mining with Python

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Chapter 3<br />

Now that we have our dataset, we can compute a baseline. A baseline is an accuracy<br />

that indicates an easy way to get a good accuracy. Any data mining solution should<br />

beat this.<br />

In each match, we have two teams: a home team and a visitor team. An obvious<br />

baseline, called the chance rate, is 50 percent. Choosing randomly will (over time)<br />

result in an accuracy of 50 percent.<br />

Extracting new features<br />

We can now extract our features from this dataset by combining and comparing<br />

the existing data. First up, we need to specify our class value, which will give<br />

our classification algorithm something to compare against to see if its prediction<br />

is correct or not. This could be encoded in a number of ways; however, for this<br />

application, we will specify our class as 1 if the home team wins and 0 if the visitor<br />

team wins. In basketball, the team <strong>with</strong> the most points wins. So, while the data set<br />

doesn't specify who wins, we can compute it easily.<br />

We can specify the data set by the following:<br />

dataset["HomeWin"] = dataset["VisitorPts"] < dataset["HomePts"]<br />

We then copy those values into a NumPy array to use later for our scikit-learn<br />

classifiers. There is not currently a clean integration between pandas and scikit-learn,<br />

but they work nicely together through the use of NumPy arrays. While we will use<br />

pandas to extract features, we will need to extract the values to use them <strong>with</strong><br />

scikit-learn:<br />

y_true = dataset["HomeWin"].values<br />

The preceding array now holds our class values in a format that scikit-learn can read.<br />

We can also start creating some features to use in our data mining. While sometimes<br />

we just throw the raw data into our classifier, we often need to derive continuous<br />

numerical or categorical features.<br />

The first two features we want to create to help us predict which team will win<br />

are whether either of those two teams won their last game. This would roughly<br />

approximate which team is playing well.<br />

We will compute this feature by iterating through the rows in order and recording<br />

which team won. When we get to a new row, we look up whether the team won the<br />

last time we saw them.<br />

[ 45 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!