10.11.2016 Views

Learning Data Mining with Python

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Predicting Sports Winners <strong>with</strong> Decision Trees<br />

Various research into predicting the winner suggests that there may be an upper<br />

limit to sports outcome prediction accuracy which, depending on the sport, is<br />

between 70 percent and 80 percent accuracy. There is a significant amount of<br />

research being performed into sports prediction, often through data mining or<br />

statistics-based methods.<br />

Collecting the data<br />

The data we will be using is the match history data for the NBA for the 2013-2014<br />

season. The website http://Basketball-Reference.com contains a significant<br />

number of resources and statistics collected from the NBA and other leagues. To<br />

download the dataset, perform the following steps:<br />

1. Navigate to http://www.basketball-reference.com/leagues/NBA_2014_<br />

games.html in your web browser.<br />

2. Click on the Export button next to the Regular Season heading.<br />

3. Download the file to your data folder and make a note of the path.<br />

This will download a CSV (short for Comma Separated Values) file containing the<br />

results of the 1,230 games in the regular season for the NBA.<br />

CSV files are simply text files where each line contains a new row and each<br />

value is separated by a comma (hence the name). CSV files can be created manually<br />

by simply typing into a text editor and saving <strong>with</strong> a .csv extension. They can also<br />

be opened in any program that can read text files, but can also be opened in Excel as<br />

a spreadsheet.<br />

We will load the file <strong>with</strong> the pandas (short for <strong>Python</strong> <strong>Data</strong> Analysis) library,<br />

which is an incredibly useful library for manipulating data. <strong>Python</strong> also contains a<br />

built-in library called csv that supports reading and writing CSV files. However, we<br />

will use pandas, which provides more powerful functions that we will use later in<br />

the chapter for creating new features.<br />

For this chapter, you will need to install pandas. The easiest way to<br />

install it is to use pip3, as you did in Chapter 1, Getting Started <strong>with</strong> <strong>Data</strong><br />

<strong>Mining</strong> to install scikit-learn:<br />

$pip3 install pandas<br />

If you have difficulty in installing pandas, head to their website at<br />

http://pandas.pydata.org/getpandas.html and read the<br />

installation instructions for your system.<br />

[ 42 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!