10.11.2016 Views

Learning Data Mining with Python

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Classifying <strong>with</strong> scikit-learn Estimators<br />

For each row in the dataset, there are 35 values. The first 34 are measurements taken<br />

from the 17 antennas (two values for each antenna). The last is either 'g' or 'b'; that<br />

stands for good and bad, respectively.<br />

Start the I<strong>Python</strong> Notebook server and create a new notebook called Ionosphere<br />

Nearest Neighbors for this chapter.<br />

First, we load up the NumPy and csv libraries that we will need for our code:<br />

import numpy as np<br />

import csv<br />

To load the dataset, we first get the filename of the dataset. First, get the folder the<br />

dataset is stored in from your data folder:<br />

data_filename = os.path.join(data_folder, "Ionosphere",<br />

"ionosphere.data")<br />

We then create the X and y NumPy arrays to store the dataset in. The sizes of these<br />

arrays are known from the dataset. Don't worry if you don't know the size of future<br />

datasets—we will use other methods to load the dataset in future chapters and you<br />

won't need to know this size beforehand:<br />

X = np.zeros((351, 34), dtype='float')<br />

y = np.zeros((351,), dtype='bool')<br />

The dataset is in a Comma-Separated Values (CSV) format, which is a commonly<br />

used format for datasets. We are going to use the csv module to load this file.<br />

Import it and set up a csv reader object:<br />

<strong>with</strong> open(data_filename, 'r') as input_file:<br />

reader = csv.reader(input_file)<br />

Next, we loop over the lines in the file. Each line represents a new set of<br />

measurements, which is a sample in this dataset. We use the enumerate<br />

function to get the line's index as well, so we can update the appropriate<br />

sample in the dataset (X):<br />

for i, row in enumerate(reader):<br />

We take the first 34 values from this sample, turn each into a float, and save that to<br />

our dataset:<br />

data = [float(datum) for datum in row[:-1]]<br />

X[i] = data<br />

[ 30 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!