08.06.2015 Views

Building Machine Learning Systems with Python - Richert, Coelho

Building Machine Learning Systems with Python - Richert, Coelho

Building Machine Learning Systems with Python - Richert, Coelho

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>Learning</strong> How to Classify <strong>with</strong> Real-world Examples<br />

Nearest neighbor classification<br />

With this dataset, even if we just try to separate two classes using the previous<br />

method, we do not get very good results. Let me introduce , therefore, a new<br />

classifier: the nearest neighbor classifier.<br />

If we consider that each example is represented by its features (in mathematical<br />

terms, as a point in N-dimensional space), we can compute the distance between<br />

examples. We can choose different ways of computing the distance, for example:<br />

def distance(p0, p1):<br />

'Computes squared euclidean distance'<br />

return np.sum( (p0-p1)**2)<br />

Now when classifying, we adopt a simple rule: given a new example, we look at<br />

the dataset for the point that is closest to it (its nearest neighbor) and look at its label:<br />

def nn_classify(training_set, training_labels, new_example):<br />

dists = np.array([distance(t, new_example)<br />

for t in training_set])<br />

nearest = dists.argmin()<br />

return training_labels[nearest]<br />

In this case, our model involves saving all of the training data and labels and<br />

computing everything at classification time. A better implementation would<br />

be to actually index these at learning time to speed up classification, but this<br />

implementation is a complex algorithm.<br />

Now, note that this model performs perfectly on its training data! For each point,<br />

its closest neighbor is itself, and so its label matches perfectly (unless two examples<br />

have exactly the same features but different labels, which can happen). Therefore, it<br />

is essential to test using a cross-validation protocol.<br />

Using ten folds for cross-validation for this dataset <strong>with</strong> this algorithm, we obtain<br />

88 percent accuracy. As we discussed in the earlier section, the cross-validation<br />

accuracy is lower than the training accuracy, but this is a more credible estimate<br />

of the performance of the model.<br />

We will now examine the decision boundary. For this, we will be forced to simplify<br />

and look at only two dimensions (just so that we can plot it on paper).<br />

[ 44 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!