Building Machine Learning Systems with Python - Richert, Coelho
Building Machine Learning Systems with Python - Richert, Coelho
Building Machine Learning Systems with Python - Richert, Coelho
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
<strong>Learning</strong> How to Classify <strong>with</strong> Real-world Examples<br />
Nearest neighbor classification<br />
With this dataset, even if we just try to separate two classes using the previous<br />
method, we do not get very good results. Let me introduce , therefore, a new<br />
classifier: the nearest neighbor classifier.<br />
If we consider that each example is represented by its features (in mathematical<br />
terms, as a point in N-dimensional space), we can compute the distance between<br />
examples. We can choose different ways of computing the distance, for example:<br />
def distance(p0, p1):<br />
'Computes squared euclidean distance'<br />
return np.sum( (p0-p1)**2)<br />
Now when classifying, we adopt a simple rule: given a new example, we look at<br />
the dataset for the point that is closest to it (its nearest neighbor) and look at its label:<br />
def nn_classify(training_set, training_labels, new_example):<br />
dists = np.array([distance(t, new_example)<br />
for t in training_set])<br />
nearest = dists.argmin()<br />
return training_labels[nearest]<br />
In this case, our model involves saving all of the training data and labels and<br />
computing everything at classification time. A better implementation would<br />
be to actually index these at learning time to speed up classification, but this<br />
implementation is a complex algorithm.<br />
Now, note that this model performs perfectly on its training data! For each point,<br />
its closest neighbor is itself, and so its label matches perfectly (unless two examples<br />
have exactly the same features but different labels, which can happen). Therefore, it<br />
is essential to test using a cross-validation protocol.<br />
Using ten folds for cross-validation for this dataset <strong>with</strong> this algorithm, we obtain<br />
88 percent accuracy. As we discussed in the earlier section, the cross-validation<br />
accuracy is lower than the training accuracy, but this is a more credible estimate<br />
of the performance of the model.<br />
We will now examine the decision boundary. For this, we will be forced to simplify<br />
and look at only two dimensions (just so that we can plot it on paper).<br />
[ 44 ]