10.11.2016 Views

Learning Data Mining with Python

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Classifying <strong>with</strong> scikit-learn Estimators<br />

Most scikit-learn estimators use the NumPy arrays or a related format for input<br />

and output.<br />

There are a large number of estimators in scikit-learn. These include support vector<br />

machines (SVM), random forests, and neural networks. Many of these algorithms<br />

will be used in later chapters. In this chapter, we will use a different estimator from<br />

scikit-learn: nearest neighbor.<br />

For this chapter, you will need to install a new library called<br />

matplotlib. The easiest way to install it is to use pip3, as you did in<br />

Chapter 1, Getting Started <strong>with</strong> <strong>Data</strong> <strong>Mining</strong>, to install scikit-learn:<br />

$pip3 install matplotlib<br />

If you have any difficulty installing matplotlib, seek the official<br />

installation instructions at http://matplotlib.org/users/<br />

installing.html.<br />

Nearest neighbors<br />

Nearest neighbors is perhaps one of the most intuitive algorithms in the set of<br />

standard data mining algorithms. To predict the class of a new sample, we look<br />

through the training dataset for the samples that are most similar to our new sample.<br />

We take the most similar sample and predict the class that the majority of those<br />

samples have.<br />

As an example, we wish to predict the class of the triangle, based on which class it is<br />

more similar to (represented here by having similar objects closer together). We seek<br />

the three nearest neighbors, which are two diamonds and one square. There are more<br />

diamonds than circles, and the predicted class for the triangle is, therefore, a diamond:<br />

[ 26 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!