10.11.2016 Views

Learning Data Mining with Python

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Classifying <strong>with</strong> scikit-learn Estimators<br />

The Cosine distance is better suited to cases where some features are larger than<br />

others and when there are lots of zeros in the dataset. Intuitively, we draw a line<br />

from the origin to each of the samples, and measure the angle between those lines.<br />

This can be seen in the following diagram:<br />

In this example, each of the grey circles are in the same distance from the white<br />

circle. In (a), the distances are Euclidean, and therefore, similar distances fit around<br />

a circle. This distance can be measured using a ruler. In (b), the distances are<br />

Manhattan, also called City Block. We compute the distance by moving across rows<br />

and columns, similar to how a Rook (Castle) in Chess moves. Finally, in (c), we<br />

have the Cosine distance that is measured by computing the angle between the lines<br />

drawn from the sample to the vector, and ignore the actual length of the line.<br />

The distance metric chosen can have a large impact on the final performance.<br />

For example, if you have many features, the Euclidean distance between random<br />

samples approaches the same value. This makes it hard to compare samples<br />

as the distances are the same! Manhattan distance can be more stable in some<br />

circumstances, but if some features have very large values, this can overrule lots of<br />

similarities in other features. Finally, Cosine distance is a good metric for comparing<br />

items <strong>with</strong> a large number of features, but it discards some information about the<br />

length of the vector, which is useful in some circumstances.<br />

For this chapter, we will stay <strong>with</strong> Euclidean distance, using other metrics in<br />

later chapters.<br />

[ 28 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!