10.11.2016 Views

Learning Data Mining with Python

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Chapter 2<br />

Nearest neighbors can be used for nearly any dataset-however, it can be very<br />

computationally expensive to compute the distance between all pairs of samples.<br />

For example if there are 10 samples in the dataset, there are 45 unique distances<br />

to compute. However, if there are 1000 samples, there are nearly 500,000! Various<br />

methods exist for improving this speed dramatically; some of which are covered in<br />

the later chapters of this book.<br />

It can also do poorly in categorical-based datasets, and another algorithm should be<br />

used for these instead.<br />

Distance metrics<br />

A key underlying concept in data mining is that of distance. If we have two samples,<br />

we need to know how close they are to each other. Further more, we need to answer<br />

questions such as are these two samples more similar than the other two?<br />

Answering questions like these is important to the outcome of the case.<br />

The most common distance metric that the people are aware of is Euclidean<br />

distance, which is the real-world distance. If you were to plot the points on a graph<br />

and measure the distance <strong>with</strong> a straight ruler, the result would be the Euclidean<br />

distance. A little more formally, it is the square root of the sum of the squared<br />

distances for each feature.<br />

Euclidean distance is intuitive, but provides poor accuracy if some features have<br />

larger values than others. It also gives poor results when lots of features have a<br />

value of 0, known as a sparse matrix. There are other distance metrics in use; two<br />

commonly employed ones are the Manhattan and Cosine distance.<br />

The Manhattan distance is the sum of the absolute differences in each feature (<strong>with</strong><br />

no use of square distances). Intuitively, it can be thought of as the number of moves<br />

a rook piece (or castle) in chess would take to move between the points, if it were<br />

limited to moving one square at a time. While the Manhattan distance does suffer if<br />

some features have larger values than others, the effect is not as dramatic as in the<br />

case of Euclidean.<br />

[ 27 ]<br />

www.allitebooks.com

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!