01.05.2017 Views

348957348957

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Looking at clustering similarity metrics<br />

Classification methods are based on calculating the similarity or difference between two<br />

observations. If your dataset is numeric — composed of only numerical features — and can be<br />

portrayed on an n-dimensional plot, you can use various geometric metrics to scale your<br />

multidimensional data.<br />

An n-dimensional plot is a multidimensional scatter plot that you can use to plot n number<br />

of dimensions of data.<br />

Some popular geometric metrics used for calculating distances between observations are simply<br />

different geometric functions that are useful for modeling distances between points:<br />

Euclidean metric: A measure of the distance between points plotted on a Euclidean plane.<br />

Manhattan metric: A measure of the distance between points where distance is calculated as<br />

the sum of the absolute value of the differences between two points’ Cartesian coordinates.<br />

Minkowski distance metric: A generalization of the Euclidean and Manhattan distance<br />

metrics. Quite often, these metrics can be used interchangeably.<br />

Cosine similarity metric: The cosine metric measures the similarity of two data points based<br />

on their orientation, as determined by taking the cosine of the angle between them.<br />

Lastly, for non-numeric data, you can use metrics like the Jaccard distance metric, an index that<br />

compares the number of features that two observations have in common. For example, to illustrate<br />

a Jaccard distance, look at these two text strings:<br />

Saint Louis de Ha-ha, Quebec<br />

St-Louis de Ha!Ha!, QC<br />

What features do these text strings have in common? And what features are different between<br />

them? The Jaccard metric generates a numerical index value that quantifies the similarity between<br />

text strings.<br />

Identifying Clusters in Your Data<br />

You can use many different algorithms for clustering, but the speed and robustness of the k-means<br />

algorithm makes it a popular choice among experienced data scientists. As alternatives, kernel<br />

density estimation methods, hierarchical algorithms, and neighborhood algorithms are also<br />

available to help you identify clusters in your dataset.<br />

Clustering with the k-means algorithm<br />

The k-means clustering algorithm is a simple, fast, unsupervised learning algorithm that you can<br />

use to predict groupings within a dataset. The model makes its prediction based on the number of

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!