08.06.2015 Views

Building Machine Learning Systems with Python - Richert, Coelho

Building Machine Learning Systems with Python - Richert, Coelho

Building Machine Learning Systems with Python - Richert, Coelho

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Chapter 3<br />

Flat clustering divides the posts into a set of clusters <strong>with</strong>out relating the clusters to<br />

each other. The goal is simply to come up <strong>with</strong> a partitioning such that all posts in<br />

one cluster are most similar to each other while being dissimilar from the posts in all<br />

other clusters. Many flat clustering algorithms require the number of clusters to be<br />

specified up front.<br />

In hierarchical clustering, the number of clusters does not have to be specified.<br />

Instead, the hierarchical clustering creates a hierarchy of clusters. While similar posts<br />

are grouped into one cluster, similar clusters are again grouped into one uber-cluster.<br />

This is done recursively, until only one cluster is left, which contains everything. In<br />

this hierarchy, one can then choose the desired number of clusters. However, this<br />

comes at the cost of lower efficiency.<br />

Scikit provides a wide range of clustering approaches in the package sklearn.<br />

cluster. You can get a quick overview of the advantages and drawbacks of each<br />

of them at http://scikit-learn.org/dev/modules/clustering.html.<br />

In the following section, we will use the flat clustering method, KMeans, and play<br />

a bit <strong>with</strong> the desired number of clusters.<br />

KMeans<br />

KMeans is the most widely used flat clustering algorithm. After it is initialized <strong>with</strong><br />

the desired number of clusters, num_clusters, it maintains that number of so-called<br />

cluster centroids. Initially, it would pick any of the num_clusters posts and set the<br />

centroids to their feature vector. Then it would go through all other posts and assign<br />

them the nearest centroid as their current cluster. Then it will move each centroid<br />

into the middle of all the vectors of that particular class. This changes, of course, the<br />

cluster assignment. Some posts are now nearer to another cluster. So it will update<br />

the assignments for those changed posts. This is done as long as the centroids move<br />

a considerable amount. After some iterations, the movements will fall below a<br />

threshold and we consider clustering to be converged.<br />

Downloading the example code<br />

You can download the example code files for all Packt books you<br />

have purchased from your account at http://www.packtpub.<br />

com. If you purchased this book elsewhere, you can visit http://<br />

www.packtpub.com/support and register to have the files<br />

e-mailed directly to you.<br />

[ 63 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!