Building Machine Learning Systems with Python - Richert, Coelho
Building Machine Learning Systems with Python - Richert, Coelho
Building Machine Learning Systems with Python - Richert, Coelho
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
Chapter 3<br />
Flat clustering divides the posts into a set of clusters <strong>with</strong>out relating the clusters to<br />
each other. The goal is simply to come up <strong>with</strong> a partitioning such that all posts in<br />
one cluster are most similar to each other while being dissimilar from the posts in all<br />
other clusters. Many flat clustering algorithms require the number of clusters to be<br />
specified up front.<br />
In hierarchical clustering, the number of clusters does not have to be specified.<br />
Instead, the hierarchical clustering creates a hierarchy of clusters. While similar posts<br />
are grouped into one cluster, similar clusters are again grouped into one uber-cluster.<br />
This is done recursively, until only one cluster is left, which contains everything. In<br />
this hierarchy, one can then choose the desired number of clusters. However, this<br />
comes at the cost of lower efficiency.<br />
Scikit provides a wide range of clustering approaches in the package sklearn.<br />
cluster. You can get a quick overview of the advantages and drawbacks of each<br />
of them at http://scikit-learn.org/dev/modules/clustering.html.<br />
In the following section, we will use the flat clustering method, KMeans, and play<br />
a bit <strong>with</strong> the desired number of clusters.<br />
KMeans<br />
KMeans is the most widely used flat clustering algorithm. After it is initialized <strong>with</strong><br />
the desired number of clusters, num_clusters, it maintains that number of so-called<br />
cluster centroids. Initially, it would pick any of the num_clusters posts and set the<br />
centroids to their feature vector. Then it would go through all other posts and assign<br />
them the nearest centroid as their current cluster. Then it will move each centroid<br />
into the middle of all the vectors of that particular class. This changes, of course, the<br />
cluster assignment. Some posts are now nearer to another cluster. So it will update<br />
the assignments for those changed posts. This is done as long as the centroids move<br />
a considerable amount. After some iterations, the movements will fall below a<br />
threshold and we consider clustering to be converged.<br />
Downloading the example code<br />
You can download the example code files for all Packt books you<br />
have purchased from your account at http://www.packtpub.<br />
com. If you purchased this book elsewhere, you can visit http://<br />
www.packtpub.com/support and register to have the files<br />
e-mailed directly to you.<br />
[ 63 ]