10.11.2016 Views

Learning Data Mining with Python

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Clustering News Articles<br />

The labels variable now contains the cluster numbers for each sample. Samples<br />

<strong>with</strong> the same label are said to belong in the same cluster. It should be noted that<br />

the cluster labels themselves are meaningless: clusters 1 and 2 are no more similar<br />

than clusters 1 and 3.<br />

We can see how many samples were placed in each cluster using the Counter class:<br />

from collections import Counter<br />

c = Counter(labels)<br />

for cluster_number in range(n_clusters):<br />

print("Cluster {} contains {} samples".format(cluster_number,<br />

c[cluster_number]))<br />

Many of the results (keeping in mind that your dataset will be quite different to<br />

mine) consist of a large cluster <strong>with</strong> the majority of instances, several medium<br />

clusters, and some clusters <strong>with</strong> only one or two instances. This imbalance is quite<br />

normal in many clustering applications.<br />

Evaluating the results<br />

Clustering is mainly an exploratory analysis, and therefore it is difficult to evaluate<br />

a clustering algorithm's results effectively. A straightforward way is to evaluate the<br />

algorithm based on the criteria the algorithm tries to learn from.<br />

If you have a test set, you can evaluate clustering against it. For<br />

more details, visit http://nlp.stanford.edu/IR-book/html/<br />

htmledition/evaluation-of-clustering-1.html.<br />

In the case of the k-means algorithm, the criterion that it uses when developing<br />

the centroids is to minimize the distance from each sample to its nearest centroid.<br />

This is called the inertia of the algorithm and can be retrieved from any KMeans<br />

instance that has had fit called on it:<br />

pipeline.named_steps['clusterer'].inertia_<br />

The result on my dataset was 343.94. Unfortunately, this value is quite meaningless<br />

by itself, but we can use it to determine how many clusters we should use. In<br />

the preceding example, we set n_clusters to 10, but is this the best value? The<br />

following code runs the k-means algorithm 10 times <strong>with</strong> each value of n_clusters<br />

from 2 to 20. For each run, it records the inertia of the result.<br />

[ 226 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!