08.06.2015 Views

Building Machine Learning Systems with Python - Richert, Coelho

Building Machine Learning Systems with Python - Richert, Coelho

Building Machine Learning Systems with Python - Richert, Coelho

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Clustering – Finding Related Posts<br />

>>> num_samples, num_features = vectorized.shape<br />

>>> print("#samples: %d, #features: %d" % (num_samples, num_features))<br />

#samples: 3414, #features: 4331<br />

We now have a pool of 3,414 posts and extracted for each of them a feature vector of<br />

4,331 dimensions. That is what KMeans takes as input. We will fix the cluster size to<br />

50 for this chapter and hope you are curious enough to try out different values as an<br />

exercise, as shown in the following code:<br />

>>> num_clusters = 50<br />

>>> from sklearn.cluster import KMeans<br />

>>> km = KMeans(n_clusters=num_clusters, init='random', n_init=1,<br />

verbose=1)<br />

>>> km.fit(vectorized)<br />

That's it. After fitting, we can get the clustering information out of the members of<br />

km. For every vectorized post that has been fit, there is a corresponding integer label<br />

in km.labels_:<br />

>>> km.labels_<br />

array([33, 22, 17, ..., 14, 11, 39])<br />

>>> km.labels_.shape<br />

(3414,)<br />

The cluster centers can be accessed via km.cluster_centers_.<br />

In the next section we will see how we can assign a cluster to a newly arriving post<br />

using km.predict.<br />

Solving our initial challenge<br />

We now put everything together and demonstrate our system for the following new<br />

post that we assign to the variable new_post:<br />

Disk drive problems. Hi, I have a problem <strong>with</strong> my hard disk.<br />

After 1 year it is working only sporadically now.<br />

I tried to format it, but now it doesn't boot any more.<br />

Any ideas? Thanks.<br />

[ 68 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!