08.06.2015 Views

Building Machine Learning Systems with Python - Richert, Coelho

Building Machine Learning Systems with Python - Richert, Coelho

Building Machine Learning Systems with Python - Richert, Coelho

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Because the cluster centers are moved, we have to reassign the cluster labels and<br />

recalculate the cluster centers. After iteration 2, we get the following clustering:<br />

Chapter 3<br />

The arrows show the movements of the cluster centers. After five iterations in this<br />

example, the cluster centers don't move noticeably any more (Scikit's tolerance<br />

threshold is 0.0001 by default).<br />

After the clustering has settled, we just need to note down the cluster centers<br />

and their identity. When each new document comes in, we have to vectorize and<br />

compare it <strong>with</strong> all the cluster centers. The cluster center <strong>with</strong> the smallest distance<br />

to our new post vector belongs to the cluster we will assign to the new post.<br />

Getting test data to evaluate our ideas on<br />

In order to test clustering, let us move away from the toy text examples and find a<br />

dataset that resembles the data we are expecting in the future so that we can test our<br />

approach. For our purpose, we need documents on technical topics that are already<br />

grouped together so that we can check whether our algorithm works as expected<br />

when we apply it later to the posts we hope to receive.<br />

[ 65 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!