08.06.2015 Views

Building Machine Learning Systems with Python - Richert, Coelho

Building Machine Learning Systems with Python - Richert, Coelho

Building Machine Learning Systems with Python - Richert, Coelho

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

As we have learned previously, we will first have to vectorize this post before we<br />

predict its label as follows:<br />

>>> new_post_vec = vectorizer.transform([new_post])<br />

>>> new_post_label = km.predict(new_post_vec)[0]<br />

Chapter 3<br />

Now that we have the clustering, we do not need to compare new_post_vec to all<br />

post vectors. Instead, we can focus only on the posts of the same cluster. Let us fetch<br />

their indices in the original dataset:<br />

>>> similar_indices = (km.labels_==new_post_label).nonzero()[0]<br />

The comparison in the bracket results in a Boolean array, and nonzero converts that<br />

array into a smaller array containing the indices of the True elements.<br />

Using similar_indices, we then simply have to build a list of posts together <strong>with</strong><br />

their similarity scores as follows:<br />

>>> similar = []<br />

>>> for i in similar_indices:<br />

... dist = sp.linalg.norm((new_post_vec - vectorized[i]).toarray())<br />

... similar.append((dist, dataset.data[i]))<br />

>>> similar = sorted(similar)<br />

>>> print(len(similar))<br />

44<br />

We found 44 posts in the cluster of our post. To give the user a quick idea of what<br />

kind of similar posts are available, we can now present the most similar post (show_<br />

at_1), the least similar one (show_at_3), and an in-between post (show_at_2), all of<br />

which are from the same cluster as follows:<br />

>>> show_at_1 = similar[0]<br />

>>> show_at_2 = similar[len(similar)/2]<br />

>>> show_at_3 = similar[-1]<br />

[ 69 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!