08.06.2015 Views

Building Machine Learning Systems with Python - Richert, Coelho

Building Machine Learning Systems with Python - Richert, Coelho

Building Machine Learning Systems with Python - Richert, Coelho

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Topic Modeling<br />

… for tj,v in t:<br />

… dense[ti,tj] = v<br />

Now, dense is a matrix of topics. We can use the pdist function in SciPy to compute<br />

all pairwise distances. That is, <strong>with</strong> a single function call, we compute all the values<br />

of sum((dense[ti] – dense[tj])**2):<br />

>>> from scipy.spatial import distance<br />

>>> pairwise = distance.squareform(distance.pdist(dense))<br />

Now we employ one last little trick; we set the diagonal elements of the distance<br />

matrix to a high value (it just needs to be larger than the other values in the matrix):<br />

>>> largest = pairwise.max()<br />

>>> for ti in range(len(topics)):<br />

pairwise[ti,ti] = largest+1<br />

And we are done! For each document, we can look up the closest element easily:<br />

>>> def closest_to(doc_id):<br />

return pairwise[doc_id].argmin()<br />

The previous code would not work if we had not set the diagonal<br />

elements to a large value; the function would always return the same<br />

element as it is almost similar to itself (except in the weird case where<br />

two elements have exactly the same topic distribution, which is very<br />

rare unless they are exactly the same).<br />

For example, here is the second document in the collection (the first document is<br />

very uninteresting, as the system returns a post stating that it is the most similar):<br />

From: geb@cs.pitt.edu (Gordon Banks)<br />

Subject: Re: request for information on "essential tremor" and Indrol?<br />

In article sundar@ai.mit.edu writes:<br />

Essential tremor is a progressive hereditary tremor that gets worse<br />

when the patient tries to use the effected member. All limbs, vocal<br />

cords, and head can be involved. Inderal is a beta-blocker and is<br />

usually effective in diminishing the tremor. Alcohol and mysoline are<br />

also effective, but alcohol is too toxic to use as a treatment.<br />

----------------------------------------------------------------Gordon<br />

Banks N3JXP | "Skepticism is the chastity of the intellect, and<br />

geb@cadre.dsl.pitt.edu | it is shameful to surrender it too soon."<br />

----------------------------------------------------------------<br />

If we ask for the most similar document, closest_to(1), we receive the<br />

following document:<br />

From: geb@cs.pitt.edu (Gordon Banks)<br />

[ 82 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!