08.06.2015 Views

Building Machine Learning Systems with Python - Richert, Coelho

Building Machine Learning Systems with Python - Richert, Coelho

Building Machine Learning Systems with Python - Richert, Coelho

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Chapter 4<br />

Subject: Re: High Prolactin<br />

In article JER4@psuvm.psu.edu (John<br />

E. Rodway) writes:<br />

>Any comments on the use of the drug Parlodel for high prolactin in<br />

the blood?<br />

>It can suppress secretion of prolactin. Is useful in cases of<br />

galactorrhea. Some adenomas of the pituitary secret too much.<br />

------------------------------------------------------------------<br />

Gordon Banks N3JXP | "Skepticism is the chastity of the<br />

intellect, and geb@cadre.dsl.pitt.edu | it is shameful to surrender<br />

it too soon."<br />

----------------------------------------------------------------<br />

We received a post by the same author discussing medications.<br />

Modeling the whole of Wikipedia<br />

While the initial LDA implementations could be slow, modern systems can work<br />

<strong>with</strong> very large collections of data. Following the documentation of gensim, we are<br />

going to build a topic model for the whole of the English language Wikipedia. This<br />

takes hours, but can be done even <strong>with</strong> a machine that is not too powerful. With a<br />

cluster of machines, we could make it go much faster, but we will look at that sort of<br />

processing in a later chapter.<br />

First we download the whole Wikipedia dump from http://dumps.wikimedia.<br />

org. This is a large file (currently just over 9 GB), so it may take a while, unless your<br />

Internet connection is very fast. Then, we will index it <strong>with</strong> a gensim tool:<br />

python -m gensim.scripts.make_wiki enwiki-latest-pages-articles.xml.bz2<br />

wiki_en_output<br />

Run the previous command on the command line, not on the <strong>Python</strong> shell. After a<br />

few hours, the indexing will be finished. Finally, we can build the final topic model.<br />

This step looks exactly like what we did for the small AP dataset. We first import a<br />

few packages:<br />

>>> import logging, gensim<br />

>>> logging.basicConfig(<br />

format='%(asctime)s : %(levelname)s : %(message)s',<br />

level=logging.INFO)<br />

Now, we load the data that has been preprocessed:<br />

>>> id2word =<br />

gensim.corpora.Dictionary.load_from_text('wiki_en_output_wordids.txt')<br />

>>> mm = gensim.corpora.MmCorpus('wiki_en_output_tfidf.mm')<br />

[ 83 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!