08.06.2015 Views

Building Machine Learning Systems with Python - Richert, Coelho

Building Machine Learning Systems with Python - Richert, Coelho

Building Machine Learning Systems with Python - Richert, Coelho

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Chapter 5<br />

Fetching the data<br />

Luckily for us, the team behind stackoverflow provides most of the data behind the<br />

StackExchange universe to which stackoverflow belongs under a CC Wiki license.<br />

While writing this, the latest data dump can be found at http://www.clearbits.<br />

net/torrents/2076-aug-2012. Most likely, this page will contain a pointer to an<br />

updated dump when you read it.<br />

After downloading and extracting it, we have around 37 GB of data in the XML<br />

format. This is illustrated in the following table:<br />

File Size (MB) Description<br />

badges.xml 309 Badges of users<br />

comments.xml 3,225 Comments on questions or answers<br />

posthistory.xml 18,370 Edit history<br />

posts.xml<br />

12,272<br />

Questions and answers—this is what<br />

we need<br />

users.xml 319 General information about users<br />

votes.xml 2,200 Information on votes<br />

As the files are more or less self-contained, we can delete all of them except posts.<br />

xml; it contains all the questions and answers as individual row tags <strong>with</strong>in the root<br />

tag posts. Refer to the following code:<br />

<br />

<br />

<br />

[ 91 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!