10.11.2016 Views

Learning Data Mining with Python

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Clustering News Articles<br />

Our system will start <strong>with</strong> the popular link aggregation website reddit, which<br />

stores lists of links to other websites, as well as a comments section for discussion.<br />

Links on reddit are broken into several categories of links, called subreddits.<br />

There are subreddits devoted to particular TV shows, funny images, and many<br />

other things. What we are interested in is the subreddits for news. We will use<br />

the /r/worldnews subreddit in this chapter, but the code should work <strong>with</strong> any<br />

other subreddit.<br />

In this chapter, our goal is to download popular stories, and then cluster them to<br />

see any major themes or concepts that occur. This will give us an insight into the<br />

popular focus, <strong>with</strong>out having to manually analyze hundreds of individual stories.<br />

Using a Web API to get data<br />

We have used web-based APIs to extract data in several of our previous chapters.<br />

For instance, in Chapter 7, Discovering Accounts to Follow Using Graph <strong>Mining</strong>,<br />

we used Twitter's API to extract data. Collecting data is a critical part of the<br />

data mining pipeline, and web-based APIs are a fantastic way to collect data<br />

on a variety of topics.<br />

There are three things you need to consider when using a web-based API for<br />

collecting data: authorization methods, rate limiting, and API endpoints.<br />

Authorization methods allow the data provider to know who is collecting the<br />

data, in order to ensure that they are being appropriately rate-limited and that<br />

data access can be tracked. For most websites, a personal account is often enough<br />

to start collecting data, but some websites will ask you to create a formal developer<br />

account to get this access.<br />

Rate limiting is applied to data collection, particularly free services. It is important<br />

to be aware of the rules when using APIs, as they can and do change from website<br />

to website. Twitter's API limit is 180 requests per 15 minutes (depending on the<br />

particular API call). Reddit, as we will see later, allows 30 requests per minute.<br />

Other websites impose daily limits, while others limit on a per-second basis. Even<br />

<strong>with</strong>in websites, there are drastic differences for different API calls. For example,<br />

Google Maps has smaller limits and different API limits per-resource, <strong>with</strong> different<br />

allowances for the number of requests per hour.<br />

If you find you are creating an app or running an experiment that<br />

needs more requests and faster responses, most API providers<br />

have commercial plans that allow for more calls.<br />

[ 212 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!