10.11.2016 Views

Learning Data Mining with Python

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Clustering News Articles<br />

In most of the previous chapters, we performed data mining knowing what we<br />

were looking for. Our use of target classes allowed us to learn how our variables<br />

model those targets during the training phase. This type of learning, where we have<br />

targets to train against, is called supervised learning. In this chapter, we consider<br />

what we do <strong>with</strong>out those targets. This is unsupervised learning and is much more<br />

of an exploratory task. Rather than wanting to classify <strong>with</strong> our model, the goal in<br />

unsupervised learning is more about exploring the data to find insights.<br />

In this chapter, we look at clustering news articles to find trends and patterns in<br />

the data. We look at how we can extract data from different websites using a link<br />

aggregation website to show a variety of news stories.<br />

The key concepts covered in this chapter include:<br />

• Obtaining text from arbitrary websites<br />

• Using the reddit API to collect interesting news stories<br />

• Cluster analysis for unsupervised data mining<br />

• Extracting topics from documents<br />

• Online learning for updating a model <strong>with</strong>out retraining it<br />

• Cluster ensembling to combine different models<br />

Obtaining news articles<br />

In this chapter, we will build a system that takes a live feed of news articles and<br />

groups them together, where the groups have similar topics. You could run the<br />

system over several weeks (or longer) to see how trends change over that time.<br />

[ 211 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!