10.11.2016 Views

Learning Data Mining with Python

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Clustering News Articles<br />

Summary<br />

In this chapter, we looked at clustering, which is an unsupervised learning approach.<br />

We use unsupervised learning to explore data, rather than for classification and<br />

prediction purposes. In the experiment here, we didn't have topics for the news items<br />

we found on reddit, so we were unable to perform classification. We used k-means<br />

clustering to group together these news stories to find common topics and trends in<br />

the data.<br />

In pulling data from reddit, we had to extract data from arbitrary websites. This<br />

was performed by looking for large text segments, rather than a full-blown machine<br />

learning approach. There are some interesting approaches to machine learning for<br />

this task that may improve upon these results. In the Appendix of this book, I've<br />

listed, for each chapter, avenues for going beyond the scope of the chapter and<br />

improving upon the results. This includes references to other sources of information<br />

and more difficult applications of the approaches in each chapter.<br />

We also looked at a straightforward ensemble algorithm, ECA. An ensemble is often<br />

a good way to deal <strong>with</strong> variance in the results, especially if you don't know how<br />

to choose good parameters (which is especially difficult <strong>with</strong> clustering).<br />

Finally, we introduced online learning. This is a gateway to larger learning exercises,<br />

including Big data, which will be discussed in the final two chapters of this book.<br />

These final experiments are quite large and require management of data as well<br />

as learning a model from them.<br />

In the next chapter, we step away from unsupervised learning and go back to<br />

classification. We will look at deep learning, which is a classification method<br />

built on complex neural networks.<br />

[ 240 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!