10.11.2016 Views

Learning Data Mining with Python

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Next Steps…<br />

Deeper networks<br />

These techniques will probably fool our current implementation, so improvements<br />

will need to be made to make the method better. Try some of the deeper networks<br />

we used in Chapter 11, Classifying Objects in Images Using Deep <strong>Learning</strong>.<br />

Larger networks need more data, though, so you will probably need to generate<br />

more than the few thousand samples we did in this chapter in order to get good<br />

performance. Generating these datasets is a good candidate for parallelization—lots<br />

of small tasks that can be performed independently.<br />

Reinforcement learning<br />

http://pybrain.org/docs/tutorial/reinforcement-learning.html<br />

Reinforcement learning is gaining traction as the next big thing in data mining—<br />

although it has been around a long time! PyBrain has some reinforcement learning<br />

algorithms that are worth checking out <strong>with</strong> this dataset (and others!).<br />

Chapter 9 – Authorship Attribution<br />

Increasing the sample size<br />

The Enron application we used ended up using just a portion of the overall dataset.<br />

There is lots more data available in this dataset. Increasing the number of authors<br />

will likely lead to a drop in accuracy, but it is possible to boost the accuracy further<br />

than was achieved in this chapter, using similar methods. Using a Grid Search, try<br />

different values for n-grams and different parameters for support vector machines,<br />

in order to get better performance on a larger number of authors.<br />

Blogs dataset<br />

The dataset used in Chapter 12, Working <strong>with</strong> Big <strong>Data</strong>, provides authorship-based<br />

classes (each blogger ID is a separate author). This dataset can be tested using<br />

this kind of method as well. In addition, there are the other classes of gender, age,<br />

industry, and star sign that can be tested—are authorship-based methods good for<br />

these classification tasks?<br />

[ 304 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!