10.11.2016 Views

Learning Data Mining with Python

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Beating CAPTCHAs <strong>with</strong> Neural Networks<br />

Finally, we will create our dataset. This dataset array is three-dimensional, as it<br />

is an array of two-dimensional images. Our classifier will need a two-dimensional<br />

array, so we simply flatten the last two dimensions:<br />

X = dataset.reshape((dataset.shape[0], dataset.shape[1] * dataset.<br />

shape[2]))<br />

Finally, using the train_test_split function of scikit-learn, we create a set of data<br />

for training and one for testing. The code is as follows:<br />

from sklearn.cross_validation import train_test_split<br />

X_train, X_test, y_train, y_test = \<br />

train_test_split(X, y, train_size=0.9)<br />

Training and classifying<br />

We are now going to build a neural network that will take an image as input and<br />

try to predict which (single) letter is in the image.<br />

We will use the training set of single letters we created earlier. The dataset itself is<br />

quite simple. We have a 20 by 20 pixel image, each pixel 1 (black) or 0 (white). These<br />

represent the 400 features that we will use as inputs into the neural network. The<br />

outputs will be 26 values between 0 and 1, where higher values indicate a higher<br />

likelihood that the associated letter (the first neuron is A, the second is B, and so<br />

on) is the letter represented by the input image.<br />

We are going to use the PyBrain library for our neural network.<br />

As <strong>with</strong> all the libraries we have seen so far, PyBrain can be installed<br />

from pip: pip install pybrain.<br />

The PyBrain library uses its own dataset format, but luckily it isn't too difficult to<br />

create training and testing datasets using this format. The code is as follows:<br />

from pybrain.datasets import Supervised<strong>Data</strong>Set<br />

First, we iterate over our training dataset and add each as a sample into a new<br />

Supervised<strong>Data</strong>Set instance. The code is as follows:<br />

training = Supervised<strong>Data</strong>Set(X.shape[1], y.shape[1])<br />

for i in range(X_train.shape[0]):<br />

training.addSample(X_train[i], y_train[i])<br />

[ 172 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!