10.11.2016 Views

Learning Data Mining with Python

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Chapter 8<br />

Our targets are integer values between 0 and 26, <strong>with</strong> each representing a letter of<br />

the alphabet. Neural networks don't usually support multiple values from a single<br />

neuron, instead preferring to have multiple outputs, each <strong>with</strong> values 0 or 1. We<br />

therefore perform one hot-encoding of the targets, giving us a target array that has<br />

26 outputs per sample, using values near 1 if that letter is likely and near 0 otherwise.<br />

The code is as follows:<br />

from sklearn.preprocessing import OneHotEncoder<br />

onehot = OneHotEncoder()<br />

y = onehot.fit_transform(targets.reshape(targets.shape[0],1))<br />

The library we are going to use doesn't support sparse arrays, so we need to turn<br />

our sparse matrix into a dense NumPy array. The code is as follows:<br />

y = y.todense()<br />

Adjusting our training dataset to our<br />

methodology<br />

Our training dataset differs from our final methodology quite significantly. Our<br />

dataset here is nicely created individual letters, fitting the 20-pixel by 20-pixel image.<br />

The methodology involves extracting the letters from words, which may squash<br />

them, move them away from the center, or create other problems.<br />

Ideally, the data you train your classifier on should mimic the environment it will<br />

be used in. In practice, we make concessions, but aim to minimize the differences<br />

as much as possible.<br />

For this experiment, we would ideally extract letters from actual CAPTCHAs and<br />

label those. In the interests of speeding up the process a bit, we will just run our<br />

segmentation function on the training dataset and return those letters instead.<br />

We will need the resize function from scikit-image, as our sub-images won't<br />

always be 20 pixels by 20 pixels. The code is as follows:<br />

from skimage.transform import resize<br />

From here, we can run our segment_image function on each sample and then resize<br />

them to 20 pixels by 20 pixels. The code is as follows:<br />

dataset = np.array([resize(segment_image(sample)[0], (20, 20)) for<br />

sample in dataset])<br />

[ 171 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!