10.11.2016 Views

Learning Data Mining with Python

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Beating CAPTCHAs <strong>with</strong> Neural Networks<br />

The next step in our algorithm for beating these CAPTCHAs involves segmenting<br />

the word to discover each of the letters <strong>with</strong>in it. To do this, we are going to create<br />

a function that finds contiguous sections of black pixels on the image and extract<br />

them as sub-images. These are (or at least should be) our letters.<br />

First we import the label and regionprops functions, which we will use in<br />

this function:<br />

from skimage.measure import label, regionprops<br />

Our function will take an image, and return a list of subimages, where each<br />

sub-image is a letter from the original word in the image:<br />

def segment_image(image):<br />

The first thing we need to do is to detect where each letter is. To do this, we will<br />

use the label function in scikit-image, which finds connected sets of pixels that<br />

have the same value. This has analogies to our connected component discovery<br />

in Chapter 7, Discovering Accounts to Follow Using Graph <strong>Mining</strong>.<br />

The label function takes an image and returns an array of the same shape as the<br />

original. However, each connected region has a different number in the array and<br />

pixels that are not in a connected region have the value 0. The code is as follows:<br />

labeled_image = label(image > 0)<br />

We will extract each of these sub-images and place them into a list:<br />

subimages = []<br />

The scikit-image library also contains a function for extracting information about<br />

these regions: regionprops. We can iterate over these regions and work on each<br />

individually:<br />

for region in regionprops(labeled_image):<br />

From here, we can query the region object for information about the current region.<br />

For our algorithm, we need to obtain the starting and ending coordinates of the<br />

current region:<br />

start_x, start_y, end_x, end_y = region.bbox<br />

[ 168 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!