10.11.2016 Views

Learning Data Mining with Python

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Beating CAPTCHAs <strong>with</strong><br />

Neural Networks<br />

Interpreting information contained in images has long been a difficult problem in<br />

data mining, but it is one that is really starting to be addressed. The latest research is<br />

providing algorithms to detect and understand images to the point where automated<br />

commercial surveillance systems are now being used—in real-world scenarios—by<br />

major vendors. These systems are capable of understanding and recognizing objects<br />

and people in video footage.<br />

It is difficult to extract information from images. There is lots of raw data in an<br />

image, and the standard method for encoding images—pixels—isn't that informative<br />

by itself. Images—particularly photos—can be blurry, too close to the targets, too<br />

dark, too light, scaled, cropped, skewed, or any other of a variety of problems that<br />

cause havoc for a computer system trying to extract useful information.<br />

In this chapter, we look at extracting text from images by using neural networks<br />

for predicting each letter. The problem we are trying to solve is to automatically<br />

understand CAPTCHA messages. CAPTCHAs are images designed to be easy for<br />

humans to solve and hard for a computer to solve, as per the acronym: Completely<br />

Automated Public Turing test to tell Computers and Humans Apart. Many websites<br />

use them for registration and commenting systems to stop automated programs<br />

flooding their site <strong>with</strong> fake accounts and spam comments.<br />

The topics covered in this chapter include:<br />

• Neural networks<br />

• Creating our own dataset of CAPTCHAs and letters<br />

• The scikit-image library for working <strong>with</strong> image data<br />

• The PyBrain library for neural networks<br />

[ 161 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!