10.11.2016 Views

Learning Data Mining with Python

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Chapter 8<br />

Summary<br />

In this chapter, we worked <strong>with</strong> images in order to use simple pixel values to<br />

predict the letter being portrayed in a CAPTCHA. Our CAPTCHAs were a bit<br />

simplified; we only used complete four-letter English words. In practice, the problem<br />

is much harder—as it should be! With some improvements, it would be possible to<br />

solve much harder CAPTCHAs <strong>with</strong> neural networks and a methodology similar to<br />

what we discussed. The scikit-image library contains lots of useful functions for<br />

extracting shapes from images, functions for improving contrast, and other image<br />

tools that will help.<br />

We took our larger problem of predicting words, and created a smaller and simple<br />

problem of predicting letters. From here, we were able to create a feed-forward<br />

neural network to accurately predict which letter was in the image. At this<br />

stage, our results were very good <strong>with</strong> 97 percent accuracy.<br />

Neural networks are simply connected sets of neurons, which are basic computation<br />

devices consisting of a single function. However, when you connect these together,<br />

they can solve incredibly complex problems. Neural networks are the basis for deep<br />

learning, which is one of the most effective areas of data mining at the moment.<br />

Despite our great per-letter accuracy, the performance when predicting a word<br />

drops to just over 50 percent when trying to predict a whole word. There were<br />

several factors for this, representing the difficulty of taking a problem from an<br />

experiment to the real world.<br />

We improved our accuracy using a dictionary, searching for the best matching<br />

word. To do this, we considered the commonly used edit distance; however, we<br />

simplified it because we were only concerned <strong>with</strong> individual mistakes on letters,<br />

not insertions or deletions. This improvement netted some benefit, but there are<br />

still many improvements you could try to further boost the accuracy.<br />

In the next chapter, we will continue <strong>with</strong> string comparisons. We will attempt<br />

to determine which author (out of a set of authors) wrote a particular<br />

document—using only the content and no other information!<br />

[ 183 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!