10.11.2016 Views

Learning Data Mining with Python

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Chapter 8<br />

However, it isn't very good. We don't really expect letters to be moved around, just<br />

individual letter comparisons to be wrong. For this reason, we will create a different<br />

distance metric, which is simply the number of letters in the same positions that are<br />

incorrect. The code is as follows:<br />

def compute_distance(prediction, word):<br />

return len(prediction) - sum(prediction[i] == word[i] for i in<br />

range(len(prediction)))<br />

We subtract the value from the length of the prediction word (which is four) to make<br />

it a distance metric where lower values indicate more similarity between the words.<br />

Putting it all together<br />

We can now test our improved prediction function using similar code to before.<br />

First we define a prediction, which also takes our list of valid words:<br />

from operator import itemgetter<br />

def improved_prediction(word, net, dictionary, shear=0.2):<br />

captcha = create_captcha(word, shear=shear)<br />

prediction = predict_captcha(captcha, net)<br />

prediction = prediction[:4]<br />

Up to this point, the code is as before. We do our prediction and limit it to the first<br />

four characters. However, we now check if the word is in the dictionary or not. If it<br />

is, we return that as our prediction. If it is not, we find the next nearest word. The<br />

code is as follows:<br />

if prediction not in dictionary:<br />

We compute the distance between our predicted word and each other word in the<br />

dictionary, and sort it by distance (lowest first). The code is as follows:<br />

distances = sorted([(word, compute_distance(prediction, word))<br />

for word in dictionary],<br />

key=itemgetter(1))<br />

We then get the best matching word—that is, the one <strong>with</strong> the lowest distance—and<br />

predict that word:<br />

best_word = distances[0]<br />

prediction = best_word[0]<br />

We then return the correctness, word, and prediction as before:<br />

return word == prediction, word, prediction<br />

[ 181 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!