13.07.2015 Views

Download - The Bastards Book of Regular Expressions

Download - The Bastards Book of Regular Expressions

Download - The Bastards Book of Regular Expressions

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Cleaning up OCR Text (TODO)Image scanners and optical-character-recognition s<strong>of</strong>tware³³ ease the process <strong>of</strong> turning paper intodigital text. And as we learned in previous chapters, data is just text with a certain structure.But the challenge <strong>of</strong> scanned text is that the conversion is messy.Todo: Example imageScenarioWhen using Tesseract (or an OCR program <strong>of</strong> your choice) on a scanned image, you’ll almost alwayshave imperfect translations. A common problem might involve numbers and letters that look-alike,such as the lower-case-“L” and the number 1.We’ll use regexes to quickly identify problematic character-groupings, such as numbers in the middle<strong>of</strong> words (e.g. he1l0).Note: This is only a pro<strong>of</strong> <strong>of</strong> concept. If you’re digitizing large batches <strong>of</strong> documents, you’ll bewriting automated scripts with a variety <strong>of</strong> regex and parsing techniques. This chapter is not meantto imply that you can deal with this problem with your trusty text-editor alone.PrerequisitesTodo:Finding misplaced symbols inside letters.• Scan a text• Tesseract it• highlight all words that fit:(\b[A-Za-z]+[ˆ A-Za-z'\-]+[A-Za-z]+)• find all numbers that have a letter in the middle <strong>of</strong> them³³http://en.wikipedia.org/wiki/Optical_character_recognition164

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!