13.07.2015 Views

Download - The Bastards Book of Regular Expressions

Download - The Bastards Book of Regular Expressions

Download - The Bastards Book of Regular Expressions

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Data Cleaning with the StarsAnyone can do smart analyses when the data is nicely organized and delivered in a spreadsheet.But when it’s just straight text, that’s where regexes separate the resourceful from the poor chumpswho try to clean up the text by hand, or give up all together.This chapter contains a few examples <strong>of</strong> dirty data and how regexes can get them ready for thespreadsheet.Normalized alphabetical titlesEven alphabetical-sorting can be a pain, if you’re dealing with titles. <strong>The</strong> definite and indefinitearticles – “the” and “a” or “an”, respectively – will give us a disproportionate number <strong>of</strong> titles in the“A” and “T” part <strong>of</strong> the alphabet:AA BEAUTIFUL MINDA FISTFUL OF DOLLARSAN AMERICAN IN PARISSTAR TREK II: THE WRATH OF KHANTHE APARTMENTTHE BIG LEBOWSKITHE GODFATHERTHE GOOD, THE BAD, AND THE UGLYTHE GREAT GATSBYTHE KING AND ITHE LORD OF THE RINGS: THE RETURN OF THE KINGTHE SHAWSHANK REDEMPTIONTHE WRESTLERIn order for us to sort these titles alphabetically, we need to remove definite and indefinite articles²⁵from the titles, e.g. “THE” and “A/AN”.A simple find-and-delete for the word THE won’t work:LORD OF RINGS: RETURN OF KINGWe need to remove the articles only if they appear at the beginning <strong>of</strong> the title. And, moreover, weneed to append them to the end <strong>of</strong> the title, with a comma:²⁵http://en.wikipedia.org/wiki/Article_(grammar)129

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!