13.07.2015 Views

Download - The Bastards Book of Regular Expressions

Download - The Bastards Book of Regular Expressions

Download - The Bastards Book of Regular Expressions

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Lookarounds 114Given this list <strong>of</strong> cities and postal codes: TODO New York, NY 10006Convert it to this CSV format:TODO“Hey,” you might say, “there’s already a comma in that data.” True, but it’s just typical punctuation.If we were to open this list in Excel, we would end up with:TODO:So we need to use a simple regex to at least separate the state from the zipcode.AnswerFind (.+?), ([A-Z]{2}) (\d{5})Replace \1,\2,\3Exercise: More complex addressesBelieve it or not, the easily-fixed scenario above is one that I’ve seen keep people from makingperfectly usable, explorable data out <strong>of</strong> text.However, for most kinds <strong>of</strong> text lists, the cleanup is a little more sophisticated than one extra comma.Here’s an example in which we have to deal with street names and addresses:50 Fifth Ave. New York, NY 10012100 Ninth Ave. Brooklyn, NY 114169 Houston St. Juneau, AK 999992800 Springfield Rd. Omaha, NE 55555Change to:50,Fifth Ave.,New York,NY,10012100,Ninth Ave.,Brooklyn,NY,114169,Houston St.,Juneau,AK,999992800,Springfield Rd.,Omaha,NE,55555AnswerThis is simply breaking each part <strong>of</strong> the line into its own separate pattern:1. Street number: consecutive digits at the beginning <strong>of</strong> the line2. Street name: A combination <strong>of</strong> word characters and spaces until a literal period is reached.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!