13.07.2015 Views

Download - The Bastards Book of Regular Expressions

Download - The Bastards Book of Regular Expressions

Download - The Bastards Book of Regular Expressions

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Data Cleaning with the Stars 132[GMNPRUX][ACGR]?(?:-\d{2})?Finally, we have the title. This seems like the trickiest part, since the titles can be <strong>of</strong> variable lengthand contain any type <strong>of</strong> character, including punctuation.However, this should work for our purposes:^.+?Why does this work? <strong>The</strong> .+? regex seems like it would inadvertently scoop up non-title parts <strong>of</strong>the line. Luckily for us, the rest <strong>of</strong> the line’s data points are more or less consistently. Whateverthe dot-plus swallows, it has to leave a minimum <strong>of</strong> structured fields in the right most part <strong>of</strong> thepattern.So even if we have a movie named “X X 9 1999”, with a rating <strong>of</strong> “R”, a user rating <strong>of</strong> “8”, and aproduction year <strong>of</strong> “2000”:X X 9 1999 R 8 2000<strong>The</strong> title-capturing part <strong>of</strong> the regex would stop at 1999 because it has to leave at least one ratingtypepattern (R), one user-rating-type pattern (8), and a four-digit number at the very end (2000).All together now:Find ˆ(.+?) ([GMNPRUX][ACGR]?(?:-\d{2})?) (\d\.?\d?) (\d{4})$Replace \1\t\2\t\3\4Or, if you prefer comma delimited style: "\1","\2","\3","\4"“<strong>The</strong> Godfather: Part II”,”R”,”9”,”1974” “Pulp Fiction”,”R”,”8.9”,”1994” “8½”,”NR”,”8.1”,”1963” “<strong>The</strong>Good, the Bad and the Ugly”,”R”,”8.9”,”1966” “12 Angry Men”,”PG”,”8.9”,”1957” “<strong>The</strong> DarkKnight”,”PG-13”,”8.9”,”2008” “1984”,”R”,”7.1”,”1984” “M”,”NR”,”8.5”,”1931” “Nosferatu”,”U”,”8”,”1922”“Schindler’s List”,”R”,”8.9”,”1993” “Midnight Cowboy”,”X”,”8”,”1969” “Fight Club”,”R”,”8.8”,”1999”

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!