Download - The Bastards Book of Regular Expressions

Recommendations

Info

Switching visualizations (TODO)If visualizations are made from data and data made from text, then regular expressions should beable to play a part in creating data visualizations.This chapter demonstrates how regexes can be used to port your data from one visualization optionto another, including Excel, Google Charts, and R.A visualization in ExcelFrom Excel to Google Static ChartFrom Google Static Charts to Google Interactive Charts163
Cleaning up OCR Text (TODO)Image scanners and optical-character-recognition software³³ ease the process of turning paper intodigital text. And as we learned in previous chapters, data is just text with a certain structure.But the challenge of scanned text is that the conversion is messy.Todo: Example imageScenarioWhen using Tesseract (or an OCR program of your choice) on a scanned image, you’ll almost alwayshave imperfect translations. A common problem might involve numbers and letters that look-alike,such as the lower-case-“L” and the number 1.We’ll use regexes to quickly identify problematic character-groupings, such as numbers in the middleof words (e.g. he1l0).Note: This is only a proof of concept. If you’re digitizing large batches of documents, you’ll bewriting automated scripts with a variety of regex and parsing techniques. This chapter is not meantto imply that you can deal with this problem with your trusty text-editor alone.PrerequisitesTodo:Finding misplaced symbols inside letters.• Scan a text• Tesseract it• highlight all words that fit:(\b[A-Za-z]+[ˆ A-Za-z'\-]+[A-Za-z]+)• find all numbers that have a letter in the middle of them³³http://en.wikipedia.org/wiki/Optical_character_recognition164
Page 1:
The Bastards Book ofRegular Express
Page 5 and 6:
CONTENTSOptionality and alternation
Page 7 and 8:
CONTENTSSwitching visualizations (T
Page 9 and 10:
Regular Expressions are for Everyon
Page 11 and 12:
Regular Expressions are for Everyon
Page 13 and 14:
Getting Started6
Page 15 and 16:
Finding a proper text editor 8Notep
Page 17 and 18:
Finding a proper text editor 10Edit
Page 19 and 20:
Finding a proper text editor 12Text
Page 21 and 22:
Finding a proper text editor 14does
Page 23 and 24:
Finding a proper text editor 16Yest
Page 25 and 26:
Finding a proper text editor 18A do
Page 27 and 28:
A better Find-and-Replace 20The lim
Page 29 and 30:
A better Find-and-Replace 22Using R
Page 31 and 32:
Your first regex 24helloThe Find-an
Page 33 and 34:
Your first regex 26The regex syntax
Page 35 and 36:
Your first regex 28Double-bounded
Page 37 and 38:
Your first regex 30AnswerFind \bcat
Page 39 and 40:
Removing emptinessIt’s funny how
Page 41 and 42:
Removing emptiness 34Now let’s do
Page 43 and 44:
Removing emptiness 36…as opposed
Page 45 and 46:
Removing emptiness 38Replacement in
Page 47 and 48:
Match one-or-more with the plus sig
Page 49 and 50:
Page 51 and 52:
Page 53 and 54:
Page 55 and 56:
Match zero-or-more with the star si
Page 57 and 58:
Specific and limited repetition 501
Page 59 and 60:
Specific and limited repetition 52E
Page 61 and 62:
Specific and limited repetition 54C
Page 63 and 64:
Anchors: A way to trim emptinessIn
Page 65 and 66:
Anchors: A way to trim emptiness 58
Page 67 and 68:
Page 69 and 70:
Page 71 and 72:
Matching any letter, any number 64A
Page 73 and 74:
Matching any letter, any number 66W
Page 75 and 76:
Matching any letter, any number 68[
Page 77 and 78:
Matching any letter, any number 70R
Page 79 and 80:
Matching any letter, any number 72F
Page 81 and 82:
Matching any letter, any number 74A
Page 83 and 84:
Negative character sets 76$1,200.00
Page 85 and 86:
Negative character sets 78\W+Exerci
Page 87 and 88:
Capture, Reuse 80Find (ba)+Matches
Page 89 and 90:
Capture, Reuse 82.- then the 1st ba
Page 91 and 92:
Capture, Reuse 84In English We are
Page 93 and 94:
Capture, Reuse 86ApplesOraclesOrang
Page 95 and 96:
Capture, Reuse 8805-14-8912-03-9803
Page 97 and 98:
Capture, Reuse 90Mary asked: "What
Page 99 and 100:
Optionality and alternationThe two
Page 101 and 102:
Optionality and alternation 94Answe
Page 103 and 104:
Optionality and alternation 96Studi
Page 105 and 106:
Optionality and alternation 98You m
Page 107 and 108:
Laziness and greediness 100Being to
Page 109 and 110:
Laziness and greediness 102Replace
Page 111 and 112:
Laziness and greediness 104With the
Page 113 and 114:
Lookarounds 106cat(?=s)- will match
Page 115 and 116:
Lookarounds 108`(?
Page 117 and 118:
Lookarounds 110ExerciseGiven the fo
Page 119 and 120: Lookarounds 112City,CountryAlbuquer
Page 121 and 122: Lookarounds 114Given this list of c
Page 123 and 124: Lookarounds 116100,J.D. Salinger Av
Page 125 and 126: Lookarounds 1186,300|Apples|New Yor
Page 127 and 128: Lookarounds 120^(\d+) Record ID (\w
Page 129 and 130: Lookarounds 122d. Remove all asides
Page 131 and 132: From Data to HTML (TODO)This chapte
Page 133 and 134: From Data to HTML (TODO) 126Turning
Page 135 and 136: The ExercisesI’ve never been one
Page 137 and 138: Data Cleaning with the Stars 130LOR
Page 139 and 140: Data Cleaning with the Stars 132[GM
Page 141 and 142: Finding needles in haystacks (TODO)
Page 143 and 144: Changing phone format (TODO)Todo: T
Page 145 and 146: Changing phone format (TODO) 138Exe
Page 147 and 148: Changing phone format (TODO) 140Ans
Page 149 and 150: Changing phone format (TODO) 142`1-
Page 151 and 152: Changing phone format (TODO) 144(\d
Page 153 and 154: Dating, Associated Press Style (TOD
Page 161 and 162: Sorting a police blotter 154Sloppy
Page 163 and 164: Sorting a police blotter 156I used
Page 165 and 166: Converting XML to tab-delimited dat
Page 167 and 168: Converting XML to tab-delimited dat
Page 169: Cleaning up Microsoft Word HTML(TOD
Page 173 and 174: Moving forwardThank you for taking
show all

Download - The Bastards Book of Regular Expressions

Create successful ePaper yourself

Delete template?

Save as template?