Lookarounds 117Mixed commas and other delimitersAgain, just to hammer home the point: data is just text, with structure. Why does that structurehave to be defined with commas? It doesn’t, so good for you for realizing that.We can basically use any symbol to structure our data. Tab-separated values, a.k.a. TSV, is anotherpopular format. In fact, when you copy and paste from a HTML table, such as this Wikipedia HTMLchart, you’ll get:TKTKAnd most modern spreadsheet programs will automatically parse pasted TSV text into columns.Copy-and-pasting from the above text will get you this in Google Docs:TKTKHeck, you can just copy-and-paste directly from the webpage into the spreadsheet:TKTKCollisions<strong>The</strong> reason why most data-providers don’t use just “any” symbol to delimit data, though, is apractical one. What happens if you use the letter a as a delimiter – nut your data includes lots<strong>of</strong> a characters naturally?You can do it, but it’s not pretty.But we don’t have to dream <strong>of</strong> that scenario, we already have that problem with using commadelimiters. Consider this example list:6,300 Apples from New York, NY $15,2304,200 Oranges from Miami, FL $20,112<strong>The</strong>re’s commas in the actual data, because they’re used as a grammatical convention: 6000, forexample, is 6,300.In this case, we don’t want to use commas as a delimiter. <strong>The</strong> pipe character, |, is a good candidatebecause it doesn’t typically appear in this kind <strong>of</strong> list.We can delimit this list by using this pattern:Find ˆ([\d,]+) (\w+) from ([\w ]+), ([A-Z]{2}) (\$[\d,]+)Replace \1|\2|\3|\4|\5And we end up with:
Lookarounds 1186,300|Apples|New York|NY|$15,2304,200|Oranges|Miami|FL|$20,112Exercise: Someone else’s comma-messLet’s pretend that someone less enlightened than us tried to do the above exercise with commadelimiters.<strong>The</strong>y would end up with:6,300,Apples,New York,NY,$15,2304,200,Oranges,Miami,FL,$20,112Which, when you open in Excel as CSV, looks predictably like nonsense:<strong>The</strong> result <strong>of</strong> too many commas in this CSV fileSo we need to fix this mess by converting only the commas meant as delimiters into pipe symbols(or a delimiting character <strong>of</strong> your choice *ndash; the @ or tab character would work in this case).AnswerWell, we obviously can’t just do a simple Find-and-Replace affecting all commas. We need to affectonly some <strong>of</strong> the commas.Which ones? In this exercise, it’s easier to look at the commas we don’t want to replace:6,300$15,2304,200$20,112So if the comma is followed by a number, we don’t want to replace it.<strong>The</strong>re’s multiple ways to do this, here’s how to do it with capturing groups and a negativecharacter set:Find ,([\D])Replace |\1
- Page 1:
The Bastards Book ofRegular Express
- Page 5 and 6:
CONTENTSOptionality and alternation
- Page 7 and 8:
CONTENTSSwitching visualizations (T
- Page 9 and 10:
Regular Expressions are for Everyon
- Page 11 and 12:
Regular Expressions are for Everyon
- Page 13 and 14:
Getting Started6
- Page 15 and 16:
Finding a proper text editor 8Notep
- Page 17 and 18:
Finding a proper text editor 10Edit
- Page 19 and 20:
Finding a proper text editor 12Text
- Page 21 and 22:
Finding a proper text editor 14does
- Page 23 and 24:
Finding a proper text editor 16Yest
- Page 25 and 26:
Finding a proper text editor 18A do
- Page 27 and 28:
A better Find-and-Replace 20The lim
- Page 29 and 30:
A better Find-and-Replace 22Using R
- Page 31 and 32:
Your first regex 24helloThe Find-an
- Page 33 and 34:
Your first regex 26The regex syntax
- Page 35 and 36:
Your first regex 28Double-bounded
- Page 37 and 38:
Your first regex 30AnswerFind \bcat
- Page 39 and 40:
Removing emptinessIt’s funny how
- Page 41 and 42:
Removing emptiness 34Now let’s do
- Page 43 and 44:
Removing emptiness 36…as opposed
- Page 45 and 46:
Removing emptiness 38Replacement in
- Page 47 and 48:
Match one-or-more with the plus sig
- Page 49 and 50:
Match one-or-more with the plus sig
- Page 51 and 52:
Match one-or-more with the plus sig
- Page 53 and 54:
Match one-or-more with the plus sig
- Page 55 and 56:
Match zero-or-more with the star si
- Page 57 and 58:
Specific and limited repetition 501
- Page 59 and 60:
Specific and limited repetition 52E
- Page 61 and 62:
Specific and limited repetition 54C
- Page 63 and 64:
Anchors: A way to trim emptinessIn
- Page 65 and 66:
Anchors: A way to trim emptiness 58
- Page 67 and 68:
Anchors: A way to trim emptiness 60
- Page 69 and 70:
Anchors: A way to trim emptiness 62
- Page 71 and 72:
Matching any letter, any number 64A
- Page 73 and 74: Matching any letter, any number 66W
- Page 75 and 76: Matching any letter, any number 68[
- Page 77 and 78: Matching any letter, any number 70R
- Page 79 and 80: Matching any letter, any number 72F
- Page 81 and 82: Matching any letter, any number 74A
- Page 83 and 84: Negative character sets 76$1,200.00
- Page 85 and 86: Negative character sets 78\W+Exerci
- Page 87 and 88: Capture, Reuse 80Find (ba)+Matches
- Page 89 and 90: Capture, Reuse 82.- then the 1st ba
- Page 91 and 92: Capture, Reuse 84In English We are
- Page 93 and 94: Capture, Reuse 86ApplesOraclesOrang
- Page 95 and 96: Capture, Reuse 8805-14-8912-03-9803
- Page 97 and 98: Capture, Reuse 90Mary asked: "What
- Page 99 and 100: Optionality and alternationThe two
- Page 101 and 102: Optionality and alternation 94Answe
- Page 103 and 104: Optionality and alternation 96Studi
- Page 105 and 106: Optionality and alternation 98You m
- Page 107 and 108: Laziness and greediness 100Being to
- Page 109 and 110: Laziness and greediness 102Replace
- Page 111 and 112: Laziness and greediness 104With the
- Page 113 and 114: Lookarounds 106cat(?=s)- will match
- Page 115 and 116: Lookarounds 108`(?
- Page 117 and 118: Lookarounds 110ExerciseGiven the fo
- Page 119 and 120: Lookarounds 112City,CountryAlbuquer
- Page 121 and 122: Lookarounds 114Given this list of c
- Page 123: Lookarounds 116100,J.D. Salinger Av
- Page 127 and 128: Lookarounds 120^(\d+) Record ID (\w
- Page 129 and 130: Lookarounds 122d. Remove all asides
- Page 131 and 132: From Data to HTML (TODO)This chapte
- Page 133 and 134: From Data to HTML (TODO) 126Turning
- Page 135 and 136: The ExercisesI’ve never been one
- Page 137 and 138: Data Cleaning with the Stars 130LOR
- Page 139 and 140: Data Cleaning with the Stars 132[GM
- Page 141 and 142: Finding needles in haystacks (TODO)
- Page 143 and 144: Changing phone format (TODO)Todo: T
- Page 145 and 146: Changing phone format (TODO) 138Exe
- Page 147 and 148: Changing phone format (TODO) 140Ans
- Page 149 and 150: Changing phone format (TODO) 142`1-
- Page 151 and 152: Changing phone format (TODO) 144(\d
- Page 153 and 154: Dating, Associated Press Style (TOD
- Page 155 and 156: Dating, Associated Press Style (TOD
- Page 157 and 158: Dating, Associated Press Style (TOD
- Page 159 and 160: Dating, Associated Press Style (TOD
- Page 161 and 162: Sorting a police blotter 154Sloppy
- Page 163 and 164: Sorting a police blotter 156I used
- Page 165 and 166: Converting XML to tab-delimited dat
- Page 167 and 168: Converting XML to tab-delimited dat
- Page 169 and 170: Cleaning up Microsoft Word HTML(TOD
- Page 171 and 172: Cleaning up OCR Text (TODO)Image sc
- Page 173 and 174: Moving forwardThank you for taking