04.08.2014 Views

o_18ufhmfmq19t513t3lgmn5l1qa8a.pdf

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

CHAPTER 10 ■ BATTERIES INCLUDED 237<br />

SPECIAL CHARACTERS IN CHARACTER SETS<br />

In general, special characters such as dots, asterisks, and question marks have to be escaped with a backslash<br />

if you want them to appear as literal characters in the pattern, rather than function as regexp operators. Inside<br />

character sets, escaping these characters is generally not necessary (although perfectly legal). You should,<br />

however, keep in mind the following rules:<br />

You do have to escape the caret (^) if it appears at the beginning of the character set unless you want it<br />

to function as a negation operator. (In other words, don’t place it at the beginning unless you mean it.)<br />

Similarly, the right bracket (]) and the dash (-) must be put either at the beginning of the character set<br />

or escaped with a backslash. (Actually, the dash may also be put at the end, if you wish.)<br />

Alternatives and Subpatterns<br />

Character sets are nice when you let each letter vary independently, but what if you want to<br />

match only the strings 'python' and 'perl'? You can’t specify such a specific pattern with<br />

character sets or wildcards. Instead, you use the special character for alternatives: the “pipe”<br />

character (|). So, your pattern would be 'python|perl'.<br />

However, sometimes you don’t want to use the choice operator on the entire pattern—just<br />

a part of it. To do that, you enclose the part, or subpattern, in parentheses. The previous example<br />

could be rewritten as 'p(ython|erl)'. (Note that the term subpattern can also be used about a<br />

single character.)<br />

Optional and Repeated Subpatterns<br />

By adding a question mark after a subpattern, you make it optional. It may appear in the matched<br />

string, but it isn’t strictly required. So, for example, the (slightly unreadable) pattern<br />

r'(http://)?(www\.)?python\.org'<br />

would match all of the following strings (and nothing else):<br />

'http://www.python.org'<br />

'http://python.org'<br />

'www.python.org'<br />

'python.org'<br />

A few things are worth noting here:<br />

• I’ve escaped the dots, to prevent them from functioning as wildcards.<br />

• I’ve used a raw string to reduce the number of backslashes needed.<br />

• Each optional subpattern is enclosed in parentheses.<br />

• The optional subpatterns may appear or not, independently of each other.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!