03.02.2014 Views

php|architect's Guide to Web Scraping with PHP - Wind Business ...

php|architect's Guide to Web Scraping with PHP - Wind Business ...

php|architect's Guide to Web Scraping with PHP - Wind Business ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

PCRE Extension ” 153<br />

• T ouse a literal ˆ character in a character range, either escape it in the same<br />

manner in which other meta-characters are escaped or do not use it as the<br />

first or only character in the range.<br />

i<br />

ctype Extension<br />

Some simple patterns have equivalent functions available in the ctype library. These<br />

generally perform better and should be used ov e r PCRE when appropriate. See<br />

http://php.net/ctype for more information on the ctype extension and the functions<br />

it offers.<br />

M o d i fi e r s<br />

The reason for having pattern delimiters <strong>to</strong> denote the start and end of a pattern<br />

is that the pattern precedes modifiers that affect the matching behavior of metacharacters.<br />

H e r e are a few modifiers that may prove useful in web scraping applications.<br />

• i: Any letters in the pattern will match both uppercase and lowercase regardless<br />

of the case of the letter used in the pattern.<br />

• m: ˆ and $ will match the beginning and ends of lines <strong>with</strong>in the string (delimited<br />

by line feed characters) rather than the beginning and end of the entire<br />

string.<br />

• s (lowercase): The . meta-character will match line feeds, which it does not by<br />

default.<br />

• S (uppercase): Additional time will be spent <strong>to</strong> analyze the pattern in order<br />

<strong>to</strong> speed up subsequent matches <strong>with</strong> that pattern. U s e f u lfor patterns used<br />

multiple times.<br />

• U: By default, the quantifiers * and + behave in a manner referred <strong>to</strong> as “ g r e e d y . ”<br />

That is, they match as many characters as possible rather than as few as possible.<br />

This modifier forces the latter behavior.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!