03.02.2014 Views

php|architect's Guide to Web Scraping with PHP - Wind Business ...

php|architect's Guide to Web Scraping with PHP - Wind Business ...

php|architect's Guide to Web Scraping with PHP - Wind Business ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Chapter 14<br />

PCRE Extension<br />

There are some instances where markup documents may be so hideously malformed<br />

that they’re simply not usable by an XML extension. Other times, you may want<br />

<strong>to</strong> have a way <strong>to</strong> check the data you’ve extracted <strong>to</strong> ensure that it’s what you expect.<br />

Changes <strong>to</strong> the structure of markup documents may be significant, <strong>to</strong> the point<br />

where your CSS or XPath queries return no results. They may also be small and subtle,<br />

such that while you do get query results, they contain less or different data than<br />

intended.<br />

While either of these tasks could be done <strong>with</strong> basic string handling functions and<br />

comparison opera<strong>to</strong>rs, in most cases the implementation would prove <strong>to</strong> be messy<br />

and unreliable. Regular expressions provide a syntax consisting of meta-characters<br />

whereby patterns <strong>with</strong>in strings are expressed flexibly and concisely. This chapter<br />

will deal <strong>with</strong> regular expressions as they relate <strong>to</strong> the P e r l - C o m p a t i b l e Regular Expression<br />

(PCRE) <strong>PHP</strong> extension in particular.<br />

A common bad practice is <strong>to</strong> use only regular expressions <strong>to</strong> extract data from<br />

markup documents. While this may work for simple scripts that are only intended <strong>to</strong><br />

be used once or very few times in a short time period, it is more difficult <strong>to</strong> maintain<br />

and less reliable in the long term. Regular expressions simply were not designed for<br />

this purpose, whereas other markup-specific extensions discussed in previous chapters<br />

are more suited for the task. It is a matter of using the best <strong>to</strong>ol for the job, and<br />

<strong>to</strong> that end, this practice should be avoided.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!