php|architect's Guide to Web Scraping with PHP - Wind Business ...
php|architect's Guide to Web Scraping with PHP - Wind Business ...
php|architect's Guide to Web Scraping with PHP - Wind Business ...
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
Chapter 14<br />
PCRE Extension<br />
There are some instances where markup documents may be so hideously malformed<br />
that they’re simply not usable by an XML extension. Other times, you may want<br />
<strong>to</strong> have a way <strong>to</strong> check the data you’ve extracted <strong>to</strong> ensure that it’s what you expect.<br />
Changes <strong>to</strong> the structure of markup documents may be significant, <strong>to</strong> the point<br />
where your CSS or XPath queries return no results. They may also be small and subtle,<br />
such that while you do get query results, they contain less or different data than<br />
intended.<br />
While either of these tasks could be done <strong>with</strong> basic string handling functions and<br />
comparison opera<strong>to</strong>rs, in most cases the implementation would prove <strong>to</strong> be messy<br />
and unreliable. Regular expressions provide a syntax consisting of meta-characters<br />
whereby patterns <strong>with</strong>in strings are expressed flexibly and concisely. This chapter<br />
will deal <strong>with</strong> regular expressions as they relate <strong>to</strong> the P e r l - C o m p a t i b l e Regular Expression<br />
(PCRE) <strong>PHP</strong> extension in particular.<br />
A common bad practice is <strong>to</strong> use only regular expressions <strong>to</strong> extract data from<br />
markup documents. While this may work for simple scripts that are only intended <strong>to</strong><br />
be used once or very few times in a short time period, it is more difficult <strong>to</strong> maintain<br />
and less reliable in the long term. Regular expressions simply were not designed for<br />
this purpose, whereas other markup-specific extensions discussed in previous chapters<br />
are more suited for the task. It is a matter of using the best <strong>to</strong>ol for the job, and<br />
<strong>to</strong> that end, this practice should be avoided.