php|architect's Guide to Web Scraping with PHP - Wind Business ...
php|architect's Guide to Web Scraping with PHP - Wind Business ...
php|architect's Guide to Web Scraping with PHP - Wind Business ...
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
138 ” CSS Selec<strong>to</strong>r Libraries<br />
Libraries<br />
At this point, CSS selec<strong>to</strong>rs have been covered <strong>to</strong> the extent that all or a subset of<br />
those supported by a given library are explained. This section will review some library<br />
implementations that are available, where <strong>to</strong> find them, what feature set they<br />
support, and some advantages and disadvantages of using them.<br />
PH P Simple HTML DOM P a r s e r<br />
The major distinguishing trait of this library is its requirements: <strong>PHP</strong> 5 and the PCRE<br />
extension (which is pretty standard in most <strong>PHP</strong> distributions). It has no external dependencies<br />
on or associations <strong>with</strong> other libraries or extensions, not even the standard<br />
XML extensions in <strong>PHP</strong>.<br />
The implication of this is that all parsing is handled in <strong>PHP</strong> itself, which makes it<br />
likely that performance will not be as good as libraries that build on a <strong>PHP</strong> extension.<br />
H o w e v e r, in environments where XML extensions (in particular the DOM extension)<br />
may not be available (which is rare), this library may be a good option. It offers basic<br />
retrieval support using <strong>PHP</strong>’s filesystem functions (which require the configuration<br />
setting allow_url_fopen <strong>to</strong> be enabled <strong>to</strong> access remote documents).<br />
The documentation for this library is fairly good and can be found<br />
at http://simplehtmldom.sourceforge.net/manual.htm. It s main web<br />
site, which includes a link <strong>to</strong> download the library, is available at<br />
http://simplehtmldom.sourceforge.net. It is licensed under the MIT License.<br />
Zend_Dom_Query<br />
One of the components of Zend Framework, this library was originally created <strong>to</strong><br />
provide a means for integration testing of applications based on the framework.<br />
H o w e v e r, it can function independently and apart from the framework and provides<br />
the functionality needed in the analysis phase of web scraping. At the time of this<br />
writing, Zend Framework 1.10.1 requires <strong>PHP</strong> 5.2.4 or higher.<br />
Zend_Dom_Query makes extensive use of the DOM extension. It supports XPath<br />
through use of the DOM extension’s DOMXPath class and handles CSS expressions<br />
by transforming them in<strong>to</strong> equivalent XPath expressions. N o t e that only CSS 2 is<br />
supported, which excludes non-attribute filters.