03.02.2014 Views

php|architect's Guide to Web Scraping with PHP - Wind Business ...

php|architect's Guide to Web Scraping with PHP - Wind Business ...

php|architect's Guide to Web Scraping with PHP - Wind Business ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

138 ” CSS Selec<strong>to</strong>r Libraries<br />

Libraries<br />

At this point, CSS selec<strong>to</strong>rs have been covered <strong>to</strong> the extent that all or a subset of<br />

those supported by a given library are explained. This section will review some library<br />

implementations that are available, where <strong>to</strong> find them, what feature set they<br />

support, and some advantages and disadvantages of using them.<br />

PH P Simple HTML DOM P a r s e r<br />

The major distinguishing trait of this library is its requirements: <strong>PHP</strong> 5 and the PCRE<br />

extension (which is pretty standard in most <strong>PHP</strong> distributions). It has no external dependencies<br />

on or associations <strong>with</strong> other libraries or extensions, not even the standard<br />

XML extensions in <strong>PHP</strong>.<br />

The implication of this is that all parsing is handled in <strong>PHP</strong> itself, which makes it<br />

likely that performance will not be as good as libraries that build on a <strong>PHP</strong> extension.<br />

H o w e v e r, in environments where XML extensions (in particular the DOM extension)<br />

may not be available (which is rare), this library may be a good option. It offers basic<br />

retrieval support using <strong>PHP</strong>’s filesystem functions (which require the configuration<br />

setting allow_url_fopen <strong>to</strong> be enabled <strong>to</strong> access remote documents).<br />

The documentation for this library is fairly good and can be found<br />

at http://simplehtmldom.sourceforge.net/manual.htm. It s main web<br />

site, which includes a link <strong>to</strong> download the library, is available at<br />

http://simplehtmldom.sourceforge.net. It is licensed under the MIT License.<br />

Zend_Dom_Query<br />

One of the components of Zend Framework, this library was originally created <strong>to</strong><br />

provide a means for integration testing of applications based on the framework.<br />

H o w e v e r, it can function independently and apart from the framework and provides<br />

the functionality needed in the analysis phase of web scraping. At the time of this<br />

writing, Zend Framework 1.10.1 requires <strong>PHP</strong> 5.2.4 or higher.<br />

Zend_Dom_Query makes extensive use of the DOM extension. It supports XPath<br />

through use of the DOM extension’s DOMXPath class and handles CSS expressions<br />

by transforming them in<strong>to</strong> equivalent XPath expressions. N o t e that only CSS 2 is<br />

supported, which excludes non-attribute filters.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!