03.02.2014 Views

php|architect's Guide to Web Scraping with PHP - Wind Business ...

php|architect's Guide to Web Scraping with PHP - Wind Business ...

php|architect's Guide to Web Scraping with PHP - Wind Business ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

100 ” DOM Extension<br />

T y p of P e a r s e r s<br />

Before going much further, you should be aware that there are two types of XML<br />

parsers: tree parsers and pull parsers. T r e parsers e load the entire document in<strong>to</strong><br />

memory and allow you <strong>to</strong> access any part of it at any time as well as manipulate it.<br />

Pull parsers read the document a piece at a time and limit you <strong>to</strong> working <strong>with</strong> the<br />

current piece being read.<br />

The two types of parsers share a relationship similar <strong>to</strong> that between the<br />

file_get_contents and fgets functions: the former lets you work <strong>with</strong> the entire document<br />

at once and uses as much memory needed <strong>to</strong> s<strong>to</strong>re it, while the latter allows<br />

you <strong>to</strong> work <strong>with</strong> a piece of the document at a time and use less memory in the process.<br />

When working <strong>with</strong> fairly large documents, lower memory usage is generally the<br />

preferable option. Attempting <strong>to</strong> load a huge document in<strong>to</strong> memory all at once has<br />

the same effect on the local system as a throttling client does on a web server: in<br />

both cases, resources are consumed and system performance is debilitated until the<br />

system eventually locks up or crashes under the stress.<br />

The DOM extension is a tree parser. In general, web scraping does not require<br />

the ability <strong>to</strong> access all parts of the document simultaneously. H o w e v e r, the type of<br />

data extraction involved in web scraping can be rather extensive <strong>to</strong> implement using<br />

a pull parser. The appropriateness of extension o v e r the other depends on the size<br />

and complexity of the document.<br />

Loading Documents<br />

The DOMDocument class is where use of the DOM extension begins. The first thing <strong>to</strong><br />

do is instantiate it and then feed it the validated markup data. N o t e that the DOM extension<br />

will emit warnings when a document is loaded if that document is not valid<br />

or well-formed. T oavoid this, see the previous chapter on using the tidy extension.<br />

If tidy does not eliminate the issue, errors can be controlled as shown in the example<br />

below. N o t e that errors are buffered until manually cleared, so make a point of clearing<br />

them after each load operation if they are not needed <strong>to</strong> avoid wasting memory.<br />

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!