php|architect's Guide to Web Scraping with PHP - Wind Business ...
php|architect's Guide to Web Scraping with PHP - Wind Business ...
php|architect's Guide to Web Scraping with PHP - Wind Business ...
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
100 ” DOM Extension<br />
T y p of P e a r s e r s<br />
Before going much further, you should be aware that there are two types of XML<br />
parsers: tree parsers and pull parsers. T r e parsers e load the entire document in<strong>to</strong><br />
memory and allow you <strong>to</strong> access any part of it at any time as well as manipulate it.<br />
Pull parsers read the document a piece at a time and limit you <strong>to</strong> working <strong>with</strong> the<br />
current piece being read.<br />
The two types of parsers share a relationship similar <strong>to</strong> that between the<br />
file_get_contents and fgets functions: the former lets you work <strong>with</strong> the entire document<br />
at once and uses as much memory needed <strong>to</strong> s<strong>to</strong>re it, while the latter allows<br />
you <strong>to</strong> work <strong>with</strong> a piece of the document at a time and use less memory in the process.<br />
When working <strong>with</strong> fairly large documents, lower memory usage is generally the<br />
preferable option. Attempting <strong>to</strong> load a huge document in<strong>to</strong> memory all at once has<br />
the same effect on the local system as a throttling client does on a web server: in<br />
both cases, resources are consumed and system performance is debilitated until the<br />
system eventually locks up or crashes under the stress.<br />
The DOM extension is a tree parser. In general, web scraping does not require<br />
the ability <strong>to</strong> access all parts of the document simultaneously. H o w e v e r, the type of<br />
data extraction involved in web scraping can be rather extensive <strong>to</strong> implement using<br />
a pull parser. The appropriateness of extension o v e r the other depends on the size<br />
and complexity of the document.<br />
Loading Documents<br />
The DOMDocument class is where use of the DOM extension begins. The first thing <strong>to</strong><br />
do is instantiate it and then feed it the validated markup data. N o t e that the DOM extension<br />
will emit warnings when a document is loaded if that document is not valid<br />
or well-formed. T oavoid this, see the previous chapter on using the tidy extension.<br />
If tidy does not eliminate the issue, errors can be controlled as shown in the example<br />
below. N o t e that errors are buffered until manually cleared, so make a point of clearing<br />
them after each load operation if they are not needed <strong>to</strong> avoid wasting memory.<br />