03.02.2014 Views

php|architect's Guide to Web Scraping with PHP - Wind Business ...

php|architect's Guide to Web Scraping with PHP - Wind Business ...

php|architect's Guide to Web Scraping with PHP - Wind Business ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Chapter 12<br />

XMLReader Extension<br />

The previous two chapters have covered two available XML extensions that implement<br />

tree parsers. This chapter will focus on the XMLReader extension, which implements<br />

a pull parser.<br />

As mentioned in the chapter on the DOM extension, pull parsers differ from tree<br />

parsers in that they read documents in a piecewise fashion rather than loading them<br />

in<strong>to</strong> memory all at once. A consequence of this is that pull parsers generally only<br />

traverse documents once in one direction and leave you <strong>to</strong> collect whatever data is<br />

relevant <strong>to</strong> you along the way.<br />

Before getting started, a noteworthy point is that XMLReader’s underlying library,<br />

libxml, uses UTF-8 encoding internally. As such, encoding issues will be mitigated<br />

if any document you imported (particularly one that’s been cleaned using the tidy<br />

extension) is encoded appropriately <strong>to</strong> avoid issues <strong>with</strong> conflicting encodings.<br />

i<br />

XML P a r s e r<br />

The XML P a r s e r extension, as it is referred <strong>to</strong> in the <strong>PHP</strong> manual, is a predecessor<br />

of XMLReader and an alternative for <strong>PHP</strong> 4 environments. Its API is oriented <strong>to</strong> a<br />

more event-driven style of programming as opposed <strong>to</strong> the iterative orientation of<br />

the XMLReader extension. F o r more information on the XML P a r s e r extension, see<br />

http://php.net/manual/en/book.xml.php.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!