03.02.2014 Views

php|architect's Guide to Web Scraping with PHP - Wind Business ...

php|architect's Guide to Web Scraping with PHP - Wind Business ...

php|architect's Guide to Web Scraping with PHP - Wind Business ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

DOM Extension ” 101<br />

// Buffer DOM errors rather than emitting them as warnings<br />

$oldSetting = libxml_use_internal_errors(true);<br />

// Instantiate a container for the document<br />

$doc = new DOMDocument;<br />

// Load markup already contained <strong>with</strong>in a string<br />

$doc->loadHTML($htmlString);<br />

// Load markup saved <strong>to</strong> an external file<br />

$doc->loadHTMLFile($htmlFilePath);<br />

// Get all errors if needed<br />

$errors = libxml_get_errors();<br />

// Get only the last error<br />

$error = libxml_get_last_error();<br />

// Clear any existing errors from previous operations<br />

libxml_clear_errors();<br />

// Revert error buffering <strong>to</strong> its previous setting<br />

libxml_use_internal_errors($oldSetting);<br />

?><br />

T r T e r m i n o l o g y<br />

Once a document is loaded, the next natural step is <strong>to</strong> extract desired data from it.<br />

H o w e v e r, doing so requires a bit more knowledge about how the DOM is structured.<br />

Recall the earlier mention of tree parsers. If you have any computer science background,<br />

you will be glad <strong>to</strong> know that the term “ t r e e ” in the context of tree parsers<br />

does in fact refer <strong>to</strong> the data structure by the same name. If not, here is a brief rundown<br />

of related concepts.<br />

A tree is a hierarchical structure (think family tree) composed of nodes, which exist<br />

in the DOM extension as the DOMNode class. N o d e s are <strong>to</strong> trees what elements are <strong>to</strong><br />

arrays: just items that exist <strong>with</strong>in the data structure.<br />

Each individual node can have zero or more child nodes that are collectively represented<br />

by a childNodes property in the DOMNode class. childNodes is an instance<br />

of the class DOMNodeList, which is exactly what it sounds like. Other related proper-

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!