php|architect's Guide to Web Scraping with PHP - Wind Business ...
php|architect's Guide to Web Scraping with PHP - Wind Business ...
php|architect's Guide to Web Scraping with PHP - Wind Business ...
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
DOM Extension ” 101<br />
// Buffer DOM errors rather than emitting them as warnings<br />
$oldSetting = libxml_use_internal_errors(true);<br />
// Instantiate a container for the document<br />
$doc = new DOMDocument;<br />
// Load markup already contained <strong>with</strong>in a string<br />
$doc->loadHTML($htmlString);<br />
// Load markup saved <strong>to</strong> an external file<br />
$doc->loadHTMLFile($htmlFilePath);<br />
// Get all errors if needed<br />
$errors = libxml_get_errors();<br />
// Get only the last error<br />
$error = libxml_get_last_error();<br />
// Clear any existing errors from previous operations<br />
libxml_clear_errors();<br />
// Revert error buffering <strong>to</strong> its previous setting<br />
libxml_use_internal_errors($oldSetting);<br />
?><br />
T r T e r m i n o l o g y<br />
Once a document is loaded, the next natural step is <strong>to</strong> extract desired data from it.<br />
H o w e v e r, doing so requires a bit more knowledge about how the DOM is structured.<br />
Recall the earlier mention of tree parsers. If you have any computer science background,<br />
you will be glad <strong>to</strong> know that the term “ t r e e ” in the context of tree parsers<br />
does in fact refer <strong>to</strong> the data structure by the same name. If not, here is a brief rundown<br />
of related concepts.<br />
A tree is a hierarchical structure (think family tree) composed of nodes, which exist<br />
in the DOM extension as the DOMNode class. N o d e s are <strong>to</strong> trees what elements are <strong>to</strong><br />
arrays: just items that exist <strong>with</strong>in the data structure.<br />
Each individual node can have zero or more child nodes that are collectively represented<br />
by a childNodes property in the DOMNode class. childNodes is an instance<br />
of the class DOMNodeList, which is exactly what it sounds like. Other related proper-