php|architect's Guide to Web Scraping with PHP - Wind Business ...
php|architect's Guide to Web Scraping with PHP - Wind Business ...
php|architect's Guide to Web Scraping with PHP - Wind Business ...
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
126 ” XMLReader Extension<br />
cate <strong>to</strong> subsequent if branch cases that the itera<strong>to</strong>r points <strong>to</strong> a node that is <strong>with</strong>in<br />
the desired table.<br />
The next if branch is entered when table row elements <strong>with</strong>in the table are<br />
encountered. A combination of checking the node name and the previously set<br />
$inTable flag facilitates this. When the branch is entered, a new element in the<br />
$tableData array is initialized <strong>to</strong> an empty array. This array will later s<strong>to</strong>re data from<br />
cells in that row. The key associated <strong>with</strong> the row in $tableData is s<strong>to</strong>red in the $row<br />
variable.<br />
Finally, the last if branch executes when table cell elements are encountered. Like<br />
the row branch, this branch checks the node name and the $inTable flag. If the check<br />
passes, it then s<strong>to</strong>res the current node’s value in the array associated <strong>with</strong> the current<br />
table row.<br />
H e r e’s where the XMLREADER::END_ELEMENT node type comes in<strong>to</strong> play. Once the end<br />
of the table is reached, no further data should be read in<strong>to</strong> the array. So, if the ending<br />
element has the name ’table’ and the $inTable flag currently indicates that the<br />
itera<strong>to</strong>r points <strong>to</strong> a node <strong>with</strong>in the desired table, the flag is then set <strong>to</strong> false. Since<br />
no other tables should theoretically have the same id attribute, no if branches will<br />
execute in subsequent while loop iterations.<br />
If this table was the only one of interest in the document, it would be prudent<br />
<strong>to</strong> replace the $inTable = false; statement <strong>with</strong> a break 2; statement. This would<br />
terminate the while loop used <strong>to</strong> read nodes from the document as soon as the end<br />
of the table was encountered, preventing any further unnecessary read operations.<br />
i<br />
readString() A v a i l a b i l i t y<br />
As its entry in the <strong>PHP</strong> manual notes, the readString() method used in the above example<br />
is only present when the XMLReader extension is compiled against certain versions<br />
of the underlying libxml library.<br />
If this method is unavailable in your environment, an alternative in the example would<br />
be <strong>to</strong> have opening and closing table cell checks that <strong>to</strong>ggle their ow n flag ($inCell for<br />
example) and switch cases for the TEXT and CDATA node types that check this flag<br />
and, when it is set <strong>to</strong> true, add the contents of the value property from the XMLReader<br />
instance <strong>to</strong> the $tableData array.