03.02.2014 Views

php|architect's Guide to Web Scraping with PHP - Wind Business ...

php|architect's Guide to Web Scraping with PHP - Wind Business ...

php|architect's Guide to Web Scraping with PHP - Wind Business ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

126 ” XMLReader Extension<br />

cate <strong>to</strong> subsequent if branch cases that the itera<strong>to</strong>r points <strong>to</strong> a node that is <strong>with</strong>in<br />

the desired table.<br />

The next if branch is entered when table row elements <strong>with</strong>in the table are<br />

encountered. A combination of checking the node name and the previously set<br />

$inTable flag facilitates this. When the branch is entered, a new element in the<br />

$tableData array is initialized <strong>to</strong> an empty array. This array will later s<strong>to</strong>re data from<br />

cells in that row. The key associated <strong>with</strong> the row in $tableData is s<strong>to</strong>red in the $row<br />

variable.<br />

Finally, the last if branch executes when table cell elements are encountered. Like<br />

the row branch, this branch checks the node name and the $inTable flag. If the check<br />

passes, it then s<strong>to</strong>res the current node’s value in the array associated <strong>with</strong> the current<br />

table row.<br />

H e r e’s where the XMLREADER::END_ELEMENT node type comes in<strong>to</strong> play. Once the end<br />

of the table is reached, no further data should be read in<strong>to</strong> the array. So, if the ending<br />

element has the name ’table’ and the $inTable flag currently indicates that the<br />

itera<strong>to</strong>r points <strong>to</strong> a node <strong>with</strong>in the desired table, the flag is then set <strong>to</strong> false. Since<br />

no other tables should theoretically have the same id attribute, no if branches will<br />

execute in subsequent while loop iterations.<br />

If this table was the only one of interest in the document, it would be prudent<br />

<strong>to</strong> replace the $inTable = false; statement <strong>with</strong> a break 2; statement. This would<br />

terminate the while loop used <strong>to</strong> read nodes from the document as soon as the end<br />

of the table was encountered, preventing any further unnecessary read operations.<br />

i<br />

readString() A v a i l a b i l i t y<br />

As its entry in the <strong>PHP</strong> manual notes, the readString() method used in the above example<br />

is only present when the XMLReader extension is compiled against certain versions<br />

of the underlying libxml library.<br />

If this method is unavailable in your environment, an alternative in the example would<br />

be <strong>to</strong> have opening and closing table cell checks that <strong>to</strong>ggle their ow n flag ($inCell for<br />

example) and switch cases for the TEXT and CDATA node types that check this flag<br />

and, when it is set <strong>to</strong> true, add the contents of the value property from the XMLReader<br />

instance <strong>to</strong> the $tableData array.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!