03.02.2014 Views

php|architect's Guide to Web Scraping with PHP - Wind Business ...

php|architect's Guide to Web Scraping with PHP - Wind Business ...

php|architect's Guide to Web Scraping with PHP - Wind Business ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

T i p s and T r i c ” k 159 s<br />

not be replicated exactly using other libraries, it is possible <strong>to</strong> run multiple requests<br />

on separate connections using processes that are executed in parallel.<br />

Eve n if you are using a library supporting connection pooling, this technique is<br />

useful for situations when multiple hosts are being scraped since each host will require<br />

a separate connection anyway. By contrast, doing so in a single process means<br />

it is possible for requests sent earlier <strong>to</strong> a host <strong>with</strong> a lower response rate <strong>to</strong> block<br />

those sent later <strong>to</strong> another more responsive host.<br />

See Appendix B for a more detailed example this.<br />

Crawlers<br />

Some web scraping applications are intended <strong>to</strong> serve as crawlers <strong>to</strong> index content<br />

from web sites. Like all other web scraping applications, the work they perform can<br />

be divided in<strong>to</strong> two categories: retrieval and analysis. The parallel processing approach<br />

is applicable here because each category of work serve <strong>to</strong> populate the work<br />

queue of the other.<br />

The retrieval process is given one or more initial documents <strong>to</strong> retrieve. Each time<br />

a document is retrieved, it becomes a job for the analysis process, which scrapes<br />

the markup searching for links (a elements) <strong>to</strong> other documents, which may be restricted<br />

by one or more relevancy fac<strong>to</strong>rs. Once analysis of a document is complete,<br />

addresses <strong>to</strong> any currently unretrieved documents are then fed back <strong>to</strong> the retrieval<br />

process.<br />

This situation of mutual supply will hypothetically be sustained until no documents<br />

are found that are unindexed or considered <strong>to</strong> be relevant. At that point, the<br />

process can be restarted <strong>with</strong> the retrieval process using appropriate request headers<br />

<strong>to</strong> check for document updates and feeding documents <strong>to</strong> the analysis process<br />

where updates are found.<br />

F o r m s<br />

Some web scraping applications must push data <strong>to</strong> the target application. This is<br />

generally accomplished using HTTP POST requests that simulate the submission of<br />

HTML forms. Before such requests can be sent, however, there are a few events that

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!