php|architect's Guide to Web Scraping with PHP - Wind Business ...
php|architect's Guide to Web Scraping with PHP - Wind Business ...
php|architect's Guide to Web Scraping with PHP - Wind Business ...
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
T i p s and T r i c ” k 159 s<br />
not be replicated exactly using other libraries, it is possible <strong>to</strong> run multiple requests<br />
on separate connections using processes that are executed in parallel.<br />
Eve n if you are using a library supporting connection pooling, this technique is<br />
useful for situations when multiple hosts are being scraped since each host will require<br />
a separate connection anyway. By contrast, doing so in a single process means<br />
it is possible for requests sent earlier <strong>to</strong> a host <strong>with</strong> a lower response rate <strong>to</strong> block<br />
those sent later <strong>to</strong> another more responsive host.<br />
See Appendix B for a more detailed example this.<br />
Crawlers<br />
Some web scraping applications are intended <strong>to</strong> serve as crawlers <strong>to</strong> index content<br />
from web sites. Like all other web scraping applications, the work they perform can<br />
be divided in<strong>to</strong> two categories: retrieval and analysis. The parallel processing approach<br />
is applicable here because each category of work serve <strong>to</strong> populate the work<br />
queue of the other.<br />
The retrieval process is given one or more initial documents <strong>to</strong> retrieve. Each time<br />
a document is retrieved, it becomes a job for the analysis process, which scrapes<br />
the markup searching for links (a elements) <strong>to</strong> other documents, which may be restricted<br />
by one or more relevancy fac<strong>to</strong>rs. Once analysis of a document is complete,<br />
addresses <strong>to</strong> any currently unretrieved documents are then fed back <strong>to</strong> the retrieval<br />
process.<br />
This situation of mutual supply will hypothetically be sustained until no documents<br />
are found that are unindexed or considered <strong>to</strong> be relevant. At that point, the<br />
process can be restarted <strong>with</strong> the retrieval process using appropriate request headers<br />
<strong>to</strong> check for document updates and feeding documents <strong>to</strong> the analysis process<br />
where updates are found.<br />
F o r m s<br />
Some web scraping applications must push data <strong>to</strong> the target application. This is<br />
generally accomplished using HTTP POST requests that simulate the submission of<br />
HTML forms. Before such requests can be sent, however, there are a few events that