03.02.2014 Views

php|architect's Guide to Web Scraping with PHP - Wind Business ...

php|architect's Guide to Web Scraping with PHP - Wind Business ...

php|architect's Guide to Web Scraping with PHP - Wind Business ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

158 ” T i p s and T r i c k s<br />

since the client essentially has <strong>to</strong> wait for two requests <strong>to</strong> complete for every one<br />

request that would normally be made <strong>to</strong> the target application.<br />

The batch approach is based on synchronization. F o r read operations, data is updated<br />

on a regular interval. F o r write operations, changes are s<strong>to</strong>red locally and then<br />

pushed out in batches (hence the name) <strong>to</strong> the target application, also on a regular<br />

interval. The pros and cons <strong>to</strong> this approach are the complement of those from the<br />

real-time approach: updates will not be real-time, but the web scraping application’s<br />

response time will not be increased. It is of course possible <strong>to</strong> use a batch approach<br />

<strong>with</strong> a relatively low interval in order <strong>to</strong> approximate real-time while gaining the benefits<br />

of the batch approach.<br />

The selection of an approach depends on the requirements of the web scraping<br />

application. In general, if real-time updates on either the web scraping application<br />

or target application are not required, the batch approach is preferred <strong>to</strong> maintain a<br />

high level of performance.<br />

A v a i l a b i l i t y<br />

Regardless of whether a web scraping application takes a real-time or batch approach,<br />

it should treat the remote service as as potential point of failure and account<br />

for cases where it does not return a response. Once a tested web scraping application<br />

goes in<strong>to</strong> production, common causes for this are either service downtime or<br />

modification. Sy m p t o m s of these include connection timeouts and responses <strong>with</strong><br />

a status code above the 2xx range.<br />

An advantage of the batch approach in this situation is that the web scraping application’s<br />

front-facing interface can remain unaffected. Cached data can be used or<br />

updates can be s<strong>to</strong>red locally and synchronization can be initiated once the service<br />

becomes available again or the web scraping application has been fixed <strong>to</strong> account<br />

for changes in the remote service.<br />

P a r a l l e lProcessing<br />

T w o of the HTTP client libraries previously covered, cURL and pecl_http, support<br />

running requests in parallel using a single connection. While the same feature can-

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!