php|architect's Guide to Web Scraping with PHP - Wind Business ...
php|architect's Guide to Web Scraping with PHP - Wind Business ...
php|architect's Guide to Web Scraping with PHP - Wind Business ...
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
158 ” T i p s and T r i c k s<br />
since the client essentially has <strong>to</strong> wait for two requests <strong>to</strong> complete for every one<br />
request that would normally be made <strong>to</strong> the target application.<br />
The batch approach is based on synchronization. F o r read operations, data is updated<br />
on a regular interval. F o r write operations, changes are s<strong>to</strong>red locally and then<br />
pushed out in batches (hence the name) <strong>to</strong> the target application, also on a regular<br />
interval. The pros and cons <strong>to</strong> this approach are the complement of those from the<br />
real-time approach: updates will not be real-time, but the web scraping application’s<br />
response time will not be increased. It is of course possible <strong>to</strong> use a batch approach<br />
<strong>with</strong> a relatively low interval in order <strong>to</strong> approximate real-time while gaining the benefits<br />
of the batch approach.<br />
The selection of an approach depends on the requirements of the web scraping<br />
application. In general, if real-time updates on either the web scraping application<br />
or target application are not required, the batch approach is preferred <strong>to</strong> maintain a<br />
high level of performance.<br />
A v a i l a b i l i t y<br />
Regardless of whether a web scraping application takes a real-time or batch approach,<br />
it should treat the remote service as as potential point of failure and account<br />
for cases where it does not return a response. Once a tested web scraping application<br />
goes in<strong>to</strong> production, common causes for this are either service downtime or<br />
modification. Sy m p t o m s of these include connection timeouts and responses <strong>with</strong><br />
a status code above the 2xx range.<br />
An advantage of the batch approach in this situation is that the web scraping application’s<br />
front-facing interface can remain unaffected. Cached data can be used or<br />
updates can be s<strong>to</strong>red locally and synchronization can be initiated once the service<br />
becomes available again or the web scraping application has been fixed <strong>to</strong> account<br />
for changes in the remote service.<br />
P a r a l l e lProcessing<br />
T w o of the HTTP client libraries previously covered, cURL and pecl_http, support<br />
running requests in parallel using a single connection. While the same feature can-