03.02.2014 Views

php|architect's Guide to Web Scraping with PHP - Wind Business ...

php|architect's Guide to Web Scraping with PHP - Wind Business ...

php|architect's Guide to Web Scraping with PHP - Wind Business ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

HTTP ” 19<br />

disabling a primary site feature, such as an e-commerce checkout page that uses<br />

ActiveX (a technology specific <strong>to</strong> <strong>Wind</strong>o ws and Internet Explorer).<br />

One well-kno wn application of this technique is the robots exclusion standard,<br />

which is used <strong>to</strong> explicitly instruct crawlers <strong>to</strong> avoid accessing individual<br />

resources or the entire web site. M ore information about this is available at<br />

http://www.robotstxt.org. The guidelines detailed there should definitely be accounted<br />

for when developing a web scraping application so as <strong>to</strong> prevent it from<br />

exhibiting behavior inconsistent <strong>with</strong> that of a normal user.<br />

In some cases, a client practice called user agent spoofing involving the specification<br />

of a false user agent string is enough <strong>to</strong> circumvent user agent sniffing, but not<br />

always. An application may have platform-specific requirements that legitimately<br />

warrant it denying access <strong>to</strong> certain user agents. In any case, spoofing the user agent<br />

is a practice that should be avoided <strong>to</strong> the fullest extent possible.<br />

Ranges<br />

The Range request header allo ws the client <strong>to</strong> specify that the body of the server’s<br />

response should be limited <strong>to</strong> one or more specific byte ranges of what it would normally<br />

be. Originally intended <strong>to</strong> allo w failed retrieval attempts <strong>to</strong> resume from their<br />

s<strong>to</strong>pping points, this feature can allo w you <strong>to</strong> minimize data transfer between your<br />

application and the server <strong>to</strong> reduce bandwidth consumption and runtime of your<br />

web scraping application.<br />

This is applicable in cases where you have a good rough idea of where your target<br />

data is located <strong>with</strong>in the document, especially if the document is fairly large and<br />

you only need a small subset of the data it contains. H o wever, using it does add one<br />

more variable <strong>to</strong> the possibility of your application breaking if the target site changes<br />

and you should bear that in mind when electing <strong>to</strong> do so .<br />

While the format of the header value is being left open <strong>to</strong> allo w for other range<br />

units, the only unit supported by HTTP/1.1 is bytes. The client and server may both<br />

use the Accept-Ranges header <strong>to</strong> indicate what units they support. The server will<br />

include the range (in a slightly different format) of the full response body in which<br />

the partial response body is located using the Content-Range header.<br />

In the case of bytes, the beginning of the document is represented by 0. Ranges use<br />

inclusive bounds. F or example, the first 500 bytes of a document would be specified

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!