php|architect's Guide to Web Scraping with PHP - Wind Business ...
php|architect's Guide to Web Scraping with PHP - Wind Business ...
php|architect's Guide to Web Scraping with PHP - Wind Business ...
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
HTTP ” 19<br />
disabling a primary site feature, such as an e-commerce checkout page that uses<br />
ActiveX (a technology specific <strong>to</strong> <strong>Wind</strong>o ws and Internet Explorer).<br />
One well-kno wn application of this technique is the robots exclusion standard,<br />
which is used <strong>to</strong> explicitly instruct crawlers <strong>to</strong> avoid accessing individual<br />
resources or the entire web site. M ore information about this is available at<br />
http://www.robotstxt.org. The guidelines detailed there should definitely be accounted<br />
for when developing a web scraping application so as <strong>to</strong> prevent it from<br />
exhibiting behavior inconsistent <strong>with</strong> that of a normal user.<br />
In some cases, a client practice called user agent spoofing involving the specification<br />
of a false user agent string is enough <strong>to</strong> circumvent user agent sniffing, but not<br />
always. An application may have platform-specific requirements that legitimately<br />
warrant it denying access <strong>to</strong> certain user agents. In any case, spoofing the user agent<br />
is a practice that should be avoided <strong>to</strong> the fullest extent possible.<br />
Ranges<br />
The Range request header allo ws the client <strong>to</strong> specify that the body of the server’s<br />
response should be limited <strong>to</strong> one or more specific byte ranges of what it would normally<br />
be. Originally intended <strong>to</strong> allo w failed retrieval attempts <strong>to</strong> resume from their<br />
s<strong>to</strong>pping points, this feature can allo w you <strong>to</strong> minimize data transfer between your<br />
application and the server <strong>to</strong> reduce bandwidth consumption and runtime of your<br />
web scraping application.<br />
This is applicable in cases where you have a good rough idea of where your target<br />
data is located <strong>with</strong>in the document, especially if the document is fairly large and<br />
you only need a small subset of the data it contains. H o wever, using it does add one<br />
more variable <strong>to</strong> the possibility of your application breaking if the target site changes<br />
and you should bear that in mind when electing <strong>to</strong> do so .<br />
While the format of the header value is being left open <strong>to</strong> allo w for other range<br />
units, the only unit supported by HTTP/1.1 is bytes. The client and server may both<br />
use the Accept-Ranges header <strong>to</strong> indicate what units they support. The server will<br />
include the range (in a slightly different format) of the full response body in which<br />
the partial response body is located using the Content-Range header.<br />
In the case of bytes, the beginning of the document is represented by 0. Ranges use<br />
inclusive bounds. F or example, the first 500 bytes of a document would be specified