php|architect's Guide to Web Scraping with PHP - Wind Business ...
php|architect's Guide to Web Scraping with PHP - Wind Business ...
php|architect's Guide to Web Scraping with PHP - Wind Business ...
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
HTTP ” 15<br />
H eaders<br />
An all-purpose method of communicating a variety of information related <strong>to</strong> requests<br />
and responses, headers are used by the client and server <strong>to</strong> accomplish a<br />
number of things including retention of state using cookies and identity verification<br />
using HTTP authentication. This section will deal <strong>with</strong> those that are particularly applicable<br />
<strong>to</strong> web scraping applications. F or more information, see section 14 of RFC<br />
2616.<br />
Cookies<br />
HTTP is designed <strong>to</strong> be a stateless pro<strong>to</strong>col. That is, once a server returns the response<br />
for a request, it effectively “ forgets ” about the request. It may log information<br />
about the request and the response it delivered, but it does not retain any sense of<br />
state for the same client between requests. Cookies are a method of circumventing<br />
this using headers. H ere is ho w they work.<br />
• The client issues a request.<br />
• In its response, the server includes a Set-Cookie header. The header value is<br />
comprised of name-value pairs each <strong>with</strong> optional associated attribute-value<br />
pairs.<br />
• In subsequent requests, the client will include a Cookie header that contains<br />
the data it received in the Set-Cookie response header.<br />
Cookies are frequently used <strong>to</strong> restrict access <strong>to</strong> certain content, most often by requiring<br />
some form of identity authentication before the target application will indicate<br />
that a cookie should be set. M ost client libraries have the capability <strong>to</strong> handle<br />
parsing and resending cookie data as appropriate, though some require explicit instruction<br />
before they will do so . F or more information on cookies, see RFC 2109 or<br />
its later (though less widely adopted) rendition RFC 2965.<br />
One of the aforementioned attributes, “expires,” is used <strong>to</strong> indicate when the client<br />
should dispose of the cookie and not persist its data in subsequent requests. This<br />
attribute is optional and its presence or lack thereof is the defining fac<strong>to</strong>r in whether<br />
or not the cookie is what’s called a session cookie. If a cookie has no expiration value