03.02.2014 Views

php|architect's Guide to Web Scraping with PHP - Wind Business ...

php|architect's Guide to Web Scraping with PHP - Wind Business ...

php|architect's Guide to Web Scraping with PHP - Wind Business ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

HTTP ” 15<br />

H eaders<br />

An all-purpose method of communicating a variety of information related <strong>to</strong> requests<br />

and responses, headers are used by the client and server <strong>to</strong> accomplish a<br />

number of things including retention of state using cookies and identity verification<br />

using HTTP authentication. This section will deal <strong>with</strong> those that are particularly applicable<br />

<strong>to</strong> web scraping applications. F or more information, see section 14 of RFC<br />

2616.<br />

Cookies<br />

HTTP is designed <strong>to</strong> be a stateless pro<strong>to</strong>col. That is, once a server returns the response<br />

for a request, it effectively “ forgets ” about the request. It may log information<br />

about the request and the response it delivered, but it does not retain any sense of<br />

state for the same client between requests. Cookies are a method of circumventing<br />

this using headers. H ere is ho w they work.<br />

• The client issues a request.<br />

• In its response, the server includes a Set-Cookie header. The header value is<br />

comprised of name-value pairs each <strong>with</strong> optional associated attribute-value<br />

pairs.<br />

• In subsequent requests, the client will include a Cookie header that contains<br />

the data it received in the Set-Cookie response header.<br />

Cookies are frequently used <strong>to</strong> restrict access <strong>to</strong> certain content, most often by requiring<br />

some form of identity authentication before the target application will indicate<br />

that a cookie should be set. M ost client libraries have the capability <strong>to</strong> handle<br />

parsing and resending cookie data as appropriate, though some require explicit instruction<br />

before they will do so . F or more information on cookies, see RFC 2109 or<br />

its later (though less widely adopted) rendition RFC 2965.<br />

One of the aforementioned attributes, “expires,” is used <strong>to</strong> indicate when the client<br />

should dispose of the cookie and not persist its data in subsequent requests. This<br />

attribute is optional and its presence or lack thereof is the defining fac<strong>to</strong>r in whether<br />

or not the cookie is what’s called a session cookie. If a cookie has no expiration value

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!