03.02.2014 Views

php|architect's Guide to Web Scraping with PHP - Wind Business ...

php|architect's Guide to Web Scraping with PHP - Wind Business ...

php|architect's Guide to Web Scraping with PHP - Wind Business ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

HTTP ” 13<br />

i<br />

URL Encoding<br />

One trait of query strings is that parameter values are encoded using percent-encoding<br />

or, as it’s more commonly known, URL encoding. The <strong>PHP</strong> functions urlencode and<br />

urldecode are a convenient way <strong>to</strong> handle string values encoded in this manner. Most<br />

HTTP client libraries handle encoding request parameters for you. Though it’s called<br />

URL encoding, the technical details for it are actually more closely associated <strong>with</strong> the<br />

URI as shown in section 2.1 of RFC 3986.<br />

HEAD Requests<br />

Though not common when accessing target web applications, HEAD requests are<br />

useful in web scraping applications in several ways. They function in the same way<br />

as a GET request <strong>with</strong> one exception: when the server delivers its response, it will<br />

not deliver the resource representation that normally comprises the response body.<br />

The reason this is useful is that it allo ws a client <strong>to</strong> get at the data present in the<br />

response headers <strong>with</strong>out having <strong>to</strong> do wnload the entire response, which is liable<br />

<strong>to</strong> be significantly larger. Such data can include whether or not the resource is still<br />

available for access and, if it is, when it was last modified.<br />

HEAD /wiki/Main_Page HTTP/1.1<br />

Host: en.wikipedia.org<br />

Speaking of responses, no w would be a good time <strong>to</strong> investigate those in more detail.<br />

R esponses<br />

Aside from the first response line, called the status line, responses are formatted very<br />

similarly <strong>to</strong> requests. While different headers are used in requests and responses,<br />

they are formatted the same way. A blank line separates the headers and the body in<br />

both requests and responses. The body may be absent in either depending on what<br />

the request operation is. Belo w is an example response.<br />

HTTP/1.0 200 OK<br />

Date: Mon, 21 Jul 2008 02:32:52 GMT

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!