php|architect's Guide to Web Scraping with PHP - Wind Business ...
php|architect's Guide to Web Scraping with PHP - Wind Business ...
php|architect's Guide to Web Scraping with PHP - Wind Business ...
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
HTTP ” 13<br />
i<br />
URL Encoding<br />
One trait of query strings is that parameter values are encoded using percent-encoding<br />
or, as it’s more commonly known, URL encoding. The <strong>PHP</strong> functions urlencode and<br />
urldecode are a convenient way <strong>to</strong> handle string values encoded in this manner. Most<br />
HTTP client libraries handle encoding request parameters for you. Though it’s called<br />
URL encoding, the technical details for it are actually more closely associated <strong>with</strong> the<br />
URI as shown in section 2.1 of RFC 3986.<br />
HEAD Requests<br />
Though not common when accessing target web applications, HEAD requests are<br />
useful in web scraping applications in several ways. They function in the same way<br />
as a GET request <strong>with</strong> one exception: when the server delivers its response, it will<br />
not deliver the resource representation that normally comprises the response body.<br />
The reason this is useful is that it allo ws a client <strong>to</strong> get at the data present in the<br />
response headers <strong>with</strong>out having <strong>to</strong> do wnload the entire response, which is liable<br />
<strong>to</strong> be significantly larger. Such data can include whether or not the resource is still<br />
available for access and, if it is, when it was last modified.<br />
HEAD /wiki/Main_Page HTTP/1.1<br />
Host: en.wikipedia.org<br />
Speaking of responses, no w would be a good time <strong>to</strong> investigate those in more detail.<br />
R esponses<br />
Aside from the first response line, called the status line, responses are formatted very<br />
similarly <strong>to</strong> requests. While different headers are used in requests and responses,<br />
they are formatted the same way. A blank line separates the headers and the body in<br />
both requests and responses. The body may be absent in either depending on what<br />
the request operation is. Belo w is an example response.<br />
HTTP/1.0 200 OK<br />
Date: Mon, 21 Jul 2008 02:32:52 GMT