php|architect's Guide to Web Scraping with PHP - Wind Business ...
php|architect's Guide to Web Scraping with PHP - Wind Business ...
php|architect's Guide to Web Scraping with PHP - Wind Business ...
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
10 ” HTTP<br />
http://en.wikipedia.org/wiki/Main_Page<br />
requested resource.<br />
is the resulting full URL of the<br />
i<br />
URI vs URL<br />
URI is sometimes used interchangeably <strong>with</strong> URL, which frequently leads <strong>to</strong> confusion<br />
about the exact nature of either. A URI is used <strong>to</strong> uniquely identify a resource, indicate<br />
how <strong>to</strong> locate a resource, or both. URL is the subset of URI that does both (as opposed<br />
<strong>to</strong> either) and is what makes them usable by humans. After all, what’s the use of being<br />
able <strong>to</strong> identify a resource if you can’t access it! See sections 1.1.3 and 1.2.2 of RFC 3986<br />
for more information.<br />
GET is by far the most commonly used operation in the HTTP pro<strong>to</strong>col. According<br />
<strong>to</strong> the HTTP specification, the intent of GET is <strong>to</strong> request a representation of a<br />
resource, essentially <strong>to</strong> “ read” it as you would a file on a file system. Common examples<br />
of formats for such representations include HTML and XML-based formats<br />
such as XHTML, RSS, and A<strong>to</strong>m.<br />
In principle, GET should not modify any existing data exposed by the application.<br />
F or this reason, it is considered <strong>to</strong> be what is called a safe operation. It is worth<br />
noting that as you examine your target applications, you may encounter situations<br />
where GET operations are used incorrectly <strong>to</strong> modify data rather than simply returning<br />
it. This indicates poor application design and should be avoided when developing<br />
your o wn applications.<br />
Ana<strong>to</strong>my of a URL<br />
If you aren’t already familiar <strong>with</strong> all the components of a URL, this will likely be<br />
useful in later chapters.<br />
http://user:pass@www.domain.com:8080/path/<strong>to</strong>/file.ext?query=&var=value#anchor<br />
• http is the pro<strong>to</strong>col used <strong>to</strong> interact <strong>with</strong> the resource. Another example is<br />
https, which is equivalent <strong>to</strong> http on a connection using an SSL certificate for<br />
encryption.<br />
• user:pass@ is an optional component used <strong>to</strong> instruct the client that Basic<br />
HTTP authentication is required <strong>to</strong> access the resource and that user and pass