03.02.2014 Views

php|architect's Guide to Web Scraping with PHP - Wind Business ...

php|architect's Guide to Web Scraping with PHP - Wind Business ...

php|architect's Guide to Web Scraping with PHP - Wind Business ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

10 ” HTTP<br />

http://en.wikipedia.org/wiki/Main_Page<br />

requested resource.<br />

is the resulting full URL of the<br />

i<br />

URI vs URL<br />

URI is sometimes used interchangeably <strong>with</strong> URL, which frequently leads <strong>to</strong> confusion<br />

about the exact nature of either. A URI is used <strong>to</strong> uniquely identify a resource, indicate<br />

how <strong>to</strong> locate a resource, or both. URL is the subset of URI that does both (as opposed<br />

<strong>to</strong> either) and is what makes them usable by humans. After all, what’s the use of being<br />

able <strong>to</strong> identify a resource if you can’t access it! See sections 1.1.3 and 1.2.2 of RFC 3986<br />

for more information.<br />

GET is by far the most commonly used operation in the HTTP pro<strong>to</strong>col. According<br />

<strong>to</strong> the HTTP specification, the intent of GET is <strong>to</strong> request a representation of a<br />

resource, essentially <strong>to</strong> “ read” it as you would a file on a file system. Common examples<br />

of formats for such representations include HTML and XML-based formats<br />

such as XHTML, RSS, and A<strong>to</strong>m.<br />

In principle, GET should not modify any existing data exposed by the application.<br />

F or this reason, it is considered <strong>to</strong> be what is called a safe operation. It is worth<br />

noting that as you examine your target applications, you may encounter situations<br />

where GET operations are used incorrectly <strong>to</strong> modify data rather than simply returning<br />

it. This indicates poor application design and should be avoided when developing<br />

your o wn applications.<br />

Ana<strong>to</strong>my of a URL<br />

If you aren’t already familiar <strong>with</strong> all the components of a URL, this will likely be<br />

useful in later chapters.<br />

http://user:pass@www.domain.com:8080/path/<strong>to</strong>/file.ext?query=&var=value#anchor<br />

• http is the pro<strong>to</strong>col used <strong>to</strong> interact <strong>with</strong> the resource. Another example is<br />

https, which is equivalent <strong>to</strong> http on a connection using an SSL certificate for<br />

encryption.<br />

• user:pass@ is an optional component used <strong>to</strong> instruct the client that Basic<br />

HTTP authentication is required <strong>to</strong> access the resource and that user and pass

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!