03.02.2014 Views

php|architect's Guide to Web Scraping with PHP - Wind Business ...

php|architect's Guide to Web Scraping with PHP - Wind Business ...

php|architect's Guide to Web Scraping with PHP - Wind Business ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

8 ” HTTP<br />

R equests<br />

The HTTP pro<strong>to</strong>col is intended <strong>to</strong> give two parties a common method of communication:<br />

web clients and web servers. Clients are programs or scripts that send<br />

requests <strong>to</strong> servers. Examples of clients include web bro wsers, such as Internet Explorer<br />

and M ozilla Firefox, and crawlers, like those used by Yahoo! and Google <strong>to</strong><br />

expand their search engine offerings. Servers are programs that run indefinitely and<br />

do nothing but receive and send responses <strong>to</strong> client requests. P opular examples include<br />

Microsoft IIS and the Apache HTTP Server.<br />

You must be familiar enough <strong>with</strong> the ana<strong>to</strong>my and nuances of HTTP requests<br />

and responses <strong>to</strong> do two things. First, you must be able <strong>to</strong> configure and use your<br />

preferred client <strong>to</strong> view requests and responses that pass between it and the server<br />

hosting the target application as you access it. This is essential <strong>to</strong> developing your<br />

web scraping application <strong>with</strong>out expending an excessive amount of time and energy<br />

on your part.<br />

Second, you must be able <strong>to</strong> use most of the features offered by a <strong>PHP</strong> HTTP client<br />

library. I deally, you would kno w HTTP and <strong>PHP</strong> well enough <strong>to</strong> build your o wn client<br />

library or fix issues <strong>with</strong> an existing one if necessary. In principle, ho wever, you<br />

should resort <strong>to</strong> finding and using an adequate existing library first and constructing<br />

one that is reusable as a last resort. W e will examine some of these libraries in the<br />

next few chapters.<br />

Supplemental References<br />

This book will co ver HTTP in sufficient depth as it relates <strong>to</strong> web scraping, but should<br />

not in any respect be considered a comprehensive guide on the subject. H ere are a<br />

few recommended references <strong>to</strong> supplement the material co vered in this book.<br />

• RFC 2616 HyperText Transfer Pro<strong>to</strong>col – HTTP/1.1<br />

(http://www.ietf.org/rfc/rfc2616.txt)<br />

• RFC 3986 U niform Resource I dentifiers (URI): Generic Syntax<br />

(http://www.ietf.org/rfc/rfc3986.txt)<br />

• “HTTP: The Definitive <strong>Guide</strong>” (ISBN 1565925092)

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!