php|architect's Guide to Web Scraping with PHP - Wind Business ...
php|architect's Guide to Web Scraping with PHP - Wind Business ...
php|architect's Guide to Web Scraping with PHP - Wind Business ...
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
8 ” HTTP<br />
R equests<br />
The HTTP pro<strong>to</strong>col is intended <strong>to</strong> give two parties a common method of communication:<br />
web clients and web servers. Clients are programs or scripts that send<br />
requests <strong>to</strong> servers. Examples of clients include web bro wsers, such as Internet Explorer<br />
and M ozilla Firefox, and crawlers, like those used by Yahoo! and Google <strong>to</strong><br />
expand their search engine offerings. Servers are programs that run indefinitely and<br />
do nothing but receive and send responses <strong>to</strong> client requests. P opular examples include<br />
Microsoft IIS and the Apache HTTP Server.<br />
You must be familiar enough <strong>with</strong> the ana<strong>to</strong>my and nuances of HTTP requests<br />
and responses <strong>to</strong> do two things. First, you must be able <strong>to</strong> configure and use your<br />
preferred client <strong>to</strong> view requests and responses that pass between it and the server<br />
hosting the target application as you access it. This is essential <strong>to</strong> developing your<br />
web scraping application <strong>with</strong>out expending an excessive amount of time and energy<br />
on your part.<br />
Second, you must be able <strong>to</strong> use most of the features offered by a <strong>PHP</strong> HTTP client<br />
library. I deally, you would kno w HTTP and <strong>PHP</strong> well enough <strong>to</strong> build your o wn client<br />
library or fix issues <strong>with</strong> an existing one if necessary. In principle, ho wever, you<br />
should resort <strong>to</strong> finding and using an adequate existing library first and constructing<br />
one that is reusable as a last resort. W e will examine some of these libraries in the<br />
next few chapters.<br />
Supplemental References<br />
This book will co ver HTTP in sufficient depth as it relates <strong>to</strong> web scraping, but should<br />
not in any respect be considered a comprehensive guide on the subject. H ere are a<br />
few recommended references <strong>to</strong> supplement the material co vered in this book.<br />
• RFC 2616 HyperText Transfer Pro<strong>to</strong>col – HTTP/1.1<br />
(http://www.ietf.org/rfc/rfc2616.txt)<br />
• RFC 3986 U niform Resource I dentifiers (URI): Generic Syntax<br />
(http://www.ietf.org/rfc/rfc3986.txt)<br />
• “HTTP: The Definitive <strong>Guide</strong>” (ISBN 1565925092)