php|architect's Guide to Web Scraping with PHP - Wind Business ...
php|architect's Guide to Web Scraping with PHP - Wind Business ...
php|architect's Guide to Web Scraping with PHP - Wind Business ...
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
Chapter 2<br />
HTTP<br />
The first task that a web scraping application must be capable of performing is the<br />
retrieval of documents containing the information <strong>to</strong> be extracted. If you have used<br />
a web bro wser <strong>with</strong>out becoming aware of all that it does “ under the hood” <strong>to</strong> render<br />
a page for your viewing pleasure, this may sound trivial <strong>to</strong> you. H o wever, the complexity<br />
of a web scraping application is generally proportional <strong>to</strong> the complexity of<br />
the application it targets for retrieving and extracting data.<br />
F or targets consisting of multiple pages or requiring retention of session or authentication<br />
information, some level of reverse-engineering is often required <strong>to</strong> develop<br />
a corresponding web scraping application. Like a complex mathematics problem<br />
<strong>with</strong> a very simple answer, the development of web scraping applications can<br />
sometimes involve more analysis of the target than work <strong>to</strong> implement a script capable<br />
of retrieving and extracting data from it.<br />
This sort of reconnaisance requires a decent working kno wledge of the HyperText<br />
Transfer Pro<strong>to</strong>col or HTTP, the pro<strong>to</strong>col that po wers the internet. The majority of<br />
this chapter will focus on familiarization <strong>with</strong> that pro<strong>to</strong>col. The end goal is that you<br />
become capable of performing the necessary research <strong>to</strong> learn ho w a target application<br />
works such that you are capable of writing an application <strong>to</strong> extract the data you<br />
want.