03.02.2014 Views

php|architect's Guide to Web Scraping with PHP - Wind Business ...

php|architect's Guide to Web Scraping with PHP - Wind Business ...

php|architect's Guide to Web Scraping with PHP - Wind Business ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Chapter 2<br />

HTTP<br />

The first task that a web scraping application must be capable of performing is the<br />

retrieval of documents containing the information <strong>to</strong> be extracted. If you have used<br />

a web bro wser <strong>with</strong>out becoming aware of all that it does “ under the hood” <strong>to</strong> render<br />

a page for your viewing pleasure, this may sound trivial <strong>to</strong> you. H o wever, the complexity<br />

of a web scraping application is generally proportional <strong>to</strong> the complexity of<br />

the application it targets for retrieving and extracting data.<br />

F or targets consisting of multiple pages or requiring retention of session or authentication<br />

information, some level of reverse-engineering is often required <strong>to</strong> develop<br />

a corresponding web scraping application. Like a complex mathematics problem<br />

<strong>with</strong> a very simple answer, the development of web scraping applications can<br />

sometimes involve more analysis of the target than work <strong>to</strong> implement a script capable<br />

of retrieving and extracting data from it.<br />

This sort of reconnaisance requires a decent working kno wledge of the HyperText<br />

Transfer Pro<strong>to</strong>col or HTTP, the pro<strong>to</strong>col that po wers the internet. The majority of<br />

this chapter will focus on familiarization <strong>with</strong> that pro<strong>to</strong>col. The end goal is that you<br />

become capable of performing the necessary research <strong>to</strong> learn ho w a target application<br />

works such that you are capable of writing an application <strong>to</strong> extract the data you<br />

want.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!