03.02.2014 Views

php|architect's Guide to Web Scraping with PHP - Wind Business ...

php|architect's Guide to Web Scraping with PHP - Wind Business ...

php|architect's Guide to Web Scraping with PHP - Wind Business ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

84 ” Rolling Y o u Own r<br />

Logic <strong>to</strong> separate individual headers must account for the ability of header values <strong>to</strong><br />

span multiple lines as per RFC 2616 Section 2.2. As such, preg_match_all is used here<br />

<strong>to</strong> separate individual headers. See the later chapter on PCRE for more information<br />

on regular expressions. If a situation necessitates parsing data contained in URLs<br />

and query strings, check out the parse_url and parse_str functions. As <strong>with</strong> the request,<br />

it is generally desirable <strong>to</strong> parse response data in<strong>to</strong> a data structure for ease of<br />

reference.<br />

T r a Encoding<br />

s f e r<br />

Before parsing the body, the headers should be checked for a few things. If a<br />

Transfer-Encoding header is present and has a value of chunked, it means that the<br />

server is sending the response back in chunks rather than all at once. The advantage<br />

<strong>to</strong> this is that the server does not have <strong>to</strong> wait until the entire response is composed<br />

before starting <strong>to</strong> return it (in order <strong>to</strong> determine and include its length in the<br />

Content-Length header), which can increase o v e r a l l server throughput.<br />

When each chunk is sent, it is preceded by a hexadecimal number <strong>to</strong> indicate the<br />

size of the chunk followed by a CRLF sequence. The end of each chunk is also denoted<br />

by a CRLF sequence. The end of the body is denoted <strong>with</strong> a chunk size of 0,<br />

which is particularly important when using a persistent connection since the client<br />

must know where one response ends and the next begins.<br />

The strstr function can be used <strong>to</strong> obtain characters in a string prior <strong>to</strong> a newline.<br />

T oconvert strings containing hexadecimal numbers <strong>to</strong> their decimal equivalents,<br />

see the hexdec function. An example of what these two might look like in action is<br />

included below. The example assumes that a request body has been written <strong>to</strong> a<br />

string.<br />

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!