php|architect's Guide to Web Scraping with PHP - Wind Business ...
php|architect's Guide to Web Scraping with PHP - Wind Business ...
php|architect's Guide to Web Scraping with PHP - Wind Business ...
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
84 ” Rolling Y o u Own r<br />
Logic <strong>to</strong> separate individual headers must account for the ability of header values <strong>to</strong><br />
span multiple lines as per RFC 2616 Section 2.2. As such, preg_match_all is used here<br />
<strong>to</strong> separate individual headers. See the later chapter on PCRE for more information<br />
on regular expressions. If a situation necessitates parsing data contained in URLs<br />
and query strings, check out the parse_url and parse_str functions. As <strong>with</strong> the request,<br />
it is generally desirable <strong>to</strong> parse response data in<strong>to</strong> a data structure for ease of<br />
reference.<br />
T r a Encoding<br />
s f e r<br />
Before parsing the body, the headers should be checked for a few things. If a<br />
Transfer-Encoding header is present and has a value of chunked, it means that the<br />
server is sending the response back in chunks rather than all at once. The advantage<br />
<strong>to</strong> this is that the server does not have <strong>to</strong> wait until the entire response is composed<br />
before starting <strong>to</strong> return it (in order <strong>to</strong> determine and include its length in the<br />
Content-Length header), which can increase o v e r a l l server throughput.<br />
When each chunk is sent, it is preceded by a hexadecimal number <strong>to</strong> indicate the<br />
size of the chunk followed by a CRLF sequence. The end of each chunk is also denoted<br />
by a CRLF sequence. The end of the body is denoted <strong>with</strong> a chunk size of 0,<br />
which is particularly important when using a persistent connection since the client<br />
must know where one response ends and the next begins.<br />
The strstr function can be used <strong>to</strong> obtain characters in a string prior <strong>to</strong> a newline.<br />
T oconvert strings containing hexadecimal numbers <strong>to</strong> their decimal equivalents,<br />
see the hexdec function. An example of what these two might look like in action is<br />
included below. The example assumes that a request body has been written <strong>to</strong> a<br />
string.<br />