php|architect's Guide to Web Scraping with PHP - Wind Business ...

Recommendations

Info

158 ” T i p s and T r i c k s since the client essentially has to wait for two requests to complete for every one request that would normally be made to the target application. The batch approach is based on synchronization. F o r read operations, data is updated on a regular interval. F o r write operations, changes are stored locally and then pushed out in batches (hence the name) to the target application, also on a regular interval. The pros and cons to this approach are the complement of those from the real-time approach: updates will not be real-time, but the web scraping application’s response time will not be increased. It is of course possible to use a batch approach with a relatively low interval in order to approximate real-time while gaining the benefits of the batch approach. The selection of an approach depends on the requirements of the web scraping application. In general, if real-time updates on either the web scraping application or target application are not required, the batch approach is preferred to maintain a high level of performance. A v a i l a b i l i t y Regardless of whether a web scraping application takes a real-time or batch approach, it should treat the remote service as as potential point of failure and account for cases where it does not return a response. Once a tested web scraping application goes into production, common causes for this are either service downtime or modification. Sy m p t o m s of these include connection timeouts and responses with a status code above the 2xx range. An advantage of the batch approach in this situation is that the web scraping application’s front-facing interface can remain unaffected. Cached data can be used or updates can be stored locally and synchronization can be initiated once the service becomes available again or the web scraping application has been fixed to account for changes in the remote service. P a r a l l e lProcessing T w o of the HTTP client libraries previously covered, cURL and pecl_http, support running requests in parallel using a single connection. While the same feature can-
T i p s and T r i c ” k 159 s not be replicated exactly using other libraries, it is possible to run multiple requests on separate connections using processes that are executed in parallel. Eve n if you are using a library supporting connection pooling, this technique is useful for situations when multiple hosts are being scraped since each host will require a separate connection anyway. By contrast, doing so in a single process means it is possible for requests sent earlier to a host with a lower response rate to block those sent later to another more responsive host. See Appendix B for a more detailed example this. Crawlers Some web scraping applications are intended to serve as crawlers to index content from web sites. Like all other web scraping applications, the work they perform can be divided into two categories: retrieval and analysis. The parallel processing approach is applicable here because each category of work serve to populate the work queue of the other. The retrieval process is given one or more initial documents to retrieve. Each time a document is retrieved, it becomes a job for the analysis process, which scrapes the markup searching for links (a elements) to other documents, which may be restricted by one or more relevancy factors. Once analysis of a document is complete, addresses to any currently unretrieved documents are then fed back to the retrieval process. This situation of mutual supply will hypothetically be sustained until no documents are found that are unindexed or considered to be relevant. At that point, the process can be restarted with the retrieval process using appropriate request headers to check for document updates and feeding documents to the analysis process where updates are found. F o r m s Some web scraping applications must push data to the target application. This is generally accomplished using HTTP POST requests that simulate the submission of HTML forms. Before such requests can be sent, however, there are a few events that
Page 1 and 2:
php|architect’s Guide to Web Scra
Page 3:
php|ar chitect’s Guide to W eb Sc
Page 7 and 8:
vi ” CONTENTS Referring URLs . .
Page 9 and 10:
viii ” CONTENTS HTTP Authenticati
Page 11:
x ” CONTENTS Chapter 14 — PCRE
Page 15 and 16:
xiv ” CONTENTS pleted. Each had a
Page 18 and 19:
For ewor d W eb scraping is the fut
Page 21 and 22:
Chapter 1 Introduction If you are l
Page 23 and 24:
Introduction ” 3 in some instance
Page 25:
Introduction ” 5 • Chapters 3-7
Page 28 and 29:
8 ” HTTP R equests The HTTP proto
Page 30 and 31:
10 ” HTTP http://en.wikipedia.org
Page 32 and 33:
12 ” HTTP i Query String Limits M
Page 34 and 35:
14 ” HTTP Server: Apache X-Powere
Page 36 and 37:
16 ” HTTP set, it will persist fo
Page 38 and 39:
18 ” HTTP Content Caching Two met
Page 40 and 41:
20 ” HTTP as 0-499. To specify fr
Page 42 and 43:
22 ” HTTP • Initialize a reques
Page 44:
24 ” HTTP W rap-U p At this point
Page 49 and 50:
HTTP Streams W rapper ” 29 Let
Page 51 and 52:
HTTP Streams W rapper ” 31 Error
Page 53:
HTTP Streams W rapper ” 33 ); ?>
Page 56 and 57:
36 ” cURL Extension Simple R eque
Page 58 and 59:
38 ” cURL Extension Setting M ult
Page 60 and 61:
40 ” cURL Extension • CURLOPT_R
Page 62 and 63:
42 ” cURL Extension containing th
Page 64 and 65:
44 ” cURL Extension operate unpre
Page 66:
46 ” cURL Extension • The sessi
Page 70 and 71:
50 ” pecl_http PECL Extension bal
Page 72 and 73:
52 ” pecl_http PECL Extension •
Page 74 and 75:
54 ” pecl_http PECL Extension Deb
Page 76 and 77:
56 ” pecl_http PECL Extension ass
Page 78 and 79:
58 ” pecl_http PECL Extension );
Page 81 and 82:
Chapter 6 P EAR::HTTP_Client The PH
Page 83 and 84:
PEAR::HTTP_Client ” 63 • sendRe
Page 85 and 86:
PEAR::HTTP_Client ” 65 • By def
Page 87 and 88:
PEAR::HTTP_Client ” 67 } ?> $url
Page 89:
PEAR::HTTP_Client ” 69 • http:/
Page 92 and 93:
72 ” Zend_Http_Client // Another
Page 94 and 95:
74 ” Zend_Http_Client Configurat
Page 96 and 97:
76 ” Zend_Http_Client getLastResp
Page 98:
78 ” Zend_Http_Client HTTP A uthe
Page 102 and 103:
82 ” Rolling Y o u Own r $stream
Page 104 and 105:
84 ” Rolling Y o u Own r Logic to
Page 106:
86 ” Rolling Y o u Own r See RFC
Page 110 and 111:
90 ” T i d y Extension direct inp
Page 112 and 113:
92 ” T i d y Extension public fun
Page 114 and 115:
94 ” T i d y Extension There are
Page 116:
96 ” T i d y Extension Output Obt
Page 120 and 121:
100 ” DOM Extension T y p of P e
Page 122 and 123:
102 ” DOM Extension ties include
Page 124 and 125:
104 ” DOM Extension // A slightly
Page 126 and 127:
106 ” DOM Extension // Also retur
Page 128 and 129: 108 ” DOM Extension • //@id add
Page 130: 110 ” DOM Extension • DOM Level
Page 134 and 135: 114 ” SimpleXML Extension The co
Page 136 and 137: 116 ” SimpleXML Extension foreach
Page 138: 118 ” SimpleXML Extension W r a
Page 142 and 143: 122 ” XMLReader Extension Loading
Page 144 and 145: 124 ” XMLReader Extension false o
Page 146 and 147: 126 ” XMLReader Extension cate to
Page 149 and 150: Chapter 13 CSS Selector Libraries T
Page 151 and 152: CSS Selector Libraries ” 131 Abou
Page 153 and 154: CSS Selector Libraries ” 133 •
Page 159 and 160: CSS Selector Libraries ” 139 It
Page 163 and 164: Chapter 14 PCRE Extension There are
Page 165 and 166: PCRE Extension ” 145 Anchors Y o
Page 167 and 168: PCRE Extension ” 147 // Matches
Page 169 and 170: PCRE Extension ” 149 if (preg_mat
Page 171 and 172: PCRE Extension ” 151 The first wa
Page 173 and 174: PCRE Extension ” 153 • T ouse a
Page 177: T i p sand T r i c k s Chapter 15 C
Page 181 and 182: T i p s and T r i c ” k 161 s W e
Page 185 and 186: A p p e n d i x A Legality of W e S
Page 187: Legality of W e b Scraping ” 167
Page 190 and 191: 170 ” M u l t i p r o c e s s i n
show all

php|architect's Guide to Web Scraping with PHP - Wind Business ...

Create successful ePaper yourself

Delete template?

Save as template?