php|architect's Guide to Web Scraping with PHP - Wind Business ...

Recommendations

Info

100 ” DOM Extension T y p of P e a r s e r s Before going much further, you should be aware that there are two types of XML parsers: tree parsers and pull parsers. T r e parsers e load the entire document into memory and allow you to access any part of it at any time as well as manipulate it. Pull parsers read the document a piece at a time and limit you to working with the current piece being read. The two types of parsers share a relationship similar to that between the file_get_contents and fgets functions: the former lets you work with the entire document at once and uses as much memory needed to store it, while the latter allows you to work with a piece of the document at a time and use less memory in the process. When working with fairly large documents, lower memory usage is generally the preferable option. Attempting to load a huge document into memory all at once has the same effect on the local system as a throttling client does on a web server: in both cases, resources are consumed and system performance is debilitated until the system eventually locks up or crashes under the stress. The DOM extension is a tree parser. In general, web scraping does not require the ability to access all parts of the document simultaneously. H o w e v e r, the type of data extraction involved in web scraping can be rather extensive to implement using a pull parser. The appropriateness of extension o v e r the other depends on the size and complexity of the document. Loading Documents The DOMDocument class is where use of the DOM extension begins. The first thing to do is instantiate it and then feed it the validated markup data. N o t e that the DOM extension will emit warnings when a document is loaded if that document is not valid or well-formed. T oavoid this, see the previous chapter on using the tidy extension. If tidy does not eliminate the issue, errors can be controlled as shown in the example below. N o t e that errors are buffered until manually cleared, so make a point of clearing them after each load operation if they are not needed to avoid wasting memory.
DOM Extension ” 101 // Buffer DOM errors rather than emitting them as warnings $oldSetting = libxml_use_internal_errors(true); // Instantiate a container for the document $doc = new DOMDocument; // Load markup already contained within a string $doc->loadHTML($htmlString); // Load markup saved to an external file $doc->loadHTMLFile($htmlFilePath); // Get all errors if needed $errors = libxml_get_errors(); // Get only the last error $error = libxml_get_last_error(); // Clear any existing errors from previous operations libxml_clear_errors(); // Revert error buffering to its previous setting libxml_use_internal_errors($oldSetting); ?> T r T e r m i n o l o g y Once a document is loaded, the next natural step is to extract desired data from it. H o w e v e r, doing so requires a bit more knowledge about how the DOM is structured. Recall the earlier mention of tree parsers. If you have any computer science background, you will be glad to know that the term “ t r e e ” in the context of tree parsers does in fact refer to the data structure by the same name. If not, here is a brief rundown of related concepts. A tree is a hierarchical structure (think family tree) composed of nodes, which exist in the DOM extension as the DOMNode class. N o d e s are to trees what elements are to arrays: just items that exist within the data structure. Each individual node can have zero or more child nodes that are collectively represented by a childNodes property in the DOMNode class. childNodes is an instance of the class DOMNodeList, which is exactly what it sounds like. Other related proper-
Page 1 and 2:
php|architect’s Guide to Web Scra
Page 3:
php|ar chitect’s Guide to W eb Sc
Page 7 and 8:
vi ” CONTENTS Referring URLs . .
Page 9 and 10:
viii ” CONTENTS HTTP Authenticati
Page 11:
x ” CONTENTS Chapter 14 — PCRE
Page 15 and 16:
xiv ” CONTENTS pleted. Each had a
Page 18 and 19:
For ewor d W eb scraping is the fut
Page 21 and 22:
Chapter 1 Introduction If you are l
Page 23 and 24:
Introduction ” 3 in some instance
Page 25:
Introduction ” 5 • Chapters 3-7
Page 28 and 29:
8 ” HTTP R equests The HTTP proto
Page 30 and 31:
10 ” HTTP http://en.wikipedia.org
Page 32 and 33:
12 ” HTTP i Query String Limits M
Page 34 and 35:
14 ” HTTP Server: Apache X-Powere
Page 36 and 37:
16 ” HTTP set, it will persist fo
Page 38 and 39:
18 ” HTTP Content Caching Two met
Page 40 and 41:
20 ” HTTP as 0-499. To specify fr
Page 42 and 43:
22 ” HTTP • Initialize a reques
Page 44:
24 ” HTTP W rap-U p At this point
Page 49 and 50:
HTTP Streams W rapper ” 29 Let
Page 51 and 52:
HTTP Streams W rapper ” 31 Error
Page 53:
HTTP Streams W rapper ” 33 ); ?>
Page 56 and 57:
36 ” cURL Extension Simple R eque
Page 58 and 59:
38 ” cURL Extension Setting M ult
Page 60 and 61:
40 ” cURL Extension • CURLOPT_R
Page 62 and 63:
42 ” cURL Extension containing th
Page 64 and 65:
44 ” cURL Extension operate unpre
Page 66:
46 ” cURL Extension • The sessi
Page 70 and 71: 50 ” pecl_http PECL Extension bal
Page 72 and 73: 52 ” pecl_http PECL Extension •
Page 74 and 75: 54 ” pecl_http PECL Extension Deb
Page 76 and 77: 56 ” pecl_http PECL Extension ass
Page 78 and 79: 58 ” pecl_http PECL Extension );
Page 81 and 82: Chapter 6 P EAR::HTTP_Client The PH
Page 83 and 84: PEAR::HTTP_Client ” 63 • sendRe
Page 85 and 86: PEAR::HTTP_Client ” 65 • By def
Page 87 and 88: PEAR::HTTP_Client ” 67 } ?> $url
Page 89: PEAR::HTTP_Client ” 69 • http:/
Page 92 and 93: 72 ” Zend_Http_Client // Another
Page 94 and 95: 74 ” Zend_Http_Client Configurat
Page 96 and 97: 76 ” Zend_Http_Client getLastResp
Page 98: 78 ” Zend_Http_Client HTTP A uthe
Page 102 and 103: 82 ” Rolling Y o u Own r $stream
Page 104 and 105: 84 ” Rolling Y o u Own r Logic to
Page 106: 86 ” Rolling Y o u Own r See RFC
Page 110 and 111: 90 ” T i d y Extension direct inp
Page 112 and 113: 92 ” T i d y Extension public fun
Page 114 and 115: 94 ” T i d y Extension There are
Page 116: 96 ” T i d y Extension Output Obt
Page 122 and 123: 102 ” DOM Extension ties include
Page 124 and 125: 104 ” DOM Extension // A slightly
Page 126 and 127: 106 ” DOM Extension // Also retur
Page 128 and 129: 108 ” DOM Extension • //@id add
Page 130: 110 ” DOM Extension • DOM Level
Page 134 and 135: 114 ” SimpleXML Extension The co
Page 136 and 137: 116 ” SimpleXML Extension foreach
Page 138: 118 ” SimpleXML Extension W r a
Page 142 and 143: 122 ” XMLReader Extension Loading
Page 144 and 145: 124 ” XMLReader Extension false o
Page 146 and 147: 126 ” XMLReader Extension cate to
Page 149 and 150: Chapter 13 CSS Selector Libraries T
Page 151 and 152: CSS Selector Libraries ” 131 Abou
Page 153 and 154: CSS Selector Libraries ” 133 •
Page 159 and 160: CSS Selector Libraries ” 139 It
Page 163 and 164: Chapter 14 PCRE Extension There are
Page 165 and 166: PCRE Extension ” 145 Anchors Y o
Page 167 and 168: PCRE Extension ” 147 // Matches
Page 169 and 170: PCRE Extension ” 149 if (preg_mat
Page 171 and 172:
PCRE Extension ” 151 The first wa
Page 173 and 174:
PCRE Extension ” 153 • T ouse a
Page 177 and 178:
T i p sand T r i c k s Chapter 15 C
Page 179 and 180:
T i p s and T r i c ” k 159 s not
Page 181 and 182:
T i p s and T r i c ” k 161 s W e
Page 185 and 186:
A p p e n d i x A Legality of W e S
Page 187:
Legality of W e b Scraping ” 167
Page 190 and 191:
170 ” M u l t i p r o c e s s i n
show all

php|architect's Guide to Web Scraping with PHP - Wind Business ...

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?