php|architect's Guide to Web Scraping with PHP - Wind Business ...

Recommendations

Info

166 ” Legality of W e b Scraping • “ A m grants a you z a o limited n license to access and make personal use of this site ... This license does not include ... any use of data mining, robots, or similar data gathering and extraction tools.” – Amazon Conditions of U s e , LICENSE AND SITE AC C E S S section as of 2/14/10 • “Youagree that you will not use any robot, spider, scraper or other automated means to access the Sites for any purpose without our express written permission.” – eBay U s e r Agreement, Access and Interference section as of 2/14/10 • “... you agree not to: ... access, monitor or copy any content or information of this W e b s i using t e any robot, spider, scraper or other automated means or any manual process for any purpose without our express written permission; ...” – Expedia, Inc. W e b Site T e r Conditions, m s , and N o t i c e s , PROHIBITED AC T I V I T I E S section as of 2/14/10 • “The foregoing licenses do not include any rights to: ... use any robot, spider, data miner, scraper or other automated means to access the Barnes & N o - ble.com Site or its systems, the Content or any portion or derivative thereof for any purpose; ...” – Barnes & N o b l e T e r of U m s e s , Section I LICENSES AND RESTRICTIONS as of 2/14/10 Determining whether or not the web site in question has a TOS document will be the first step. If you find one, look for clauses using language similar to that of the above examples. Also, look for any broad “blanket” clauses of prohibited activities under which web scraping may fall. If you find a TOS document and it does not expressly forbid web scraping, the next step is to contact representatives who have authority to speak on behalf of the organization that o w n s the web site. Some organizations may allow web scraping assuming that you secure permission with appropriate authorities beforehand. When obtaining this permission, it is best to obtain a document in writing and on official letterhead that clearly indicates that it originated from the organization in question. This has the greatest chance of mitigating any legal issues that may arise. If intellectual property-related allegations are brought against an individual as a result of usage of an automated agent or information acquired by one, assuming the individual did not violate any TOS agreement imposed by its o w n e r or related computer use laws, a court decision will likely boil down to whether or not the usage
Legality of W e b Scraping ” 167 of said information is interpreted as “ f a i ruse” with respect to copyright laws in the geographical area in which the alleged offense took place. Please note that these statements are very general and are not intended to replace the consultation of an attorney. If TOS agreements or lack thereof and communications with the web site o w n e r prove inconclusive, it is highly advisable to seek legal council prior to any attempts being made to launch an automated agent on a web site. This is another reason why web scraping is a less-than-ideal approach to solving the problem of data acquisition and why it should be considered only in the absence of alternatives. Some sites actually use license agreements to grant open or mildly restricted usage rights for their content. Common licenses to this end include the GNU Free Documentation license and the Creative Commons licenses. In instances where the particular data source being used to acquire data is not relevant, sources that use licenses like these should be preferred o v e r those that do not, as legalities are significantly less likely to become an issue. The second point of inspection is the legitimacy of the web site as the originating source of the data to be harvested. Eve n large companies with substantial legal resources, such as Google, have run into issues when their automated agents acquired content from sites illegally syndicating other sites. In some cases, sites will attribute their sources, but in many cases they will not. F o r textual content, entering direct quotations that are likely to be unique from the site into major search engines is one method that can help to determine if the site in question originated the data. It may also provide some indication as to whether or not syndicating that data is legal. F o r non-textual data, make educated guesses as to keywords that correspond to the subject and try using a search engine specific to that particular data format. Searches like this are not intended to be extensive or definitive indications, but merely a quick way of ruling out an obvious syndication of an original data source.
Page 1 and 2:
php|architect’s Guide to Web Scra
Page 3:
php|ar chitect’s Guide to W eb Sc
Page 7 and 8:
vi ” CONTENTS Referring URLs . .
Page 9 and 10:
viii ” CONTENTS HTTP Authenticati
Page 11:
x ” CONTENTS Chapter 14 — PCRE
Page 15 and 16:
xiv ” CONTENTS pleted. Each had a
Page 18 and 19:
For ewor d W eb scraping is the fut
Page 21 and 22:
Chapter 1 Introduction If you are l
Page 23 and 24:
Introduction ” 3 in some instance
Page 25:
Introduction ” 5 • Chapters 3-7
Page 28 and 29:
8 ” HTTP R equests The HTTP proto
Page 30 and 31:
10 ” HTTP http://en.wikipedia.org
Page 32 and 33:
12 ” HTTP i Query String Limits M
Page 34 and 35:
14 ” HTTP Server: Apache X-Powere
Page 36 and 37:
16 ” HTTP set, it will persist fo
Page 38 and 39:
18 ” HTTP Content Caching Two met
Page 40 and 41:
20 ” HTTP as 0-499. To specify fr
Page 42 and 43:
22 ” HTTP • Initialize a reques
Page 44:
24 ” HTTP W rap-U p At this point
Page 49 and 50:
HTTP Streams W rapper ” 29 Let
Page 51 and 52:
HTTP Streams W rapper ” 31 Error
Page 53:
HTTP Streams W rapper ” 33 ); ?>
Page 56 and 57:
36 ” cURL Extension Simple R eque
Page 58 and 59:
38 ” cURL Extension Setting M ult
Page 60 and 61:
40 ” cURL Extension • CURLOPT_R
Page 62 and 63:
42 ” cURL Extension containing th
Page 64 and 65:
44 ” cURL Extension operate unpre
Page 66:
46 ” cURL Extension • The sessi
Page 70 and 71:
50 ” pecl_http PECL Extension bal
Page 72 and 73:
52 ” pecl_http PECL Extension •
Page 74 and 75:
54 ” pecl_http PECL Extension Deb
Page 76 and 77:
56 ” pecl_http PECL Extension ass
Page 78 and 79:
58 ” pecl_http PECL Extension );
Page 81 and 82:
Chapter 6 P EAR::HTTP_Client The PH
Page 83 and 84:
PEAR::HTTP_Client ” 63 • sendRe
Page 85 and 86:
PEAR::HTTP_Client ” 65 • By def
Page 87 and 88:
PEAR::HTTP_Client ” 67 } ?> $url
Page 89:
PEAR::HTTP_Client ” 69 • http:/
Page 92 and 93:
72 ” Zend_Http_Client // Another
Page 94 and 95:
74 ” Zend_Http_Client Configurat
Page 96 and 97:
76 ” Zend_Http_Client getLastResp
Page 98:
78 ” Zend_Http_Client HTTP A uthe
Page 102 and 103:
82 ” Rolling Y o u Own r $stream
Page 104 and 105:
84 ” Rolling Y o u Own r Logic to
Page 106:
86 ” Rolling Y o u Own r See RFC
Page 110 and 111:
90 ” T i d y Extension direct inp
Page 112 and 113:
92 ” T i d y Extension public fun
Page 114 and 115:
94 ” T i d y Extension There are
Page 116:
96 ” T i d y Extension Output Obt
Page 120 and 121:
100 ” DOM Extension T y p of P e
Page 122 and 123:
102 ” DOM Extension ties include
Page 124 and 125:
104 ” DOM Extension // A slightly
Page 126 and 127:
106 ” DOM Extension // Also retur
Page 128 and 129:
108 ” DOM Extension • //@id add
Page 130:
110 ” DOM Extension • DOM Level
Page 134 and 135:
114 ” SimpleXML Extension The co
Page 136 and 137: 116 ” SimpleXML Extension foreach
Page 138: 118 ” SimpleXML Extension W r a
Page 142 and 143: 122 ” XMLReader Extension Loading
Page 144 and 145: 124 ” XMLReader Extension false o
Page 146 and 147: 126 ” XMLReader Extension cate to
Page 149 and 150: Chapter 13 CSS Selector Libraries T
Page 151 and 152: CSS Selector Libraries ” 131 Abou
Page 153 and 154: CSS Selector Libraries ” 133 •
Page 159 and 160: CSS Selector Libraries ” 139 It
Page 163 and 164: Chapter 14 PCRE Extension There are
Page 165 and 166: PCRE Extension ” 145 Anchors Y o
Page 167 and 168: PCRE Extension ” 147 // Matches
Page 169 and 170: PCRE Extension ” 149 if (preg_mat
Page 171 and 172: PCRE Extension ” 151 The first wa
Page 173 and 174: PCRE Extension ” 153 • T ouse a
Page 177 and 178: T i p sand T r i c k s Chapter 15 C
Page 179 and 180: T i p s and T r i c ” k 159 s not
Page 181 and 182: T i p s and T r i c ” k 161 s W e
Page 185: A p p e n d i x A Legality of W e S
Page 190 and 191: 170 ” M u l t i p r o c e s s i n
show all

php|architect's Guide to Web Scraping with PHP - Wind Business ...

Create successful ePaper yourself

Delete template?

Save as template?