php|architect's Guide to Web Scraping with PHP - Wind Business ...
php|architect's Guide to Web Scraping with PHP - Wind Business ...
php|architect's Guide to Web Scraping with PHP - Wind Business ...
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
Legality of W e b <strong>Scraping</strong> ” 167<br />
of said information is interpreted as “ f a i ruse” <strong>with</strong> respect <strong>to</strong> copyright laws in the<br />
geographical area in which the alleged offense <strong>to</strong>ok place.<br />
Please note that these statements are very general and are not intended <strong>to</strong> replace<br />
the consultation of an at<strong>to</strong>rney. If TOS agreements or lack thereof and communications<br />
<strong>with</strong> the web site o w n e r prove inconclusive, it is highly advisable <strong>to</strong> seek legal<br />
council prior <strong>to</strong> any attempts being made <strong>to</strong> launch an au<strong>to</strong>mated agent on a web<br />
site. This is another reason why web scraping is a less-than-ideal approach <strong>to</strong> solving<br />
the problem of data acquisition and why it should be considered only in the absence<br />
of alternatives.<br />
Some sites actually use license agreements <strong>to</strong> grant open or mildly restricted usage<br />
rights for their content. Common licenses <strong>to</strong> this end include the GNU Free<br />
Documentation license and the Creative Commons licenses. In instances where the<br />
particular data source being used <strong>to</strong> acquire data is not relevant, sources that use<br />
licenses like these should be preferred o v e r those that do not, as legalities are significantly<br />
less likely <strong>to</strong> become an issue.<br />
The second point of inspection is the legitimacy of the web site as the originating<br />
source of the data <strong>to</strong> be harvested. Eve n large companies <strong>with</strong> substantial legal resources,<br />
such as Google, have run in<strong>to</strong> issues when their au<strong>to</strong>mated agents acquired<br />
content from sites illegally syndicating other sites. In some cases, sites will attribute<br />
their sources, but in many cases they will not.<br />
F o r textual content, entering direct quotations that are likely <strong>to</strong> be unique from the<br />
site in<strong>to</strong> major search engines is one method that can help <strong>to</strong> determine if the site in<br />
question originated the data. It may also provide some indication as <strong>to</strong> whether or<br />
not syndicating that data is legal.<br />
F o r non-textual data, make educated guesses as <strong>to</strong> keywords that correspond <strong>to</strong><br />
the subject and try using a search engine specific <strong>to</strong> that particular data format.<br />
Searches like this are not intended <strong>to</strong> be extensive or definitive indications, but<br />
merely a quick way of ruling out an obvious syndication of an original data source.