03.02.2014 Views

php|architect's Guide to Web Scraping with PHP - Wind Business ...

php|architect's Guide to Web Scraping with PHP - Wind Business ...

php|architect's Guide to Web Scraping with PHP - Wind Business ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Legality of W e b <strong>Scraping</strong> ” 167<br />

of said information is interpreted as “ f a i ruse” <strong>with</strong> respect <strong>to</strong> copyright laws in the<br />

geographical area in which the alleged offense <strong>to</strong>ok place.<br />

Please note that these statements are very general and are not intended <strong>to</strong> replace<br />

the consultation of an at<strong>to</strong>rney. If TOS agreements or lack thereof and communications<br />

<strong>with</strong> the web site o w n e r prove inconclusive, it is highly advisable <strong>to</strong> seek legal<br />

council prior <strong>to</strong> any attempts being made <strong>to</strong> launch an au<strong>to</strong>mated agent on a web<br />

site. This is another reason why web scraping is a less-than-ideal approach <strong>to</strong> solving<br />

the problem of data acquisition and why it should be considered only in the absence<br />

of alternatives.<br />

Some sites actually use license agreements <strong>to</strong> grant open or mildly restricted usage<br />

rights for their content. Common licenses <strong>to</strong> this end include the GNU Free<br />

Documentation license and the Creative Commons licenses. In instances where the<br />

particular data source being used <strong>to</strong> acquire data is not relevant, sources that use<br />

licenses like these should be preferred o v e r those that do not, as legalities are significantly<br />

less likely <strong>to</strong> become an issue.<br />

The second point of inspection is the legitimacy of the web site as the originating<br />

source of the data <strong>to</strong> be harvested. Eve n large companies <strong>with</strong> substantial legal resources,<br />

such as Google, have run in<strong>to</strong> issues when their au<strong>to</strong>mated agents acquired<br />

content from sites illegally syndicating other sites. In some cases, sites will attribute<br />

their sources, but in many cases they will not.<br />

F o r textual content, entering direct quotations that are likely <strong>to</strong> be unique from the<br />

site in<strong>to</strong> major search engines is one method that can help <strong>to</strong> determine if the site in<br />

question originated the data. It may also provide some indication as <strong>to</strong> whether or<br />

not syndicating that data is legal.<br />

F o r non-textual data, make educated guesses as <strong>to</strong> keywords that correspond <strong>to</strong><br />

the subject and try using a search engine specific <strong>to</strong> that particular data format.<br />

Searches like this are not intended <strong>to</strong> be extensive or definitive indications, but<br />

merely a quick way of ruling out an obvious syndication of an original data source.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!