02.11.2014 Views

untangling_the_web

untangling_the_web

untangling_the_web

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

DOCID: 4046925<br />

UNCLASSIFIEOHFOR OFFIGIAL I:JSE ONLY<br />

respects <strong>the</strong>se mechanisms. Password protection, firewalls, and o<strong>the</strong>r<br />

measures will generally keep spiders from crawling a <strong>web</strong>site and indexing it.<br />

The Web Robots Pages ...<br />

Robots Exclusion<br />

Sometimes people find <strong>the</strong>y have been indexed by an indexing robot, or that a resource discovery robot has<br />

visited part of a site that for some reason shouldn't be visited by robots<br />

In recognition of this problem, many Web Robots offer facilities for Web site administrators and content<br />

providers to limit what <strong>the</strong> robot does. This is achieved through two mechanisms:<br />

The Robots Exclusion Protocol<br />

A Web site administrator can indicate which parts of <strong>the</strong> site should not<br />

be vistsed by a robot, by providing a specially formatted file on <strong>the</strong>ir<br />

site, inhttpll./robotstxt<br />

TIle Robots META tag A Web author can indicate if a page mayor may not be indexed, or<br />

analysed for links, through <strong>the</strong> use of a special HTIvfi.., META tag<br />

The remainder ofthis pages provides full details on <strong>the</strong>se facilities.<br />

Note that <strong>the</strong>se methods rely on cooperation from <strong>the</strong> Robot, and are by no means guaranteed to work for every<br />

Robot. If you need stronger protection from robots and o<strong>the</strong>r agents, you should use alternative methods such as<br />

password protection<br />

Robots Exclusion Page<br />

http://www.robotstxt.org/wc/exclusion.html<br />

Not every search engine has its own proprietary search program but instead relies<br />

upon ano<strong>the</strong>r company's search service for its results. Most of <strong>the</strong>se strategic<br />

alliances now involve Yahoo, Google, and Windows Live Search. All <strong>the</strong>se<br />

partnerships are subject to change without notice; for more on <strong>the</strong>se strategic<br />

alliances, see:<br />

Search Engine Alliances<br />

http://searchenginewatch.com/reports/alliances.html<br />

Knowing that Yahoo, for example, is <strong>the</strong> search tool behind a search engine can<br />

save you time because you can be pretty sure that using AltaVista Will get you<br />

similar (although not identical) results to <strong>the</strong> o<strong>the</strong>r search engines also powered by<br />

Yahoo. It is critical to remember that each service powered by a particular search<br />

engine produces different results even though <strong>the</strong>y may all use <strong>the</strong> same core<br />

database. Why is this? Because <strong>the</strong> search interfaces have <strong>the</strong>ir own algorithms that<br />

decide how queries are run, how results are returned, or even if <strong>the</strong>y query <strong>the</strong> entire<br />

database (most do not). In short, go to <strong>the</strong> primary search engine-Google, Yahoo,<br />

or Live Search for best results.<br />

UNCLASSIFIEOHFOR OFFIOIAL I:JSE ONLY 21

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!