02.11.2014 Views

untangling_the_web

untangling_the_web

untangling_the_web

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

DID: 4046925<br />

UNCLASSIFIEDNFOR OFFIGIAb l:IS~ O~lb¥<br />

engine spiders. Indeed, without <strong>the</strong>m, we would have little or no idea what is "out<br />

<strong>the</strong>re" and available to us. The problem for <strong>web</strong>masters is that it is <strong>the</strong>ir<br />

responsibility to keep <strong>the</strong> search engine spiders out of any parts of <strong>the</strong>ir <strong>web</strong>sites<br />

<strong>the</strong>y do not want to be accessed and indexed by a search engine. The spider is not<br />

smart; it simply knows that if a "door" is open, it can-and will-go in and crawl<br />

around. Webmasters must tell spiders "do not enter" (primarily) by <strong>the</strong> use of <strong>the</strong><br />

Robots Exclusion Protocol.<br />

Robots Exclusion 62 comes in two basic flavors: ei<strong>the</strong>r a metatag that can be inserted<br />

into <strong>the</strong> HTML of a <strong>web</strong> page (usually used by an individual) or a Robots Exclusion<br />

Protocol (robots.txt) file, a specially formatted file inserted by <strong>the</strong> <strong>web</strong>site<br />

administrator to tell <strong>the</strong> spider which parts of <strong>the</strong> <strong>web</strong>site may and may not be<br />

indexed by <strong>the</strong> spider. If a robots exclusion is missing or improperly configured, <strong>the</strong><br />

spider will index pages that <strong>the</strong> <strong>web</strong>site owner may not have wished to have been<br />

accessed.<br />

The whole problem of keeping information on <strong>the</strong> Internet private dramatically<br />

worsened almost overnight a couple of years ago when Google quietly started<br />

indexing whole new types of data. Originally, most of what got spidered and indexed<br />

was HTML <strong>web</strong>pages and documents, with some plain text thrown in for good<br />

measure. However, <strong>the</strong> ever-innovative Google decided this wasn't good enough<br />

and started to index PDF, PostScript, and-most importantly-a whole range of<br />

Microsoft file types: Word, Excel, PowerPoint, and Access. Problem was, lots of<br />

folks had assumed <strong>the</strong>se file types were "immune" to spidering not because it<br />

couldn't be done but because no one had yet done it. As a result, many companies,<br />

organizations, and even governments had quite a lot of egg on <strong>the</strong>ir faces when<br />

sensitive documents began turning up in <strong>the</strong> Google database.<br />

That was <strong>the</strong>n, this is now. You might think people would have learned, but judging<br />

by <strong>the</strong> amount of "sensitive" information still available, many have not. Even though<br />

search engines now routinely index many non-HTML file types, many individuals and<br />

organizations still do not protect <strong>the</strong>se files from <strong>the</strong> long reach of search engine<br />

spiders. Fur<strong>the</strong>rmore, <strong>the</strong>re are many ways for sensitive information to end up in<br />

search engine databases. An improperly configured server, security holes, and<br />

unpatched software can give search engine spiders unintended access. Quite<br />

frankly, most of <strong>the</strong> problems boil down to one thing: human error, ei<strong>the</strong>r through<br />

ignorance or neglect.<br />

What kinds of sensitive information can routinely be found using search engines?<br />

. The types of data most commonly discovered by Google hackers usually falls into<br />

one of <strong>the</strong>se categories:<br />

62 For additional information, see: (14 November 2006).<br />

176 UNCLASSIFIEDft'JiOK OJiJilehltL USE Ot4LY

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!