02.11.2014 Views

untangling_the_web

untangling_the_web

untangling_the_web

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

OOClO: 4046925<br />

UNCLASSIFIEONFOR OFFl61AL l:JSE m.LY<br />

Uncovering <strong>the</strong> "Invisible" Internet<br />

One of <strong>the</strong> most frustrating things about Internet search tools is <strong>the</strong> fact that even<br />

<strong>the</strong> best index only a portion of <strong>the</strong> <strong>web</strong>, much less <strong>the</strong> entire Internet. The deep<br />

(aka hidden or invisible) <strong>web</strong> continues to elude most search services and users<br />

seeking to plumb its depths. We are still, for <strong>the</strong> most part, dependent upon specialty<br />

tools and sites to help us find and exploit deep <strong>web</strong> resources. The challenge is how<br />

to access that part of <strong>the</strong> <strong>web</strong> that remains invisible to search engines. It is important<br />

to understand that search engines are generally designed to index a certain subset<br />

of <strong>the</strong> Internet: <strong>web</strong> pages and, in some cases, certain types of files, e.g., video,<br />

audio, PDF 91 . Fur<strong>the</strong>rmore, most search engines limit <strong>the</strong>ir <strong>web</strong> page and<br />

document indexing. For example, Google used to index approximately <strong>the</strong> first<br />

100KS of HTML, and reportedly <strong>the</strong> first megabyte of PDF documents, but in<br />

October 2005, Google dramatically increased <strong>the</strong> size of its cache limit, although no<br />

one knows for sure what that limit is. Yahoo indexes at least <strong>the</strong> first 500KS of HTML<br />

and PDF documents. In any event, long documents usually are partially invisible to<br />

<strong>the</strong>se and o<strong>the</strong>r search engines. You cannot rely upon a search engine spider to<br />

index long documents in <strong>the</strong>ir entirety.<br />

A9 Search<br />

At <strong>the</strong> end of September 2006, A9, <strong>the</strong> Amazon.com-owned search property, made<br />

sweeping changes, some good and' some bad. Contrary to what some search<br />

bloggers said, A9 is not "dead" (at least not yet). But some of A9's best features are<br />

gone. As I feared, not enough people used <strong>the</strong> wonderful "street view" map resource<br />

and now it is gone. As of September 29, 2006, A9 "discontinued A9 Maps and <strong>the</strong><br />

A9 Yellow Pages (including BlockView) ... [and] discontinued <strong>the</strong> A9 Instant<br />

Reward program, and <strong>the</strong> A9 Toolbar and personalized services such as history,<br />

bookmarks, and diary." O<strong>the</strong>r changes include "a new continuous scrolling feature,<br />

so you no longer have to bo<strong>the</strong>r with next and previous buttons to move from one<br />

page of results to <strong>the</strong> next. You can now also drag-and-drop <strong>the</strong> columns to change<br />

91 Google was <strong>the</strong> first major search engine to routinely index <strong>the</strong> contentof many file types, including<br />

pdf, ps, xis, doc, ppt, and o<strong>the</strong>rs. See "Google's Frequently Asked Questions - File Types ,"<br />

(14 November 2006).<br />

UNCLASSIFIEOMI='OR Qj;j;ISIAL l:JSE ONLY 239

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!