02.11.2014 Views

untangling_the_web

untangling_the_web

untangling_the_web

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

OOClO: 4046925<br />

UNCLASSIFIEDJlI&QR QFFl61)1cL USE or~L f<br />

Google<br />

Google first gained fame and widespread use because of its single-minded focus on<br />

search, exemplified by its "clean" interface, and its PageRank "weighted link<br />

popularity." In simple terms, Google gives each <strong>web</strong>page a rank based on <strong>the</strong><br />

number of o<strong>the</strong>r pages linking to it and <strong>the</strong> "importance" of those pages, where<br />

importance is derived from an overall link count. While PageRank is imperfect, it<br />

works better than most o<strong>the</strong>r approaches to ranking search results and, indeed, is<br />

one of <strong>the</strong> primary reasons for Google's success.<br />

Some of Google's features that helped to create this very successful and powerful<br />

search tool are:<br />

)0> cached versions of <strong>web</strong>pages; Google was <strong>the</strong> first search engine to offer<br />

this option, which let users peek into its vast database.<br />

)0> automatic conversion of non-HTML filetypes to HTML is available; Google<br />

was not <strong>the</strong> first to do this, but certainly has been <strong>the</strong> most successful.<br />

)0> backlinks (<strong>the</strong> link: syntax); unfortunately, Google now limits <strong>the</strong> number of<br />

backlinks it shows, greatly reducing <strong>the</strong> utility of this option .<br />

)0> Google seems to have increased its limits on <strong>the</strong> size of indexed pages. I<br />

found an indexed PDF document over 764K, a text file over 1000K, and a<br />

<strong>web</strong>page over 366K. Very few <strong>web</strong>pages are larger than SOaK. Google does<br />

not offer HTML versions of very large PDF or Word documents, e.g., <strong>the</strong><br />

complete 9/11 Commission Report, but exactly what <strong>the</strong>ir cut-off size is, I do<br />

not know.<br />

)0> Google refreshes its index continuously, not on a schedule (this is a good<br />

thing); Google's Matt Cutts explains Google's refresh rate: "It's true that when<br />

an event happens on <strong>the</strong> <strong>web</strong>, our index can often pick it up in 1-2 days, and<br />

usually even faster. But a typical page in Google's main <strong>web</strong> index is updated<br />

every 2-3 weeks or faster; it's not <strong>the</strong> case that <strong>the</strong> entire main <strong>web</strong> index is<br />

updated every 2-3 days.,,36<br />

)0> Google stopped advertising <strong>the</strong> size of its database in 2005, but Google is<br />

one of <strong>the</strong> largest if not <strong>the</strong> largest search database.<br />

In determining <strong>the</strong> overall size of its index, Google also includes uris of pages that<br />

it has not crawled and for which it has not indexed <strong>the</strong> text. These "orphan"<br />

36 Matt Cutts. "Google Update Speed ," Google Blogoscoped, 26 July 2006 , <br />

(14 November 2006) .<br />

UNCLASSIFIEDIlFOR OFFlGI~L l:J5E ONLY 47

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!