12.07.2015 Views

The PageRank Citation Ranking: Bringing Order to the ... - Dejan SEO

The PageRank Citation Ranking: Bringing Order to the ... - Dejan SEO

The PageRank Citation Ranking: Bringing Order to the ... - Dejan SEO

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

7 Applications7.1 Estimating Web TracBecause <strong>PageRank</strong> roughly corresponds <strong>to</strong> a random web surfer (see Section 2.5), it is interesting<strong>to</strong> see how <strong>PageRank</strong> corresponds <strong>to</strong> actual usage. We used <strong>the</strong> counts of web page accesses fromNLANR [NLA] proxy cache and compared <strong>the</strong>se <strong>to</strong> <strong>PageRank</strong>. <strong>The</strong> NLANR data was from severalnational proxy caches over <strong>the</strong> period of several months and consisted of 11,817,665 unique URLswith <strong>the</strong> highest hit count going <strong>to</strong> Altavista with 638,657 hits. <strong>The</strong>re were 2.6 million pages in <strong>the</strong>intersection of <strong>the</strong> cache data and our 75 million URL database. It is extremely dicult <strong>to</strong> compare<strong>the</strong>se datasets analytically for a number of dierent reasons. Many of <strong>the</strong> URLs in <strong>the</strong> cache accessdata are people reading <strong>the</strong>ir personal mail on free email services. Duplicate server names and pagenames are a serious problem. Incompleteness and bias a problem is both <strong>the</strong> <strong>PageRank</strong> data and<strong>the</strong> usage data. However, we did see some interesting trends in <strong>the</strong> data. <strong>The</strong>re seems <strong>to</strong> be a highusage of pornographic sites in <strong>the</strong> cache data, but <strong>the</strong>se sites generally had low <strong>PageRank</strong>s. Webelieve this is because people do not want <strong>to</strong> link <strong>to</strong> pornographic sites from <strong>the</strong>ir own web pages.Using this technique of looking for dierences between <strong>PageRank</strong> and usage, it may be possible <strong>to</strong>nd things that people like <strong>to</strong> look at, but do not want <strong>to</strong>mention on <strong>the</strong>ir web pages. <strong>The</strong>re aresome sites that have avery high usage, but low <strong>PageRank</strong> such as netscape.yahoo.com. We believe<strong>the</strong>re is probably an important backlink which simply is omitted from our database (we only havea partial link structure of <strong>the</strong> web). It may be possible <strong>to</strong> use usage data as a start vec<strong>to</strong>r for<strong>PageRank</strong>, and <strong>the</strong>n iterate <strong>PageRank</strong> a few times. This might allow lling in holes in <strong>the</strong> usagedata. In any case, <strong>the</strong>se types of comparisons are an interesting <strong>to</strong>pic for future study.7.2 <strong>PageRank</strong> as Backlink Predic<strong>to</strong>rOne justication for <strong>PageRank</strong> is that it is a predic<strong>to</strong>r for backlinks. In [CGMP98] we explore <strong>the</strong>issue of how <strong>to</strong> crawl <strong>the</strong> web eciently, trying <strong>to</strong> crawl better documents rst. We found on testsof <strong>the</strong> Stanford web that <strong>PageRank</strong> is a better predic<strong>to</strong>r of future citation counts than citationcounts <strong>the</strong>mselves.<strong>The</strong> experiment assumes that <strong>the</strong> system starts out with only a single URL and no o<strong>the</strong>rinformation, and <strong>the</strong> goal is <strong>to</strong> try <strong>to</strong> crawl <strong>the</strong> pages in as close <strong>to</strong> <strong>the</strong> optimal order as possible.<strong>The</strong> optimal order is <strong>to</strong> crawl pages in exactly <strong>the</strong> order of <strong>the</strong>ir rank according <strong>to</strong> an evaluationfunction. For <strong>the</strong> purposes here, <strong>the</strong> evaluation function is simply <strong>the</strong> number of citations, givencomplete information. <strong>The</strong> catch is that all <strong>the</strong> information <strong>to</strong> calculate <strong>the</strong> evaluation function isnot available until after all <strong>the</strong> documents have been crawled. It turns out using <strong>the</strong> incompletedata, <strong>PageRank</strong> is a more eective way <strong>to</strong> order <strong>the</strong> crawling than <strong>the</strong> number of known citations.In o<strong>the</strong>r words, <strong>PageRank</strong> is a better predic<strong>to</strong>r than citation counting even when <strong>the</strong> measure is<strong>the</strong> number of citations! <strong>The</strong> explanation for this seems <strong>to</strong> be that <strong>PageRank</strong> avoids <strong>the</strong> localmaxima that citation counting gets stuck in. For example, citation counting tends <strong>to</strong> get stuck inlocal collections like <strong>the</strong> Stanford CS web pages, taking a long time <strong>to</strong> branch out and nd highlycited pages in o<strong>the</strong>r areas. <strong>PageRank</strong> quickly nds <strong>the</strong> Stanford homepage is important, and givespreference <strong>to</strong> its children resulting in an ecient, broad search.This ability of<strong>PageRank</strong> <strong>to</strong> predict citation counts is a powerful justication for using <strong>PageRank</strong>.Since it is very dicult <strong>to</strong> map <strong>the</strong> citation structure of <strong>the</strong> web completely, <strong>PageRank</strong> mayeven be a better citation count approximation than citation counts <strong>the</strong>mselves.13

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!