10.07.2015 Views

Web Mining and Social Networking: Techniques and ... - tud.ttu.ee

Web Mining and Social Networking: Techniques and ... - tud.ttu.ee

Web Mining and Social Networking: Techniques and ... - tud.ttu.ee

SHOW MORE
SHOW LESS
  • No tags were found...

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

5.3 PageRank <strong>and</strong> HITS Algorithms 91Fig. 5.2. Paper i <strong>and</strong> j cite paper k5.3 PageRank <strong>and</strong> HITS Algorithms5.3.1 PageRankIn 1998, Sergey Brin <strong>and</strong> Larry Page invented PageRank <strong>and</strong> presented it at the Seventh InternationalWorld Wide <strong>Web</strong> Conference (WWW7). The PageRank is a measure of pages qualityused in the Google search engine. PageRank utilizes hyperlinking as an indicator of the qualityof a <strong>Web</strong> page. In essence, PageRank interprets a hyperlink from page x to page y as aconveyance of prestige, by page x, for page y. However, PageRank does not just consider thesh<strong>ee</strong>r number of links that a page receives. It also takes into account the importance or qualityof the page that conveys the prestige. Hyperlinks from pages that are themselves importantweigh high <strong>and</strong> help to make other pages more important. This idea is based on the concept ofrank prestige in social network analysis [250].Algorithm: PageRankPageRank [196, 43] is a search query independent ranking algorithm for <strong>Web</strong> pages. It is basedon the measure of rank prestige in social network analysis. A PageRank value is calculatedfor each indexed page off-line. The main concepts underlying the PageRank algorithm can bedescribed as follows.1. A hyperlink from a page pointing to another page is an implicit conveyance of authorityto the target page. Thus, a <strong>Web</strong> page is important if it receives a lot of in-links.2. A hyperlink from a high prestige page is more important than a hyperlink from a lowprestige page. Thus, a <strong>Web</strong> page is important if it is pointed to by other important pages.To formulate the above concepts, we represent the <strong>Web</strong> as a directed graph G = (V, E),where V is the set of vertices or nodes, i.e. the set of all pages in the <strong>Web</strong>, <strong>and</strong> E is the set ofdirected edges betw<strong>ee</strong>n a pair of nodes, i.e. hyperlinks. Let the total number of pages on the<strong>Web</strong> be n(n = |V|). The PageRank score of the page i, P(i), is defined by:P(i)=(1 − d)+d∑ ( j,i)∈EP( j)/outdegr<strong>ee</strong>( j) (5.3)where d is a damping factor usually set betw<strong>ee</strong>n 0.8 <strong>and</strong> 0.9; outdegr<strong>ee</strong>(q) is the numberof hyperlinks on page q.Alternately, the PageRank can be defined by the stationary distribution of the followinginfinite, r<strong>and</strong>om walk p 1 , p 2 , p 3 ,...,where each p i is a node in G: The walk starts at each nodewith equal probability. To determine node p i+1 a biased coin is flipped: With probability 1-dnode p i+1 is chosen uniformly at r<strong>and</strong>om from all nodes in G, with probability d it is chosenuniformly from all nodes q such that (p i ,q) exists in the graph G.Equation 5.1 can be rewritten in a matrix notation. Let P be a n-dimensional column vectorof PageRank values: P =(P(1),P(2),···,P(n)) T . Let A be the adjacency matrix of G with

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!