10.07.2015 Views

Web Mining and Social Networking: Techniques and ... - tud.ttu.ee

Web Mining and Social Networking: Techniques and ... - tud.ttu.ee

Web Mining and Social Networking: Techniques and ... - tud.ttu.ee

SHOW MORE
SHOW LESS
  • No tags were found...

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

92 5 <strong>Web</strong> Linkage <strong>Mining</strong>{ /1 out degr<strong>ee</strong>(i),if(i, j) ∈ EA ij =0,otherwiseWe can write the system of n equations with(5.4)P =(1 − d)e + dA T P (5.5)where, e is a column vector of all 1’s.This is the characteristic equation of the eigensystem, where the solution to P is an eigenvectorwith the corresponding eigenvalue of 1. Since this is a recursive definition, an iterativealgorithm is used to solve it. It turns out that if the adjacency matrix A is a stochastic (transition)matrix, irreducible <strong>and</strong> aperiodic, then 1 is the largest eigenvalue <strong>and</strong> the PageRankvector P is the principal eigenvector. A well known power iteration method [103] can be usedto find P. The power iteration algorithm for PageRank is given in Fig.5.3 below. The algorithmcan start with any initial assignments of PageRank values. The iteration ends when the PageRankvalues do not change much or converge such as when the 1-norm of the residual vector isless than a threshold value ε. Note that, the 1-norm for a vector is simply the sum of all thecomponents. For a detailed discussion of the derivation of PageRank, interested readers canrefer to [166, 151], <strong>and</strong> [150].Fig. 5.3. The power iteration method for PageRankThe main advantage of PageRank is its ability to combat spam. Since it is not easy forthe page owner to add in-links into his/her pages from other important pages, it is thereforenot easy to deliberately boost the PageRank. Nevertheless, there are reported ways to boostPageRank. Identifying <strong>and</strong> combating web spam is an important research issue in <strong>Web</strong> search.Another strong point of PageRank is that it is a global rank measure <strong>and</strong> is query independent.In Google, the <strong>Web</strong> graph induced from the crawled pages is first used to compute PageRankscores of all pages, <strong>and</strong> the computed PageRank scores are kept for later process. When asearch query is submitted, a text index is first consulted to select possible response pages.Then an undisclosed ranking scheme that combines PageRank with textual match is used toproduce a final rank list of response URLs. This makes Google much comparably faster thanconventional text-based search engines.The main shortcoming of PageRank is also caused by the query-independence, global natureof PageRank. PageRank could not distinguish betw<strong>ee</strong>n pages that are authoritative in general<strong>and</strong> pages that are authoritative on the query topic. Another shortcoming is that PageRankdoes not consider timeliness of authoritative sources.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!