19.08.2013 Views

r - SNAP - Stanford University

r - SNAP - Stanford University

r - SNAP - Stanford University

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

CS246: Mining Massive Datasets<br />

Jure Leskovec, <strong>Stanford</strong> <strong>University</strong><br />

http://cs246.stanford.edu


Web pages are not equally “important”<br />

www.joe-schmoe.com vs. www.stanford.edu<br />

We already know:<br />

Since there is large diversity<br />

in the connectivity of the<br />

webgraph we can<br />

rank the pages by<br />

the link structure<br />

2/7/2012 Jure Leskovec, <strong>Stanford</strong> C246: Mining Massive Datasets 2<br />

vs.


We will cover the following Link Analysis<br />

approaches to computing importances of<br />

nodes in a graph:<br />

Page Rank<br />

Hubs and Authorities (HITS)<br />

Topic-Specific (Personalized) Page Rank<br />

Web Spam Detection Algorithms<br />

2/7/2012 Jure Leskovec, <strong>Stanford</strong> C246: Mining Massive Datasets<br />

3


Idea: Links as votes<br />

Page is more important if it has more links<br />

In-coming links? Out-going links?<br />

Think of in-links as votes:<br />

www.stanford.edu has 23,400 inlinks<br />

www.joe-schmoe.com has 1 inlink<br />

Are all in-links are equal?<br />

Links from important pages count more<br />

Recursive question!<br />

2/7/2012 Jure Leskovec, <strong>Stanford</strong> C246: Mining Massive Datasets 4


Each link’s vote is proportional to the<br />

importance of its source page<br />

If page p with importance x has n out-links,<br />

each link gets x/n votes<br />

Page p’s own importance is the sum of the<br />

votes on its in-links<br />

2/7/2012 Jure Leskovec, <strong>Stanford</strong> C246: Mining Massive Datasets<br />

5<br />

p


A “vote” from an important<br />

page is worth more<br />

A page is important if it is<br />

pointed to by other important<br />

pages<br />

Define a “rank” r j for node j<br />

r<br />

j<br />

=<br />

∑<br />

i→<br />

j<br />

r<br />

i<br />

d out<br />

(i)<br />

2/7/2012 Jure Leskovec, <strong>Stanford</strong> C246: Mining Massive Datasets 6<br />

a/2<br />

a<br />

The web in 1839<br />

a/2<br />

y/2<br />

y<br />

y/2<br />

m<br />

Flow equations:<br />

ry = ry /2 + ra /2<br />

ra = ry /2 + rm rm = ra /2<br />

m


3 equations, 3 unknowns,<br />

no constants<br />

No unique solution<br />

All solutions equivalent modulo scale factor<br />

Additional constraint forces uniqueness<br />

y + a + m = 1<br />

y = 2/5, a = 2/5, m = 1/5<br />

Flow equations:<br />

ry = ry /2 + ra /2<br />

ra = ry /2 + rm rm = ra /2<br />

Gaussian elimination method works for small<br />

examples, but we need a better method for<br />

large web-size graphs<br />

2/7/2012 Jure Leskovec, <strong>Stanford</strong> C246: Mining Massive Datasets<br />

7


Stochastic adjacency matrix M<br />

Let page j has d j out-links<br />

If j → i, then M ij = 1/d j else M ij = 0<br />

M is a column stochastic matrix<br />

Columns sum to 1<br />

Rank vector r: vector with an entry per page<br />

r i is the importance score of page i<br />

∑ i r i = 1<br />

The flow equations can be written<br />

r = M r<br />

2/7/2012 Jure Leskovec, <strong>Stanford</strong> C246: Mining Massive Datasets 8


Suppose page j links to 3 pages, including i<br />

i<br />

j<br />

1/3<br />

M r r<br />

2/7/2012 Jure Leskovec, <strong>Stanford</strong> C246: Mining Massive Datasets 9<br />

=<br />

i


The flow equations can be written<br />

r = M ∙ r<br />

So the rank vector is an eigenvector of the<br />

stochastic web matrix<br />

In fact, its first or principal eigenvector, with<br />

corresponding eigenvalue 1<br />

2/7/2012 Jure Leskovec, <strong>Stanford</strong> C246: Mining Massive Datasets<br />

10


y<br />

a m<br />

ry = ry /2 + ra /2<br />

ra = ry /2 + rm rm = ra /2<br />

y a m<br />

y ½ ½ 0<br />

a ½ 0 1<br />

m 0 ½ 0<br />

r = Mr<br />

y ½ ½ 0 y<br />

a = ½ 0 1 a<br />

m 0 ½ 0 m<br />

2/7/2012 Jure Leskovec, <strong>Stanford</strong> C246: Mining Massive Datasets<br />

11


Given a web graph with n nodes, where the<br />

nodes are pages and edges are hyperlinks<br />

Power iteration: a simple iterative scheme<br />

Suppose there are N web pages<br />

Initialize: r (0) = [1/N,….,1/N] T<br />

Iterate: r (t+1) = M ∙ r (t)<br />

Stop when |r (t+1) – r (t) | 1 < ε<br />

|x| 1 = ∑1≤i≤N|xi| is the L1 norm<br />

Can use any other vector norm e.g., Euclidean<br />

( t+<br />

1)<br />

= ∑ j<br />

i→<br />

j<br />

2/7/2012 Jure Leskovec, <strong>Stanford</strong> C246: Mining Massive Datasets<br />

12<br />

r<br />

( t)<br />

i<br />

r<br />

d<br />

d i …. out-degree of node i<br />

i


Power Iteration:<br />

Set


Imagine a random web surfer:<br />

At any time t, surfer is on some page u<br />

At time t+1, the surfer follows an<br />

out-link from u uniformly at random<br />

Ends up on some page v linked from u<br />

Process repeats indefinitely<br />

Let:<br />

p(t) … vector whose i th coordinate is the<br />

prob. that the surfer is at page i at time t<br />

p(t) is a probability distribution over pages<br />

2/7/2012 Jure Leskovec, <strong>Stanford</strong> C246: Mining Massive Datasets 14<br />

r<br />

j<br />

i 1 i 2 i 3<br />

=<br />

j<br />

∑<br />

i→<br />

j dout ri<br />

(i)


Where is the surfer at time t+1?<br />

Follows a link uniformly at random<br />

p(t+1) = M · p(t)<br />

Suppose the random walk reaches a state<br />

p(t+1) = M · p(t) = p(t)<br />

then p(t) is stationary distribution of a random walk<br />

Our rank vector r satisfies r = M · r<br />

So, it is a stationary distribution for<br />

the random walk<br />

i 1 i 2 i 3<br />

p( t + 1)<br />

= M ⋅ p(<br />

t)<br />

2/7/2012 Jure Leskovec, <strong>Stanford</strong> C246: Mining Massive Datasets 15<br />

j


( t+<br />

1)<br />

= j<br />

∑<br />

i→<br />

j<br />

r<br />

( t)<br />

i<br />

d<br />

Does this converge?<br />

Does it converge to what we want?<br />

Are results reasonable?<br />

i<br />

or<br />

equivalently<br />

r =<br />

Mr<br />

2/7/2012 Jure Leskovec, <strong>Stanford</strong> C246: Mining Massive Datasets 16


Example:<br />

a<br />

b<br />

ra 1 0 1 0<br />

=<br />

rb 0 1 0 1<br />

Iteration 0, 1, 2, …<br />

( t+<br />

1)<br />

= ∑ j<br />

i→<br />

j<br />

( t)<br />

i<br />

2/7/2012 Jure Leskovec, <strong>Stanford</strong> C246: Mining Massive Datasets 17<br />

r<br />

r<br />

d<br />

i


Example:<br />

a<br />

b<br />

ra 1 0 0 0<br />

=<br />

rb 0 1 0 0<br />

Iteration 0, 1, 2, …<br />

( t+<br />

1)<br />

= ∑ j<br />

i→<br />

j<br />

( t)<br />

i<br />

2/7/2012 Jure Leskovec, <strong>Stanford</strong> C246: Mining Massive Datasets 18<br />

r<br />

r<br />

d<br />

i


2 problems:<br />

Some pages are “dead ends”<br />

(have no out-links)<br />

Such pages cause<br />

importance to “leak out”<br />

Spider traps (all out links are<br />

within the group)<br />

Eventually spider traps absorb all importance<br />

2/7/2012 Jure Leskovec, <strong>Stanford</strong> C246: Mining Massive Datasets 19


Power Iteration:<br />

Set


The Google solution for spider traps: At each<br />

time step, the random surfer has two options:<br />

With probability β, follow a link at random<br />

With probability 1-β, jump to some page uniformly<br />

at random<br />

Common values for β are in the range 0.8 to 0.9<br />

Surfer will teleport out of spider trap within a<br />

few time steps<br />

y<br />

a m<br />

2/7/2012 Jure Leskovec, <strong>Stanford</strong> C246: Mining Massive Datasets<br />

21<br />

y<br />

a m


Power Iteration:<br />

Set


Teleports: Follow random teleport links with<br />

probability 1.0 from dead-ends<br />

Adjust matrix accordingly<br />

y<br />

a m<br />

y a m<br />

y ½ ½ 0<br />

a ½ 0 0<br />

m 0 ½ 0<br />

y a m<br />

y ½ ½ ⅓<br />

a ½ 0 ⅓<br />

m 0 ½ ⅓<br />

2/7/2012 Jure Leskovec, <strong>Stanford</strong> C246: Mining Massive Datasets 23<br />

y<br />

a m


( t+<br />

1)<br />

= Mr<br />

( t)<br />

Markov Chains<br />

Set of states X<br />

Transition matrix P where P ij = P(X t=i | X t-1=j)<br />

π specifying the probability of being at each<br />

state x ∈ X<br />

Goal is to find π such that π = P π<br />

2/7/2012 Jure Leskovec, <strong>Stanford</strong> C246: Mining Massive Datasets 24


Theory of Markov chains<br />

Fact: For any start vector, the power method<br />

applied to a Markov transition matrix P will<br />

converge to a unique positive stationary<br />

vector as long as P is stochastic, irreducible<br />

and aperiodic.<br />

2/7/2012 Jure Leskovec, <strong>Stanford</strong> C246: Mining Massive Datasets 25


Stochastic: Every column sums to 1<br />

A possible solution: Add green links<br />

y<br />

S<br />

=<br />

a m<br />

M<br />

+<br />

a<br />

T<br />

y a m<br />

y ½ ½ 1/3<br />

a ½ 0 1/3<br />

m 0 ½ 1/3<br />

1<br />

( 1)<br />

n<br />

r y = r y /2 + r a /2 + r m /3<br />

r a = r y /2+ r m /3<br />

r m = r a /2 + r m /3<br />

• a i…=1 if node i has<br />

out deg 0, =0 else<br />

• 1…vector of all 1s<br />

2/7/2012 Jure Leskovec, <strong>Stanford</strong> C246: Mining Massive Datasets 26


A chain is periodic if there exists k > 1 such<br />

that the interval between two visits to some<br />

state s is always a multiple of k.<br />

A possible solution: Add green links<br />

y<br />

a m<br />

2/7/2012 Jure Leskovec, <strong>Stanford</strong> C246: Mining Massive Datasets 27


From any state, there is a non-zero<br />

probability of going from any one<br />

state to any another<br />

A possible solution: Add green links<br />

y<br />

a m<br />

2/7/2012 Jure Leskovec, <strong>Stanford</strong> C246: Mining Massive Datasets 28


Google’s solution that does it all:<br />

Makes M stochastic, aperiodic, irreducible<br />

At each step, random surfer has two options:<br />

With probability 1-β, follow a link at random<br />

With probability β, jump to some random page<br />

PageRank equation [Brin-Page, 98]


PageRank equation [Brin-Page, 98]


a<br />

y<br />

a =<br />

m<br />

y<br />

0.8·½+0.2·⅓<br />

1/3<br />

1/3<br />

1/3<br />

m<br />

0.33<br />

0.20<br />

0.46<br />

0.24<br />

0.20<br />

0.52<br />

0.8<br />

0.8+0.2·⅓<br />

1/2 1/2 0<br />

1/2 0 0<br />

0 1/2 1<br />

0.26<br />

0.18<br />

0.56<br />

. . .<br />

+ 0.2<br />

y 7/15 7/15 1/15<br />

a 7/15 1/15 1/15<br />

m 1/15 7/15 13/15<br />

7/33<br />

5/33<br />

21/33<br />

1/3 1/3 1/3<br />

1/3 1/3 1/3<br />

1/3 1/3 1/3<br />

2/7/2012 Jure Leskovec, <strong>Stanford</strong> C246: Mining Massive Datasets<br />

31<br />

S<br />

A<br />

1/n·1·1 T


Suppose there are N pages<br />

Consider a page j, with set of out-links O(j)<br />

We have M ij = 1/|O(j)| when j→i and M ij = 0<br />

otherwise<br />

The random teleport is equivalent to<br />

Adding a teleport link from j to every other page<br />

with probability (1-β)/N<br />

Reducing the probability of following each out-link<br />

from 1/|O(j)| to β/|O(j)|<br />

Equivalent: Tax each page a fraction (1-β) of its<br />

score and redistribute evenly<br />

2/7/2012 Jure Leskovec, <strong>Stanford</strong> C246: Mining Massive Datasets<br />

32


Construct the N x N matrix A as follows<br />

A ij = β ∙ M ij + (1-β)/N<br />

Verify that A is a stochastic matrix<br />

The page rank vector r is the principal<br />

eigenvector of this matrix A<br />

satisfying r = A ∙ r<br />

Equivalently, r is the stationary distribution of<br />

the random walk with teleports<br />

2/7/2012 Jure Leskovec, <strong>Stanford</strong> C246: Mining Massive Datasets<br />

33


Key step is matrix-vector multiplication<br />

r new = A ∙ r old<br />

Easy if we have enough main memory to hold<br />

A, r old , r new<br />

Say N = 1 billion pages<br />

We need 4 bytes for<br />

each entry (say)<br />

2 billion entries for<br />

vectors, approx 8GB<br />

Matrix A has N 2 entries<br />

10 18 is a large number!<br />

A =<br />

A = β∙M + (1-β) [1/N] NxN<br />

½ ½ 0<br />

½ 0 0<br />

0 ½ 1<br />

0.8 +0.2<br />

1/3 1/3 1/3<br />

1/3 1/3 1/3<br />

1/3 1/3 1/3<br />

7/15 7/15 1/15<br />

7/15 1/15 1/15<br />

1/15 7/15 13/15<br />

2/7/2012 Jure Leskovec, <strong>Stanford</strong> C246: Mining Massive Datasets 34<br />

=


Key step is matrix-vector multiplication<br />

r new = A ∙ r old<br />

Easy if we have enough main memory to hold<br />

A, r old , r new<br />

Say N = 1 billion pages<br />

We need 4 bytes for<br />

each entry (say)<br />

2 billion entries for<br />

vectors, approx 8GB<br />

Matrix A has N 2 entries<br />

10 18 is a large number!<br />

A =<br />

A = β∙M + (1-β) [1/N] NxN<br />

½ ½ 0<br />

½ 0 0<br />

0 ½ 1<br />

0.8 +0.2<br />

1/3 1/3 1/3<br />

1/3 1/3 1/3<br />

1/3 1/3 1/3<br />

7/15 7/15 1/15<br />

7/15 1/15 1/15<br />

1/15 7/15 13/15<br />

2/7/2012 Jure Leskovec, <strong>Stanford</strong> C246: Mining Massive Datasets 34<br />

=


We can rearrange the PageRank equation


Encode sparse matrix using only nonzero<br />

entries<br />

Space proportional roughly to number of links<br />

Say 10N, or 4*10*1 billion = 40GB<br />

Still won’t fit in memory, but will fit on disk<br />

source<br />

node degree destination nodes<br />

0 3 1, 5, 7<br />

1 5 17, 64, 113, 117, 245<br />

2 2 13, 23<br />

2/7/2012 Jure Leskovec, <strong>Stanford</strong> C246: Mining Massive Datasets 37


Assume enough RAM to fit r new into memory<br />

Store r old and matrix M on disk<br />

Then 1 step of power-iteration is:<br />

Initialize all entries of rnew to (1-β)/N<br />

For each page p (of out-degree n):<br />

Read into memory: p, n, dest1,…,destn, rold (p)<br />

for j = 1…n: rnew (destj) += β rold (p) / n<br />

0<br />

1<br />

2<br />

3<br />

4<br />

5<br />

6<br />

r<br />

src degree destination<br />

new rold 0 3 1, 5, 6<br />

1 4 17, 64, 113, 117<br />

2 2 13, 23<br />

2/7/2012 Jure Leskovec, <strong>Stanford</strong> C246: Mining Massive Datasets 38<br />

0<br />

1<br />

2<br />

3<br />

4<br />

5<br />

6


Assume enough RAM to fit r new into memory<br />

Store r old and matrix M on disk<br />

In each iteration, we have to:<br />

Read r old and M<br />

Write r new back to disk<br />

IO cost = 2|r| + |M|<br />

Question:<br />

What if we could not even fit r new in memory?<br />

2/7/2012 Jure Leskovec, <strong>Stanford</strong> C246: Mining Massive Datasets 39


0<br />

1<br />

2<br />

3<br />

4<br />

5<br />

r src degree destination<br />

new rold 0 4 0, 1, 3, 5<br />

1 2 0, 5<br />

2 2 3, 4<br />

2/7/2012 Jure Leskovec, <strong>Stanford</strong> C246: Mining Massive Datasets 40<br />

0<br />

1<br />

2<br />

3<br />

4<br />

5


Similar to nested-loop join in databases<br />

Break r new into k blocks that fit in memory<br />

Scan M and r old once for each block<br />

k scans of M and r old<br />

k(|M| + |r|) + |r| = k|M| + (k+1)|r|<br />

Can we do better?<br />

Hint: M is much bigger than r (approx 10-20x), so<br />

we must avoid reading it k times per iteration<br />

2/7/2012 Jure Leskovec, <strong>Stanford</strong> C246: Mining Massive Datasets 41


0<br />

1<br />

2<br />

3<br />

4<br />

5<br />

r new<br />

src degree destination<br />

0 4 0, 1<br />

1 3 0<br />

2 2 1<br />

0 4 3<br />

2 2 3<br />

0 4 5<br />

1 3 5<br />

2 2 4<br />

2/7/2012 Jure Leskovec, <strong>Stanford</strong> C246: Mining Massive Datasets 42<br />

r old<br />

0<br />

1<br />

2<br />

3<br />

4<br />

5


Break M into stripes<br />

Each stripe contains only destination nodes in the<br />

corresponding block of r new<br />

Some additional overhead per stripe<br />

But it is usually worth it<br />

Cost per iteration<br />

|M|(1+ε) + (k+1)|r|<br />

2/7/2012 Jure Leskovec, <strong>Stanford</strong> C246: Mining Massive Datasets 43


Measures generic popularity of a page<br />

Biased against topic-specific authorities<br />

Solution: Topic-Specific PageRank (next)<br />

Uses a single measure of importance<br />

Other models e.g., hubs-and-authorities<br />

Solution: Hubs-and-Authorities (next)<br />

Susceptible to Link spam<br />

Artificial link topographies created in order to<br />

boost page rank<br />

Solution: TrustRank (next)<br />

2/7/2012 Jure Leskovec, <strong>Stanford</strong> C246: Mining Massive Datasets 44

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!