r - SNAP - Stanford University

CS246: Mining Massive Datasets

Jure Leskovec, **Stanford** **University**

http://cs246.stanford.edu

Web pages are not equally “important”

www.joe-schmoe.com vs. www.stanford.edu

We already know:

Since there is large diversity

in the connectivity of the

webgraph we can

rank the pages by

the link structure

2/7/2012 Jure Leskovec, **Stanford** C246: Mining Massive Datasets 2

vs.

We will cover the following Link Analysis

approaches to computing importances of

nodes in a graph:

Page Rank

Hubs and Authorities (HITS)

Topic-Specific (Personalized) Page Rank

Web Spam Detection Algorithms

2/7/2012 Jure Leskovec, **Stanford** C246: Mining Massive Datasets

3

Idea: Links as votes

Page is more important if it has more links

In-coming links? Out-going links?

Think of in-links as votes:

www.stanford.edu has 23,400 inlinks

www.joe-schmoe.com has 1 inlink

Are all in-links are equal?

Links from important pages count more

Recursive question!

2/7/2012 Jure Leskovec, **Stanford** C246: Mining Massive Datasets 4

Each link’s vote is proportional to the

importance of its source page

If page p with importance x has n out-links,

each link gets x/n votes

Page p’s own importance is the sum of the

votes on its in-links

2/7/2012 Jure Leskovec, **Stanford** C246: Mining Massive Datasets

5

p

A “vote” from an important

page is worth more

A page is important if it is

pointed to by other important

pages

Define a “rank” r j for node j

r

j

=

∑

i→

j

r

i

d out

(i)

2/7/2012 Jure Leskovec, **Stanford** C246: Mining Massive Datasets 6

a/2

a

The web in 1839

a/2

y/2

y

y/2

m

Flow equations:

ry = ry /2 + ra /2

ra = ry /2 + rm rm = ra /2

m

3 equations, 3 unknowns,

no constants

No unique solution

All solutions equivalent modulo scale factor

Additional constraint forces uniqueness

y + a + m = 1

y = 2/5, a = 2/5, m = 1/5

Flow equations:

ry = ry /2 + ra /2

ra = ry /2 + rm rm = ra /2

Gaussian elimination method works for small

examples, but we need a better method for

large web-size graphs

2/7/2012 Jure Leskovec, **Stanford** C246: Mining Massive Datasets

7

Stochastic adjacency matrix M

Let page j has d j out-links

If j → i, then M ij = 1/d j else M ij = 0

M is a column stochastic matrix

Columns sum to 1

Rank vector r: vector with an entry per page

r i is the importance score of page i

∑ i r i = 1

The flow equations can be written

r = M r

2/7/2012 Jure Leskovec, **Stanford** C246: Mining Massive Datasets 8

Suppose page j links to 3 pages, including i

i

j

1/3

M r r

2/7/2012 Jure Leskovec, **Stanford** C246: Mining Massive Datasets 9

=

i

The flow equations can be written

r = M ∙ r

So the rank vector is an eigenvector of the

stochastic web matrix

In fact, its first or principal eigenvector, with

corresponding eigenvalue 1

2/7/2012 Jure Leskovec, **Stanford** C246: Mining Massive Datasets

10

y

a m

ry = ry /2 + ra /2

ra = ry /2 + rm rm = ra /2

y a m

y ½ ½ 0

a ½ 0 1

m 0 ½ 0

r = Mr

y ½ ½ 0 y

a = ½ 0 1 a

m 0 ½ 0 m

2/7/2012 Jure Leskovec, **Stanford** C246: Mining Massive Datasets

11

Given a web graph with n nodes, where the

nodes are pages and edges are hyperlinks

Power iteration: a simple iterative scheme

Suppose there are N web pages

Initialize: r (0) = [1/N,….,1/N] T

Iterate: r (t+1) = M ∙ r (t)

Stop when |r (t+1) – r (t) | 1 < ε

|x| 1 = ∑1≤i≤N|xi| is the L1 norm

Can use any other vector norm e.g., Euclidean

( t+

1)

= ∑ j

i→

j

2/7/2012 Jure Leskovec, **Stanford** C246: Mining Massive Datasets

12

r

( t)

i

r

d

d i …. out-degree of node i

i

Power Iteration:

Set

Imagine a random web surfer:

At any time t, surfer is on some page u

At time t+1, the surfer follows an

out-link from u uniformly at random

Ends up on some page v linked from u

Process repeats indefinitely

Let:

p(t) … vector whose i th coordinate is the

prob. that the surfer is at page i at time t

p(t) is a probability distribution over pages

2/7/2012 Jure Leskovec, **Stanford** C246: Mining Massive Datasets 14

r

j

i 1 i 2 i 3

=

j

∑

i→

j dout ri

(i)

Where is the surfer at time t+1?

Follows a link uniformly at random

p(t+1) = M · p(t)

Suppose the random walk reaches a state

p(t+1) = M · p(t) = p(t)

then p(t) is stationary distribution of a random walk

Our rank vector r satisfies r = M · r

So, it is a stationary distribution for

the random walk

i 1 i 2 i 3

p( t + 1)

= M ⋅ p(

t)

2/7/2012 Jure Leskovec, **Stanford** C246: Mining Massive Datasets 15

j

( t+

1)

= j

∑

i→

j

r

( t)

i

d

Does this converge?

Does it converge to what we want?

Are results reasonable?

i

or

equivalently

r =

Mr

2/7/2012 Jure Leskovec, **Stanford** C246: Mining Massive Datasets 16

Example:

a

b

ra 1 0 1 0

=

rb 0 1 0 1

Iteration 0, 1, 2, …

( t+

1)

= ∑ j

i→

j

( t)

i

2/7/2012 Jure Leskovec, **Stanford** C246: Mining Massive Datasets 17

r

r

d

i

Example:

a

b

ra 1 0 0 0

=

rb 0 1 0 0

Iteration 0, 1, 2, …

( t+

1)

= ∑ j

i→

j

( t)

i

2/7/2012 Jure Leskovec, **Stanford** C246: Mining Massive Datasets 18

r

r

d

i

2 problems:

Some pages are “dead ends”

(have no out-links)

Such pages cause

importance to “leak out”

Spider traps (all out links are

within the group)

Eventually spider traps absorb all importance

2/7/2012 Jure Leskovec, **Stanford** C246: Mining Massive Datasets 19

Power Iteration:

Set

The Google solution for spider traps: At each

time step, the random surfer has two options:

With probability β, follow a link at random

With probability 1-β, jump to some page uniformly

at random

Common values for β are in the range 0.8 to 0.9

Surfer will teleport out of spider trap within a

few time steps

y

a m

2/7/2012 Jure Leskovec, **Stanford** C246: Mining Massive Datasets

21

y

a m

Power Iteration:

Set

Teleports: Follow random teleport links with

probability 1.0 from dead-ends

Adjust matrix accordingly

y

a m

y a m

y ½ ½ 0

a ½ 0 0

m 0 ½ 0

y a m

y ½ ½ ⅓

a ½ 0 ⅓

m 0 ½ ⅓

2/7/2012 Jure Leskovec, **Stanford** C246: Mining Massive Datasets 23

y

a m

( t+

1)

= Mr

( t)

Markov Chains

Set of states X

Transition matrix P where P ij = P(X t=i | X t-1=j)

π specifying the probability of being at each

state x ∈ X

Goal is to find π such that π = P π

2/7/2012 Jure Leskovec, **Stanford** C246: Mining Massive Datasets 24

Theory of Markov chains

Fact: For any start vector, the power method

applied to a Markov transition matrix P will

converge to a unique positive stationary

vector as long as P is stochastic, irreducible

and aperiodic.

2/7/2012 Jure Leskovec, **Stanford** C246: Mining Massive Datasets 25

Stochastic: Every column sums to 1

A possible solution: Add green links

y

S

=

a m

M

+

a

T

y a m

y ½ ½ 1/3

a ½ 0 1/3

m 0 ½ 1/3

1

( 1)

n

r y = r y /2 + r a /2 + r m /3

r a = r y /2+ r m /3

r m = r a /2 + r m /3

• a i…=1 if node i has

out deg 0, =0 else

• 1…vector of all 1s

2/7/2012 Jure Leskovec, **Stanford** C246: Mining Massive Datasets 26

A chain is periodic if there exists k > 1 such

that the interval between two visits to some

state s is always a multiple of k.

A possible solution: Add green links

y

a m

2/7/2012 Jure Leskovec, **Stanford** C246: Mining Massive Datasets 27

From any state, there is a non-zero

probability of going from any one

state to any another

A possible solution: Add green links

y

a m

2/7/2012 Jure Leskovec, **Stanford** C246: Mining Massive Datasets 28

Google’s solution that does it all:

Makes M stochastic, aperiodic, irreducible

At each step, random surfer has two options:

With probability 1-β, follow a link at random

With probability β, jump to some random page

PageRank equation [Brin-Page, 98]

PageRank equation [Brin-Page, 98]

a

y

a =

m

y

0.8·½+0.2·⅓

1/3

1/3

1/3

m

0.33

0.20

0.46

0.24

0.20

0.52

0.8

0.8+0.2·⅓

1/2 1/2 0

1/2 0 0

0 1/2 1

0.26

0.18

0.56

. . .

+ 0.2

y 7/15 7/15 1/15

a 7/15 1/15 1/15

m 1/15 7/15 13/15

7/33

5/33

21/33

1/3 1/3 1/3

1/3 1/3 1/3

1/3 1/3 1/3

2/7/2012 Jure Leskovec, **Stanford** C246: Mining Massive Datasets

31

S

A

1/n·1·1 T

Suppose there are N pages

Consider a page j, with set of out-links O(j)

We have M ij = 1/|O(j)| when j→i and M ij = 0

otherwise

The random teleport is equivalent to

Adding a teleport link from j to every other page

with probability (1-β)/N

Reducing the probability of following each out-link

from 1/|O(j)| to β/|O(j)|

Equivalent: Tax each page a fraction (1-β) of its

score and redistribute evenly

2/7/2012 Jure Leskovec, **Stanford** C246: Mining Massive Datasets

32

Construct the N x N matrix A as follows

A ij = β ∙ M ij + (1-β)/N

Verify that A is a stochastic matrix

The page rank vector r is the principal

eigenvector of this matrix A

satisfying r = A ∙ r

Equivalently, r is the stationary distribution of

the random walk with teleports

2/7/2012 Jure Leskovec, **Stanford** C246: Mining Massive Datasets

33

Key step is matrix-vector multiplication

r new = A ∙ r old

Easy if we have enough main memory to hold

A, r old , r new

Say N = 1 billion pages

We need 4 bytes for

each entry (say)

2 billion entries for

vectors, approx 8GB

Matrix A has N 2 entries

10 18 is a large number!

A =

A = β∙M + (1-β) [1/N] NxN

½ ½ 0

½ 0 0

0 ½ 1

0.8 +0.2

1/3 1/3 1/3

1/3 1/3 1/3

1/3 1/3 1/3

7/15 7/15 1/15

7/15 1/15 1/15

1/15 7/15 13/15

2/7/2012 Jure Leskovec, **Stanford** C246: Mining Massive Datasets 34

=

Key step is matrix-vector multiplication

r new = A ∙ r old

Easy if we have enough main memory to hold

A, r old , r new

Say N = 1 billion pages

We need 4 bytes for

each entry (say)

2 billion entries for

vectors, approx 8GB

Matrix A has N 2 entries

10 18 is a large number!

A =

A = β∙M + (1-β) [1/N] NxN

½ ½ 0

½ 0 0

0 ½ 1

0.8 +0.2

1/3 1/3 1/3

1/3 1/3 1/3

1/3 1/3 1/3

7/15 7/15 1/15

7/15 1/15 1/15

1/15 7/15 13/15

2/7/2012 Jure Leskovec, **Stanford** C246: Mining Massive Datasets 34

=

We can rearrange the PageRank equation

Encode sparse matrix using only nonzero

entries

Space proportional roughly to number of links

Say 10N, or 4*10*1 billion = 40GB

Still won’t fit in memory, but will fit on disk

source

node degree destination nodes

0 3 1, 5, 7

1 5 17, 64, 113, 117, 245

2 2 13, 23

2/7/2012 Jure Leskovec, **Stanford** C246: Mining Massive Datasets 37

Assume enough RAM to fit r new into memory

Store r old and matrix M on disk

Then 1 step of power-iteration is:

Initialize all entries of rnew to (1-β)/N

For each page p (of out-degree n):

Read into memory: p, n, dest1,…,destn, rold (p)

for j = 1…n: rnew (destj) += β rold (p) / n

0

1

2

3

4

5

6

r

src degree destination

new rold 0 3 1, 5, 6

1 4 17, 64, 113, 117

2 2 13, 23

2/7/2012 Jure Leskovec, **Stanford** C246: Mining Massive Datasets 38

0

1

2

3

4

5

6

Assume enough RAM to fit r new into memory

Store r old and matrix M on disk

In each iteration, we have to:

Read r old and M

Write r new back to disk

IO cost = 2|r| + |M|

Question:

What if we could not even fit r new in memory?

2/7/2012 Jure Leskovec, **Stanford** C246: Mining Massive Datasets 39

0

1

2

3

4

5

r src degree destination

new rold 0 4 0, 1, 3, 5

1 2 0, 5

2 2 3, 4

2/7/2012 Jure Leskovec, **Stanford** C246: Mining Massive Datasets 40

0

1

2

3

4

5

Similar to nested-loop join in databases

Break r new into k blocks that fit in memory

Scan M and r old once for each block

k scans of M and r old

k(|M| + |r|) + |r| = k|M| + (k+1)|r|

Can we do better?

Hint: M is much bigger than r (approx 10-20x), so

we must avoid reading it k times per iteration

2/7/2012 Jure Leskovec, **Stanford** C246: Mining Massive Datasets 41

0

1

2

3

4

5

r new

src degree destination

0 4 0, 1

1 3 0

2 2 1

0 4 3

2 2 3

0 4 5

1 3 5

2 2 4

2/7/2012 Jure Leskovec, **Stanford** C246: Mining Massive Datasets 42

r old

0

1

2

3

4

5

Break M into stripes

Each stripe contains only destination nodes in the

corresponding block of r new

Some additional overhead per stripe

But it is usually worth it

Cost per iteration

|M|(1+ε) + (k+1)|r|

2/7/2012 Jure Leskovec, **Stanford** C246: Mining Massive Datasets 43

Measures generic popularity of a page

Biased against topic-specific authorities

Solution: Topic-Specific PageRank (next)

Uses a single measure of importance

Other models e.g., hubs-and-authorities

Solution: Hubs-and-Authorities (next)

Susceptible to Link spam

Artificial link topographies created in order to

boost page rank

Solution: TrustRank (next)

2/7/2012 Jure Leskovec, **Stanford** C246: Mining Massive Datasets 44