# M - SNAP - Stanford University

M - SNAP - Stanford University

CS246: Mining Massive Datasets

Jure Leskovec, Stanford University

http://cs246.stanford.edu

What is the structure of the Web?

How is it organized?

2/7/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 2

What is the structure of the Web?

How is it organized?

Web as a

directed graph

2/7/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 3

Two types of directed graphs:

DAG – Directed Acyclic Graph:

Has no cycles: if u can reach v,

then v can not reach u

Strongly connected:

Any node can reach any node

via a directed path

Any directed graph can be

expressed in terms of these

two types of graphs

2/7/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 4

Strongly connected component (SCC) is a set of

nodes S:

Every pair of nodes in S can reach each other

There is no larger set containing S with this property

Any directed graph is a DAG on its SCCs:

Each SCC is a super-node

Super-node A links to super-node B if a node in A links to

node in B

2/7/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 5

Take a large snapshot of the Web and try to

understand how it’s SCCs “fit” as a DAG

Computational issues:

Say want to find SCC containing specific node v?

Observation:

Out(v) … nodes reachable from v (via out-edges)

In(v) … nodes reachable from v (via in-edges)

SCC containing v:

v

= Out(v, G) ∩ In(v, G)

= Out(v, G) ∩ Out(v, G)

where G is G with directions of edges flipped

2/7/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 6

[Broder et al., ‘00]

250 million webpages, 1.5 billion links [Altavista]

2/7/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 7

Out-/In- Degree Distribution:

p k: fraction of nodes with k out-/in-links

Histogram of p k vs. k

Normalized count, p k

2/7/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 8

Normalized count, p k

Plot the same data on log-log axes:

log = log β −α

log k

2/7/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 9

p k

p k

β

α −

= k

[Broder et al., ‘00]

2/7/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 10

Random network Power-law network

Degree distribution is Binomial, i.e.,

all nodes have similar degree

2/7/2011

Degrees are Power-law,

i.e., heavily skewed

Jure Leskovec, Stanford C246: Mining Massive Datasets 11

Web pages are not equally “important”

www.joe-schmoe.com vs. www.stanford.edu

Since there is large diversity in the

connectivity of the

webgraph we can

rank the pages by

2/7/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 12

We will cover the following Link Analysis

approaches to computing importances of

nodes in a graph:

Page Rank

Hubs and Authorities (HITS)

Topic-Specific (Personalized) Page Rank

Spam Detection Algorithms

2/7/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets

13

First try:

Page is more important if it has more links

Links from important pages count more

Recursive question!

2/7/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 14

Each link’s vote is proportional to the

importance of its source page

If page P with importance x has n out-links,

Page P’s own importance is the sum of the

2/7/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets

15

The web in 1839

a/2

Amazon

a

y

Yahoo

y/2

m

a/2

y/2

M’soft

m

y = y /2 + a /2

a = y /2 + m

m = a /2

2/7/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets

16

3 equations, 3 unknowns, no constants

No unique solution

All solutions equivalent modulo scale factor

y+a+m = 1

y = 2/5, a = 2/5, m = 1/5

Gaussian elimination method works for small

examples, but we need a better method for

large web-size graphs

2/7/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets

17

Matrix M has one row and one column for each

web page

Suppose page j has n out-links

If j → i, then M ij = 1/n

else M ij = 0

M is a column stochastic matrix

Columns sum to 1

Suppose r is a vector with one entry per web

page:

r i is the importance score of page i

Call it the rank vector

|r| = 1

2/7/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets

18

Suppose page j links to 3 pages, including i

i

j

1/3

M r r

2/7/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 19

=

i

The flow equations can be written

r = M∙r

So the rank vector is an eigenvector of the

stochastic web matrix

In fact, its first or principal eigenvector, with

corresponding eigenvalue 1

2/7/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets

20

Amazon

Yahoo

M’soft

y = y /2 + a /2

a = y /2 + m

m = a /2

Y! A MS

Y! ½ ½ 0

A ½ 0 1

MS 0 ½ 0

r = Mr

y ½ ½ 0 y

a = ½ 0 1 a

m 0 ½ 0 m

2/7/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets

21

Simple iterative scheme

Suppose there are N web pages

Initialize: r 0 = [1/N,….,1/N] T

Iterate: r k+1 = M∙r k

Stop when |r k+1 - r k | 1 < ε

|x| 1 = ∑ 1≤i≤N|x i| is the L1 norm

Can use any other vector norm e.g., Euclidean

2/7/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets

22

Power iteration:

Set r i=1/n

r i=∑ j M ij∙r j

And iterate

Example:

y

a =

m

1/3

1/3

1/3

1/3

1/2

1/6

Y!

A MS

5/12

1/3

1/4

3/8

11/24

1/6

2/5

2/5

1/5

2/7/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 23

. . .

Y! A MS

Y! ½ ½ 0

A ½ 0 1

MS 0 ½ 0

Imagine a random web surfer

At any time t, surfer is on some page P

At time t+1, the surfer follows an outlink from P

uniformly at random

Ends up on some page Q linked from P

Process repeats indefinitely

Let p(t) be a vector whose i th component is

the probability that the surfer is at page i at

time t

p(t) is a probability distribution on pages

2/7/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets

24

Where is the surfer at time t+1?

Follows a link uniformly at random

p(t+1) = M∙p(t)

Suppose the random walk reaches a state

such that p(t+1) = M∙p(t) = p(t)

Then p(t) is called a stationary distribution

for the random walk

Our rank vector r satisfies r = M∙r

So it is a stationary distribution for the random

surfer

2/7/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets

25

A central result from the theory of random walks

(aka Markov processes):

For graphs that satisfy certain conditions, the

stationary distribution is unique and eventually

will be reached no matter what the initial

probability distribution at time t = 0.

2/7/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets

26

Such pages cause importance

to leak out

Spider traps (all out links are

within the group)

A group of pages is a spider trap if there are no links

from within the group to the outside of the group

Random surfer gets trapped

And eventually spider traps absorb all importance

2/7/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 27

Power iteration:

Set r i=1

r i=∑ j M ij∙r j

And iterate

Example:

y 1 1 ¾ 5/8 0

a = 1 ½ ½ 3/8 … 0

m 1 3/2 7/4 2 3

2/7/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets

Y!

A MS

Y! A MS

Y! ½ ½ 0

A ½ 0 0

MS 0 ½ 1

28

The Google solution for spider traps

At each time step, the random surfer has two

options:

at random

Common values for β are in the range 0.8 to 0.9

Surfer will teleport out of spider trap within a

few time steps

2/7/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets

29

1/2

0.8*1/2

Amazon

0.2*1/3

Yahoo

0.2*1/3

1/2

0.8*1/2

0.2*1/3

M’soft

0.8

y

y 1/2

a 1/2

m 0

1/2 1/2 0

1/2 0 0

0 1/2 1

0.8*

y

1/2

1/2

0

+ 0.2

+ 0.2*

y 7/15 7/15 1/15

a 7/15 1/15 1/15

m 1/15 7/15 13/15

y

1/3

1/3

1/3

1/3 1/3 1/3

1/3 1/3 1/3

1/3 1/3 1/3

2/7/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets

30

Amazon

y

a =

m

Yahoo

1

1

1

M’soft

1.00

0.60

1.40

0.8

0.84

0.60

1.56

1/2 1/2 0

1/2 0 0

0 1/2 1

+ 0.2

y 7/15 7/15 1/15

a 7/15 1/15 1/15

m 1/15 7/15 13/15

0.776

0.536

1.688

. . .

1/3 1/3 1/3

1/3 1/3 1/3

1/3 1/3 1/3

7/11

5/11

21/11

2/7/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets

31

Such pages cause importance

to leak out

Power iteration:

Set r i=1

r i=∑ j M ij∙r j

And iterate

Example:

y 1 1 ¾ 5/8 0

a = 1 ½ ½ 3/8 … 0

m 1 ½ ¼ ¼ 0

2/7/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets

Y!

A MS

Y! A MS

Y! ½ ½ 0

A ½ 0 0

MS 0 ½ 0

32

Teleports

2/7/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets

33

Suppose there are N pages

Consider a page j, with set of outlinks O(j)

We have M ij = 1/|O(j)| when j→i and M ij = 0

otherwise

The random teleport is equivalent to

probability (1-β)/N

reducing the probability of following each outlink from

1/|O(j)| to β/|O(j)|

Equivalent: tax each page a fraction (1-β) of its score and

redistribute evenly

2/7/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets

34

Construct the N x N matrix A as follows

A ij = β∙M ij + (1-β)/N

Verify that A is a stochastic matrix

The page rank vector r is the principal

eigenvector of this matrix

satisfying r = A∙r

Equivalently, r is the stationary distribution of

the random walk with teleports

2/7/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets

35

Key step is matrix-vector multiplication

r new = A∙r old

Easy if we have enough main memory to

hold A, r old , r new

Say N = 1 billion pages

We need 4 bytes for each entry (say)

2 billion entries for vectors, approx 8GB

Matrix A has N2 entries

10 18 is a large number!

2/7/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets

36

= A∙r, where

A ij = β M ij + (1-β)/N

r i = ∑ 1≤j≤N A ij r j

r i = ∑ 1≤j≤N [β M ij + (1-β)/N] r j

= β ∑ 1≤j≤N M ij r j + (1-β)/N ∑ 1≤j≤N r j

= β ∑ 1≤j≤N M ij r j + (1-β)/N, since |r| = 1

r = βM∙r + [(1-β)/N] N

where [x] N is an N-vector with all entries x

2/7/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets

37

We can rearrange the PageRank equation

r = β M∙r + [(1-β)/N] N

[(1-β)/N] N is an N-vector with all entries (1-β)/N

M is a sparse matrix!

10 links per node, approx 10N entries

So in each iteration, we need to:

Compute r new = β M∙r old

Add a constant value (1-β)/N to each entry in r new

2/7/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets

38

Encode sparse matrix using only nonzero

entries

Space proportional roughly to number of links

say 10N, or 4*10*1 billion = 40GB

still won’t fit in memory, but will fit on disk

source

node

degree destination nodes

0 3 1, 5, 7

1 5 17, 64, 113, 117, 245

2 2 13, 23

2/7/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets

39

Assume enough RAM to fit r new into memory

Store r old and matrix M on disk

Initialize all entries of rnew to (1-β)/N

For each page p (of out-degree n):

Read into memory: p, n, dest1,…,destn, rold (p)

for j = 1…n:

rnew (destj) += β rold (p) / n

0

1

2

3

4

5

6

r

src degree destination

new rold 0 3 1, 5, 6

1 4 17, 64, 113, 117

2 2 13, 23

2/7/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 40

0

1

2

3

4

5

6

In each iteration, we have to:

Write r new back to disk

IO Cost = 2|r| + |M|

Questions:

What if we had enough memory to fit both r new

and r old ?

What if we could not even fit r new in memory?

2/7/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets

41

Measures generic popularity of a page

Biased against topic-specific authorities

Solution: Topic-Specific PageRank (next lecture)

Uses a single measure of importance

Other models e.g., hubs-and-authorities

Solution: Hubs-and-Authorities (next)

Artificial link topographies created in order to boost

page rank

Solution: TrustRank (next lecture)

2/7/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets

46

Interesting pages fall into two classes:

1. Authorities are pages containing useful

information

2. Hubs are pages that link to authorities

List of newspapers

Course bulletin

List of US auto manufacturers

NYT: 10

Ebay: 3

Yahoo: 3

CNN: 8

WSJ: 9

2/7/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets

47

2/7/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 48

2/7/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 49

2/7/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 50

A good hub links to many good authorities

A good authority is linked from many good

hubs

Model using two scores for each node:

Hub score and Authority score

Represented as vectors h and a

2/7/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 51

A[i, j] = 1 if page i links to page j,

0 else

A T , the transpose of A, is similar to the PageRank

matrix M but A T has 1’s where M has fractions

Amazon

Yahoo

M’soft

A =

y a m

y 1 1 1

a 1 0 1

m 0 1 0

2/7/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets

52

Amazon

Yahoo

M’soft

A =

y a m

y 1 1 1

a 1 0 1

m 0 1 0

2/7/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets

53

Notation:

Vector a=(a 1…,a n), h=(h 1…,h n)

Adjacency matrix (n x n): A ij=1 if ij else A ij=0

Then:

h

i

So: h = A⋅

a

Likewise:

= a j ⇔ hi

=

i→

j

j

a

=

T

A ⋅h

2/7/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets

A

ij

a

j

54

The hub score of page i is proportional to the

sum of the authority scores of the pages it

links to: h = λ A a

Constant λ is a scaling factor, λ = 1/∑h i

The authority score of page i is proportional

to the sum of the hub scores of the pages it is

linked from: a = μ A T h

Constant μ is scaling factor, μ = 1/∑a i

2/7/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets

55

The HITS algorithm:

Initialize h, a to all 1’s

Repeat:

h = A a

Scale h so that its sums to 1.0

a = A T h

Scale a so that its sums to 1.0

Until h, a converge (i.e., change very little)

2/7/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets

56

1 1 1

A = 1 0 1

0 1 0

a(yahoo)

a(amazon)

a(m’soft)

1 1 0

A T = 1 0 1

1 1 0

=

=

=

1

1

1

h(yahoo) = 1

h(amazon) = 1

h(m’soft) = 1

1

1

1

1

2/3

1/3

1

4/5

1

1

0.71

0.29

1

0.75

1

1

0.73

0.27

Amazon

. . .

. . .

. . .

. . .

. . .

. . .

1

0.732

1

1.000

0.732

0.268

2/7/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets

57

Yahoo

M’soft

Algorithm:

Set: a = h = 1 n

Repeat:

h=M a, a=M T h

Normalize

Then: a=M T (M a)

new h

new a

Thus:

a=(M T M) a

h=(M M T ) h

2/7/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets

a is being updated (in 2 steps):

M T (M a) = (M T M) a

h is updated (in 2 steps):

M (M T h) = (M M T ) h

Repeated matrix powering

58

Under reasonable assumptions about A, the

HITS iterative algorithm converges to vectors

h* and a*:

h* is the principal eigenvector of matrix AA T

a* is the principal eigenvector of matrix A T A

2/7/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets

59

PageRank and HITS are two solutions to the

same problem:

What is the value of an in-link from u to v?

In the PageRank model, the value of the link

depends on the links into u

In the HITS model, it depends on the value of the