CS246: Mining Massive Datasets

Jure Leskovec, **Stanford** **University**

http://cs246.stanford.edu

What is the structure of the Web?

How is it organized?

2/7/2011 Jure Leskovec, **Stanford** C246: Mining Massive Datasets 2

What is the structure of the Web?

How is it organized?

Web as a

directed graph

2/7/2011 Jure Leskovec, **Stanford** C246: Mining Massive Datasets 3

Two types of directed graphs:

DAG – Directed Acyclic Graph:

Has no cycles: if u can reach v,

then v can not reach u

Strongly connected:

Any node can reach any node

via a directed path

Any directed graph can be

expressed in terms of these

two types of graphs

2/7/2011 Jure Leskovec, **Stanford** C246: Mining Massive Datasets 4

Strongly connected component (SCC) is a set of

nodes S:

Every pair of nodes in S can reach each other

There is no larger set containing S with this property

Any directed graph is a DAG on its SCCs:

Each SCC is a super-node

Super-node A links to super-node B if a node in A links to

node in B

2/7/2011 Jure Leskovec, **Stanford** C246: Mining Massive Datasets 5

Take a large snapshot of the Web and try to

understand how it’s SCCs “fit” as a DAG

Computational issues:

Say want to find SCC containing specific node v?

Observation:

Out(v) … nodes reachable from v (via out-edges)

In(v) … nodes reachable from v (via in-edges)

SCC containing v:

v

= Out(v, G) ∩ In(v, G)

= Out(v, G) ∩ Out(v, G)

where G is G with directions of edges flipped

2/7/2011 Jure Leskovec, **Stanford** C246: Mining Massive Datasets 6

[Broder et al., ‘00]

250 million webpages, 1.5 billion links [Altavista]

2/7/2011 Jure Leskovec, **Stanford** C246: Mining Massive Datasets 7

Out-/In- Degree Distribution:

p k: fraction of nodes with k out-/in-links

Histogram of p k vs. k

Normalized count, p k

2/7/2011 Jure Leskovec, **Stanford** C246: Mining Massive Datasets 8

Normalized count, p k

Plot the same data on log-log axes:

log = log β −α

log k

2/7/2011 Jure Leskovec, **Stanford** C246: Mining Massive Datasets 9

p k

p k

β

α −

= k

[Broder et al., ‘00]

2/7/2011 Jure Leskovec, **Stanford** C246: Mining Massive Datasets 10

Random network Power-law network

Degree distribution is Binomial, i.e.,

all nodes have similar degree

2/7/2011

Degrees are Power-law,

i.e., heavily skewed

Jure Leskovec, **Stanford** C246: Mining Massive Datasets 11

Web pages are not equally “important”

www.joe-schmoe.com vs. www.stanford.edu

Since there is large diversity in the

connectivity of the

webgraph we can

rank the pages by

the link structure

2/7/2011 Jure Leskovec, **Stanford** C246: Mining Massive Datasets 12

We will cover the following Link Analysis

approaches to computing importances of

nodes in a graph:

Page Rank

Hubs and Authorities (HITS)

Topic-Specific (Personalized) Page Rank

Spam Detection Algorithms

2/7/2011 Jure Leskovec, **Stanford** C246: Mining Massive Datasets

13

First try:

Page is more important if it has more links

In-coming links? Out-going links?

Think of in-links as votes:

www.stanford.edu has 23,400 inlinks

www.joe-schmoe.com has 1 inlink

Are all in-links are equal?

Links from important pages count more

Recursive question!

2/7/2011 Jure Leskovec, **Stanford** C246: Mining Massive Datasets 14

Each link’s vote is proportional to the

importance of its source page

If page P with importance x has n out-links,

each link gets x/n votes

Page P’s own importance is the sum of the

votes on its in-links

2/7/2011 Jure Leskovec, **Stanford** C246: Mining Massive Datasets

15

The web in 1839

a/2

Amazon

a

y

Yahoo

y/2

m

a/2

y/2

M’soft

m

y = y /2 + a /2

a = y /2 + m

m = a /2

2/7/2011 Jure Leskovec, **Stanford** C246: Mining Massive Datasets

16

3 equations, 3 unknowns, no constants

No unique solution

All solutions equivalent modulo scale factor

Additional constraint forces uniqueness

y+a+m = 1

y = 2/5, a = 2/5, m = 1/5

Gaussian elimination method works for small

examples, but we need a better method for

large web-size graphs

2/7/2011 Jure Leskovec, **Stanford** C246: Mining Massive Datasets

17

Matrix M has one row and one column for each

web page

Suppose page j has n out-links

If j → i, then M ij = 1/n

else M ij = 0

M is a column stochastic matrix

Columns sum to 1

Suppose r is a vector with one entry per web

page:

r i is the importance score of page i

Call it the rank vector

|r| = 1

2/7/2011 Jure Leskovec, **Stanford** C246: Mining Massive Datasets

18

Suppose page j links to 3 pages, including i

i

j

1/3

M r r

2/7/2011 Jure Leskovec, **Stanford** C246: Mining Massive Datasets 19

=

i

The flow equations can be written

r = M∙r

So the rank vector is an eigenvector of the

stochastic web matrix

In fact, its first or principal eigenvector, with

corresponding eigenvalue 1

2/7/2011 Jure Leskovec, **Stanford** C246: Mining Massive Datasets

20

Amazon

Yahoo

M’soft

y = y /2 + a /2

a = y /2 + m

m = a /2

Y! A MS

Y! ½ ½ 0

A ½ 0 1

MS 0 ½ 0

r = Mr

y ½ ½ 0 y

a = ½ 0 1 a

m 0 ½ 0 m

2/7/2011 Jure Leskovec, **Stanford** C246: Mining Massive Datasets

21

Simple iterative scheme

Suppose there are N web pages

Initialize: r 0 = [1/N,….,1/N] T

Iterate: r k+1 = M∙r k

Stop when |r k+1 - r k | 1 < ε

|x| 1 = ∑ 1≤i≤N|x i| is the L1 norm

Can use any other vector norm e.g., Euclidean

2/7/2011 Jure Leskovec, **Stanford** C246: Mining Massive Datasets

22

Power iteration:

Set r i=1/n

r i=∑ j M ij∙r j

And iterate

Example:

y

a =

m

1/3

1/3

1/3

1/3

1/2

1/6

Y!

A MS

5/12

1/3

1/4

3/8

11/24

1/6

2/5

2/5

1/5

2/7/2011 Jure Leskovec, **Stanford** C246: Mining Massive Datasets 23

. . .

Y! A MS

Y! ½ ½ 0

A ½ 0 1

MS 0 ½ 0

Imagine a random web surfer

At any time t, surfer is on some page P

At time t+1, the surfer follows an outlink from P

uniformly at random

Ends up on some page Q linked from P

Process repeats indefinitely

Let p(t) be a vector whose i th component is

the probability that the surfer is at page i at

time t

p(t) is a probability distribution on pages

2/7/2011 Jure Leskovec, **Stanford** C246: Mining Massive Datasets

24

Where is the surfer at time t+1?

Follows a link uniformly at random

p(t+1) = M∙p(t)

Suppose the random walk reaches a state

such that p(t+1) = M∙p(t) = p(t)

Then p(t) is called a stationary distribution

for the random walk

Our rank vector r satisfies r = M∙r

So it is a stationary distribution for the random

surfer

2/7/2011 Jure Leskovec, **Stanford** C246: Mining Massive Datasets

25

A central result from the theory of random walks

(aka Markov processes):

For graphs that satisfy certain conditions, the

stationary distribution is unique and eventually

will be reached no matter what the initial

probability distribution at time t = 0.

2/7/2011 Jure Leskovec, **Stanford** C246: Mining Massive Datasets

26

Some pages are “dead ends”

(have no out-links)

Such pages cause importance

to leak out

Spider traps (all out links are

within the group)

A group of pages is a spider trap if there are no links

from within the group to the outside of the group

Random surfer gets trapped

And eventually spider traps absorb all importance

2/7/2011 Jure Leskovec, **Stanford** C246: Mining Massive Datasets 27

Power iteration:

Set r i=1

r i=∑ j M ij∙r j

And iterate

Example:

y 1 1 ¾ 5/8 0

a = 1 ½ ½ 3/8 … 0

m 1 3/2 7/4 2 3

2/7/2011 Jure Leskovec, **Stanford** C246: Mining Massive Datasets

Y!

A MS

Y! A MS

Y! ½ ½ 0

A ½ 0 0

MS 0 ½ 1

28

The Google solution for spider traps

At each time step, the random surfer has two

options:

With probability β, follow a link at random

With probability 1-β, jump to some page uniformly

at random

Common values for β are in the range 0.8 to 0.9

Surfer will teleport out of spider trap within a

few time steps

2/7/2011 Jure Leskovec, **Stanford** C246: Mining Massive Datasets

29

1/2

0.8*1/2

Amazon

0.2*1/3

Yahoo

0.2*1/3

1/2

0.8*1/2

0.2*1/3

M’soft

0.8

y

y 1/2

a 1/2

m 0

1/2 1/2 0

1/2 0 0

0 1/2 1

0.8*

y

1/2

1/2

0

+ 0.2

+ 0.2*

y 7/15 7/15 1/15

a 7/15 1/15 1/15

m 1/15 7/15 13/15

y

1/3

1/3

1/3

1/3 1/3 1/3

1/3 1/3 1/3

1/3 1/3 1/3

2/7/2011 Jure Leskovec, **Stanford** C246: Mining Massive Datasets

30

Amazon

y

a =

m

Yahoo

1

1

1

M’soft

1.00

0.60

1.40

0.8

0.84

0.60

1.56

1/2 1/2 0

1/2 0 0

0 1/2 1

+ 0.2

y 7/15 7/15 1/15

a 7/15 1/15 1/15

m 1/15 7/15 13/15

0.776

0.536

1.688

. . .

1/3 1/3 1/3

1/3 1/3 1/3

1/3 1/3 1/3

7/11

5/11

21/11

2/7/2011 Jure Leskovec, **Stanford** C246: Mining Massive Datasets

31

Some pages are “dead ends”

(have no out-links)

Such pages cause importance

to leak out

Power iteration:

Set r i=1

r i=∑ j M ij∙r j

And iterate

Example:

y 1 1 ¾ 5/8 0

a = 1 ½ ½ 3/8 … 0

m 1 ½ ¼ ¼ 0

2/7/2011 Jure Leskovec, **Stanford** C246: Mining Massive Datasets

Y!

A MS

Y! A MS

Y! ½ ½ 0

A ½ 0 0

MS 0 ½ 0

32

Teleports

Follow random teleport links with probability 1.0

from dead-ends

Adjust matrix accordingly

2/7/2011 Jure Leskovec, **Stanford** C246: Mining Massive Datasets

33

Suppose there are N pages

Consider a page j, with set of outlinks O(j)

We have M ij = 1/|O(j)| when j→i and M ij = 0

otherwise

The random teleport is equivalent to

adding a teleport link from j to every other page with

probability (1-β)/N

reducing the probability of following each outlink from

1/|O(j)| to β/|O(j)|

Equivalent: tax each page a fraction (1-β) of its score and

redistribute evenly

2/7/2011 Jure Leskovec, **Stanford** C246: Mining Massive Datasets

34

Construct the N x N matrix A as follows

A ij = β∙M ij + (1-β)/N

Verify that A is a stochastic matrix

The page rank vector r is the principal

eigenvector of this matrix

satisfying r = A∙r

Equivalently, r is the stationary distribution of

the random walk with teleports

2/7/2011 Jure Leskovec, **Stanford** C246: Mining Massive Datasets

35

Key step is matrix-vector multiplication

r new = A∙r old

Easy if we have enough main memory to

hold A, r old , r new

Say N = 1 billion pages

We need 4 bytes for each entry (say)

2 billion entries for vectors, approx 8GB

Matrix A has N2 entries

10 18 is a large number!

2/7/2011 Jure Leskovec, **Stanford** C246: Mining Massive Datasets

36

= A∙r, where

A ij = β M ij + (1-β)/N

r i = ∑ 1≤j≤N A ij r j

r i = ∑ 1≤j≤N [β M ij + (1-β)/N] r j

= β ∑ 1≤j≤N M ij r j + (1-β)/N ∑ 1≤j≤N r j

= β ∑ 1≤j≤N M ij r j + (1-β)/N, since |r| = 1

r = βM∙r + [(1-β)/N] N

where [x] N is an N-vector with all entries x

2/7/2011 Jure Leskovec, **Stanford** C246: Mining Massive Datasets

37

We can rearrange the PageRank equation

r = β M∙r + [(1-β)/N] N

[(1-β)/N] N is an N-vector with all entries (1-β)/N

M is a sparse matrix!

10 links per node, approx 10N entries

So in each iteration, we need to:

Compute r new = β M∙r old

Add a constant value (1-β)/N to each entry in r new

2/7/2011 Jure Leskovec, **Stanford** C246: Mining Massive Datasets

38

Encode sparse matrix using only nonzero

entries

Space proportional roughly to number of links

say 10N, or 4*10*1 billion = 40GB

still won’t fit in memory, but will fit on disk

source

node

degree destination nodes

0 3 1, 5, 7

1 5 17, 64, 113, 117, 245

2 2 13, 23

2/7/2011 Jure Leskovec, **Stanford** C246: Mining Massive Datasets

39

Assume enough RAM to fit r new into memory

Store r old and matrix M on disk

Initialize all entries of rnew to (1-β)/N

For each page p (of out-degree n):

Read into memory: p, n, dest1,…,destn, rold (p)

for j = 1…n:

rnew (destj) += β rold (p) / n

0

1

2

3

4

5

6

r

src degree destination

new rold 0 3 1, 5, 6

1 4 17, 64, 113, 117

2 2 13, 23

2/7/2011 Jure Leskovec, **Stanford** C246: Mining Massive Datasets 40

0

1

2

3

4

5

6

In each iteration, we have to:

Read r old and M

Write r new back to disk

IO Cost = 2|r| + |M|

Questions:

What if we had enough memory to fit both r new

and r old ?

What if we could not even fit r new in memory?

See reading: http://i.stanford.edu/~ullman/mmds/ch5.pdf

2/7/2011 Jure Leskovec, **Stanford** C246: Mining Massive Datasets

41

Measures generic popularity of a page

Biased against topic-specific authorities

Solution: Topic-Specific PageRank (next lecture)

Uses a single measure of importance

Other models e.g., hubs-and-authorities

Solution: Hubs-and-Authorities (next)

Susceptible to Link spam

Artificial link topographies created in order to boost

page rank

Solution: TrustRank (next lecture)

2/7/2011 Jure Leskovec, **Stanford** C246: Mining Massive Datasets

46

Interesting pages fall into two classes:

1. Authorities are pages containing useful

information

Newspaper home pages

Course home pages

Home pages of auto manufacturers

2. Hubs are pages that link to authorities

List of newspapers

Course bulletin

List of US auto manufacturers

NYT: 10

Ebay: 3

Yahoo: 3

CNN: 8

WSJ: 9

2/7/2011 Jure Leskovec, **Stanford** C246: Mining Massive Datasets

47

2/7/2011 Jure Leskovec, **Stanford** C246: Mining Massive Datasets 48

2/7/2011 Jure Leskovec, **Stanford** C246: Mining Massive Datasets 49

2/7/2011 Jure Leskovec, **Stanford** C246: Mining Massive Datasets 50

A good hub links to many good authorities

A good authority is linked from many good

hubs

Model using two scores for each node:

Hub score and Authority score

Represented as vectors h and a

2/7/2011 Jure Leskovec, **Stanford** C246: Mining Massive Datasets 51

HITS uses adjacency matrix

A[i, j] = 1 if page i links to page j,

0 else

A T , the transpose of A, is similar to the PageRank

matrix M but A T has 1’s where M has fractions

Amazon

Yahoo

M’soft

A =

y a m

y 1 1 1

a 1 0 1

m 0 1 0

2/7/2011 Jure Leskovec, **Stanford** C246: Mining Massive Datasets

52

Amazon

Yahoo

M’soft

A =

y a m

y 1 1 1

a 1 0 1

m 0 1 0

2/7/2011 Jure Leskovec, **Stanford** C246: Mining Massive Datasets

53

Notation:

Vector a=(a 1…,a n), h=(h 1…,h n)

Adjacency matrix (n x n): A ij=1 if ij else A ij=0

Then:

h

i

∑

So: h = A⋅

a

Likewise:

∑

= a j ⇔ hi

=

i→

j

j

a

=

T

A ⋅h

2/7/2011 Jure Leskovec, **Stanford** C246: Mining Massive Datasets

A

ij

a

j

54

The hub score of page i is proportional to the

sum of the authority scores of the pages it

links to: h = λ A a

Constant λ is a scaling factor, λ = 1/∑h i

The authority score of page i is proportional

to the sum of the hub scores of the pages it is

linked from: a = μ A T h

Constant μ is scaling factor, μ = 1/∑a i

2/7/2011 Jure Leskovec, **Stanford** C246: Mining Massive Datasets

55

The HITS algorithm:

Initialize h, a to all 1’s

Repeat:

h = A a

Scale h so that its sums to 1.0

a = A T h

Scale a so that its sums to 1.0

Until h, a converge (i.e., change very little)

2/7/2011 Jure Leskovec, **Stanford** C246: Mining Massive Datasets

56

1 1 1

A = 1 0 1

0 1 0

a(yahoo)

a(amazon)

a(m’soft)

1 1 0

A T = 1 0 1

1 1 0

=

=

=

1

1

1

h(yahoo) = 1

h(amazon) = 1

h(m’soft) = 1

1

1

1

1

2/3

1/3

1

4/5

1

1

0.71

0.29

1

0.75

1

1

0.73

0.27

Amazon

. . .

. . .

. . .

. . .

. . .

. . .

1

0.732

1

1.000

0.732

0.268

2/7/2011 Jure Leskovec, **Stanford** C246: Mining Massive Datasets

57

Yahoo

M’soft

Algorithm:

Set: a = h = 1 n

Repeat:

h=M a, a=M T h

Normalize

Then: a=M T (M a)

new h

new a

Thus:

a=(M T M) a

h=(M M T ) h

2/7/2011 Jure Leskovec, **Stanford** C246: Mining Massive Datasets

a is being updated (in 2 steps):

M T (M a) = (M T M) a

h is updated (in 2 steps):

M (M T h) = (M M T ) h

Repeated matrix powering

58

Under reasonable assumptions about A, the

HITS iterative algorithm converges to vectors

h* and a*:

h* is the principal eigenvector of matrix AA T

a* is the principal eigenvector of matrix A T A

2/7/2011 Jure Leskovec, **Stanford** C246: Mining Massive Datasets

59

PageRank and HITS are two solutions to the

same problem:

What is the value of an in-link from u to v?

In the PageRank model, the value of the link

depends on the links into u

In the HITS model, it depends on the value of the

other links out of u

The destinies of PageRank and HITS

post-1998 were very different

2/7/2011 Jure Leskovec, **Stanford** C246: Mining Massive Datasets

60