r - SNAP - Stanford University
r - SNAP - Stanford University
r - SNAP - Stanford University
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
CS246: Mining Massive Datasets<br />
Jure Leskovec, <strong>Stanford</strong> <strong>University</strong><br />
http://cs246.stanford.edu
Web pages are not equally “important”<br />
www.joe-schmoe.com vs. www.stanford.edu<br />
We already know:<br />
Since there is large diversity<br />
in the connectivity of the<br />
webgraph we can<br />
rank the pages by<br />
the link structure<br />
2/7/2012 Jure Leskovec, <strong>Stanford</strong> C246: Mining Massive Datasets 2<br />
vs.
We will cover the following Link Analysis<br />
approaches to computing importances of<br />
nodes in a graph:<br />
Page Rank<br />
Hubs and Authorities (HITS)<br />
Topic-Specific (Personalized) Page Rank<br />
Web Spam Detection Algorithms<br />
2/7/2012 Jure Leskovec, <strong>Stanford</strong> C246: Mining Massive Datasets<br />
3
Idea: Links as votes<br />
Page is more important if it has more links<br />
In-coming links? Out-going links?<br />
Think of in-links as votes:<br />
www.stanford.edu has 23,400 inlinks<br />
www.joe-schmoe.com has 1 inlink<br />
Are all in-links are equal?<br />
Links from important pages count more<br />
Recursive question!<br />
2/7/2012 Jure Leskovec, <strong>Stanford</strong> C246: Mining Massive Datasets 4
Each link’s vote is proportional to the<br />
importance of its source page<br />
If page p with importance x has n out-links,<br />
each link gets x/n votes<br />
Page p’s own importance is the sum of the<br />
votes on its in-links<br />
2/7/2012 Jure Leskovec, <strong>Stanford</strong> C246: Mining Massive Datasets<br />
5<br />
p
A “vote” from an important<br />
page is worth more<br />
A page is important if it is<br />
pointed to by other important<br />
pages<br />
Define a “rank” r j for node j<br />
r<br />
j<br />
=<br />
∑<br />
i→<br />
j<br />
r<br />
i<br />
d out<br />
(i)<br />
2/7/2012 Jure Leskovec, <strong>Stanford</strong> C246: Mining Massive Datasets 6<br />
a/2<br />
a<br />
The web in 1839<br />
a/2<br />
y/2<br />
y<br />
y/2<br />
m<br />
Flow equations:<br />
ry = ry /2 + ra /2<br />
ra = ry /2 + rm rm = ra /2<br />
m
3 equations, 3 unknowns,<br />
no constants<br />
No unique solution<br />
All solutions equivalent modulo scale factor<br />
Additional constraint forces uniqueness<br />
y + a + m = 1<br />
y = 2/5, a = 2/5, m = 1/5<br />
Flow equations:<br />
ry = ry /2 + ra /2<br />
ra = ry /2 + rm rm = ra /2<br />
Gaussian elimination method works for small<br />
examples, but we need a better method for<br />
large web-size graphs<br />
2/7/2012 Jure Leskovec, <strong>Stanford</strong> C246: Mining Massive Datasets<br />
7
Stochastic adjacency matrix M<br />
Let page j has d j out-links<br />
If j → i, then M ij = 1/d j else M ij = 0<br />
M is a column stochastic matrix<br />
Columns sum to 1<br />
Rank vector r: vector with an entry per page<br />
r i is the importance score of page i<br />
∑ i r i = 1<br />
The flow equations can be written<br />
r = M r<br />
2/7/2012 Jure Leskovec, <strong>Stanford</strong> C246: Mining Massive Datasets 8
Suppose page j links to 3 pages, including i<br />
i<br />
j<br />
1/3<br />
M r r<br />
2/7/2012 Jure Leskovec, <strong>Stanford</strong> C246: Mining Massive Datasets 9<br />
=<br />
i
The flow equations can be written<br />
r = M ∙ r<br />
So the rank vector is an eigenvector of the<br />
stochastic web matrix<br />
In fact, its first or principal eigenvector, with<br />
corresponding eigenvalue 1<br />
2/7/2012 Jure Leskovec, <strong>Stanford</strong> C246: Mining Massive Datasets<br />
10
y<br />
a m<br />
ry = ry /2 + ra /2<br />
ra = ry /2 + rm rm = ra /2<br />
y a m<br />
y ½ ½ 0<br />
a ½ 0 1<br />
m 0 ½ 0<br />
r = Mr<br />
y ½ ½ 0 y<br />
a = ½ 0 1 a<br />
m 0 ½ 0 m<br />
2/7/2012 Jure Leskovec, <strong>Stanford</strong> C246: Mining Massive Datasets<br />
11
Given a web graph with n nodes, where the<br />
nodes are pages and edges are hyperlinks<br />
Power iteration: a simple iterative scheme<br />
Suppose there are N web pages<br />
Initialize: r (0) = [1/N,….,1/N] T<br />
Iterate: r (t+1) = M ∙ r (t)<br />
Stop when |r (t+1) – r (t) | 1 < ε<br />
|x| 1 = ∑1≤i≤N|xi| is the L1 norm<br />
Can use any other vector norm e.g., Euclidean<br />
( t+<br />
1)<br />
= ∑ j<br />
i→<br />
j<br />
2/7/2012 Jure Leskovec, <strong>Stanford</strong> C246: Mining Massive Datasets<br />
12<br />
r<br />
( t)<br />
i<br />
r<br />
d<br />
d i …. out-degree of node i<br />
i
Power Iteration:<br />
Set
Imagine a random web surfer:<br />
At any time t, surfer is on some page u<br />
At time t+1, the surfer follows an<br />
out-link from u uniformly at random<br />
Ends up on some page v linked from u<br />
Process repeats indefinitely<br />
Let:<br />
p(t) … vector whose i th coordinate is the<br />
prob. that the surfer is at page i at time t<br />
p(t) is a probability distribution over pages<br />
2/7/2012 Jure Leskovec, <strong>Stanford</strong> C246: Mining Massive Datasets 14<br />
r<br />
j<br />
i 1 i 2 i 3<br />
=<br />
j<br />
∑<br />
i→<br />
j dout ri<br />
(i)
Where is the surfer at time t+1?<br />
Follows a link uniformly at random<br />
p(t+1) = M · p(t)<br />
Suppose the random walk reaches a state<br />
p(t+1) = M · p(t) = p(t)<br />
then p(t) is stationary distribution of a random walk<br />
Our rank vector r satisfies r = M · r<br />
So, it is a stationary distribution for<br />
the random walk<br />
i 1 i 2 i 3<br />
p( t + 1)<br />
= M ⋅ p(<br />
t)<br />
2/7/2012 Jure Leskovec, <strong>Stanford</strong> C246: Mining Massive Datasets 15<br />
j
( t+<br />
1)<br />
= j<br />
∑<br />
i→<br />
j<br />
r<br />
( t)<br />
i<br />
d<br />
Does this converge?<br />
Does it converge to what we want?<br />
Are results reasonable?<br />
i<br />
or<br />
equivalently<br />
r =<br />
Mr<br />
2/7/2012 Jure Leskovec, <strong>Stanford</strong> C246: Mining Massive Datasets 16
Example:<br />
a<br />
b<br />
ra 1 0 1 0<br />
=<br />
rb 0 1 0 1<br />
Iteration 0, 1, 2, …<br />
( t+<br />
1)<br />
= ∑ j<br />
i→<br />
j<br />
( t)<br />
i<br />
2/7/2012 Jure Leskovec, <strong>Stanford</strong> C246: Mining Massive Datasets 17<br />
r<br />
r<br />
d<br />
i
Example:<br />
a<br />
b<br />
ra 1 0 0 0<br />
=<br />
rb 0 1 0 0<br />
Iteration 0, 1, 2, …<br />
( t+<br />
1)<br />
= ∑ j<br />
i→<br />
j<br />
( t)<br />
i<br />
2/7/2012 Jure Leskovec, <strong>Stanford</strong> C246: Mining Massive Datasets 18<br />
r<br />
r<br />
d<br />
i
2 problems:<br />
Some pages are “dead ends”<br />
(have no out-links)<br />
Such pages cause<br />
importance to “leak out”<br />
Spider traps (all out links are<br />
within the group)<br />
Eventually spider traps absorb all importance<br />
2/7/2012 Jure Leskovec, <strong>Stanford</strong> C246: Mining Massive Datasets 19
Power Iteration:<br />
Set
The Google solution for spider traps: At each<br />
time step, the random surfer has two options:<br />
With probability β, follow a link at random<br />
With probability 1-β, jump to some page uniformly<br />
at random<br />
Common values for β are in the range 0.8 to 0.9<br />
Surfer will teleport out of spider trap within a<br />
few time steps<br />
y<br />
a m<br />
2/7/2012 Jure Leskovec, <strong>Stanford</strong> C246: Mining Massive Datasets<br />
21<br />
y<br />
a m
Power Iteration:<br />
Set
Teleports: Follow random teleport links with<br />
probability 1.0 from dead-ends<br />
Adjust matrix accordingly<br />
y<br />
a m<br />
y a m<br />
y ½ ½ 0<br />
a ½ 0 0<br />
m 0 ½ 0<br />
y a m<br />
y ½ ½ ⅓<br />
a ½ 0 ⅓<br />
m 0 ½ ⅓<br />
2/7/2012 Jure Leskovec, <strong>Stanford</strong> C246: Mining Massive Datasets 23<br />
y<br />
a m
( t+<br />
1)<br />
= Mr<br />
( t)<br />
Markov Chains<br />
Set of states X<br />
Transition matrix P where P ij = P(X t=i | X t-1=j)<br />
π specifying the probability of being at each<br />
state x ∈ X<br />
Goal is to find π such that π = P π<br />
2/7/2012 Jure Leskovec, <strong>Stanford</strong> C246: Mining Massive Datasets 24
Theory of Markov chains<br />
Fact: For any start vector, the power method<br />
applied to a Markov transition matrix P will<br />
converge to a unique positive stationary<br />
vector as long as P is stochastic, irreducible<br />
and aperiodic.<br />
2/7/2012 Jure Leskovec, <strong>Stanford</strong> C246: Mining Massive Datasets 25
Stochastic: Every column sums to 1<br />
A possible solution: Add green links<br />
y<br />
S<br />
=<br />
a m<br />
M<br />
+<br />
a<br />
T<br />
y a m<br />
y ½ ½ 1/3<br />
a ½ 0 1/3<br />
m 0 ½ 1/3<br />
1<br />
( 1)<br />
n<br />
r y = r y /2 + r a /2 + r m /3<br />
r a = r y /2+ r m /3<br />
r m = r a /2 + r m /3<br />
• a i…=1 if node i has<br />
out deg 0, =0 else<br />
• 1…vector of all 1s<br />
2/7/2012 Jure Leskovec, <strong>Stanford</strong> C246: Mining Massive Datasets 26
A chain is periodic if there exists k > 1 such<br />
that the interval between two visits to some<br />
state s is always a multiple of k.<br />
A possible solution: Add green links<br />
y<br />
a m<br />
2/7/2012 Jure Leskovec, <strong>Stanford</strong> C246: Mining Massive Datasets 27
From any state, there is a non-zero<br />
probability of going from any one<br />
state to any another<br />
A possible solution: Add green links<br />
y<br />
a m<br />
2/7/2012 Jure Leskovec, <strong>Stanford</strong> C246: Mining Massive Datasets 28
Google’s solution that does it all:<br />
Makes M stochastic, aperiodic, irreducible<br />
At each step, random surfer has two options:<br />
With probability 1-β, follow a link at random<br />
With probability β, jump to some random page<br />
PageRank equation [Brin-Page, 98]
PageRank equation [Brin-Page, 98]
a<br />
y<br />
a =<br />
m<br />
y<br />
0.8·½+0.2·⅓<br />
1/3<br />
1/3<br />
1/3<br />
m<br />
0.33<br />
0.20<br />
0.46<br />
0.24<br />
0.20<br />
0.52<br />
0.8<br />
0.8+0.2·⅓<br />
1/2 1/2 0<br />
1/2 0 0<br />
0 1/2 1<br />
0.26<br />
0.18<br />
0.56<br />
. . .<br />
+ 0.2<br />
y 7/15 7/15 1/15<br />
a 7/15 1/15 1/15<br />
m 1/15 7/15 13/15<br />
7/33<br />
5/33<br />
21/33<br />
1/3 1/3 1/3<br />
1/3 1/3 1/3<br />
1/3 1/3 1/3<br />
2/7/2012 Jure Leskovec, <strong>Stanford</strong> C246: Mining Massive Datasets<br />
31<br />
S<br />
A<br />
1/n·1·1 T
Suppose there are N pages<br />
Consider a page j, with set of out-links O(j)<br />
We have M ij = 1/|O(j)| when j→i and M ij = 0<br />
otherwise<br />
The random teleport is equivalent to<br />
Adding a teleport link from j to every other page<br />
with probability (1-β)/N<br />
Reducing the probability of following each out-link<br />
from 1/|O(j)| to β/|O(j)|<br />
Equivalent: Tax each page a fraction (1-β) of its<br />
score and redistribute evenly<br />
2/7/2012 Jure Leskovec, <strong>Stanford</strong> C246: Mining Massive Datasets<br />
32
Construct the N x N matrix A as follows<br />
A ij = β ∙ M ij + (1-β)/N<br />
Verify that A is a stochastic matrix<br />
The page rank vector r is the principal<br />
eigenvector of this matrix A<br />
satisfying r = A ∙ r<br />
Equivalently, r is the stationary distribution of<br />
the random walk with teleports<br />
2/7/2012 Jure Leskovec, <strong>Stanford</strong> C246: Mining Massive Datasets<br />
33
Key step is matrix-vector multiplication<br />
r new = A ∙ r old<br />
Easy if we have enough main memory to hold<br />
A, r old , r new<br />
Say N = 1 billion pages<br />
We need 4 bytes for<br />
each entry (say)<br />
2 billion entries for<br />
vectors, approx 8GB<br />
Matrix A has N 2 entries<br />
10 18 is a large number!<br />
A =<br />
A = β∙M + (1-β) [1/N] NxN<br />
½ ½ 0<br />
½ 0 0<br />
0 ½ 1<br />
0.8 +0.2<br />
1/3 1/3 1/3<br />
1/3 1/3 1/3<br />
1/3 1/3 1/3<br />
7/15 7/15 1/15<br />
7/15 1/15 1/15<br />
1/15 7/15 13/15<br />
2/7/2012 Jure Leskovec, <strong>Stanford</strong> C246: Mining Massive Datasets 34<br />
=
Key step is matrix-vector multiplication<br />
r new = A ∙ r old<br />
Easy if we have enough main memory to hold<br />
A, r old , r new<br />
Say N = 1 billion pages<br />
We need 4 bytes for<br />
each entry (say)<br />
2 billion entries for<br />
vectors, approx 8GB<br />
Matrix A has N 2 entries<br />
10 18 is a large number!<br />
A =<br />
A = β∙M + (1-β) [1/N] NxN<br />
½ ½ 0<br />
½ 0 0<br />
0 ½ 1<br />
0.8 +0.2<br />
1/3 1/3 1/3<br />
1/3 1/3 1/3<br />
1/3 1/3 1/3<br />
7/15 7/15 1/15<br />
7/15 1/15 1/15<br />
1/15 7/15 13/15<br />
2/7/2012 Jure Leskovec, <strong>Stanford</strong> C246: Mining Massive Datasets 34<br />
=
We can rearrange the PageRank equation
Encode sparse matrix using only nonzero<br />
entries<br />
Space proportional roughly to number of links<br />
Say 10N, or 4*10*1 billion = 40GB<br />
Still won’t fit in memory, but will fit on disk<br />
source<br />
node degree destination nodes<br />
0 3 1, 5, 7<br />
1 5 17, 64, 113, 117, 245<br />
2 2 13, 23<br />
2/7/2012 Jure Leskovec, <strong>Stanford</strong> C246: Mining Massive Datasets 37
Assume enough RAM to fit r new into memory<br />
Store r old and matrix M on disk<br />
Then 1 step of power-iteration is:<br />
Initialize all entries of rnew to (1-β)/N<br />
For each page p (of out-degree n):<br />
Read into memory: p, n, dest1,…,destn, rold (p)<br />
for j = 1…n: rnew (destj) += β rold (p) / n<br />
0<br />
1<br />
2<br />
3<br />
4<br />
5<br />
6<br />
r<br />
src degree destination<br />
new rold 0 3 1, 5, 6<br />
1 4 17, 64, 113, 117<br />
2 2 13, 23<br />
2/7/2012 Jure Leskovec, <strong>Stanford</strong> C246: Mining Massive Datasets 38<br />
0<br />
1<br />
2<br />
3<br />
4<br />
5<br />
6
Assume enough RAM to fit r new into memory<br />
Store r old and matrix M on disk<br />
In each iteration, we have to:<br />
Read r old and M<br />
Write r new back to disk<br />
IO cost = 2|r| + |M|<br />
Question:<br />
What if we could not even fit r new in memory?<br />
2/7/2012 Jure Leskovec, <strong>Stanford</strong> C246: Mining Massive Datasets 39
0<br />
1<br />
2<br />
3<br />
4<br />
5<br />
r src degree destination<br />
new rold 0 4 0, 1, 3, 5<br />
1 2 0, 5<br />
2 2 3, 4<br />
2/7/2012 Jure Leskovec, <strong>Stanford</strong> C246: Mining Massive Datasets 40<br />
0<br />
1<br />
2<br />
3<br />
4<br />
5
Similar to nested-loop join in databases<br />
Break r new into k blocks that fit in memory<br />
Scan M and r old once for each block<br />
k scans of M and r old<br />
k(|M| + |r|) + |r| = k|M| + (k+1)|r|<br />
Can we do better?<br />
Hint: M is much bigger than r (approx 10-20x), so<br />
we must avoid reading it k times per iteration<br />
2/7/2012 Jure Leskovec, <strong>Stanford</strong> C246: Mining Massive Datasets 41
0<br />
1<br />
2<br />
3<br />
4<br />
5<br />
r new<br />
src degree destination<br />
0 4 0, 1<br />
1 3 0<br />
2 2 1<br />
0 4 3<br />
2 2 3<br />
0 4 5<br />
1 3 5<br />
2 2 4<br />
2/7/2012 Jure Leskovec, <strong>Stanford</strong> C246: Mining Massive Datasets 42<br />
r old<br />
0<br />
1<br />
2<br />
3<br />
4<br />
5
Break M into stripes<br />
Each stripe contains only destination nodes in the<br />
corresponding block of r new<br />
Some additional overhead per stripe<br />
But it is usually worth it<br />
Cost per iteration<br />
|M|(1+ε) + (k+1)|r|<br />
2/7/2012 Jure Leskovec, <strong>Stanford</strong> C246: Mining Massive Datasets 43
Measures generic popularity of a page<br />
Biased against topic-specific authorities<br />
Solution: Topic-Specific PageRank (next)<br />
Uses a single measure of importance<br />
Other models e.g., hubs-and-authorities<br />
Solution: Hubs-and-Authorities (next)<br />
Susceptible to Link spam<br />
Artificial link topographies created in order to<br />
boost page rank<br />
Solution: TrustRank (next)<br />
2/7/2012 Jure Leskovec, <strong>Stanford</strong> C246: Mining Massive Datasets 44