31.07.2013 Views

Jure Leskovec, Stanford University - SNAP - Stanford University

Jure Leskovec, Stanford University - SNAP - Stanford University

Jure Leskovec, Stanford University - SNAP - Stanford University

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

ands<br />

Hash cols of signature<br />

matrix M: Similar columns<br />

likely hash to same bucket<br />

Cols. x and y are a candidate<br />

pair if M (i, x) = M (i, y) for at<br />

least frac. s values of i<br />

Divide matrix M into b bands<br />

of r rows<br />

r rows<br />

Buckets<br />

Matrix M<br />

Prob. of sharing<br />

a bucket<br />

Sim(C 1 ,C 2)=s<br />

Prob. that at least 1 band is<br />

identical = 1 - (1 - s r ) b<br />

Given s, tune r and b to get<br />

almost all pairs with similar<br />

signatures, but eliminate<br />

most pairs that do not have<br />

similar signatures<br />

Sim. threshold s<br />

b=20, r=5<br />

s 1-(1-s r ) b<br />

.2 .006<br />

.3 .047<br />

.4 .186<br />

.5 .470<br />

.6 .802<br />

.7 .975<br />

.8 .9996<br />

3/9/2011 <strong>Jure</strong> <strong>Leskovec</strong>, <strong>Stanford</strong> C246: Mining Massive Datasets 11<br />

2<br />

1<br />

2<br />

1<br />

2<br />

1<br />

4<br />

1<br />

2<br />

1<br />

2<br />

1

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!