Jure Leskovec, Stanford University - SNAP - Stanford University
Jure Leskovec, Stanford University - SNAP - Stanford University
Jure Leskovec, Stanford University - SNAP - Stanford University
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
Shingling: convert docs to sets of items<br />
Shingle: sequence of k tokens that appear in doc<br />
Example: k=2; D 1= abcab, 2-shingles: S(D 1)={ab, bc, ca}<br />
Represent a doc by the set of hashes of its shingles<br />
MinHashing: convert large sets to short<br />
signatures, while preserving similarity<br />
Similarity preserving hash func. h() s.t.:<br />
Pr[h π(S(D 1)) = h π(S(D 2))] = Sim(S(D 1), S(D 2))<br />
For Jaccard use permutation of columns and index of first 1.<br />
3/9/2011 <strong>Jure</strong> <strong>Leskovec</strong>, <strong>Stanford</strong> C246: Mining Massive Datasets 9