31.07.2013 Views

Jure Leskovec, Stanford University - SNAP - Stanford University

Jure Leskovec, Stanford University - SNAP - Stanford University

Jure Leskovec, Stanford University - SNAP - Stanford University

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Shingling: convert docs to sets of items<br />

Shingle: sequence of k tokens that appear in doc<br />

Example: k=2; D 1= abcab, 2-shingles: S(D 1)={ab, bc, ca}<br />

Represent a doc by the set of hashes of its shingles<br />

MinHashing: convert large sets to short<br />

signatures, while preserving similarity<br />

Similarity preserving hash func. h() s.t.:<br />

Pr[h π(S(D 1)) = h π(S(D 2))] = Sim(S(D 1), S(D 2))<br />

For Jaccard use permutation of columns and index of first 1.<br />

3/9/2011 <strong>Jure</strong> <strong>Leskovec</strong>, <strong>Stanford</strong> C246: Mining Massive Datasets 9

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!