Jure Leskovec, Stanford University - SNAP - Stanford University
Jure Leskovec, Stanford University - SNAP - Stanford University
Jure Leskovec, Stanford University - SNAP - Stanford University
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
Document<br />
The set of strings<br />
of length k that<br />
appear in the<br />
document<br />
Signatures : short<br />
integer vectors that<br />
represent the sets,<br />
and reflect their<br />
similarity<br />
Localitysensitive<br />
Hashing<br />
1. Shingling: convert docs to sets<br />
2. Minhashing: convert large sets to short<br />
signatures, while preserving similarity.<br />
3. Locality-sensitive hashing: focus on pairs of<br />
signatures likely to be similar<br />
Candidate<br />
pairs :<br />
those pairs<br />
of signatures<br />
that we need<br />
to test for<br />
similarity.<br />
3/9/2011 <strong>Jure</strong> <strong>Leskovec</strong>, <strong>Stanford</strong> C246: Mining Massive Datasets 8