Theory of Locality Sensitive Hashing - SNAP - Stanford University
Theory of Locality Sensitive Hashing - SNAP - Stanford University
Theory of Locality Sensitive Hashing - SNAP - Stanford University
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
Goal: Given a large number (N in the millions or<br />
billions) <strong>of</strong> text documents, find pairs that are<br />
“near duplicates”<br />
Application:<br />
Detect mirror and approximate mirror sites/pages:<br />
Don’t want to show both in a web search<br />
Problems:<br />
Many small pieces <strong>of</strong> one doc can appear out <strong>of</strong> order<br />
in another<br />
Too many docs to compare all pairs<br />
Docs are so large and so many that they cannot fit in<br />
main memory<br />
1/20/2012 Jure Leskovec, <strong>Stanford</strong> C246: Mining Massive Datasets 2