31.07.2013 Views

Theory of Locality Sensitive Hashing - SNAP - Stanford University

Theory of Locality Sensitive Hashing - SNAP - Stanford University

Theory of Locality Sensitive Hashing - SNAP - Stanford University

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Goal: Given a large number (N in the millions or<br />

billions) <strong>of</strong> text documents, find pairs that are<br />

“near duplicates”<br />

Application:<br />

Detect mirror and approximate mirror sites/pages:<br />

Don’t want to show both in a web search<br />

Problems:<br />

Many small pieces <strong>of</strong> one doc can appear out <strong>of</strong> order<br />

in another<br />

Too many docs to compare all pairs<br />

Docs are so large and so many that they cannot fit in<br />

main memory<br />

1/20/2012 Jure Leskovec, <strong>Stanford</strong> C246: Mining Massive Datasets 2

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!