31.07.2013 Views

Jure Leskovec, Stanford University - SNAP - Stanford University

Jure Leskovec, Stanford University - SNAP - Stanford University

Jure Leskovec, Stanford University - SNAP - Stanford University

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Document<br />

The set of strings<br />

of length k that<br />

appear in the<br />

document<br />

Signatures : short<br />

integer vectors that<br />

represent the sets,<br />

and reflect their<br />

similarity<br />

Localitysensitive<br />

Hashing<br />

1. Shingling: convert docs to sets<br />

2. Minhashing: convert large sets to short<br />

signatures, while preserving similarity.<br />

3. Locality-sensitive hashing: focus on pairs of<br />

signatures likely to be similar<br />

Candidate<br />

pairs :<br />

those pairs<br />

of signatures<br />

that we need<br />

to test for<br />

similarity.<br />

3/9/2011 <strong>Jure</strong> <strong>Leskovec</strong>, <strong>Stanford</strong> C246: Mining Massive Datasets 8

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!