27.10.2014 Views

Cracking the Coding Interview, 4 Edition - 150 Programming Interview Questions and Solutions

Cracking the Coding Interview, 4 Edition - 150 Programming Interview Questions and Solutions

Cracking the Coding Interview, 4 Edition - 150 Programming Interview Questions and Solutions

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>Solutions</strong> to Chapter 11 | System Design <strong>and</strong> Memory Limits<br />

11.6 You have a billion urls, where each is a huge page. How do you detect <strong>the</strong> duplicate<br />

documents?<br />

SOLUTION<br />

pg 72<br />

Observations:<br />

1. Pages are huge, so bringing all of <strong>the</strong>m in memory is a costly affair. We need a shorter<br />

representation of pages in memory. A hash is an obvious choice for this.<br />

2. Billions of urls exist so we don’t want to compare every page with every o<strong>the</strong>r page<br />

(that would be O(n^2)).<br />

Based on <strong>the</strong> above two observations we can derive an algorithm which is as follows:<br />

1. Iterate through <strong>the</strong> pages <strong>and</strong> compute <strong>the</strong> hash table of each one.<br />

2. Check if <strong>the</strong> hash value is in <strong>the</strong> hash table. If it is, throw out <strong>the</strong> url as a duplicate. If it<br />

is not, <strong>the</strong>n keep <strong>the</strong> url <strong>and</strong> insert it in into <strong>the</strong> hash table.<br />

This algorithm will provide us a list of unique urls. But wait, can this fit on one computer?<br />

»»<br />

How much space does each page take up in <strong>the</strong> hash table?<br />

»»<br />

Each page hashes to a four byte value.<br />

»»<br />

Each url is an average of 30 characters, so that’s ano<strong>the</strong>r 30 bytes at least.<br />

»»<br />

Each url takes up roughly 34 bytes.<br />

»»<br />

34 bytes * 1 billion = 31.6 gigabytes. We’re going to have trouble holding that all in<br />

memory!<br />

What do we do?<br />

»»<br />

We could split this up into files. We’ll have to deal with <strong>the</strong> file loading / unloading—ugh.<br />

»»<br />

We could hash to disk. Size wouldn’t be a problem, but access time might. A hash table<br />

on disk would require a r<strong>and</strong>om access read for each check <strong>and</strong> write to store a viewed<br />

url. This could take msecs waiting for seek <strong>and</strong> rotational latencies. Elevator algorithms<br />

could elimate r<strong>and</strong>om bouncing from track to track.<br />

»»<br />

Or, we could split this up across machines, <strong>and</strong> deal with network latency. Let’s go with<br />

this solution, <strong>and</strong> assume we have n machines.<br />

»»<br />

First, we hash <strong>the</strong> document to get a hash value v<br />

»»<br />

v%n tells us which machine this document’s hash table can be found on.<br />

»»<br />

v / n is <strong>the</strong> value in <strong>the</strong> hash table that is located on its machine.<br />

2 0 7<br />

<strong>Cracking</strong> <strong>the</strong> <strong>Coding</strong> <strong>Interview</strong> | Concepts <strong>and</strong> Algorithms

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!