25.03.2013 Views

Cracking the Coding Interview - Fooo

Cracking the Coding Interview - Fooo

Cracking the Coding Interview - Fooo

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Solutions to Chapter 11 | System Design and Memory Limits<br />

11 5 If you were designing a web crawler, how would you avoid getting into infinite loops?<br />

SOLUTION<br />

CareerCup com<br />

pg 72<br />

First, how does <strong>the</strong> crawler get into a loop? The answer is very simple: when we re-parse an<br />

already parsed page This would mean that we revisit all <strong>the</strong> links found in that page, and this<br />

would continue in a circular fashion<br />

Be careful about what <strong>the</strong> interviewer considers <strong>the</strong> “same” page Is it URL or content? One<br />

could easily get redirected to a previously crawled page<br />

So how do we stop visiting an already visited page? The web is a graph-based structure,<br />

and we commonly use DFS (depth first search) and BFS (breadth first search) for traversing<br />

graphs We can mark already visited pages <strong>the</strong> same way that we would in a BFS/DFS<br />

We can easily prove that this algorithm will terminate in any case We know that each step<br />

of <strong>the</strong> algorithm will parse only new pages, not already visited pages So, if we assume that<br />

we have N number of unvisited pages, <strong>the</strong>n at every step we are reducing N (N-1) by 1 That<br />

proves that our algorithm will continue until <strong>the</strong>y are only N steps<br />

SUGGESTIONS AND OBSERVATIONS<br />

» This question has a lot of ambiguity Ask clarifying questions!<br />

» Be prepared to answer questions about coverage<br />

» What kind of pages will you hit with a DFS versus a BFS?<br />

» What will you do when your crawler runs into a honey pot that generates an infinite<br />

subgraph for you to wander about?<br />

2 0 6

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!