08.10.2016 Views

Foundations of Data Science

2dLYwbK

2dLYwbK

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

1<br />

1<br />

2 3<br />

2 3<br />

4 5<br />

4 5<br />

Figure 4.8: A graph (left) and the breadth first search <strong>of</strong> the graph (right). At vertex 1<br />

the algorithm queried all edges. The solid edges are real edges, the dashed edges are edges<br />

that were queried but do not exist. At vertex 2 the algorithm queried all possible edges to<br />

vertices not yet discovered. The algorithm does not query whether the edge (2,3) exists<br />

since vertex 3 has already been discovered when the algorithm is at vertex 2. Potential<br />

edges not queried are illustrated with dotted edges.<br />

Thus the probability that a vertex is discovered after i steps is approximately id. The n<br />

expected number <strong>of</strong> discovered vertices grows as id and the expected size <strong>of</strong> the frontier<br />

grows as (d − 1) i. As the fraction <strong>of</strong> discovered vertices increases, the expected rate <strong>of</strong><br />

growth <strong>of</strong> newly discovered vertices decreases since many <strong>of</strong> the vertices adjacent to the<br />

vertex currently being searched have already been discovered. For d > 1, once d−1n<br />

vertices<br />

have been discovered, the growth <strong>of</strong> newly discovered vertices slows to one at each<br />

d<br />

step. Eventually for d >1, the growth <strong>of</strong> discovering new vertices drops below one per<br />

step and the frontier starts to shrink. For d 1, the expected size <strong>of</strong> the frontier grows as (d−1)i for small i. The actual size<br />

<strong>of</strong> the frontier is a random variable. What is the probability that the actual size <strong>of</strong> the<br />

frontier will differ from the expected size <strong>of</strong> the frontier by a sufficient amount so that the<br />

actual size <strong>of</strong> the frontier is zero? To answer this, we need to understand the distribution<br />

<strong>of</strong> the number <strong>of</strong> discovered vertices after i steps. For small i, the probability that a vertex<br />

has been discovered is 1 − (1 − d/n) i ≈ id/n and the binomial distribution for the number<br />

<strong>of</strong> discovered vertices, binomial(n, id ), is well approximated by the Poisson distribution<br />

n<br />

with the same mean id. The probability that a total <strong>of</strong> k vertices have been discovered in<br />

i steps is approximately e −di (di) k<br />

. For a connected component to have exactly i vertices,<br />

k!<br />

the frontier must drop to zero for the first time at step i. A necessary condition is that<br />

exactly i vertices must have been discovered in the first i steps. The probability <strong>of</strong> this is<br />

approximately<br />

−di (di)i<br />

e = e −di di i i<br />

e i = e −(d−1)i d i = e −(d−1−ln d)i .<br />

i! i i<br />

90

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!