08.10.2016 Views

Foundations of Data Science

2dLYwbK

2dLYwbK

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

estimate <strong>of</strong> m. To obtain a coin that comes down heads with probability 1/2 k , flip a fair<br />

coin, one that comes down heads with probability 1 / 2 , k times and report heads if the fair<br />

coin comes down heads in all k flips.<br />

Given k, on average it will take 2 k ones before k is incremented. Thus, the expected<br />

number <strong>of</strong> 1’s to produce the current value <strong>of</strong> k is 1 + 2 + 4 + · · · + 2 k−1 = 2 k − 1.<br />

7.2.3 Counting Frequent Elements<br />

The Majority and Frequent Algorithms<br />

First consider the very simple problem <strong>of</strong> n people voting. There are m candidates,<br />

{1, 2, . . . , m}. We want to determine if one candidate gets a majority vote and if so<br />

who. Formally, we are given a stream <strong>of</strong> integers a 1 , a 2 , . . . , a n , each a i belonging to<br />

{1, 2, . . . , m}, and want to determine whether there is some s ∈ {1, 2, . . . , m} which occurs<br />

more than n/2 times and if so which s. It is easy to see that to solve the problem<br />

exactly on read-once streaming data with a deterministic algorithm, requires Ω(min(n, m))<br />

space. Suppose n is even and the last n/2 items are identical. Suppose also that after<br />

reading the first n/2 items, there are two different sets <strong>of</strong> elements that result in the same<br />

content <strong>of</strong> our memory. In that case, a mistake would occur if the second half <strong>of</strong> the<br />

stream consists solely <strong>of</strong> an element that is in one set, but not in the other. If n/2 ≥ m<br />

then there are at least 2 m − 1 possible subsets <strong>of</strong> the first n/2 elements. If n/2 ≤ m<br />

then there are ∑ n/2<br />

( m<br />

)<br />

i=1 i subsets. By the above argument, the number <strong>of</strong> bits <strong>of</strong> memory<br />

must be at least the base 2 logarithm <strong>of</strong> the number <strong>of</strong> subsets, which is Ω(min(m, n)).<br />

Surprisingly, we can bypass the above lower bound by slightly weakening our goal.<br />

Again let’s require that if some element appears more than n/2 times, then we must<br />

output it. But now, let us say that if no element appears more than n/2 times, then our<br />

algorithm may output whatever it wants, rather than requiring that it output “no”. That<br />

is, there may be “false positives”, but no “false negatives”.<br />

Majority Algorithm<br />

Store a 1 and initialize a counter to one. For each subsequent a i , if a i is the<br />

same as the currently stored item, increment the counter by one. If it differs,<br />

decrement the counter by one provided the counter is nonzero. If the counter<br />

is zero, then store a i and set the counter to one.<br />

To analyze the algorithm, it is convenient to view the decrement counter step as “eliminating”<br />

two items, the new one and the one that caused the last increment in the counter.<br />

It is easy to see that if there is a majority element s, it must be stored at the end. If<br />

not, each occurrence <strong>of</strong> s was eliminated; but each such elimination also causes another<br />

item to be eliminated. Thus for a majority item not to be stored at the end, more than<br />

n items must have eliminated, a contradiction.<br />

243

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!