11.07.2015 Views

A Deterministic Algorithm for Summarizing Asynchronous Streams ...

A Deterministic Algorithm for Summarizing Asynchronous Streams ...

A Deterministic Algorithm for Summarizing Asynchronous Streams ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Contributions. We present the first deterministic algorithms <strong>for</strong> summarizingasynchronous data streams over a sliding window. We first consider a fundamentalaggregate called the basic count, which is simply the number of elementswithin the sliding window. We present a data structure that can summarizean asynchronous stream in a small space and is able to provide a provablyaccurate estimate of the basic count. More precisely, let W denote anupper bound on the window size and B denote an upper bound on the basiccount. For any ɛ ∈ (0, 1), we present a summary of the stream that uses spaceO(log W · log B · (log W + log B)/ɛ) bits. For any window size w ≤ W presentedat the time of query, the summary can return an ɛ-approximation to the numberof elements whose timestamps are in the range [c − w, c] and arrive in theaggregator no later than c, where c denotes the current time. The time taken toprocess a new stream element is O(log W · log B) and the time taken to answera query <strong>for</strong> basic count is O(log B + log Wɛ).We next consider a generalization of basic counting, the sum problem. Ina stream whose observations {v i } are positive integer values, the sum prob-∑lem is to maintain the sum of all observations within the sliding window,{(v,t)∈R |t∈[c−w,c]}v. Our summary <strong>for</strong> the sum provides similar guaranteesas <strong>for</strong> basic counting. For any ɛ ∈ (0, 1) the summary <strong>for</strong> the sum uses spaceO(log W · log B · (log W + log B)/ɛ) bits, where W is an upper bound on thewindow size, and B is an upper bound on the value of the sum. For any windowsize w ≤ W , the summary can return an ɛ-approximation to the sum of allelement values within the sliding window [c − w, c]. The time taken to processa new stream element is O(log W · log B) and the time taken to answer a query<strong>for</strong> the sum is O(log B + log Wɛ).It is easy to verify that even on a synchronous data stream, a stream summarythat can return the exact value of the basic count within the sliding windowmust use Ω(W ) space in the worst case. The reason is that using such a summaryone can reconstruct the number of elements arriving at each instant within thesliding window. Hence, to achieve space efficiency it is necessary to introduceapproximations. Datar et. al. [5] show lower bounds <strong>for</strong> the space complexity ofapproximate basic counting on a synchronous stream. They show that if a summaryhas to return an ɛ-approximation <strong>for</strong> the basic count on distinct timestampelements, then it should use space at least Ω(log 2 W/ɛ). Since the synchronousstream is a special case of an asynchronous stream, the above lower bound ofΩ(log 2 W/ɛ) applies to approximate basic counting over asynchronous streamstoo. To compare our results <strong>for</strong> basic counting with this lower bound, let us considerthe case when the timestamps of the elements are unique. In such a case,log B = O(log W ), since the value of the basic count cannot exceed W , and thusthe space required by our summary is O(log 3 W/ɛ).Techniques. Our algorithm <strong>for</strong> basic counting is based on a novel data structurethat we call a splittable histogram. The data structure consists of a small numberof histograms that summarize the elements within the sliding window at variousgranularities. Within each histogram, the elements in the sliding window aregrouped into buckets, that are each responsible <strong>for</strong> a certain range of timestamps.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!