11.07.2015 Views

A Deterministic Algorithm for Summarizing Asynchronous Streams ...

A Deterministic Algorithm for Summarizing Asynchronous Streams ...

A Deterministic Algorithm for Summarizing Asynchronous Streams ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Contributions. We present the first deterministic algorithms <strong>for</strong> summarizingasynchronous data streams over a sliding window. We first consider a fundamentalaggregate called the basic count, which is simply the number of elementswithin the sliding window. We present a data structure that can summarizean asynchronous stream in a small space and is able to provide a provablyaccurate estimate of the basic count. More precisely, let W denote anupper bound on the window size and B denote an upper bound on the basiccount. For any ɛ ∈ (0, 1), we present a summary of the stream that uses spaceO(log W · log B · (log W + log B)/ɛ) bits. For any window size w ≤ W presentedat the time of query, the summary can return an ɛ-approximation to the numberof elements whose timestamps are in the range [c − w, c] and arrive in theaggregator no later than c, where c denotes the current time. The time taken toprocess a new stream element is O(log W · log B) and the time taken to answera query <strong>for</strong> basic count is O(log B + log Wɛ).We next consider a generalization of basic counting, the sum problem. Ina stream whose observations {v i } are positive integer values, the sum prob-∑lem is to maintain the sum of all observations within the sliding window,{(v,t)∈R |t∈[c−w,c]}v. Our summary <strong>for</strong> the sum provides similar guaranteesas <strong>for</strong> basic counting. For any ɛ ∈ (0, 1) the summary <strong>for</strong> the sum uses spaceO(log W · log B · (log W + log B)/ɛ) bits, where W is an upper bound on thewindow size, and B is an upper bound on the value of the sum. For any windowsize w ≤ W , the summary can return an ɛ-approximation to the sum of allelement values within the sliding window [c − w, c]. The time taken to processa new stream element is O(log W · log B) and the time taken to answer a query<strong>for</strong> the sum is O(log B + log Wɛ).It is easy to verify that even on a synchronous data stream, a stream summarythat can return the exact value of the basic count within the sliding windowmust use Ω(W ) space in the worst case. The reason is that using such a summaryone can reconstruct the number of elements arriving at each instant within thesliding window. Hence, to achieve space efficiency it is necessary to introduceapproximations. Datar et. al. [5] show lower bounds <strong>for</strong> the space complexity ofapproximate basic counting on a synchronous stream. They show that if a summaryhas to return an ɛ-approximation <strong>for</strong> the basic count on distinct timestampelements, then it should use space at least Ω(log 2 W/ɛ). Since the synchronousstream is a special case of an asynchronous stream, the above lower bound ofΩ(log 2 W/ɛ) applies to approximate basic counting over asynchronous streamstoo. To compare our results <strong>for</strong> basic counting with this lower bound, let us considerthe case when the timestamps of the elements are unique. In such a case,log B = O(log W ), since the value of the basic count cannot exceed W , and thusthe space required by our summary is O(log 3 W/ɛ).Techniques. Our algorithm <strong>for</strong> basic counting is based on a novel data structurethat we call a splittable histogram. The data structure consists of a small numberof histograms that summarize the elements within the sliding window at variousgranularities. Within each histogram, the elements in the sliding window aregrouped into buckets, that are each responsible <strong>for</strong> a certain range of timestamps.


decreases by a factor of 2 during every split, and the initial bucket is responsible<strong>for</strong> a timestamp range of length W . A bucket that is responsible <strong>for</strong> a singletimestamp is treated as a special case, and is not split further, even if its weightincreases beyond 2 i+1 .Due to the splits, the number of buckets within S i may increase beyond α,in which case we only maintain the α buckets that are responsible <strong>for</strong> the mostrecent timestamps. Given a query <strong>for</strong> the basic count in window [c − w, c], thedifferent S i s are examined in increasing order of i. For smaller values of i, S imay have already discarded some buckets that are responsible <strong>for</strong> timestampsin [c − w, c]. But, there will always be a level l ≤ M that will have all bucketsintersecting the range [c − w, c] (this is <strong>for</strong>mally proved in Lemma 2). The algorithmselects the earliest such level to answer the basic counting query, and weshow that the resulting relative error is within ɛ.The algorithm <strong>for</strong> basic counting is given below. <strong>Algorithm</strong> 1 describes theinitialization of the data structure, <strong>Algorithm</strong> 2 describes the steps taken to processa new element with a timestamp t, and <strong>Algorithm</strong> 3 describes the procedure<strong>for</strong> answering a query <strong>for</strong> basic count.<strong>Algorithm</strong> 1: Basic Counting: Initializationα ← ⌈ ⌉(1 + log W ) · 2+ɛɛ , where ɛ is the desired relative error;S 0 ← φ; T 0 ← −1;<strong>for</strong> i = 1, . . . , M doS i is a set with a single element 〈0, 0, W − 1〉;T i ← −1;end2.2 Proof of CorrectnessLet c denote the current time. We consider the contents of sets S i and the valuesof T i at time c. For any time t, 0 ≤ t ≤ c, let s t denote the number of elementswith timestamps in the range [t, W − 1] which arrive until time c. For level i,0 ≤ i ≤ M, e i t is defined as follows.Definition 2.e i t =∑w(b){b∈S i|l(b)≥t}Lemma 1. For any level i ∈ [0, M], <strong>for</strong> any t such that T i < t ≤ c,|s t − e i t| ≤ 2 i · (1 + log W )Proof. For level i = 0 we have s t = e 0 t , since each element x with timestampt ′ , where t ≤ t ′ ≤ W − 1, is counted in the bucket b = 〈w(b), t ′ , t ′ 〉 which is amember of S 0 at time c. Thus, |s t − e 0 t | = 0.Consider now some level i > 0. We can construct a binary tree A whose nodesare all the buckets that appeared in S i up to current time c. Let b 0 = 〈0, 0, W −1〉


<strong>Algorithm</strong> 2: Basic Counting: When an element with timestamp t arrives// level 0if there is bucket 〈w(b), t, t〉 ∈ S 0 thenIncrement w(b);elseInsert bucket 〈1, t, t〉 into S 0;end// level i, i > 0<strong>for</strong> i = 1, . . . , M doif there is bucket b = 〈w(b), l(b), r(b)〉 ∈ S i with t ∈ [l(b), r(b)] thenIncrement w(b);if w(b) = 2 i+1 and l(b) ≠ r(b) then// bucket too heavy, split// note that a bucket is not split// if it is responsible <strong>for</strong> only a single time stampNew bucket b 1 = 〈2 i , l(b), l(b)+r(b)+1 − 1〉;2New bucket b 2 = 〈2 i , l(b)+r(b)+1 , r(b)〉;2Delete b from S i;Insert b 1 and b 2 into S i;endendend// handle overflow<strong>for</strong> i = 0, . . . , M doif |S i| > α then// overflowDiscard bucket b ∗ ∈ S i such that r(b ∗ ) = min b∈Si r(b);T i ← r(b ∗ );endend<strong>Algorithm</strong> 3: Basic Counting: Query(w)Input: w, the width of the query window, where w ≤ WOutput: An estimate of the number of elements with timestamps in [c − w, c]where c is the current timeLet l ∈ [0, . . . , M] be the smallest integer such that T l < c − w;return ∑ {b∈S l |l(b)≥c−w} w(b);


e the initial bucket which is inserted into S i during initialization (<strong>Algorithm</strong>1). The root of A is b 0 . For any bucket b ∈ A, if b is split into two buckets b land b r , then b l and b r will appear as the respective left and right children of bin A. Note that in A a node is either a leaf or has exactly two children. Tree Ahas depth at most log W (the root is at depth 0), since every time that a bucketsplits the time period divides in half, and the smallest time period is a discretetime step. For any node b ∈ A let A(b) denote the subtree with root b; we willalso refer to this as the subtree of b.Consider now the tree A at time c. The buckets in S i appear as the |S i |rightmost leaves of A. Let S i ′ denote the set of buckets in S i with l(b) ≥ t.clearly, e i t = ∑ b∈Sw(b). The buckets in S ′ i ′ i are the |S′ i | rightmost leaves of A.Suppose that S i ′ ≠ ∅ (the case S′ i = ∅ is discussed below). Let b′ be the leftmostleaf in A among the buckets in S i ′ . Let p denote the path in A from the root tob ′ . For the number of nodes |p| of p it holds |p| ≤ 1 + log W . Let H 1 (H 2 ) be theset that consists of the right (left) children of the nodes in p, such that thesechildren are not members of the path p. Note that b ′ /∈ H 1 ∪ H 2 . The union ofb ′ and the leaves in the subtrees of H 1 (∪ b∈H2 A(b)) constitute the nodes in S i ′.Further, each bucket b /∈ S i ′ is in a leaf in a subtree of H 2.Consider some element x with timestamp t ′ . Initially, when x arrives it isinitially assigned to the bucket b which t ′ belongs to. If b splits to two (children)buckets b 1 and b 2 , then we can assume that x is assigned arbitrarily to one ofthe two new buckets arbitrarily. Even through x’s timestamp may belong to b 1 ,x may be assigned to b 2 , and vice-versa. If again the new bucket splits, x isassigned to one of its children, and so on. Note that x is always assigned to aleaf of A.At time c, we can writee i t = s t + |X 1 | − |X 2 ∪ X 3 |, (1)such that: X 1 is the set of elements with timestamps in [0, t−1] which are assignedto buckets in S ′ i ; X 2 is the set of elements with timestamps in [l(b ′ ), W −1] whichare assigned to buckets outside of S ′ i ; and, <strong>for</strong> t < l(b′ ), X 3 is the set of elementswith timestamps in [t, l(b ′ )−1] which are assigned to buckets outside of S ′ i , while<strong>for</strong> t = l(b ′ ), X 3 = ∅. Note that the sets X 1 , X 2 , X 3 are disjoint.First, we bound |X 1 |. Consider some element x ∈ X 1 with timestamp in[0, t − 1] which at time c appears assigned to a leaf bucket b l ∈ S ′ i . Since b l ∈ S ′ i ,t cannot be a member of the time range of b l , that is, t /∈ [l(b l ), r(b l )]. Thus, xcould not have been initially assigned to b l . Suppose that b l ≠ b ′ . Then, there isa node ̂b ∈ H 1 such that b l is the leaf of the subtree A(̂b). None of the nodes inA(̂b) contain t in their time range, since all the leaves of A(̂b) are members of S ′ i .There<strong>for</strong>e, x could not have been initially assigned to A(̂b). Thus, x is initiallyassigned to a node b p ∈ p ′ = p − {b ′ }, since x could not have been assignedto any node in the subtrees of H 2 which would certainly bring x outside of S ′ i .Similarly, if b l ≠ b ′ , x is initially assigned to a node b p ∈ p ′ . Since at most 2 i+1elements are initially assigned to the root, and at most 2 i elements are initially


assigned to each of the subsequent nodes of p ′ , we get:|X 1 | ≤ 2 i · (|p ′ | − 1) + 2 i+1 = 2 i · |p| ≤ 2 i · (1 + log W ). (2)With a similar analysis (the details are omitted due to space constraints) incan be shown that:|X 2 ∪ X 3 | ≤ 2 i · (1 + log W ). (3)Combining Equations 1, 2, and 3 we can bound s t − e i t :−2 i · (1 + log W ) ≤ −|X 1 | ≤ s t − e i t ≤ |X 2 ∪ X 3 | ≤ 2 i · (1 + log W ).There<strong>for</strong>e, |s t − e i t | ≤ 2i · (1 + log W ). In case S i ′ = ∅, ei t = s t − |X 3 | = 0, and thesame bound follows immediately.⊓⊔Lemma 2. When asked <strong>for</strong> an estimate of the number of timestamps in [c−w, c](1)There exists a level i ∈ [0, M] such that T i < c − w, and(2)<strong>Algorithm</strong> 3 returns e l c−w where l ∈ [0, M] is the smallest level such thatT l < c − w.The proof of Lemma 2 is omitted due to space constraints, and can be foundin the full version [3]. Let l denote the level used by <strong>Algorithm</strong> 3 to answer aquery <strong>for</strong> the number of timestamps in [c − w, c]. From Lemma 2 we know lalways exists.Lemma 3. If l > 0, then s c−w ≥(1+log W )·2lɛ.Proof. If l > 0, it must be true that T l−1 ≥ c − w, since otherwise level l − 1would have been chosen. Let t = T l−1 + 1. Then, t > c − w, and thus s c−w ≥ s t .From Lemma 1, we know s t ≥ e l−1t − (1 + log W ) · 2 l−1 . Thus we have:s c−w ≥ e l−1t − (1 + log W ) · 2 l−1 (4)We know that <strong>for</strong> each bucket b ∈ S l−1 , l(b) ≥ t. Further we know that eachbucket in S l−1 has a weight of at least 2 l−1 (only the initial bucket in S l−1may have a smaller weight, but this bucket must have split, since otherwise T l−1would still be −1). Since there are α buckets in S l−1 , we have:e l−1t ≥ α2 l−1 ≥ (1 + log W ) · 2 + ɛ · 2 l−1ɛ(5)The lemma follows from Equations 4 and 5.⊓⊔Theorem 1. The answer returned by <strong>Algorithm</strong> 3 is within an ɛ relative errorof s c−w .Proof. Let X denote the value returned by <strong>Algorithm</strong> 3. If l = 0, it can beverified that <strong>Algorithm</strong> 3 returns exactly s c−w (proof omitted due to space constraints).If l > 0, from Lemmas 1 and 2, we have |X − s c−w | ≤ (1 + log W ) · 2 l .Using Lemma 3, we get |X − s c−w | ≤ ɛ · s c−w as needed.⊓⊔


Theorem 2. The worst case space required by the data structure <strong>for</strong> basic countingis O((log W · log B) · (log W + log B)/ɛ) where B is an upper bound on thevalue of the basic count, W is an upper bound on the window size w, and ɛ isthe desired upper bound on the relative error. The worst case time taken by <strong>Algorithm</strong>2 to process a new element is O(log W · log B), and the worst case timetaken by <strong>Algorithm</strong> 3 to answer a query <strong>for</strong> basic counting is O(log B + log Wɛ).The proof is omitted due to space constraints, and can be found in the fullversion [3].3 Sum of Positive IntegersWe now consider the maintenance of a sketch <strong>for</strong> the sum, which is a generalizationof basic counting. The stream is a sequence of tuples R = d 1 = (v 1 , t 1 ), d 2 =(v 2 , t 2 ), . . . , d n = (v n , t n ) where the v i s are positive integers, corresponding tothe observations, and t i s are the timestamps of the observations. Let c denotethe current time. The goal is to maintain a sketch of R which will provide ananswer <strong>for</strong> the following query. For a user provided w ≤ W that is given at thetime of the query, return the sum of the values of stream elements that are withinthe current timestamp window [c − w, c]. Clearly, basic counting is a special casewhere all v i s are equal to 1.An arriving element (v, t), is treated as v different elements each of value 1and timestamp t, and these v elements are inserted into the data structure <strong>for</strong>basic counting. Finally, when asked <strong>for</strong> an estimate <strong>for</strong> the sum, the algorithm<strong>for</strong> handling a query in basic counting (<strong>Algorithm</strong> 3) is used. The correctnessof this algorithm <strong>for</strong> the sum follows from the correctness of the basic countingalgorithm (Theorem 1). The space complexity of this algorithm is the same asthe space complexity of basic counting, the only difference being that the numberof levels in the algorithm <strong>for</strong> the sum is M = ⌈log B⌉, where B is an upper boundon the value of the sum within the sliding window (in the case of basic counting,B was an upper bound on the number of elements within the window).If naively executed, the time complexity of the above procedure <strong>for</strong> processingan element (v, t) could be large, since v could be large. The time complexityof processing an element can be reduced by directly computing the final stateof the basic counting data structure after inserting all the v elements. The intuitionbehind the faster processing is as follows. The element (v, t) is insertedinto each of the M + 1 levels. In each level i, i = 0, . . . , M, the v elements areinserted into S i in batches of unit elements (1, t) taken from (v, t). A batch containsenough elements to cause the current bucket containing timestamp t tosplit. The next batch contains enough elements from v to cause the new bucketcontaining timestamp t to split, too, and so on. The process repeats until abucket containing timestamp t cannot split further. This occurs when at mostO(max(v/2 i , log W )) batches are processed (and a similar number of respectivenew buckets is created), since at most O(2 i ) elements from v are processed ateach iteration in a batch, and a bucket can be recursively split at most log W


times until it is responsible <strong>for</strong> only one timestamp, at which point no furthersplitting can occur (and any remaining elements are directly inserted into thisbucket). The complete algorithm <strong>for</strong> processing (v, t) and its analysis can befound in the full version of the paper [3], where it is proved that upon receivingelement (v, t), the algorithm <strong>for</strong> the sum simulates the behavior of <strong>Algorithm</strong> 2upon receiving v elements each with a timestamp of t.Theorem 3. The worst case space required by the data structure <strong>for</strong> the sum isO((log W · log B)(log W + log B)/ɛ) bits where B is an upper bound on the valueof the sum, W is an upper bound on the window size w, and ɛ is the desiredupper bound on the relative error. The worst case time taken by the algorithm<strong>for</strong> the sum to process a new element is O(log W · log B), and the time taken toanswer a query <strong>for</strong> the sum is O(log B + (log W )/ɛ).References1. A. Arasu and G. Manku. Approximate counts and quantiles over sliding windows.In Proc. ACM Symposium on Principles of Database Systems (PODS), pages 286–296, 2004.2. B. Babcock, M. Datar, R. Motwani, and L. O’Callaghan. Maintaining variance andk-medians over data stream windows. In Proc. 22nd ACM Symp. on Principles ofDatabase Systems (PODS), pages 234–243, June 2003.3. C. Busch and S. Tirthapura. A deterministic algorithm <strong>for</strong> summarizing asynchronousstreams over a sliding window. Technical report, Iowa State University,2006. Available at http://archives.ece.iastate.edu/view/year/2006.html.4. G. Cormode, F. Korn, S. Muthukrishnan, and D. Srivastava. Space- and timeefficientdeterministic algorithms <strong>for</strong> biased quantiles over data streams. In Proc.ACM Symposium on Principles of Database Systems, pages 263–272, 2006.5. M. Datar, A. Gionis, P. Indyk, and R. Motwani. Maintaining stream statistics oversliding windows. SIAM Journal on Computing, 31(6):1794–1813, 2002.6. J. Feigenbaum, S. Kannan, and J. Zhang. Computing diameter in the streamingand sliding-window models. <strong>Algorithm</strong>ica, 41:25–41, 2005.7. P. Gibbons and S. Tirthapura. Distributed streams algorithms <strong>for</strong> sliding windows.Theory of Computing Systems, 37:457–478, 2004.8. S. Guha, D. Gunopulos, and N. Koudas. Correlating synchronous and asynchronousdata streams. In Proc.9th ACM International Conference on KnowledgeDiscovery and Data Mining (KDD), pages 529–534, 2003.9. A. Manjhi, V. Shkapenyuk, K. Dhamdhere, and C. Olston. Finding (recently)frequent items in distributed data streams. In Proc. IEEE International Conferenceon Data Engineering (ICDE), pages 767–778, 2005.10. S. Muthukrishnan. Data <strong>Streams</strong>: <strong>Algorithm</strong>s and Applications. Foundations andTrends in Theoretical Computer Science. Now Publishers, August 2005.11. U. Srivastava and J. Widom. Flexible time management in data stream systems.In Proc. 23rd ACM Symposium on Principles of Database Systems (PODS), pages263–274, 2004.12. S. Tirthapura, B. Xu, and C. Busch. Sketching asynchronous streams over a slidingwindow. In Proc. 25th annual ACM symposium on Principles of distributedcomputing (PODC), pages 82–91, 2006.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!