11.07.2015 Views

A Data Streaming Algorithm for Estimating Entropies of OD Flows

A Data Streaming Algorithm for Estimating Entropies of OD Flows

A Data Streaming Algorithm for Estimating Entropies of OD Flows

SHOW MORE
SHOW LESS
  • No tags were found...

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

A = O − D = flows that enter at ingress but do not exitthrough the egressB = O ∩ D = flows that enter at ingress and exit throughthe egressC = D − O = flows that do not enter at ingress but exitthrough the egressWe know that Λ( ⃗ O) is an estimator <strong>for</strong> ||O|| p, so Λ( ⃗ O) pis an estimator <strong>for</strong> ||O|| p p = ||A ∪ B|| p p = ||A|| p p + ||B|| p p.Similarly Λ( ⃗ D) p is an estimator <strong>for</strong> ||B|| p p + ||C|| p p.The sketch ⃗ O + ⃗ D holds the contributions <strong>of</strong> all the flowsin A, B and C, but with every packet from B contributingtwice. We use B (2) to denote all the flows in B with packetcounts doubled. Then Λ( ⃗ O + ⃗ D) p is an estimator <strong>for</strong> ||A ∪B (2) ∪C|| p p = ||A|| p p+2 p ||B|| p p+||C|| p p. It is important to pointout that the reasoning here (and in the next paragraph)depends on the fact that the ingress and egress nodes areusing the same sketch settings as noted at the beginning <strong>of</strong>this section.The sketch ⃗ O − ⃗ D exactly cancels out the contributionsfrom all the flows in B, and leaves us with the contributions<strong>of</strong> flows from A and the negative <strong>of</strong> the contributions <strong>of</strong>flows from C. We use C (−1) to denote all the flows in Cwith packet counts multiplied by −1. Then Λ( ⃗O − ⃗D) p is anestimator <strong>of</strong> ||A ∪ C (−1) || p p = ||A|| p p + ||C (−1) || p p = ||A|| p p +||C | | p p .To sum up, we get the following:Λ( ⃗ O) p estimates ||A|| p p + ||B|| p pΛ( ⃗ D) p estimates ||B|| p p + ||C|| p pΛ( ⃗O + ⃗D) p estimates ||A|| p p + 2 p ||B|| p p + ||C|| p pΛ( ⃗ O − ⃗ D) p estimates ||A|| p p + ||C|| p p.It is easy to see from the above <strong>for</strong>mulae that both Formula(5) and Formula (6) are reasonable estimators <strong>for</strong> ||B|| p,i.e. ||O ∩ D|| p.Proposition 8. Suppose we have chosen proper l suchthat (1) is roughly an (ɛ, δ) estimator. Suppose ||O ∩ D|| p p =r 1||O|| p p, and ||O ∩ D|| p p = r 2||D|| p p. Then (5) raised to thepower p estimates ||O ∩ D|| p p roughly within relative errorbound ( 1r 1+ 1r 2−1)ɛ with probability at least 1−3δ. Similarly,(6) raised to the power p estimates ||O ∩ D|| p p roughly withinrelative error bound 2 1−p ( 1r 1+ 1r 2+2 p−1 −2)ɛ ≈ ( 1r 1+ 1r 2−1)ɛwith probability at least 1 − 2δ.We omit the pro<strong>of</strong> here since it is similar to the previouspro<strong>of</strong>s. This gives us a very loose rough upper bound on therelative error. The ratios r 1 and r 2 are related to, but notidentical to, the ratio <strong>of</strong> <strong>OD</strong> flow traffic against the totaltraffic at the ingress and egress points. We want to pointout that we cannot pursue a more aggressive claim similarto Proposition 5, because we cannot claim independence <strong>of</strong>Λ( ⃗ O) and Λ( ⃗ O + ⃗ D), etc.Now, to calculate <strong>OD</strong> flow entropy, we just need to replaceΛ( ⃗ Y ) in (1) and (4) with Λ( ⃗ O, ⃗ D) where ⃗ O and ⃗ D are L 1+αsketches, and similarly <strong>for</strong> Λ( ⃗ Z).6. USING BUCKETSOur earlier examples showed that l, the number <strong>of</strong> countersin the L p sketch, need to be in the order <strong>of</strong> many thousandsto achieve a high estimation accuracy. Recall thateach incoming packet will trigger increments to all l counters<strong>for</strong> estimating one L p norm, and our algorithm requiresthat two different L p norms be computed. Such a large l isunacceptable to networking applications, however, since <strong>for</strong>high-speed links, where each packet has tens <strong>of</strong> nanosecondsto process, it is impossible to increment that many countersper packet, even if they are all in fast SRAM.We resolve this problem by adopting the standard methodology<strong>of</strong> bucketing [9], shown in <strong>Algorithm</strong> 2. With bucketing,the sketch data structure becomes a two-dimensionalarray M[1..k][1..l]. For each incoming packet, we hash itsflow label using a uni<strong>for</strong>m hash, function uh, and the resultuh(pkt.id) is the index <strong>of</strong> the bucket at which the packetshould be processed. Then we increment the countersM[uh(pkt.id)][1 . . . l] like in <strong>Algorithm</strong> 1. Finally, we addup the L p estimations from all these buckets to obtain ourfinal estimate.In the following, we will show that, roughly speaking,bucketing (i.e., with k buckets instead <strong>of</strong> 1) will reduce thestandard deviation <strong>of</strong> our estimation <strong>of</strong> L p norms by a factorslightly less than √ k, provided that the number <strong>of</strong> flows ismuch larger than the number <strong>of</strong> buckets k. Lemma 9 showsthat the standard deviation <strong>for</strong> using l counters is in theorder <strong>of</strong> O(1/ √ l). There<strong>for</strong>e, a decrease in l can be compensatedby an increase in k by a slightly larger factor. Inour proposed implementation (described in Section 7), thenumber <strong>of</strong> buckets k is typically on the order <strong>of</strong> 10,000. Sucha large bucket size allows l to shrink to a small number suchas 20 to achieve the same (or even better) estimation accuracy.We will show shortly that, even on very high-speedlinks (say 10M packets per second), a few tens <strong>of</strong> memoryaccesses per packet can be accommodated.We use B i to denote all the flows hashed to the ith bucket,and ||B i|| p to denote the L p norm <strong>of</strong> the flows in the ithbucket, similar to how we defined ||S|| p be<strong>for</strong>e. Obviously||S|| p p = P i ||Bi||p p. Let M i be the i-th row vector <strong>of</strong> thesketch. We know that Λ(M i) as defined in (1) is an estimator<strong>of</strong> ||B i|| p, so naturally the estimator <strong>for</strong> ||S||p is:" kX# 1/pΛ(M) ≡ (Λ(M i)) p . (7)i=1In the ideal case <strong>of</strong> even distribution <strong>of</strong> flows into buckets,and all ‖B i‖ p p are the same, then ||S|| p p = k||B i|| p p. Let’sconsider p = 1 first. Lemma 9 tells us that the estimatorΛ(M i) is roughly Gaussian with mean ||B i|| 1 and standarddeviation (1/2mf(m) √ l)‖B i‖ 1. By the Central Limit Theorem,Λ(M) = P ki=1Λ(Mi) is asymptotically Gaussian, andits standard deviation is roughly √ k(1/2mf(m) √ l)‖B 1‖ 1 =(1/2mf(m) √ lk)||S|| 1. If we didn’t use any buckets and simplyused estimator (1), then the standard deviation wouldbe (1/2mf(m) √ l)||S|| 1. So lk is per<strong>for</strong>ming the same roleas l in the previous analysis, or in other words, k bucketsreduces standard error roughly by a factor <strong>of</strong> √ k.When p = 1 ± α in a small neighborhood <strong>of</strong> 1, we canreach the same conclusion by using the same handwavingargument in pro<strong>of</strong> <strong>of</strong> Proposition 5 that a Gaussian raisedto power p is still roughly Gaussian.In reality, we will not have even distribution <strong>of</strong> flows intovarious buckets. However, when the number <strong>of</strong> flows is farlarger than the number <strong>of</strong> buckets, which will be the casewith our parameter settings and intended workload, we can

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!