11.07.2015 Views

A Data Streaming Algorithm for Estimating Entropies of OD Flows

A Data Streaming Algorithm for Estimating Entropies of OD Flows

A Data Streaming Algorithm for Estimating Entropies of OD Flows

SHOW MORE
SHOW LESS
  • No tags were found...

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

A <strong>Data</strong> <strong>Streaming</strong> <strong>Algorithm</strong> <strong>for</strong> <strong>Estimating</strong><strong>Entropies</strong> <strong>of</strong> <strong>OD</strong> <strong>Flows</strong>Haiquan (Chuck) ZhaoGeorgia Inst. <strong>of</strong> TechnologyOliver SpatscheckAT&T Labs – ResearchAshwin LallUniversity <strong>of</strong> RochesterJia WangAT&T Labs – ResearchMitsunori OgiharaUniversity <strong>of</strong> RochesterJun (Jim) Xu ∗Georgia Inst. <strong>of</strong> TechnologyABSTRACTEntropy has recently gained considerable significance as animportant metric <strong>for</strong> network measurement. Previous researchhas shown its utility in clustering traffic and detectingtraffic anomalies. While measuring the entropy <strong>of</strong> thetraffic observed at a single point has already been studied,an interesting open problem is to measure the entropy <strong>of</strong> thetraffic between every origin-destination pair. In this paper,we propose the first solution to this challenging problem.Our sketch builds upon and extends the L p sketch <strong>of</strong> Indykwith significant additional innovations. We present calculationsshowing that our data streaming algorithm is feasible<strong>for</strong> high link speeds using commodity CPU/memory at areasonable cost. Our algorithm is shown to be very accuratein practice via simulations, using traffic traces collectedat a tier-1 ISP backbone link.Categories and Subject DescriptorsC.2.3 [Computer-Communication Networks]: NetworkOperations—Network MonitoringGeneral Terms<strong>Algorithm</strong>s, Measurement, TheoryKeywordsNetwork Measurement, Entropy Estimation, <strong>Data</strong> <strong>Streaming</strong>,Traffic Matrix, Stable Distributions1. INTR<strong>OD</strong>UCTIONThe (empirical) entropy <strong>of</strong> the network traffic has recentlybeen proposed, in many different contexts, as an effective∗ Supported in part by NSF grants CNS 0716423, CNS0626979, CNS 0519745, and NSF CAREER award CNS0238315.Permission to make digital or hard copies <strong>of</strong> all or part <strong>of</strong> this work <strong>for</strong>personal or classroom use is granted without fee provided that copies arenot made or distributed <strong>for</strong> pr<strong>of</strong>it or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.IMC’07, October 24-26, 2007, San Diego, Cali<strong>for</strong>nia, USA.Copyright 2007 ACM 978-1-59593-908-1/07/0010 ...$5.00.and reliable metric <strong>for</strong> anomaly detection and network diagnosis[3, 11, 15, 24, 25]. Measuring this quantity exactly inreal time (say by maintaining per-flow states), however, isnot possible on high-speed links due to its prohibitively highcomputational and memory requirements. For this reason,various data streaming algorithms [16] have been proposedto approximate this quantity. <strong>Data</strong> streaming [19] is concernedwith processing a long stream <strong>of</strong> data items in onepass using a small working memory (called a sketch) in orderto answer a class <strong>of</strong> queries regarding the stream. Thechallenge is to use this sketch to “remember” as much in<strong>for</strong>mationpertinent to our problem as possible.We observe that it is <strong>of</strong>ten important to know the entropies<strong>of</strong> origin-destination (<strong>OD</strong>) flows, where an <strong>OD</strong> flowis defined as all the traffic that enters an ingress point (calledorigin) and exits at an egress point (called destination).Knowing these quantities will give us significantly more insightinto the dynamics <strong>of</strong> traffic inside an ISP network,which we will elaborate upon shortly. However, we foundthat none <strong>of</strong> the existing entropy estimation algorithms [1,2, 5, 16] can be extended to solve this new problem. In particular,the traffic matrix estimation literature [18, 21, 22,23, 27] has ruled out the following naive approach: separatethe traffic at each ingress point into various <strong>OD</strong> flowsin real time and feed them to existing entropy estimationalgorithms. In this paper, we propose the first solution tothis new problem.1.1 MotivationThe ability to estimate entropy <strong>for</strong> all <strong>OD</strong> flows in a networkcan be highly beneficial. On today’s Internet, networkper<strong>for</strong>mance degradation and service disruption canbe caused by a wide range <strong>of</strong> events, including anomalies(e.g., DDoS attacks, network failures, and flash crowds) andscheduled network maintenance tasks (e.g., router IOS updatesand customer migration). Many <strong>of</strong> these events occurin a distributed manner (intentionally or unintentionally)in terms <strong>of</strong> their signatures and impacts. Detecting theseevents and evaluating their impact on network services thus<strong>of</strong>ten requires monitoring <strong>of</strong> the traffic from a number <strong>of</strong> locationsacross the network. More importantly, these changesin traffic distribution may be totally invisible in the traditionaltraffic volume matrix. However, by examining theentropy <strong>of</strong> every <strong>OD</strong> flow in the network, one can hope tocapture these events in real time.To illustrate how beneficial the <strong>OD</strong> flow entropy estimationis, consider a situation in which a link fails (e.g., due to


hardware problems) and is cost out <strong>for</strong> maintenance. Thetraffic flows that used to be carried on that link will mostlikely be redistributed to alternative paths. In the very rarecase where no alternative path exists or the network experienceslong convergence delay, the flows may abort completely.In either case, traffic distribution on a number <strong>of</strong>links across the network will change. It is possible that thetraffic volume changes on those impacted links are small,and sometimes too small to be detected on each link usingvolume based detection algorithms. However, the overallchange in the traffic distribution across the network can bequite large. If so, the examination <strong>of</strong> traffic entropy alongwith the traffic matrix <strong>for</strong> all <strong>OD</strong> flows is likely to enableus to capture such distributed and dynamically occurringevents.Another example is a DoS attack. When there is a distributedDoS attack launched from outside the network, itsimpact may not be immediately visible in the traffic matrixin terms <strong>of</strong> traffic volume change. However, it is very important<strong>for</strong> operators to be able to detect and mitigate theattack at the earliest time to minimize the possible negativeimpact on their services. That is, it may be too late<strong>for</strong> certain types <strong>of</strong> events to be detected when their impactbecomes visible in the traffic volume matrix. In addition,certain types <strong>of</strong> DoS attacks are difficult to detect due tothe low-rate nature <strong>of</strong> attack flows [14, 26]. In such cases,we hope to be able to achieve real time detection based ontraffic entropy estimation.1.2 Our solution and contributionsIn this work, we propose the first solution to the challengingproblem <strong>of</strong> estimating the entropies <strong>of</strong> all <strong>OD</strong> flows.Our key innovation is to invent a sketch that allows us toapproximately estimate not only (a) the entropy <strong>of</strong> a datastream, but also (b) the entropy <strong>of</strong> the intersection <strong>of</strong> twodata streams A and B from the sketches <strong>of</strong> A and B. We referto property (b) as the “intersection measurable property”(IMP) in the sequel. Note that all existing entropy estimationalgorithms (their sketches) have property (a), but none<strong>of</strong> them has IMP.The intersection measurable property (IMP) <strong>of</strong> our sketchleads to the following very elegant solution <strong>of</strong> our problem.Each participating ingress and egress node in the networkmaintains a sketch <strong>of</strong> the traffic that flows in/out <strong>of</strong> it duringa measurement interval. When we would like to measure theentropy <strong>of</strong> an <strong>OD</strong> flow stream between an ingress point i andan egress point j, we simply summon from node i the sketch<strong>of</strong> its ingress stream O i, and from node j the sketch <strong>of</strong> itsegress stream D j, and then infer the entropy <strong>of</strong> O i ∩ D jusing the sketches’ IMP. This solution is extremely costeffectivein that we will be able to estimate O(k 2 ) quantities(the entropies <strong>of</strong> a k by k <strong>OD</strong> flow stream matrix) usingonly O(k) sketches. However sketches that have IMP aremuch harder to design than those without it, and hence thesignificant challenge we have in this work.Our sketch builds upon and extends the L p sketch <strong>of</strong> Indyk[12] with significant additional innovations. Indyk’ssketch was designed <strong>for</strong> the estimation <strong>of</strong> the L p norm <strong>of</strong>a stream (defined later). We observe that Indyk’s L p sketchhas the desirable IMP, and there<strong>for</strong>e allows <strong>for</strong> the estimation<strong>of</strong> the L p norm <strong>of</strong> an <strong>OD</strong> flow O i ∩ D j using the L psketches at O i and D j. However, we have to develop theIMP theory <strong>of</strong> L p sketch ourselves since IMP was never evenclaimed in [13] (probably because there was no applicationin sight <strong>for</strong> IMP at that time).Our most important contribution is to discover that theentropy <strong>of</strong> an <strong>OD</strong> flow can be approximated by a function<strong>of</strong> just two L p norms (L p1 and L p2 , p 1 ≠ p 2) <strong>of</strong> the <strong>OD</strong>flow. With this insight, our sketch at each ingress and egresspoint is simply one L p1 sketch and one L p2 sketch. In thisway, we not only solve this difficult problem, but also find anice application <strong>for</strong> the L p sketch. From a theoretical point<strong>of</strong> view, our solution resolves one <strong>of</strong> the open questions <strong>of</strong>Cormode [7].Our contributions can be summarized as follows. First,our discovery builds a mathematical connection between ourproblem and Indyk’s L p norm estimation problem, leadingto a highly cost-effective solution. Second, we extend Indyk’s(ɛ, δ) analysis <strong>of</strong> L p sketches [13] using asymptotic normality<strong>of</strong> order statistics, leading to much tighter (yet slightlyless rigorous) accuracy bounds. We also modify Indyk’s algorithmto make it run fast enough <strong>for</strong> processing traffic atvery high-speed links (e.g., 10 million packets per second).Finally, we thoroughly evaluate the accuracy <strong>of</strong> our solutionby simulating it on real-world packet traces collected at atier-1 ISP backbone. We show that our algorithm deliversvery accurate estimations <strong>of</strong> the <strong>OD</strong> flow entropies.The remainder <strong>of</strong> this paper is organized as follows. InSection 2 we give a high-level overview <strong>of</strong> how the differentcomponents <strong>of</strong> our entire scheme works. In Section 3through 6 we describe the major components <strong>of</strong> our algorithm.• In Section 3 we introduce stable distributions and describeIndyk’s algorithm to estimate the L p norm usingthem.• In Section 4 we show how to estimate entropy from L pnorm.• In Section 5 we show the intersection measurable property(IMP) <strong>of</strong> the L p norm algorithm.• In Section 6 we modify the L p norm algorithm to improveits accuracy under tight resource constraints.In Section 7 we discuss the hardware implementation <strong>of</strong>our algorithm. In Section 8 we demonstrate the accuracy<strong>of</strong> our algorithm via simulations run on real-world data. InSection 9 we briefly survey previous work that is related toours. Finally, we conclude in Section 10. The Appendixcontains the pro<strong>of</strong> <strong>of</strong> some <strong>of</strong> the theorems.2. PROBLEM STATEMENT AND OVERVIEWIn this section, we describe precisely the problem <strong>of</strong> estimatingthe entropies <strong>of</strong> <strong>OD</strong> flows and <strong>of</strong>fer an overview <strong>of</strong>our solution approach. As defined be<strong>for</strong>e in [16], given astream S <strong>of</strong> packets that contains n (transport-layer) flowswith sizes (number <strong>of</strong> packets) a 1, a 2, ..., a n respectively,its empirical entropy H(S) is defined as − P n a ii=1 slog 2 ( a is),where s = P ni=1ai is the total number <strong>of</strong> packets in S.1We have shown be<strong>for</strong>e [16] that to estimate the empirical1 Throughout this paper logarithms will be base e except inthe definition <strong>of</strong> empirical entropy here. Since the logarithmbase e and the logarithm base 2 are different by a constantmultiplicative factor, it does not affect our analysis <strong>of</strong> relativeerrors.


entropy <strong>of</strong> a stream S, it suffices to to estimate a relatedquantity called entropy norm ||S|| H, defined as P iai ln (ai),since H(S) can be rewritten asH(S) = − X a“is log ai”2si= log 2 (e)[ln (s) − 1 Xa i ln (a i)],sand s is usually a known quantity.The entropy <strong>of</strong> an <strong>OD</strong> flow stream <strong>OD</strong> ij between an ingresspoint i and an egress point j, is definedTas the entropy <strong>of</strong>their intersection, i.e., H(<strong>OD</strong> ij) ≡ H(O i Dj). To derivethis entropy value from the above <strong>for</strong>mula,Thowever, we needto measure both the entropy norm ||O iT Dj|| H and s in orderto compute the entropy <strong>of</strong> H(O i Dj). Note that inour case, s is the volume <strong>of</strong> the <strong>OD</strong> flow and is an unknownquantity that needs to be estimated/inferred separately. Asdescribed in Section 1.2 , our solution is to invent a sketch<strong>for</strong> estimating the entropy norm <strong>of</strong> a stream that has theintersection measurable property (IMP).Our algorithm and sketch build upon Indyk’s classical resultson estimating the L p norm <strong>of</strong> a stream using the theory<strong>of</strong> stable distributions, <strong>for</strong> values <strong>of</strong> p in (0, 2]. In [13], Indykpresents an algorithm <strong>for</strong> computing the L p norm <strong>of</strong>a stream S. For a stream S that contains n flows <strong>of</strong> sizesa 1, . . . , a n, its L p norm ||S|| p is defined as ( P i |ai|p ) 1/p , so||S|| p p = P i |ai|p . We discover that the data streaming solution<strong>of</strong> Indyk has the a<strong>for</strong>ementioned IMP. Interestingly,this nice intersection property was never claimed in [13] oranywhere else. We will develop an entropy estimation techniquethat fully takes advantage <strong>of</strong> this property and <strong>of</strong>ferits rigorous analysis.While the work <strong>of</strong> Indyk is very influential, the practicalimportance <strong>of</strong> being able to estimate the L p norm <strong>for</strong>values other than 1 and 2 was never clear. In our work,the L p norms <strong>for</strong> p values slightly above or below 1 playa crucial role as follows. On realizing that Indyk’s algorithmcan be extended <strong>for</strong> estimating the L p norms <strong>of</strong> an<strong>OD</strong> flow, we came up with the wild conjecture that it ispossible to approximate the function x ln(x) using a linearcombination <strong>of</strong> a small number <strong>of</strong> functions in the family{x p |p ∈ (0, 2]}. In other words, we conjectured that we canfind parameters c 1, . . . , c k ∈ R and p 1, . . . , p k ∈ (0, 2] suchthat x ln(x) ≈ P kj=1 cj ∗ xp j. If this conjecture is true, thenwe will be able to estimate the entropy norm <strong>of</strong> an <strong>OD</strong> flowstream S as P kj=1 cj ∗ ||S||p jp j, where ||S|| pj , j = 1, . . . , k canbe estimated using our extension <strong>of</strong> Indyk’s algorithm <strong>for</strong>stream intersection. We emphasize that it took a leap <strong>of</strong>faith <strong>for</strong> us to think in this direction, as most <strong>of</strong> the approximationschemes we encountered in mathematical literatureare by linear combination <strong>of</strong> terms like x j (approximationby polynomial) and sin(x) (Fourier expansion).Our wild conjecture has been proven correct! To ouramazement we found that by using a linear combination <strong>of</strong>1only two functions in the family, in the <strong>for</strong>m <strong>of</strong>2α (x1+α −x 1−α ), we can approximate x ln(x) very closely <strong>for</strong> all x valuesin a large interval, e.g., [1, 1000] or [1, 5000]. Here α is atunable parameter that takes small values. For example, inFigure 1, we show how closely we can approximate x ln(x)using 10(x 1.05 − x 0.95 ) within the interval [1, 1000]. In otherwords, if all transport-layer flows in an <strong>OD</strong> flow S have lessthan 1000 packets, we can estimate the entropy norm <strong>of</strong> thei800070006000500040003000200010000approximationy = xln(x)0 100 200 300 400 500 600 700 800 9001000Figure 1: Comparison <strong>of</strong> entropy function and theapproximation<strong>OD</strong> flow as 10(||S|| 1.051.05 − ||S|| 0.950.95). It turns out that this“symmetry” <strong>of</strong> the exponents (1 + α and 1 − α) around 1serves another very important purpose that we will describeshortly.We found that the approximation becomes gradually worsewhen x becomes larger than 1000 (packets), but the relativeerror level is generally acceptable up until 5000. There<strong>for</strong>e, ifthere are some very large transport-layer flows (say with tens<strong>of</strong> or hundreds <strong>of</strong> thousands <strong>of</strong> packets) inside an <strong>OD</strong> flow S,the estimation <strong>for</strong>mula such as 10(||S|| 1.051.05 − ||S|| 0.95) 0.95 maydeviate significantly from its entropy norm. Fortunately,identifying such very large flows is a well-studied problemin both computer networking (e.g., [10]) and theory (e.g.,[17]). We will adopt the “sample and hold” algorithm proposedin [10] to identify all flows that are much larger thana certain threshold (say 1000 packets) and compute theircontributions to the <strong>OD</strong> flow entropy separately.Now the last piece <strong>of</strong> the puzzle is that when we computethe entropy H from the entropy norm, we need to knows, the total volume <strong>of</strong> the <strong>OD</strong> flow. <strong>Estimating</strong> this volumeis a task known as traffic matrix estimation [18, 21,22, 23, 27]. Various techniques <strong>for</strong> estimating the trafficmatrix have been proposed that are based on statistical inferenceor direct measurement (including data streaming).In fact, this quantity is exactly the L 1 norm, which canbe estimated using Indyk’s L 1 norm estimation algorithmwith IMP extensions. It turns out that we need none <strong>of</strong>these. Recall that our approximation <strong>for</strong>mula is in the <strong>for</strong>m1<strong>of</strong>2α (x1+α −x 1−α ), where α is a small number such as 0.05.We observe that, in this case, the function x can indeed beapproximated by (x 1+α +x 1−α )/2 and there<strong>for</strong>e the <strong>OD</strong> flowvolume (i.e., the L 1 norm) can be approximated by the average<strong>of</strong> ||S|| 1+α1+α and ||S||1−αcalculated from the <strong>OD</strong> flow.There<strong>for</strong>e, our sketch data structure allows us to kill twobirds (the L 1 norm and the entropy norm) with one stone!3. PRELIMINARIESFor the purposes <strong>of</strong> this paper we define a flow to be allthe packets with the same five-tuple in their headers: sourceaddress, destination address, source port, destination port,and protocol.Clearly, we will not be able to compute the entropy <strong>of</strong>each distribution exactly, so we use the following type <strong>of</strong>approximation scheme. An (ɛ, δ) approximation scheme isone that returns an approximation ˆθ <strong>for</strong> a value θ such that,with probability at least 1−δ, we have (1−ɛ)θ ≤ ˆθ ≤ (1+ɛ)θ.


has distribution S(p). Note that, due to the symmetry <strong>of</strong>S(p), DMed p is exactly the three-quarter quantile <strong>of</strong> S(p).Although there is no closed <strong>for</strong>m <strong>for</strong> DMed p <strong>for</strong> most <strong>of</strong> thep values, we can numerically calculate it by simulation, orwe can use a program like [20].Intuitively, the correctness <strong>of</strong> this estimator can be justifiedas follows. Since Y 1/||S|| p, ..., Y l /||S|| p are i.i.d. randomvariables with distribution S(p), taking absolute value givesus i.i.d. draws from S + (p). For large enough l, their medianshould be close to the distribution median <strong>of</strong> S + (p). There<strong>for</strong>e,we simply divide median(|Y 1|, . . . , |Y l |) by the distributionmedian <strong>of</strong> S + (p) to get an estimator <strong>of</strong> ||S|| p. We haveto take absolute values because the distribution median <strong>of</strong>S(p) is 0 due to its symmetry. In the next section we willanalyze the relative error <strong>of</strong> this estimator.Indyk’s estimator <strong>for</strong> the L p norm is based on the property<strong>of</strong> the median. We find, however, that it is possible toconstruct estimators based on other quantiles and they mayeven outper<strong>for</strong>m the median estimator, in terms <strong>of</strong> estimationaccuracy. However, since the improvement is marginal<strong>for</strong> our parameters settings, we stick to the median estimator.3.3 Error analysis <strong>for</strong> L p norm estimatorIn this section we analyze the per<strong>for</strong>mance <strong>of</strong> the estimator<strong>for</strong> ||S|| p in (1).3.3.1 (ɛ, δ) bound <strong>for</strong> p = 1Here we basically restate Lemma 2 and Theorem 3 fromIndyk [13] with the constants spelled out. We arrived atthis by using Chern<strong>of</strong>f bounds to derive the constant in hisClaim 2.Theorem 1. (Indyk, [13]) Let ⃗ X = (X 1, . . . , X l ) be i.i.d.samples from S(1), l = 8(ln 2 + ln 1 δ )/ɛ2 , ɛ < 0.2, thenDMed 1 = 1, and P r[median(|X 1|, . . . , |X l |) ∈ [1−ɛ, 1+ɛ]] >1 − δ. Thus (1) gives an (ɛ, δ) estimator <strong>for</strong> p = 1.Example: <strong>for</strong> p = 1, δ = 0.05, ɛ = 0.1, we get l = 2951.This is a very loose bound in the sense that we need a verylarge l. This motivates us to resort to the asymptotic normality<strong>of</strong> the median <strong>for</strong> some approximate analysis.3.3.2 Asymptotic (ɛ, δ) bound <strong>for</strong> p ∈ (0, 2]z δ/22mf(m)ɛ )2 ,Theorem 2. Let f = f S + (p), m = DMed p, l = (then (1) gives an estimator with asymptotic (ɛ, δ) bound. z ais the number such that <strong>for</strong> standard normal distribution Zwe have P r[Z > z a] = a.Pro<strong>of</strong> is in the Appendix. This result is in the same order<strong>of</strong> O(1/ɛ 2 ) as the Chern<strong>of</strong>f result, but the coefficient is muchsmaller as shown below.Example: For p = 1, δ = 0.05, ɛ = 0.1, we get m =1, f(m) = 1/π, z δ/2 = 2, l = 986. Compare with l = 2951from the previous section.For p = 1.05, we get m = 0.9938, f(m) = 0.3324, l = 916.For p = 0.95, we get m = 1.0078, f(m) = 0.3030, l = 1072.Our simulations show that these are quite accurate bounds.We can see that mf(m) does not change much in a smallneighborhood <strong>of</strong> p = 1. Since we are only interested in p ina small neighborhood <strong>of</strong> 1, <strong>for</strong> rough arguments we may usemf(m) at p = 1, which is 1/π.4. SINGLE N<strong>OD</strong>E ALGORITHMIn this section, we show how our sketch works <strong>for</strong> estimatingthe entropy <strong>of</strong> the traffic stream on a single link;Estimation <strong>of</strong> <strong>OD</strong> flow entropy based on its intersection measurableproperty (IMP) is the topic <strong>of</strong> the next section. Wefirst show how to approximate the function x ln(x) by a linearcombination <strong>of</strong> at most two functions <strong>of</strong> the <strong>for</strong>m x p ,p ∈ (0, 2]. After that we analyze the combined error <strong>of</strong> thisapproximation and Indyk’s algorithm.4.1 Approximating x ln xOur algorithm computes the entropy <strong>of</strong> a stream <strong>of</strong> flowsby approximating the entropy function x ln x by a linearcombination <strong>of</strong> expressions x p , p ∈ (0, 2]. In this section wedemonstrate how to do this approximation up to arbitraryrelative error ɛ. To make the <strong>for</strong>mula simpler we use thenatural logarithm ln x instead <strong>of</strong> log 2 x, noting that changingthe base is simply a matter <strong>of</strong> multiplying by the appropriateconstant, thus having no effect on relative error.Theorem 3. For any N > 1, ɛ > 0, there exists α ∈(0, 1), c = 1 ln∈ O( √N2α ɛ), such that f(x) = c(x 1+α − x 1−α )approximates the entropy function x ln x <strong>for</strong> x ∈ (1, N] withinf(x)−x ln xrelative error bound ɛ, i.e., | | ≤ ɛ.x ln xPro<strong>of</strong>. Using the Taylor expansion,x α = e α ln x = 1 + α ln x +we get thatf(x) = x ln x + α2 x ln 3 x3!(α ln x)22!++ α4 x ln 5 x5!(α ln x)33!+ · · ·+ · · · ,Rewriting in terms <strong>of</strong> the relative error, we get thatr(x, α) ≡ f(x)(α ln x)2− 1 =x ln x 3!∞X=k=1+(α ln x) 2k(2k + 1)! .(α ln x)45!+ · · ·Since every term is positive, we have r(x, α) ≥ 0. Weassume that α < 1 . This gives usln Nr(x, α) ≤ 1 ∞X(α ln x) 2k = 1 „ «(α ln x)2. (2)66 1 − (α ln x) 2”16 1−(α ln N) 2ln N2q 6ɛ1+6ɛk=1The bound takes maximum value at x = N. Solvingq“(α ln N)26ɛ1+6ɛ= ɛ gives us α = , and c = 1 =ln N 2α∈ O( ln √ Nɛ). There<strong>for</strong>e f(x) = 12α (x1+α − x 1−α ) approximatesx ln x within the relative error bound ɛ.A plot <strong>of</strong> this approximation <strong>for</strong> the range [1, 1000] andf(x) = 10(x 1.05 − x 0.95 ) is given in Figure 1. The relativeerror guarantee <strong>of</strong> the approximation only holds <strong>for</strong> valuesless than some constant N. As we have mentioned, we willuse some elephant detection mechanism to circumvent thisshortcoming.4.2 <strong>Estimating</strong> entropy norm ||S|| HNow we will combine our approximation <strong>for</strong>mula and Indyk’salgorithm to get an estimator <strong>for</strong> the entropy norm||S|| H. Suppose we have chosen α and c in Theorem 3 toget relative error bound ɛ 0 on [1, N], and we have chosen l


from Theorem 1 <strong>for</strong> p = 1 ± α to achieve asymptotic (ɛ, δ)error bound. For p = 1 + α we have sketches Y ⃗ , and <strong>for</strong>p = 1 − α we have sketches Z. ⃗ Our estimator <strong>for</strong> ||S|| H iŝ||S|| H ≡ 1 “Λ(⃗Y ) 1+α − Λ( ⃗Z) 1−α” (3)2αWe now study the error <strong>of</strong> this estimator. We will use someapproximation to get some rough but simple error estimates.The pro<strong>of</strong>s are in the Appendix.Proposition 4. Assume a i ≤ N. Then (3) estimates||S|| H within relative error roughly 2λcɛ + λ 0ɛ 0 with probabilityroughly 1 − 2δ, where c = 1 , 2α λ0 = |c(||S||1+α−||S|| 1−α1−α )/||S||H − 1|/ɛ0 ≤ 1, λ = ||S||1+α/||S||H , and typicallyλ < 1.Example: For N = 1024, α = 0.05, ɛ = 0.001, δ = 0.05,then ɛ 0 = 0.023, l ≈ 10 5 . If we only assume λ = λ 0 = 1,then (2cɛ+ɛ 0) ≈ 0.04, i.e. we can approximate ||S|| H within4% error with 90% probability using 10 5 samples. If we assumeλ = 0.5, then we can af<strong>for</strong>d to increase ɛ to 0.002, andthus decrease l to ≈ 2.5 × 10 4 to achieve the 4% error.Proposition 5. (More Aggressive): Under same assumptionsas above, (3) estimates ||S|| H within relative errorroughly √ 2cλɛ + λ 0ɛ 0 with probability roughly 1 − δ.Example: Same assumption as the example above, then weonly need l ≈ 1.25 × 10 4 to achieve the 4% error with probability95%.4.3 <strong>Estimating</strong> L 1 norm sPRecall that to compute the actual entropy H = log 2 s −1s i ai log 2 ai, we need to know the complete volume <strong>of</strong> thetraffic s, or ||S|| 1. For a single stream this is trivial to dowith a single counter. But our ultimate goal is to calculateentropy <strong>of</strong> every <strong>OD</strong> pair, thus we need to computethe entire traffic matrix <strong>for</strong> the network. There has beenconsiderable previous work in solving this problem, but anyadditional method will have the corresponding overhead associatedwith it. In this section we show how to use thesame sketch data structure that we have been maintainingso far to approximate this value <strong>for</strong> a single stream, whichwill be naturally extended to distributed case later.Similar to the pro<strong>of</strong> <strong>of</strong> Theorem 3, we can easily show thefollowing theorem:Theorem 6. Let α, N, and ɛ be as in Theorem 3. Then,the approximation (x 1+α + x 1−α )/2 approximates the functionf(x) = x with relative error at most 3ɛ.This approximation holds good <strong>for</strong> all counts in the range[1, N] and we use the elephant sketch to capture all flowswith size strictly greater than N.So our estimator <strong>for</strong> ||S|| 1, iŝ ||S|| 1 ≡ 1 2“Λ( ⃗ Y ) 1+α + Λ( ⃗ Z) 1−α” . (4)Using Theorem 6 and pro<strong>of</strong>s similar to those <strong>of</strong> Propositions4 and 5, we have the following:Proposition 7. (4) estimates ||S|| 1 roughly within relativeerror bound ɛ + 3λ ′ 0ɛ 0 with probability 1 − 2δ, whereλ ′ 0 = |(||S|| 1+α1+α + ||S||1−α)/||S||1 − 1|/3ɛ0 ≤ 1. Or, moreaggressively, the error bound is roughly √12ɛ + 3λ ′ 0ɛ 0 withprobability 1 − δ.4.4 Separating elephantsRecall that we need to keep track <strong>of</strong> the elephant flows(say those that have more 1000 packets) and estimate theircontributions to the total entropy separately. In our scheme,we adopt the “sample and hold” algorithm proposed in [10]due to its low computational complexity and ease <strong>of</strong> analysis.The “sample and hold” algorithm will produce a list <strong>of</strong> flowsthat include all elephant flows with very high probability.Then <strong>for</strong> each elephant flow in the list, we subtract the incrementscaused by them to the sketches, and compute theircontributions to the entropy separately. The algorithm alsohas the nice property that the larger the size <strong>of</strong> a flow, thesmaller (actually exponentially smaller) the probability <strong>of</strong>its missing from the list. This property works very well withthe fact that the accuracy <strong>of</strong> our approximation <strong>of</strong> x ln(x)degrades only gradually after the target threshold (say 1000packets).5. DISTRIBUTED ALGORITHMIn this section we show the IMP property <strong>of</strong> the L p sketch,i.e., we can use the sketches at ingress nodes and egress nodesto estimate the L p norm <strong>of</strong> the <strong>OD</strong> flows.For a given <strong>OD</strong>-pair, we will require only the sketches atthat ingress and egress node. Hence, we fix one such pairand do all the analysis <strong>for</strong> it. We call the ingress stream asO and egress stream as D. We are interested in the flows inthe set O ∩ D. Note that if we can estimate the pair <strong>of</strong> L pnorms <strong>for</strong> O ∩ D, p = 1 ± α, then we can use the <strong>for</strong>mula <strong>for</strong>approximating the entropy function as be<strong>for</strong>e.The sketch data structures will be the same at every ingressand egress node, that is, they will use the same number <strong>of</strong>counters l, and they will use the same set <strong>of</strong> p-stable hashfunctions as defined in Section 3.2. After we introduce bucketingin Section 6, they will also use the same number <strong>of</strong>buckets k and the same uni<strong>for</strong>m hash function.5.1 Computing <strong>OD</strong> flow L p norm ||O ∩ D|| pWith a slight abuse <strong>of</strong> notation, we will use ⃗O to denotethe sketch <strong>for</strong> the ingress node, and ⃗ D to denote the sketch<strong>for</strong> the egress node. ⃗ O + ⃗ D and ⃗ O − ⃗ D are the componentwiseaddition and difference <strong>of</strong> ⃗ O and ⃗ D respectively. (Thisis possible because all nodes are using the same values <strong>of</strong> l.)Our estimator <strong>for</strong> ||O ∩ D|| p, ||O ̂ ∩ D||p , can be eitheror!Λ( O, ⃗ D) ⃗ ≡ Λ( 1/p⃗O) p + Λ( ⃗D) p − Λ( ⃗O − ⃗D) p(5)2Λ ′ ( ⃗ O, ⃗ D) ≡5.2 CorrectnessΛ( ⃗ O + ⃗ D) p − Λ( ⃗ O − ⃗ D) p2 p ! 1/p. (6)In this section we show that the two <strong>for</strong>mulae describedare good estimators <strong>for</strong> ||O ∩ D|| p. Hence, by using twocopies <strong>of</strong> the above algorithm, one each <strong>for</strong> p 1 = 1 − α andp 2 = 1 + α, and our x ln x approximation <strong>for</strong>mula, we canobtain an approximation <strong>of</strong> the entropy between every pair<strong>of</strong> ingress and egress nodes.We partition the flows that enter through the ingress nodeor exit through the egress nodes as follows:


A = O − D = flows that enter at ingress but do not exitthrough the egressB = O ∩ D = flows that enter at ingress and exit throughthe egressC = D − O = flows that do not enter at ingress but exitthrough the egressWe know that Λ( ⃗ O) is an estimator <strong>for</strong> ||O|| p, so Λ( ⃗ O) pis an estimator <strong>for</strong> ||O|| p p = ||A ∪ B|| p p = ||A|| p p + ||B|| p p.Similarly Λ( ⃗ D) p is an estimator <strong>for</strong> ||B|| p p + ||C|| p p.The sketch ⃗ O + ⃗ D holds the contributions <strong>of</strong> all the flowsin A, B and C, but with every packet from B contributingtwice. We use B (2) to denote all the flows in B with packetcounts doubled. Then Λ( ⃗ O + ⃗ D) p is an estimator <strong>for</strong> ||A ∪B (2) ∪C|| p p = ||A|| p p+2 p ||B|| p p+||C|| p p. It is important to pointout that the reasoning here (and in the next paragraph)depends on the fact that the ingress and egress nodes areusing the same sketch settings as noted at the beginning <strong>of</strong>this section.The sketch ⃗ O − ⃗ D exactly cancels out the contributionsfrom all the flows in B, and leaves us with the contributions<strong>of</strong> flows from A and the negative <strong>of</strong> the contributions <strong>of</strong>flows from C. We use C (−1) to denote all the flows in Cwith packet counts multiplied by −1. Then Λ( ⃗O − ⃗D) p is anestimator <strong>of</strong> ||A ∪ C (−1) || p p = ||A|| p p + ||C (−1) || p p = ||A|| p p +||C | | p p .To sum up, we get the following:Λ( ⃗ O) p estimates ||A|| p p + ||B|| p pΛ( ⃗ D) p estimates ||B|| p p + ||C|| p pΛ( ⃗O + ⃗D) p estimates ||A|| p p + 2 p ||B|| p p + ||C|| p pΛ( ⃗ O − ⃗ D) p estimates ||A|| p p + ||C|| p p.It is easy to see from the above <strong>for</strong>mulae that both Formula(5) and Formula (6) are reasonable estimators <strong>for</strong> ||B|| p,i.e. ||O ∩ D|| p.Proposition 8. Suppose we have chosen proper l suchthat (1) is roughly an (ɛ, δ) estimator. Suppose ||O ∩ D|| p p =r 1||O|| p p, and ||O ∩ D|| p p = r 2||D|| p p. Then (5) raised to thepower p estimates ||O ∩ D|| p p roughly within relative errorbound ( 1r 1+ 1r 2−1)ɛ with probability at least 1−3δ. Similarly,(6) raised to the power p estimates ||O ∩ D|| p p roughly withinrelative error bound 2 1−p ( 1r 1+ 1r 2+2 p−1 −2)ɛ ≈ ( 1r 1+ 1r 2−1)ɛwith probability at least 1 − 2δ.We omit the pro<strong>of</strong> here since it is similar to the previouspro<strong>of</strong>s. This gives us a very loose rough upper bound on therelative error. The ratios r 1 and r 2 are related to, but notidentical to, the ratio <strong>of</strong> <strong>OD</strong> flow traffic against the totaltraffic at the ingress and egress points. We want to pointout that we cannot pursue a more aggressive claim similarto Proposition 5, because we cannot claim independence <strong>of</strong>Λ( ⃗ O) and Λ( ⃗ O + ⃗ D), etc.Now, to calculate <strong>OD</strong> flow entropy, we just need to replaceΛ( ⃗ Y ) in (1) and (4) with Λ( ⃗ O, ⃗ D) where ⃗ O and ⃗ D are L 1+αsketches, and similarly <strong>for</strong> Λ( ⃗ Z).6. USING BUCKETSOur earlier examples showed that l, the number <strong>of</strong> countersin the L p sketch, need to be in the order <strong>of</strong> many thousandsto achieve a high estimation accuracy. Recall thateach incoming packet will trigger increments to all l counters<strong>for</strong> estimating one L p norm, and our algorithm requiresthat two different L p norms be computed. Such a large l isunacceptable to networking applications, however, since <strong>for</strong>high-speed links, where each packet has tens <strong>of</strong> nanosecondsto process, it is impossible to increment that many countersper packet, even if they are all in fast SRAM.We resolve this problem by adopting the standard methodology<strong>of</strong> bucketing [9], shown in <strong>Algorithm</strong> 2. With bucketing,the sketch data structure becomes a two-dimensionalarray M[1..k][1..l]. For each incoming packet, we hash itsflow label using a uni<strong>for</strong>m hash, function uh, and the resultuh(pkt.id) is the index <strong>of</strong> the bucket at which the packetshould be processed. Then we increment the countersM[uh(pkt.id)][1 . . . l] like in <strong>Algorithm</strong> 1. Finally, we addup the L p estimations from all these buckets to obtain ourfinal estimate.In the following, we will show that, roughly speaking,bucketing (i.e., with k buckets instead <strong>of</strong> 1) will reduce thestandard deviation <strong>of</strong> our estimation <strong>of</strong> L p norms by a factorslightly less than √ k, provided that the number <strong>of</strong> flows ismuch larger than the number <strong>of</strong> buckets k. Lemma 9 showsthat the standard deviation <strong>for</strong> using l counters is in theorder <strong>of</strong> O(1/ √ l). There<strong>for</strong>e, a decrease in l can be compensatedby an increase in k by a slightly larger factor. Inour proposed implementation (described in Section 7), thenumber <strong>of</strong> buckets k is typically on the order <strong>of</strong> 10,000. Sucha large bucket size allows l to shrink to a small number suchas 20 to achieve the same (or even better) estimation accuracy.We will show shortly that, even on very high-speedlinks (say 10M packets per second), a few tens <strong>of</strong> memoryaccesses per packet can be accommodated.We use B i to denote all the flows hashed to the ith bucket,and ||B i|| p to denote the L p norm <strong>of</strong> the flows in the ithbucket, similar to how we defined ||S|| p be<strong>for</strong>e. Obviously||S|| p p = P i ||Bi||p p. Let M i be the i-th row vector <strong>of</strong> thesketch. We know that Λ(M i) as defined in (1) is an estimator<strong>of</strong> ||B i|| p, so naturally the estimator <strong>for</strong> ||S||p is:" kX# 1/pΛ(M) ≡ (Λ(M i)) p . (7)i=1In the ideal case <strong>of</strong> even distribution <strong>of</strong> flows into buckets,and all ‖B i‖ p p are the same, then ||S|| p p = k||B i|| p p. Let’sconsider p = 1 first. Lemma 9 tells us that the estimatorΛ(M i) is roughly Gaussian with mean ||B i|| 1 and standarddeviation (1/2mf(m) √ l)‖B i‖ 1. By the Central Limit Theorem,Λ(M) = P ki=1Λ(Mi) is asymptotically Gaussian, andits standard deviation is roughly √ k(1/2mf(m) √ l)‖B 1‖ 1 =(1/2mf(m) √ lk)||S|| 1. If we didn’t use any buckets and simplyused estimator (1), then the standard deviation wouldbe (1/2mf(m) √ l)||S|| 1. So lk is per<strong>for</strong>ming the same roleas l in the previous analysis, or in other words, k bucketsreduces standard error roughly by a factor <strong>of</strong> √ k.When p = 1 ± α in a small neighborhood <strong>of</strong> 1, we canreach the same conclusion by using the same handwavingargument in pro<strong>of</strong> <strong>of</strong> Proposition 5 that a Gaussian raisedto power p is still roughly Gaussian.In reality, we will not have even distribution <strong>of</strong> flows intovarious buckets. However, when the number <strong>of</strong> flows is farlarger than the number <strong>of</strong> buckets, which will be the casewith our parameter settings and intended workload, we can


<strong>Algorithm</strong> 2: <strong>Algorithm</strong> to compute L p norm withbucketing1: Pre-processing stage2: Initialize a sketch M[1 . . . k][1 . . . l] to all zeroes3: Fix l p-stable hash functions sh 1 through sh l .4: Let hash function uh map flow labels to {1, . . . , k}5: Calculate expected sample median EMed p,l by simulation6: Online stage7: <strong>for</strong> each incoming packet pkt do8: <strong>for</strong> j := 1 to l do9: M[uh(pkt.id)][j] += sh j(pkt.id))10: Offline stage11: Returnh Pki=1“ ” p i 1/pmedian(|M[i][1]|,...,|M[i][l]|)EMed p,lprove that the factor <strong>of</strong> error reduction is only slightly lessthan √ k. We omit the pro<strong>of</strong> here due to lack <strong>of</strong> space 4 .When k is increased to be on the same order <strong>of</strong> the number<strong>of</strong> flows, however, the factor <strong>of</strong> error reduction will no longerscale as √ k since (a) there will be many empty buckets thatwill not contribute to the reduction <strong>of</strong> estimation error, and(b) distribution <strong>of</strong> flows into buckets will be more and moreuneven. There<strong>for</strong>e, when l is fixed, the estimation errorcannot be brought down arbitrarily close to 0 by increasingk arbitrarily.Another issue is the bias <strong>of</strong> median estimator (1), thatis, the expected value <strong>of</strong> the sample median <strong>of</strong> l samples isnot equal to the distribution median DMed p. When we arenot using buckets, the asymptotic normality implies that thebias is much smaller than the standard error, so we couldignore the issue. Now that we are using k buckets to reducethe standard error by a factor <strong>of</strong> √ k, the bias becomessignificant. Let EMed p,l denote the expected value <strong>of</strong> themedian <strong>of</strong> l samples from distribution S + (p). So we redefineour estimator <strong>for</strong> ||B i|| p:˜Λ(M i) ≡median(|M[i][1]|, . . . , |M[i][l]|)EMed p,l. (8)This replaces Λ(M i) in (7) and gives our estimator usingbuckets. Note that (8) is an unbiased estimator, but (7) stillmay be biased.Let us assume that M and N are L p sketches with bucketsat ingress node O and egress node D respectively, andthe two sketches use the same settings. We can replace O ⃗and D ⃗ in (5) and (6) with M and N, and it is easy to repeatthe arguments there to show that these are reasonableestimators <strong>for</strong> the <strong>OD</strong>-flow L p norm.EMed p,l can be numerically calculated using the p.d.f.<strong>for</strong>mula <strong>for</strong> order statistics when f S(p) has closed <strong>for</strong>m. Orit can be derived via simulation. We can also talk aboutV Med p,l , variance <strong>of</strong> the sample median <strong>of</strong> l samples.Example: For p = 1, l = 20, we get EMed 1,20 = 1.069,V Med 1,20 = 0.149, so standard deviation is 0.386. Thedistribution median is DMed 1 = 1, so we can see the bias0.069 is much smaller than the standard deviation 0.386.Also the asymptotic standard deviation given by Lemma 9is 0.351, which is close to the actual value <strong>of</strong> 0.385.4 In fact, even to state rigorously the theorem we would liketo prove requires more space than we have here.7. ALGORITHM IMPLEMENTATIONOur data streaming algorithm is designed to work withhigh link speeds <strong>of</strong> up to 10 million packets per second usingcommodity CPU/memory at a reasonable cost. In thissection, we explain how various components <strong>of</strong> the algorithmshall be implemented to achieve this design objective.Recall that our algorithm needs to keep two sub-sketches<strong>for</strong> estimating the L 1+α and the L 1−α norms respectively,each <strong>of</strong> which consists <strong>of</strong> k ∗ l real-valued counters. We setthe number <strong>of</strong> counters per bucket l to 20 in our proposedimplementation. Since the sketches are implemented usinginexpensive high-throughput DRAM (explained next), weallow the number <strong>of</strong> buckets k to be very large (say up tomillions).As shown in <strong>Algorithm</strong> 2, <strong>for</strong> each incoming packet, weneed to increment l = 20 counters per sketch and we needto do this <strong>for</strong> two sub-sketches. We use single-precision realnumber counters (4 bytes each) to minimize memory I/O(space not an issue), as 7 decimal digits <strong>of</strong> precision is accurateenough <strong>for</strong> our computations. This involves 320 bytes<strong>of</strong> memory reads or writes, since each counter increment involvesa memory read and a memory write. We will shownext that generating realizations <strong>of</strong> p-stable distributionsfrom two precomputed tables in order to compute sh willinvolve another 320 bytes <strong>of</strong> memory reads. In total, eachincoming packet triggers 640 bytes <strong>of</strong> memory reads/writes.However, we will show that if implemented using commodityRDRAM DIMM 6400, our sketch can deliver a throughput<strong>of</strong> 10 million packets per second.Our sketches can be implemented using RDRAM DIMM6400 (named after its 6400 MB/s sustained throughput <strong>for</strong>burst accesses), except <strong>for</strong> the elephant detection module,which is to be implemented using a small amount <strong>of</strong> SRAMin the same way as suggested in [10]. RDRAM can delivera very high throughput <strong>for</strong> read/write in burst mode (aseries <strong>of</strong> accesses to consecutive memory locations). Sinceour 640 bytes <strong>of</strong> memory accesses triggered by each incomingpacket consist <strong>of</strong> 4 large contiguous blocks <strong>of</strong> 160 byteseach, we can fully take advantage <strong>of</strong> the 6400 MB/s throughputprovided by RDRAM DIMM 6400. Implementing oursketch using DRAM, we never need to worry about memoryspace/consumption, as even if we need millions <strong>of</strong> bucketsin the future (we use tens <strong>of</strong> thousands right now), we areconsuming only hundreds <strong>of</strong> MB; In comparison, the retailprice <strong>of</strong> a commodity 2 GB RDRAM DIMM 6400 module isabout $300.Next we describe how to implement the “magic” stablehash functions sh 1, ..., sh l used in <strong>Algorithm</strong> 2. The standardmethodology <strong>for</strong> generating random variables with stabledistributions S(p) is through the following simulation<strong>for</strong>mula:X =» – " „ « # 1/p−1sin (pθ)1cos 1/p θ (cos (θ(1 − p)))1/p−1 ,− ln r(9)where θ is chosen uni<strong>for</strong>mly in [−π/2, π/2] and r is chosenuni<strong>for</strong>mly in [0, 1] [6].One possible way to implement these stable hash functionssh j, j = 1, ..., l, is as follows. To implement each sh jwe fix two uni<strong>for</strong>m hash functions uh j1 and uh j2 that mapa flow identifier pkt.id to a θ value uni<strong>for</strong>mly distributed in[−π/2, π/2], and an r value uni<strong>for</strong>mly distributed in [0, 1]respectively. We then plug these two values into the above


<strong>for</strong>mula. However, computing Formula (9) requires thousands<strong>of</strong> CPU cycles, and it is not possible to per<strong>for</strong>m 40such computations <strong>for</strong> each incoming packet.Our solution <strong>for</strong> speeding up the computation <strong>of</strong> these stablehash functions (i.e., sh ′ js) is to per<strong>for</strong>m memory lookupsinto precomputed tables (also in RDRAM DIMM 6400) asfollows. Note that in the RHS <strong>of</strong> (9), the term in the firstbracket is a function <strong>of</strong> only θ and the term in the secondbracket is a function <strong>of</strong> only r. For implementing eachsh j we now fix two uni<strong>for</strong>m hash functions uh j1 and uh j2that map a flow identifier pkt.id to two index values uni<strong>for</strong>mlydistributed in [1..N 1] and [1..N 2] respectively. Weallocate two lookup tables T 1 and T 2 that contain N 1 andN 2 entries respectively, and each table entry (<strong>for</strong> both T 1and T 2) contains l = 20 blocks <strong>of</strong> 4 bytes each. Then weprecompute N 1 ∗ 20 i.i.d. random variables distributed asthe term in the first bracket and fill them into T 1, andprecompute N 2 ∗ 20 i.i.d. random variables distributed asthe term in the second bracket and fill them into T 2. Foreach incoming packet, we simply return l = 20 randomvalues T 1[uh j1(pkt.id)][j] ∗ T 2[uh j2(pkt.id)][j], j = 1, ..., 20,as the computation result <strong>for</strong> sh 1(pkt.id), sh 2(pkt.id), ...,sh l (pkt.id). Since each sub-sketch requires two tables, weneed a total <strong>of</strong> four tables. In our implementation, both N 1and N 2 are set to fairly large values like 1M. Our simulationshows that stable distribution values generated this way isindistinguishable from real stable distribution values. Notethat our implementation is very fast: two memory reads (4bytes each) and a floating point multiplication <strong>for</strong> computingeach sh j(pkt.id). Note also that index values uh j1(pkt.id)and uh j2(pkt.id) generated <strong>for</strong> estimating the L 1+α normcan be reused <strong>for</strong> the lookup operations per<strong>for</strong>med in estimatingthe L 1−α norm, since all entries in these four tablesare mutually independent.For our distributed algorithm in Section 5 to work, weneed all the nodes to use identical lookup tables, and identicaluni<strong>for</strong>m hash functions that are used to map into thelookup tables. One way to ensure the identical lookup tablesis to distribute a random value to every ingress and egressnode and to use it as the seed to each <strong>of</strong> their (identical)pseudorandom number generators.8. EVALUATIONIn this section, we evaluate the per<strong>for</strong>mance <strong>of</strong> our algorithmusing real packet traces obtained from a tier-1 ISP.8.1 <strong>Data</strong> GatheringWe deployed a packet monitor on a 1 Gbit/second ingresslink from a data center into a large tier-1 ISP backbone network.The data center hosts tens <strong>of</strong> thousands <strong>of</strong> Web sitesas well as servers providing a wide range <strong>of</strong> services such asmultimedia, mail, etc. The link monitored is one <strong>of</strong> multiplelinks connecting this data center to the Internet. All thetraffic carried by this link enters the ISP backbone networkat the same ingress router. For each observed packet, themonitor captured its IP header fields as well as UDP/TCPand ICMP in<strong>for</strong>mation required <strong>for</strong> the flow definition weconsidered.We collected a number <strong>of</strong> five-minute traces and a onehourtrace in April 2007. We use the routing table dumpedat the ingress router to determine the egress router <strong>for</strong> eachpacket, thus determining the <strong>OD</strong> flows to each possible egressrouter. Because we don’t have the packet monitoring capabilityat egress routers, we chose to generate synthetic traffictraces at egress routers so that they contain corresponding<strong>OD</strong> flows observed at the ingress router. We can furtherdictate the flow size distribution at egress routers. In mostcases, we make them match the flow size distribution <strong>of</strong> theingress trace.In the rest <strong>of</strong> the paper, we will mostly use the followingtwo traces to illustrate our results. 5• Trace 1: A one hour trace collected at 9:41pm on April25, 2007. It contains 0.4 billion packets which belongto 1.8 million flows. We chose one egress router sothat the traffic between the origin and destination comprised<strong>of</strong> 5% <strong>of</strong> the total traffic arriving at the ingressrouter.• Trace 2: A five minute trace collected at 10:06pm onApril 25, 2007. Similar to Trace 1, the traffic betweenthe origin and our chosen egress router comprised <strong>of</strong>5% <strong>of</strong> the total traffic arriving at the ingress router.The traffic in this trace is purposely chosen as beinga subset <strong>of</strong> the traffic <strong>for</strong> Trace 1 so that we can directlycompare the per<strong>for</strong>mance <strong>of</strong> our algorithm <strong>for</strong>five minutes and one hour intervals.8.2 Experimental SetupFor each trace, we repeat each experiment 1000 times withindependently generated sketch data structures and computethe cumulative density function <strong>of</strong> the relative error.Unless stated otherwise, the parameters we used <strong>for</strong> eachexperiment were: number <strong>of</strong> buckets k = 50, 000, number <strong>of</strong>registers in each bucket l = 20, sample and hold samplingrate P = 0.001, and one million entries in each lookup table.In our experiments, we also define an elephant flow (<strong>for</strong>which the contribution to the entropy are computed separately)to be any flow with at least N = 1000 packets. Weuse α = 0.05, which satisfies the constraint α < 1/ ln N.Hence, at every ingress and egress point we have a pair <strong>of</strong>sketches computing the L 1.05 and L 0.95 norms <strong>of</strong> the traffic.8.3 Experimental ResultsFormulae 1 and 2: Recall from Section 5 that we had two<strong>for</strong>mulae <strong>for</strong> estimating the L p norms, i.e.,• Formula 1:• Formula 2:“ ”L( O) ⃗ p +L( D) ⃗ p −L( O− ⃗ D) ⃗ p 1/p2“L( ⃗ O+ ⃗ D)p −L( ⃗ O− ⃗ D)p2 p ” 1/p.We first compare the experimental results <strong>for</strong> these two<strong>for</strong>mulae. The cumulative density plots <strong>for</strong> the error <strong>of</strong> ouralgorithm using these two <strong>for</strong>mulae <strong>for</strong> Trace 1 are given inFigure 2. We observe that both <strong>for</strong>mulae have reasonablysmall and comparable error values. This observation alsoholds on all five minute traces and hence we fix and useFormula 1 <strong>for</strong> the rest <strong>of</strong> our evaluation.Varying number <strong>of</strong> buckets: We study the effect <strong>of</strong> varyingthe number <strong>of</strong> buckets on the per<strong>for</strong>mance <strong>of</strong> our algorithm.Keeping all other parameters fixed at reasonablevalues, we varied the number <strong>of</strong> buckets between k = 5000and k = 80000, with increasing factors <strong>of</strong> two. Figure 3shows the results <strong>of</strong> Trace 2. We observe that increasing thenumber <strong>of</strong> buckets increases the accuracy (as expected), butwith diminishing returns.5 The results on other five minute traces are very similar.We omit them here <strong>for</strong> sake <strong>of</strong> brevity.


Fraction <strong>of</strong> experiments10.90.80.70.60.50.40.30.20.10Formula 1Formula 20 0.05 0.1 0.15 0.2 0.25Relative ErrorFigure 2: Comparing Formulae 1 and 2 (Trace 1)Fraction <strong>of</strong> experiments10.90.80.70.60.50.40.30.20.10Five minutesOne hour0 0.05 0.1 0.15 0.2 0.25Relative ErrorFigure 5: Comparing Traces 1 and 2Fraction <strong>of</strong> experiments10.90.80.70.60.50.40.30.20.10k = 5000k = 10000k = 20000k = 40000k = 800000 0.1 0.2 0.3 0.4 0.5 0.6 0.7Relative ErrorFigure 3: Varying numbers <strong>of</strong> buckets (Trace 2)Relative error0.110.10.090.080.070.060.050.040.030.020.010 5 10 15 20 25 30 35 40Ratio <strong>of</strong> <strong>OD</strong> traffic to ingress traffic (%)Figure 6: Varying fraction <strong>of</strong> traffic from ingressFraction <strong>of</strong> experiments10.90.80.70.60.50.40.30.20.10P = 0.01P = 0.001P =0.00010 0.05 0.1 0.15 0.2 0.25 0.3Relative ErrorFigure 4: Varying sampling rates (Trace 2)Relative error0.110.10.090.080.070.060.050.040.030.020 2 4 6 8 10 12 14 16 18 20 22Ratio <strong>of</strong> <strong>OD</strong> traffic to egress traffic (%)Figure 7: Varying fraction <strong>of</strong> traffic from egressVarying sampling rate: Recall from Section 4 that weseparate the computation <strong>for</strong> the large (elephant) flows bymeans <strong>of</strong> Sample and Hold. By varying the sampling ratewe can increase or decrease the probability with which wewill sample flows that are above our elephant threshold (i.e.,flows <strong>of</strong> size greater than 1000). We found that the samplingrate did not affect the per<strong>for</strong>mance <strong>of</strong> our algorithm significantly.Figure 4 shows that, even with a small sampling rate(e.g., 1 in 1000), the elephant detection mechanism allowsgood overall per<strong>for</strong>mance <strong>for</strong> our algorithm.Varying trace length: Figure 5 compares the cumulativedensity plots <strong>of</strong> the error <strong>for</strong> the five minute, Trace 2, andthe one hour, Trace 1, which have the same origin and destinationnodes and similar traffic distributions. We observethat, even though there is an order <strong>of</strong> magnitude differencein the size <strong>of</strong> these traces, not only does the error remaincomparable, but also the distribution <strong>of</strong> the error. Our experimentson different trace sizes show that the algorithmis robust to changes in the size <strong>of</strong> the trace as long as thefraction <strong>of</strong> cross-traffic is held constant, as discussed next.Varying cross traffic: We study the variation <strong>of</strong> the accuracy<strong>of</strong> our algorithm based on what fraction <strong>of</strong> the totalflow (from the source) that particular <strong>OD</strong> flow comprises.For <strong>OD</strong> flows that are very small in comparison to the volumes<strong>of</strong> traffic at the origin (and destination) we expect theper<strong>for</strong>mance <strong>of</strong> our algorithm to degrade since the variation<strong>of</strong> the cross traffic will begin to dominate the error <strong>of</strong> ourestimator. This is demonstrated in Figures 6 and 7 <strong>for</strong>various fractions <strong>of</strong> the ingress and egress traffic (using theaverage <strong>of</strong> 100 runs), respectively. The complete c.d.f. <strong>for</strong>the ingress traffic is given <strong>for</strong> reference in Figure 8.Computing Actual Entropy: We evaluate the per<strong>for</strong>mance<strong>of</strong> our algorithm in computing the actual entropy (asopposed to the entropy norm) <strong>of</strong> the <strong>OD</strong> flows. This computationhas additional error since we need to make use <strong>of</strong> oursketch to estimate the total volume <strong>of</strong> traffic between the


Fraction <strong>of</strong> experiments10.90.80.70.60.50.40.30.20.100 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5Relative Error35.5%14.5%5.3%3.3%2.5%Figure 8: Varying fraction <strong>of</strong> traffic from ingressFraction <strong>of</strong> experiments10.90.80.70.60.50.40.30.20.10Entropy error0 0.05 0.1 0.15 0.2 0.25Relative ErrorFigure 9: Error distribution <strong>for</strong> actual entropysource and destination. In Figure 9 we observe that the errorplot <strong>for</strong> the entropy is comparable to that <strong>for</strong> the entropynorm. Hence, this confirms the fact that our algorithm is arobust estimator <strong>of</strong> the entropy <strong>of</strong> <strong>OD</strong> flows.9. RELATED WORKThere has been considerable previous work in computingthe traffic matrix in a network [18, 21, 22, 23, 27]. The trafficmatrix is simply the matrix defined by the packet (byte)counts between each pair <strong>of</strong> ingress and egress nodes in thenetwork over some measurement interval. For fine-grainednetwork measurement we are sometimes interested in theflow matrix [28], which quantifies the volume <strong>of</strong> the trafficbetween individual <strong>OD</strong> flows. In this paper we propose thecomputation <strong>of</strong> the entropy <strong>of</strong> every <strong>OD</strong> flows, which givesus more in<strong>for</strong>mation than a simple traffic matrix and hasconsiderably less overhead than maintaining the entire flowmatrix.In the last few years, entropy has been suggested as a usefulmeasure <strong>for</strong> different network-monitoring applications.Lakhina et al. [15] use the entropy measure to per<strong>for</strong>m anomalydetection and network diagnosis. In<strong>for</strong>mation measures suchas entropy have been suggested <strong>for</strong> tracking malicious networkactivity [11, 24]. Entropy has been used by Xu etal. [25] to infer patterns <strong>of</strong> interesting activity by using itto cluster traffic. For detecting specific types <strong>of</strong> attacks,researchers have suggested the use <strong>of</strong> entropy <strong>of</strong> differenttraffic features <strong>for</strong> worm [24] and DDoS detection [11]. Recently,it has been shown that entropy-based techniques <strong>for</strong>anomaly detection are much more resistant to the effects <strong>of</strong>sampling [3] than other, volume-based methods.The use <strong>of</strong> stable distributions to produce a sketch wasfirst proposed in [12] to measure the L 1 distance betweentwo streams. This result was generalized to the L p distance<strong>for</strong> all 0 < p ≤ 2 in [8, 13]. This sketch data structure isthe starting point <strong>for</strong> the one that we propose in this paper.The main difference is that we have to make several keymodifications to make it work in practice. In particular, wehave to ensure that the number <strong>of</strong> updates per packet issmall enough to feasibly per<strong>for</strong>m in realtime.Other than <strong>for</strong> p = 1, 2, there is no known closed <strong>for</strong>m<strong>for</strong> the p-stable distribution. To independently draw valuesfrom an arbitrary p-stable distribution, we make use <strong>of</strong> the<strong>for</strong>mula proposed by [6]. Since this <strong>for</strong>mula is expensiveto compute, we create a lookup table to hold several precomputedvalues.In [7] it is suggested that the stable distribution sketchcan be used as a building block to compute empirical entropy,but no methods <strong>for</strong> doing this are suggested. Moreimportantly, there are already known to be algorithms thatwork well <strong>for</strong> the single stream case [16] and we believe thatit is <strong>for</strong> this distributed (i.e., traffic matrix) setting that thestable sketch really shines.10. CONCLUSIONIn this paper we motivate the problem <strong>of</strong> estimating theentropy between origin destination pairs in a network andpresent an algorithm <strong>for</strong> solving this problem. Along theway, we present a completely novel scheme <strong>for</strong> estimatingentropy, introduce applications <strong>for</strong> non-standard L p norms,present an extension <strong>of</strong> Indyk’s algorithm, and show howit can be used in our distributed setting. Via simulationon real-world data, collected at a tier-1 ISP, we are able todemonstrate that our algorithm is practically viable.11. REFERENCES[1] A. Chakrabarti, K. Do Ba, and S. Muthukrishnan.<strong>Estimating</strong> entropy and entropy norm on data streams. InSTACS, 2006.[2] L. Bhuvanagiri and S. Ganguly. <strong>Estimating</strong> entropy overdata streams. In ESA, 2006.[3] D. Brauckh<strong>of</strong>f, B. Tellenbach, A. Wagner, M. May, andA. Lakhina. Impact <strong>of</strong> packet sampling on anomalydetection metrics. In IMC, 2006.[4] G. Casella and R. L. Berger. Statistical Inference. Duxbury,2nd edition, 2002.[5] A. Chakrabarti and G. Cormode. A near-optimal algorithm<strong>for</strong> computing the entropy <strong>of</strong> a stream. In S<strong>OD</strong>A, 2007.[6] J. M. Chambers, C. L. Mallows, and B. W. Stuck. Amethod <strong>for</strong> simulating stable random variables. Journal <strong>of</strong>the American Statistical Association, 71(354), 1976.[7] G. Cormode. Stable distributions <strong>for</strong> stream computations:It’s as easy as 0,1,2. In Workshop on Management andProcessing <strong>of</strong> <strong>Data</strong> Streams, 2003.[8] G. Cormode, P. Indyk, N. Koudas, and S. Muthukrishnan.Fast mining <strong>of</strong> massive tabular data via approximatedistance computations. In ICDE, 2002.[9] M. Durand and P. Flajolet. Loglog counting <strong>of</strong> largecardinalities. In ESA, 2003.[10] C. Estan and G. Varghese. New Directions in TrafficMeasurement and Accounting. In SIGCOMM, Aug. 2002.[11] L. Feinstein, D. Schnackenberg, R. Balupari, andD. Kindred. Statistical approaches to DDoS attackdetection and response. In Proceedings <strong>of</strong> the DARPAIn<strong>for</strong>mation Survivability Conference and Exposition, 2003.[12] P. Indyk. Stable distributions, pseudorandom generators,embeddings and data stream computation. In FOCS, 2000.[13] P. Indyk. Stable distributions, pseudorandom generators,embeddings, and data stream computation. J. ACM,53(3):307–323, 2006.


Pro<strong>of</strong> <strong>of</strong> Theorem 2. From Lemma 9,[14] A. Kuzmanovic and E. W. Knightly. Low-rate tcp targeted12mf(m) √ ||S||p. ldenial <strong>of</strong> service attacks (the shrew vs. the mice andP r[|Λ( ⃗ 1Y )/||S|| p − 1| < ɛ] ≈ P r[|2mf(m) √ Z| < ɛ]lelephants). In SIGCOMM, 2003.ɛ= P r[| Z| < ɛ] = P r[|Z| < z[15] A. Lakhina, M. Crovella, and C. Diot. Mining anomaliesz δ/2 ] = 1 − δ.δ/2using traffic feature distributions. In SIGCOMM, 2005.[16] A. Lall, V. Sekar, M. Ogihara, J. Xu, and H. Zhang. <strong>Data</strong>streaming algorithms <strong>for</strong> estimating entropy <strong>of</strong> networkA.2 Pro<strong>of</strong>s <strong>of</strong> Propositions 4 and 5traffic. In SIGMETRICS, 2006.Lemma 10. x α / ln x is a decreasing function on (1, N] if[17] G. S. Manku and R. Motwani. Approximate frequencycounts over data streams. In Proceedings <strong>of</strong> the 28thα < 1/ ln N. In fact, if α < 0.085, then x α / ln x < 1 onInternational Conference on Very Large <strong>Data</strong> Bases, 2002. [3,N].[18] A. Medina, N. Taft, K. Salamatian, S. Bhattacharyya, andPro<strong>of</strong>. The derivative is negative on (1,N]. 3 α < ln 3 <strong>for</strong>C. Diot. Traffic matrix estimation:existing techniques andnew directions. In SIGCOMM, Aug. 2002.α < 0.085.[19] S. Muthukrishnan. <strong>Data</strong> streams: algorithms andapplications. available athttp://athos.rutgers.edu/∼muthu/.[20] J. Nolan. STABLE program. online atLemma 11. If a approximates b within relative error boundɛ, then a 1+α approximates b 1+α roughly within relative errorbound ɛ <strong>for</strong> small α and ɛ. Similarly <strong>for</strong> 1 − α.http://academic2.american.edu/∼jpnolan/stable/stable.html.[21] A. Soule, A. Nucci, R. Cruz, E. Leonardi, and N. Taft. How Pro<strong>of</strong>. 1 − ɛ < a/b < 1 + ɛ, so (1 − ɛ) 1+α < a 1+α /b 1+α

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!