A Data Streaming Algorithm for Estimating Entropies of OD Flows

has distribution S(p). Note that, due to the symmetry ofS(p), DMed p is exactly the three-quarter quantile of S(p).Although there is no closed form for DMed p for most of thep values, we can numerically calculate it by simulation, orwe can use a program like [20].Intuitively, the correctness of this estimator can be justifiedas follows. Since Y 1/||S|| p, ..., Y l /||S|| p are i.i.d. randomvariables with distribution S(p), taking absolute value givesus i.i.d. draws from S + (p). For large enough l, their medianshould be close to the distribution median of S + (p). Therefore,we simply divide median(|Y 1|, . . . , |Y l |) by the distributionmedian of S + (p) to get an estimator of ||S|| p. We haveto take absolute values because the distribution median ofS(p) is 0 due to its symmetry. In the next section we willanalyze the relative error of this estimator.Indyk’s estimator for the L p norm is based on the propertyof the median. We find, however, that it is possible toconstruct estimators based on other quantiles and they mayeven outperform the median estimator, in terms of estimationaccuracy. However, since the improvement is marginalfor our parameters settings, we stick to the median estimator.3.3 Error analysis for L p norm estimatorIn this section we analyze the performance of the estimatorfor ||S|| p in (1).3.3.1 (ɛ, δ) bound for p = 1Here we basically restate Lemma 2 and Theorem 3 fromIndyk [13] with the constants spelled out. We arrived atthis by using Chernoff bounds to derive the constant in hisClaim 2.Theorem 1. (Indyk, [13]) Let ⃗ X = (X 1, . . . , X l ) be i.i.d.samples from S(1), l = 8(ln 2 + ln 1 δ )/ɛ2 , ɛ < 0.2, thenDMed 1 = 1, and P r[median(|X 1|, . . . , |X l |) ∈ [1−ɛ, 1+ɛ]] >1 − δ. Thus (1) gives an (ɛ, δ) estimator for p = 1.Example: for p = 1, δ = 0.05, ɛ = 0.1, we get l = 2951.This is a very loose bound in the sense that we need a verylarge l. This motivates us to resort to the asymptotic normalityof the median for some approximate analysis.3.3.2 Asymptotic (ɛ, δ) bound for p ∈ (0, 2]z δ/22mf(m)ɛ )2 ,Theorem 2. Let f = f S + (p), m = DMed p, l = (then (1) gives an estimator with asymptotic (ɛ, δ) bound. z ais the number such that for standard normal distribution Zwe have P r[Z > z a] = a.Proof is in the Appendix. This result is in the same orderof O(1/ɛ 2 ) as the Chernoff result, but the coefficient is muchsmaller as shown below.Example: For p = 1, δ = 0.05, ɛ = 0.1, we get m =1, f(m) = 1/π, z δ/2 = 2, l = 986. Compare with l = 2951from the previous section.For p = 1.05, we get m = 0.9938, f(m) = 0.3324, l = 916.For p = 0.95, we get m = 1.0078, f(m) = 0.3030, l = 1072.Our simulations show that these are quite accurate bounds.We can see that mf(m) does not change much in a smallneighborhood of p = 1. Since we are only interested in p ina small neighborhood of 1, for rough arguments we may usemf(m) at p = 1, which is 1/π.4. SINGLE NODE ALGORITHMIn this section, we show how our sketch works for estimatingthe entropy of the traffic stream on a single link;Estimation of OD flow entropy based on its intersection measurableproperty (IMP) is the topic of the next section. Wefirst show how to approximate the function x ln(x) by a linearcombination of at most two functions of the form x p ,p ∈ (0, 2]. After that we analyze the combined error of thisapproximation and Indyk’s algorithm.4.1 Approximating x ln xOur algorithm computes the entropy of a stream of flowsby approximating the entropy function x ln x by a linearcombination of expressions x p , p ∈ (0, 2]. In this section wedemonstrate how to do this approximation up to arbitraryrelative error ɛ. To make the formula simpler we use thenatural logarithm ln x instead of log 2 x, noting that changingthe base is simply a matter of multiplying by the appropriateconstant, thus having no effect on relative error.Theorem 3. For any N > 1, ɛ > 0, there exists α ∈(0, 1), c = 1 ln∈ O( √N2α ɛ), such that f(x) = c(x 1+α − x 1−α )approximates the entropy function x ln x for x ∈ (1, N] withinf(x)−x ln xrelative error bound ɛ, i.e., | | ≤ ɛ.x ln xProof. Using the Taylor expansion,x α = e α ln x = 1 + α ln x +we get thatf(x) = x ln x + α2 x ln 3 x3!(α ln x)22!++ α4 x ln 5 x5!(α ln x)33!+ · · ·+ · · · ,Rewriting in terms of the relative error, we get thatr(x, α) ≡ f(x)(α ln x)2− 1 =x ln x 3!∞X=k=1+(α ln x) 2k(2k + 1)! .(α ln x)45!+ · · ·Since every term is positive, we have r(x, α) ≥ 0. Weassume that α < 1 . This gives usln Nr(x, α) ≤ 1 ∞X(α ln x) 2k = 1 „ «(α ln x)2. (2)66 1 − (α ln x) 2”16 1−(α ln N) 2ln N2q 6ɛ1+6ɛk=1The bound takes maximum value at x = N. Solvingq“(α ln N)26ɛ1+6ɛ= ɛ gives us α = , and c = 1 =ln N 2α∈ O( ln √ Nɛ). Therefore f(x) = 12α (x1+α − x 1−α ) approximatesx ln x within the relative error bound ɛ.A plot of this approximation for the range [1, 1000] andf(x) = 10(x 1.05 − x 0.95 ) is given in Figure 1. The relativeerror guarantee of the approximation only holds for valuesless than some constant N. As we have mentioned, we willuse some elephant detection mechanism to circumvent thisshortcoming.4.2 Estimating entropy norm ||S|| HNow we will combine our approximation formula and Indyk’salgorithm to get an estimator for the entropy norm||S|| H. Suppose we have chosen α and c in Theorem 3 toget relative error bound ɛ 0 on [1, N], and we have chosen l

Previous page

Next page

1

2

3

5

6

7

8

9

10

11

12

A Data Streaming Algorithm for Estimating Entropies of OD Flows

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?