11.07.2015 Views

A Data Streaming Algorithm for Estimating Entropies of OD Flows

A Data Streaming Algorithm for Estimating Entropies of OD Flows

A Data Streaming Algorithm for Estimating Entropies of OD Flows

SHOW MORE
SHOW LESS
  • No tags were found...

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

has distribution S(p). Note that, due to the symmetry <strong>of</strong>S(p), DMed p is exactly the three-quarter quantile <strong>of</strong> S(p).Although there is no closed <strong>for</strong>m <strong>for</strong> DMed p <strong>for</strong> most <strong>of</strong> thep values, we can numerically calculate it by simulation, orwe can use a program like [20].Intuitively, the correctness <strong>of</strong> this estimator can be justifiedas follows. Since Y 1/||S|| p, ..., Y l /||S|| p are i.i.d. randomvariables with distribution S(p), taking absolute value givesus i.i.d. draws from S + (p). For large enough l, their medianshould be close to the distribution median <strong>of</strong> S + (p). There<strong>for</strong>e,we simply divide median(|Y 1|, . . . , |Y l |) by the distributionmedian <strong>of</strong> S + (p) to get an estimator <strong>of</strong> ||S|| p. We haveto take absolute values because the distribution median <strong>of</strong>S(p) is 0 due to its symmetry. In the next section we willanalyze the relative error <strong>of</strong> this estimator.Indyk’s estimator <strong>for</strong> the L p norm is based on the property<strong>of</strong> the median. We find, however, that it is possible toconstruct estimators based on other quantiles and they mayeven outper<strong>for</strong>m the median estimator, in terms <strong>of</strong> estimationaccuracy. However, since the improvement is marginal<strong>for</strong> our parameters settings, we stick to the median estimator.3.3 Error analysis <strong>for</strong> L p norm estimatorIn this section we analyze the per<strong>for</strong>mance <strong>of</strong> the estimator<strong>for</strong> ||S|| p in (1).3.3.1 (ɛ, δ) bound <strong>for</strong> p = 1Here we basically restate Lemma 2 and Theorem 3 fromIndyk [13] with the constants spelled out. We arrived atthis by using Chern<strong>of</strong>f bounds to derive the constant in hisClaim 2.Theorem 1. (Indyk, [13]) Let ⃗ X = (X 1, . . . , X l ) be i.i.d.samples from S(1), l = 8(ln 2 + ln 1 δ )/ɛ2 , ɛ < 0.2, thenDMed 1 = 1, and P r[median(|X 1|, . . . , |X l |) ∈ [1−ɛ, 1+ɛ]] >1 − δ. Thus (1) gives an (ɛ, δ) estimator <strong>for</strong> p = 1.Example: <strong>for</strong> p = 1, δ = 0.05, ɛ = 0.1, we get l = 2951.This is a very loose bound in the sense that we need a verylarge l. This motivates us to resort to the asymptotic normality<strong>of</strong> the median <strong>for</strong> some approximate analysis.3.3.2 Asymptotic (ɛ, δ) bound <strong>for</strong> p ∈ (0, 2]z δ/22mf(m)ɛ )2 ,Theorem 2. Let f = f S + (p), m = DMed p, l = (then (1) gives an estimator with asymptotic (ɛ, δ) bound. z ais the number such that <strong>for</strong> standard normal distribution Zwe have P r[Z > z a] = a.Pro<strong>of</strong> is in the Appendix. This result is in the same order<strong>of</strong> O(1/ɛ 2 ) as the Chern<strong>of</strong>f result, but the coefficient is muchsmaller as shown below.Example: For p = 1, δ = 0.05, ɛ = 0.1, we get m =1, f(m) = 1/π, z δ/2 = 2, l = 986. Compare with l = 2951from the previous section.For p = 1.05, we get m = 0.9938, f(m) = 0.3324, l = 916.For p = 0.95, we get m = 1.0078, f(m) = 0.3030, l = 1072.Our simulations show that these are quite accurate bounds.We can see that mf(m) does not change much in a smallneighborhood <strong>of</strong> p = 1. Since we are only interested in p ina small neighborhood <strong>of</strong> 1, <strong>for</strong> rough arguments we may usemf(m) at p = 1, which is 1/π.4. SINGLE N<strong>OD</strong>E ALGORITHMIn this section, we show how our sketch works <strong>for</strong> estimatingthe entropy <strong>of</strong> the traffic stream on a single link;Estimation <strong>of</strong> <strong>OD</strong> flow entropy based on its intersection measurableproperty (IMP) is the topic <strong>of</strong> the next section. Wefirst show how to approximate the function x ln(x) by a linearcombination <strong>of</strong> at most two functions <strong>of</strong> the <strong>for</strong>m x p ,p ∈ (0, 2]. After that we analyze the combined error <strong>of</strong> thisapproximation and Indyk’s algorithm.4.1 Approximating x ln xOur algorithm computes the entropy <strong>of</strong> a stream <strong>of</strong> flowsby approximating the entropy function x ln x by a linearcombination <strong>of</strong> expressions x p , p ∈ (0, 2]. In this section wedemonstrate how to do this approximation up to arbitraryrelative error ɛ. To make the <strong>for</strong>mula simpler we use thenatural logarithm ln x instead <strong>of</strong> log 2 x, noting that changingthe base is simply a matter <strong>of</strong> multiplying by the appropriateconstant, thus having no effect on relative error.Theorem 3. For any N > 1, ɛ > 0, there exists α ∈(0, 1), c = 1 ln∈ O( √N2α ɛ), such that f(x) = c(x 1+α − x 1−α )approximates the entropy function x ln x <strong>for</strong> x ∈ (1, N] withinf(x)−x ln xrelative error bound ɛ, i.e., | | ≤ ɛ.x ln xPro<strong>of</strong>. Using the Taylor expansion,x α = e α ln x = 1 + α ln x +we get thatf(x) = x ln x + α2 x ln 3 x3!(α ln x)22!++ α4 x ln 5 x5!(α ln x)33!+ · · ·+ · · · ,Rewriting in terms <strong>of</strong> the relative error, we get thatr(x, α) ≡ f(x)(α ln x)2− 1 =x ln x 3!∞X=k=1+(α ln x) 2k(2k + 1)! .(α ln x)45!+ · · ·Since every term is positive, we have r(x, α) ≥ 0. Weassume that α < 1 . This gives usln Nr(x, α) ≤ 1 ∞X(α ln x) 2k = 1 „ «(α ln x)2. (2)66 1 − (α ln x) 2”16 1−(α ln N) 2ln N2q 6ɛ1+6ɛk=1The bound takes maximum value at x = N. Solvingq“(α ln N)26ɛ1+6ɛ= ɛ gives us α = , and c = 1 =ln N 2α∈ O( ln √ Nɛ). There<strong>for</strong>e f(x) = 12α (x1+α − x 1−α ) approximatesx ln x within the relative error bound ɛ.A plot <strong>of</strong> this approximation <strong>for</strong> the range [1, 1000] andf(x) = 10(x 1.05 − x 0.95 ) is given in Figure 1. The relativeerror guarantee <strong>of</strong> the approximation only holds <strong>for</strong> valuesless than some constant N. As we have mentioned, we willuse some elephant detection mechanism to circumvent thisshortcoming.4.2 <strong>Estimating</strong> entropy norm ||S|| HNow we will combine our approximation <strong>for</strong>mula and Indyk’salgorithm to get an estimator <strong>for</strong> the entropy norm||S|| H. Suppose we have chosen α and c in Theorem 3 toget relative error bound ɛ 0 on [1, N], and we have chosen l

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!