03.03.2013 Views

Chernoff bound, Cramer's Theorem

Chernoff bound, Cramer's Theorem

Chernoff bound, Cramer's Theorem

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

R. Srikant’s Notes on Modeling and Control of High-Speed Networks 1<br />

1 Large deviations<br />

Consider a sequence of i.i.d. random variables {Xi}. The central limit theorem provides an<br />

estimate of the probability<br />

ni=1 Xi − nµ<br />

P (<br />

σ √ > x),<br />

n<br />

where µ = E(X1) and σ 2 = V ar(X1). Thus, the CLT estimates the probability of O( √ n)<br />

deviations from the mean of the sum of the random variables {Xi} n i=1. These deviations are<br />

small compared to the mean of n i=1 Xi which is an O(n) quantity. On the other hand, large<br />

deviations of the order of the mean itself, i.e., O(n) deviations, is the subject of this section.<br />

The simplest large deviations result is the <strong>Chernoff</strong> <strong>bound</strong> which we review here. First,<br />

recall the Markov inequality: for a positive random varaible X,<br />

For θ ≥ 0,<br />

P (<br />

n<br />

i=1<br />

P (X ≥ ɛ) ≤ E(X)<br />

.<br />

ɛ<br />

Xi ≥ nx) ≤ P (e θ n<br />

i=1 Xi ≥ e θnx )<br />

≤ E(eθ n i=1 Xi )<br />

eθnx .<br />

where the first inequality becomes an equality if θ > 0 and the second inequality follows<br />

from the Markov inequality. Since this is true for all θ ≥ 0,<br />

P (<br />

n<br />

i=1<br />

Xi ≥ nx) ≤ inf<br />

θ≥0<br />

E(e θ n<br />

i=1 Xi )<br />

e θnx<br />

= e −n sup θ≥0 {θx−log M(θ)} , (1)<br />

where M(θ) := E(e θX1 ) is the moment generating function of X1. The above result is called<br />

the <strong>Chernoff</strong> <strong>bound</strong>. The following theorem quantifies the tightness of this <strong>bound</strong>.<br />

<strong>Theorem</strong> 1 (Cramer-<strong>Chernoff</strong> <strong>Theorem</strong>) Let X1, X2, . . . be i.i.d. and suppose that their<br />

common moment generating function M(θ) < ∞ all θ in some neighborhood B0 of θ = 0.<br />

Further suppose that the supremum in the following definition of the rate function I(x) is<br />

obtained at some interior point in this neighborhood:<br />

I(x) = sup θx − Λ(θ), (2)<br />

θ<br />

where Λ(θ) := log M(θ) is called the log moment generating function or the cumulant generating<br />

function. In other words, we assume that there exists θ ∗ ∈ Interior(B0) such that<br />

I(x) = θ ∗ x − Λ(θ ∗ ).<br />

Fix any x > E(X1). Then, for each ɛ > 0, there exists N such that, for all n ≥ N,<br />

e −n(I(x)+ɛ) ≤ P (<br />

n<br />

Xi ≥ nx) ≤ e −nI(x) . (3)<br />

i=1


R. Srikant’s Notes on Modeling and Control of High-Speed Networks 2<br />

(Note: Another way to state the result of the above theorem is<br />

1<br />

lim<br />

n→∞ n<br />

log P (<br />

n<br />

Xi ≥ nx) = −I(x).) (4)<br />

i=1<br />

Proof: We first prove the upper <strong>bound</strong> and then, the lower <strong>bound</strong>.<br />

The upper <strong>bound</strong> in (3): This follows from the <strong>Chernoff</strong> <strong>bound</strong> if we show that the value<br />

of the supremum in (1) does not change if we relax the condition θ ≥ 0 and allow θ to take<br />

negative values. Since e θx is a convex function of x, by Jensen’s inequality, we have<br />

M(θ) ≥ e θµ .<br />

If θ < 0, since x − µ > 0, then e −θ(x−µ) > 1. Thus,<br />

Taking the logarithm of both sides yields<br />

M(θ)e −θ(x−µ) ≥ e θµ .<br />

θx − Λ(θ) ≤ 0.<br />

Noting that θx − Λ(θ) = 0 when θ = 0, we have<br />

where<br />

sup<br />

θ<br />

θx − Λ(θ) = sup θx − Λ(θ).<br />

θ≥0<br />

The lower <strong>bound</strong> in (3): Let p(x) be the pdf of X1. Then, for any δ > 0,<br />

n<br />

P ( Xi ≥ nx)<br />

i=1<br />

=<br />

<br />

n<br />

n<br />

p(xi)dxi<br />

xi ≥ nx i=1<br />

≥<br />

i=1<br />

<br />

n<br />

n<br />

p(xi)dxi<br />

nx ≤ xi ≤ n(x + δ) i=1<br />

i=1<br />

= M n (θ∗ )<br />

enθ∗(x+δ) <br />

e<br />

n<br />

nx ≤ xi ≤ n(x + δ)<br />

i=1<br />

nθ∗ (x+δ)<br />

M n (θ∗ n<br />

p(xi)dxi<br />

) i=1<br />

≥ M n (θ∗ )<br />

enθ∗(x+δ) <br />

n e<br />

n<br />

nx ≤ xi ≤ n(x + δ) i=1<br />

i=1<br />

θ∗xip(xi) M(θ∗ ) dxi<br />

= M n (θ∗ )<br />

enθ∗(x+δ) <br />

n<br />

n<br />

q(xi)dxi,<br />

nx ≤ xi ≤ n(x + δ) i=1<br />

i=1<br />

(5)<br />

q(y) := eθ∗ y p(y)<br />

M(θ ∗ ) .


R. Srikant’s Notes on Modeling and Control of High-Speed Networks 3<br />

Note that ∞<br />

−∞ q(y)dy = 1. Thus, q(y) is a pdf. Let Y be a random variable with q(y) as its<br />

pdf. The moment generating function of Y is given by<br />

Thus,<br />

∞<br />

MY (θ) =<br />

−∞<br />

e θy q(y)dy = M(θ + θ∗ )<br />

M(θ∗ .<br />

)<br />

E(Y ) = dMY (θ)<br />

|θ=0 =<br />

dθ<br />

M ′ (θ∗ )<br />

M(θ∗ ) .<br />

From the assumptions of the theorem θ ∗ achieves the supremum in (2). Thus,<br />

d<br />

θx − log M(θ) = 0<br />

dθ<br />

at θ = θ∗ . From this, we obtain x = M ′ (θ∗ )<br />

M(θ∗ . Therefore, E(Y ) = x. In other words, the pdf q(y)<br />

)<br />

defines a set of i.i.d. random variables Yi, each with mean x. Thus, from (5), the probability<br />

of a large deviation of the sum n i=1 Xi can be lower <strong>bound</strong>ed by probability that n i=1 Yi is<br />

near its mean nx as follows:<br />

By the central limit theorem,<br />

P (nx ≤<br />

n<br />

P ( Xi ≥ nx) ≥ e<br />

i=1<br />

−nI(x) e −nθ∗ n<br />

δ<br />

P (nx ≤ Yi ≤ n(x + δ)).<br />

i=1<br />

n<br />

Yi ≤ n(x + δ)) = P (0 ≤<br />

i=1<br />

Given ɛ > 0, choose N (dependent on δ) such that<br />

and<br />

P (nx ≤<br />

n<br />

i=1<br />

ni=1(Yi − x)<br />

√ n<br />

Yi ≤ n(x + δ)) ≥ 1<br />

4<br />

≤ √ n δ) n→∞<br />

−→ 1<br />

2 .<br />

e −nθ∗ δ 1<br />

4 ≥ e−nɛ .<br />

Thus, the theorem is proved. ⋄<br />

Remark 1 The key idea in the proof of the above theorem is the definition of a new pdf<br />

q(y) under which the random variable has a mean at x, instead of at µ. This changes the<br />

nature of the deviation from the mean to a small deviation, as opposed to a large deviation<br />

and thus, allows the use of the central limit theorem to complete the proof. Changing the pdf<br />

from p(y) to q(y) is called a change of measure. The new distribution x<br />

−∞ q(y)dy is called<br />

the twisted distribution or exponentially tilted distribution. ⋄<br />

Remark 2 The theorem is also applicable when x < E(X1). To see this, define Yi = −Xi,<br />

and consider<br />

n<br />

P ( Yi > −nx).<br />

i=1


R. Srikant’s Notes on Modeling and Control of High-Speed Networks 4<br />

Note that M(−θ) be the moment generating function of Y. Further,<br />

I(x) = sup<br />

θ<br />

θx − Λ(θ) = sup −θx − Λ(−θ).<br />

θ<br />

Thus, the rate function is the same for Y. Since −x > E(Y1), the theorem applies to {Yi}. ⋄<br />

Remark 3 It should be noted that the proof of the theorem can be easily modified to yield<br />

the following result: for any δ > 0,<br />

<br />

1<br />

n<br />

<br />

lim log P nx ≤ Xi ≤ n(x + δ) = −I(x).<br />

n→∞ n<br />

i=1<br />

Noting that, for small δ, P (nx ≤ n i=1 Xi ≤ n(x+δ)) can be interpreted as P ( n i=1 Xi ≈ nx),<br />

this result states that the probability that the sum of the random variables exceeds nx is<br />

approximately equal (up to logarithmic equivalence) to the probability that the sum is “equal”<br />

to x. ⋄<br />

Properties of the rate function:<br />

Lemma 1 I(x) is a convex function.<br />

Proof:<br />

I(αx1 + (1 − α)x2) = sup θ(αx1 + (1 − α)x2) − Λ(θ)<br />

θ<br />

= sup θαx1 − αΛ(θ) + θ(1 − α)x2 − (1 − α)Λ(θ)<br />

θ<br />

≤ sup<br />

θ<br />

θαx1 − αΛ(θ) + sup θ(1 − α)x2 − (1 − α)Λ(θ)<br />

θ<br />

= αI(x1) + (1 − α)I(x2).<br />

Lemma 2 Let I(x) be the rate function of a random variable X with mean µ. Then,<br />

Proof: Recall that<br />

Thus,<br />

By Jensen’s inequality,<br />

Thus,<br />

⇒<br />

I(x) ≥ I(µ) = 0.<br />

I(x) = sup θx − Λ(θ).<br />

θ<br />

I(x) ≥ 0 · x − Λ(0) = 0.<br />

M(θ) = E(e θX ) ≥ e θµ .<br />

Λ(θ) = log M(θ) ≥ θµ,<br />

θµ − Λ(θ) ≤ 0<br />

⇒<br />

I(µ) = sup θµ − Λ(θ) ≤ 0.<br />

θ<br />

Since we have shown that I(x) ≥ 0 ∀x, we have the desired result. ⋄<br />


R. Srikant’s Notes on Modeling and Control of High-Speed Networks 5<br />

Lemma 3<br />

Λ(θ) = sup θx − I(x).<br />

x<br />

Proof: We will prove this under the assumption that all functions of interest are differentiable.<br />

From the definition of I(x) and the convexity of Λ(θ),<br />

where θ ∗ solves<br />

I(x) = θ ∗ x − Λ(θ ∗ ), (6)<br />

Λ ′ (θ ∗ ) = x. (7)<br />

Since I(x) is convex, it is enough to show that, for each θ, there exists an x ∗ such that<br />

Λ(θ) = θx ∗ − I(x ∗ ) (8)<br />

and θ = I ′ (x ∗ ). We claim that such an x ∗ is given by x ∗ = Λ ′ (θ). To see this, we note from<br />

(6)-(7) that<br />

I(x ∗ ) = θx ∗ − Λ(θ)<br />

which verifies (8). ⋄

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!