Chernoff bound, Cramer's Theorem
Chernoff bound, Cramer's Theorem
Chernoff bound, Cramer's Theorem
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
R. Srikant’s Notes on Modeling and Control of High-Speed Networks 1<br />
1 Large deviations<br />
Consider a sequence of i.i.d. random variables {Xi}. The central limit theorem provides an<br />
estimate of the probability<br />
ni=1 Xi − nµ<br />
P (<br />
σ √ > x),<br />
n<br />
where µ = E(X1) and σ 2 = V ar(X1). Thus, the CLT estimates the probability of O( √ n)<br />
deviations from the mean of the sum of the random variables {Xi} n i=1. These deviations are<br />
small compared to the mean of n i=1 Xi which is an O(n) quantity. On the other hand, large<br />
deviations of the order of the mean itself, i.e., O(n) deviations, is the subject of this section.<br />
The simplest large deviations result is the <strong>Chernoff</strong> <strong>bound</strong> which we review here. First,<br />
recall the Markov inequality: for a positive random varaible X,<br />
For θ ≥ 0,<br />
P (<br />
n<br />
i=1<br />
P (X ≥ ɛ) ≤ E(X)<br />
.<br />
ɛ<br />
Xi ≥ nx) ≤ P (e θ n<br />
i=1 Xi ≥ e θnx )<br />
≤ E(eθ n i=1 Xi )<br />
eθnx .<br />
where the first inequality becomes an equality if θ > 0 and the second inequality follows<br />
from the Markov inequality. Since this is true for all θ ≥ 0,<br />
P (<br />
n<br />
i=1<br />
Xi ≥ nx) ≤ inf<br />
θ≥0<br />
E(e θ n<br />
i=1 Xi )<br />
e θnx<br />
= e −n sup θ≥0 {θx−log M(θ)} , (1)<br />
where M(θ) := E(e θX1 ) is the moment generating function of X1. The above result is called<br />
the <strong>Chernoff</strong> <strong>bound</strong>. The following theorem quantifies the tightness of this <strong>bound</strong>.<br />
<strong>Theorem</strong> 1 (Cramer-<strong>Chernoff</strong> <strong>Theorem</strong>) Let X1, X2, . . . be i.i.d. and suppose that their<br />
common moment generating function M(θ) < ∞ all θ in some neighborhood B0 of θ = 0.<br />
Further suppose that the supremum in the following definition of the rate function I(x) is<br />
obtained at some interior point in this neighborhood:<br />
I(x) = sup θx − Λ(θ), (2)<br />
θ<br />
where Λ(θ) := log M(θ) is called the log moment generating function or the cumulant generating<br />
function. In other words, we assume that there exists θ ∗ ∈ Interior(B0) such that<br />
I(x) = θ ∗ x − Λ(θ ∗ ).<br />
Fix any x > E(X1). Then, for each ɛ > 0, there exists N such that, for all n ≥ N,<br />
e −n(I(x)+ɛ) ≤ P (<br />
n<br />
Xi ≥ nx) ≤ e −nI(x) . (3)<br />
i=1
R. Srikant’s Notes on Modeling and Control of High-Speed Networks 2<br />
(Note: Another way to state the result of the above theorem is<br />
1<br />
lim<br />
n→∞ n<br />
log P (<br />
n<br />
Xi ≥ nx) = −I(x).) (4)<br />
i=1<br />
Proof: We first prove the upper <strong>bound</strong> and then, the lower <strong>bound</strong>.<br />
The upper <strong>bound</strong> in (3): This follows from the <strong>Chernoff</strong> <strong>bound</strong> if we show that the value<br />
of the supremum in (1) does not change if we relax the condition θ ≥ 0 and allow θ to take<br />
negative values. Since e θx is a convex function of x, by Jensen’s inequality, we have<br />
M(θ) ≥ e θµ .<br />
If θ < 0, since x − µ > 0, then e −θ(x−µ) > 1. Thus,<br />
Taking the logarithm of both sides yields<br />
M(θ)e −θ(x−µ) ≥ e θµ .<br />
θx − Λ(θ) ≤ 0.<br />
Noting that θx − Λ(θ) = 0 when θ = 0, we have<br />
where<br />
sup<br />
θ<br />
θx − Λ(θ) = sup θx − Λ(θ).<br />
θ≥0<br />
The lower <strong>bound</strong> in (3): Let p(x) be the pdf of X1. Then, for any δ > 0,<br />
n<br />
P ( Xi ≥ nx)<br />
i=1<br />
=<br />
<br />
n<br />
n<br />
p(xi)dxi<br />
xi ≥ nx i=1<br />
≥<br />
i=1<br />
<br />
n<br />
n<br />
p(xi)dxi<br />
nx ≤ xi ≤ n(x + δ) i=1<br />
i=1<br />
= M n (θ∗ )<br />
enθ∗(x+δ) <br />
e<br />
n<br />
nx ≤ xi ≤ n(x + δ)<br />
i=1<br />
nθ∗ (x+δ)<br />
M n (θ∗ n<br />
p(xi)dxi<br />
) i=1<br />
≥ M n (θ∗ )<br />
enθ∗(x+δ) <br />
n e<br />
n<br />
nx ≤ xi ≤ n(x + δ) i=1<br />
i=1<br />
θ∗xip(xi) M(θ∗ ) dxi<br />
= M n (θ∗ )<br />
enθ∗(x+δ) <br />
n<br />
n<br />
q(xi)dxi,<br />
nx ≤ xi ≤ n(x + δ) i=1<br />
i=1<br />
(5)<br />
q(y) := eθ∗ y p(y)<br />
M(θ ∗ ) .
R. Srikant’s Notes on Modeling and Control of High-Speed Networks 3<br />
Note that ∞<br />
−∞ q(y)dy = 1. Thus, q(y) is a pdf. Let Y be a random variable with q(y) as its<br />
pdf. The moment generating function of Y is given by<br />
Thus,<br />
∞<br />
MY (θ) =<br />
−∞<br />
e θy q(y)dy = M(θ + θ∗ )<br />
M(θ∗ .<br />
)<br />
E(Y ) = dMY (θ)<br />
|θ=0 =<br />
dθ<br />
M ′ (θ∗ )<br />
M(θ∗ ) .<br />
From the assumptions of the theorem θ ∗ achieves the supremum in (2). Thus,<br />
d<br />
θx − log M(θ) = 0<br />
dθ<br />
at θ = θ∗ . From this, we obtain x = M ′ (θ∗ )<br />
M(θ∗ . Therefore, E(Y ) = x. In other words, the pdf q(y)<br />
)<br />
defines a set of i.i.d. random variables Yi, each with mean x. Thus, from (5), the probability<br />
of a large deviation of the sum n i=1 Xi can be lower <strong>bound</strong>ed by probability that n i=1 Yi is<br />
near its mean nx as follows:<br />
By the central limit theorem,<br />
P (nx ≤<br />
n<br />
P ( Xi ≥ nx) ≥ e<br />
i=1<br />
−nI(x) e −nθ∗ n<br />
δ<br />
P (nx ≤ Yi ≤ n(x + δ)).<br />
i=1<br />
n<br />
Yi ≤ n(x + δ)) = P (0 ≤<br />
i=1<br />
Given ɛ > 0, choose N (dependent on δ) such that<br />
and<br />
P (nx ≤<br />
n<br />
i=1<br />
ni=1(Yi − x)<br />
√ n<br />
Yi ≤ n(x + δ)) ≥ 1<br />
4<br />
≤ √ n δ) n→∞<br />
−→ 1<br />
2 .<br />
e −nθ∗ δ 1<br />
4 ≥ e−nɛ .<br />
Thus, the theorem is proved. ⋄<br />
Remark 1 The key idea in the proof of the above theorem is the definition of a new pdf<br />
q(y) under which the random variable has a mean at x, instead of at µ. This changes the<br />
nature of the deviation from the mean to a small deviation, as opposed to a large deviation<br />
and thus, allows the use of the central limit theorem to complete the proof. Changing the pdf<br />
from p(y) to q(y) is called a change of measure. The new distribution x<br />
−∞ q(y)dy is called<br />
the twisted distribution or exponentially tilted distribution. ⋄<br />
Remark 2 The theorem is also applicable when x < E(X1). To see this, define Yi = −Xi,<br />
and consider<br />
n<br />
P ( Yi > −nx).<br />
i=1
R. Srikant’s Notes on Modeling and Control of High-Speed Networks 4<br />
Note that M(−θ) be the moment generating function of Y. Further,<br />
I(x) = sup<br />
θ<br />
θx − Λ(θ) = sup −θx − Λ(−θ).<br />
θ<br />
Thus, the rate function is the same for Y. Since −x > E(Y1), the theorem applies to {Yi}. ⋄<br />
Remark 3 It should be noted that the proof of the theorem can be easily modified to yield<br />
the following result: for any δ > 0,<br />
<br />
1<br />
n<br />
<br />
lim log P nx ≤ Xi ≤ n(x + δ) = −I(x).<br />
n→∞ n<br />
i=1<br />
Noting that, for small δ, P (nx ≤ n i=1 Xi ≤ n(x+δ)) can be interpreted as P ( n i=1 Xi ≈ nx),<br />
this result states that the probability that the sum of the random variables exceeds nx is<br />
approximately equal (up to logarithmic equivalence) to the probability that the sum is “equal”<br />
to x. ⋄<br />
Properties of the rate function:<br />
Lemma 1 I(x) is a convex function.<br />
Proof:<br />
I(αx1 + (1 − α)x2) = sup θ(αx1 + (1 − α)x2) − Λ(θ)<br />
θ<br />
= sup θαx1 − αΛ(θ) + θ(1 − α)x2 − (1 − α)Λ(θ)<br />
θ<br />
≤ sup<br />
θ<br />
θαx1 − αΛ(θ) + sup θ(1 − α)x2 − (1 − α)Λ(θ)<br />
θ<br />
= αI(x1) + (1 − α)I(x2).<br />
Lemma 2 Let I(x) be the rate function of a random variable X with mean µ. Then,<br />
Proof: Recall that<br />
Thus,<br />
By Jensen’s inequality,<br />
Thus,<br />
⇒<br />
I(x) ≥ I(µ) = 0.<br />
I(x) = sup θx − Λ(θ).<br />
θ<br />
I(x) ≥ 0 · x − Λ(0) = 0.<br />
M(θ) = E(e θX ) ≥ e θµ .<br />
Λ(θ) = log M(θ) ≥ θµ,<br />
θµ − Λ(θ) ≤ 0<br />
⇒<br />
I(µ) = sup θµ − Λ(θ) ≤ 0.<br />
θ<br />
Since we have shown that I(x) ≥ 0 ∀x, we have the desired result. ⋄<br />
⋄
R. Srikant’s Notes on Modeling and Control of High-Speed Networks 5<br />
Lemma 3<br />
Λ(θ) = sup θx − I(x).<br />
x<br />
Proof: We will prove this under the assumption that all functions of interest are differentiable.<br />
From the definition of I(x) and the convexity of Λ(θ),<br />
where θ ∗ solves<br />
I(x) = θ ∗ x − Λ(θ ∗ ), (6)<br />
Λ ′ (θ ∗ ) = x. (7)<br />
Since I(x) is convex, it is enough to show that, for each θ, there exists an x ∗ such that<br />
Λ(θ) = θx ∗ − I(x ∗ ) (8)<br />
and θ = I ′ (x ∗ ). We claim that such an x ∗ is given by x ∗ = Λ ′ (θ). To see this, we note from<br />
(6)-(7) that<br />
I(x ∗ ) = θx ∗ − Λ(θ)<br />
which verifies (8). ⋄