Scribe 9 - Classes

ECE566: Information Theory - Fall 2011 

Lecture 6+7 - Date: 10/24/2011 

Scribe: Duong Nguyen-Huu 

LONG MARKOV CHAIN 

For a Markov chain X 1 → X 2 → X 3 → · · · → X n then mutual information increases as you get 

closer and closer: 

I(X 1 ; X 2 ) ≥ I(X 1 ; X 3 ) ≥ I(X 1 ; X 4 ) ≥ · · · ≥ I(X 1 ; X n ) 

Proof: 

We have a Markov chain: X 1 → X 2 → X 3 → X 4 (1) then we will show that X 1 → X 3 → X 4 

(2) also form a Markov chain. 

P (X 1 , X 4 |X 3 ) = P (X 1 |X 3 ) · P (X 4 |X 1 , X 3 ) = P (X 1 |X 3 ) · P (X 4 |X 3 ) 

Using Data Processing Theorem for Chain (1) and (2) then: 

I(X 1 ; X 2 ) ≥ I(X 1 ; X 3 ) ≥ I(X 1 ; X 4 ) 

SUFFICIENT STATISTICS 

• If probability density function (pdf) of X depends on a unknown parameter θ and you 

extract a statistic T(X) from your observation, then θ → X → T (X) => I(θ; T (X)) ≤ 

I(θ; X) 

• T(X) is sufficient for θ if the stronger condition: θ → X → T (X) → θ 

⇔ I(θ; T (X)) = I(θ; X) (1) (T(X) includes information about θ as much as X 

⇔ θ → T (X) → X → θ (2) (reverse chain) 

⇔ P (X|T (X), θ) = P (X|T (X)) (3) (Markov property of (2)) 

Proof of (1): Using the chain and the reserve chain with data processing theorem, we 

show that: 

I(θ; T (X)) ≥ I(θ; X) and I(θ; T (X)) ≥ I(θ; X) ⇒ equality 

EXAMPLES SUFFICIENT STATISTICS 

Example 1: Consider X Bernulli(θ). 

n∑ 

T (X) = x i with x i ∈ [0, 1] 

i=1 

We have P (x|θ, T (X) = k) = 1 

C n k 

independent of θ 

⇒ P (x|θ, T (X) = k) = P (x|T (X) = k) then T(X) is sufficient for θ. 

Note: Fisher-Neyman factorization theorem 

If the probability density function is f θ (x), then T is sufficient for θ if and only if functions g 

and h can be found such that: 

f θ (x) = h(X) · g θ (T (X)) 

1

i.e. the density f can be factored into a product such that one factor, h, does not depend on θ 

and the other factor, which does depend on θ, depends on X only through T(X). 

Example 2: If X 1 , . . . , X n are independent and uniformly distributed on the interval [0, θ], 

then T (X) = max(X 1 , . . . , X n ) is sufficient for θ the sample maximum is a sufficient statistic 

for the population maximum. 

To see this, consider the joint probability density function of X = (X 1 , . . . , X n ). Because the 

observations are independent, the pdf can be written as a product of individual densities. 

f X (x 1 , . . . , x n ) = 1 θ 1 {0≤x 1 ≤θ} · · · 1 

θ 1 {0≤x n≤θ} (1) 

= 1 

θ n 1 {0≤min{x i }}1 {max{xi }≤θ} (2) 

where 1 {...} is the indicator function. Thus the density takes form required by the FisherNeyman 

factorization theorem, where h(x) = 1 {min{xi }≥0}, and the rest of the expression is a function 

of only θ and T (x) = max{x i }. 

FANO’S INEQUALITY 

This scenario is viewed as a communication link between a source and a receiver. X is data 

transmitted from the source, and Y is received data at the receiver. The channel is noisy. If we 

estimate X from Y, what is P (X ≠ ˆX) ? 

H(X|Y )−H(Pe) 

log(N−1) 

≥ 

H(X|Y )−1 

log(N−1) 

P e ≡ P (X ≠ ˆX) ≥ 

N is the size of the outcome set of X. This inequality shows a lower bound on error of 

estimation of X from Y. 

Proof: Let E = I(X ≠ hatX) is an indicator function, so E is a binary random variable as: 

P (E = 1) = P e and P (E = 0) = 1 − P e and H(E) = H(P e ) 

We have: H(E, X|Y ) = H(X|Y ) + X(E|X, Y ) = H(X, Y ) (1)(because knowing X and Y, we 

know E). 

Also, we have: H(E, X|Y ) = H(E|Y ) + H(X|E, Y ) 

≤ H(E) + H(X|Y, E = 0) · (1 − P e ) + H(X|Y, E = 1) · P e 

= H(P e ) + H(X|Y, E = 1) · P e (since H(X|Y, E = 0) = 0 .) 

≤ H(P e ) + log(N − 1) · P e (2) (since H(X—Y,E=1) is maximized when X is equally likely in 

other N-1 choices) 

From (1) and (2), we have: H(X|Y ) ≤ H(P e ) + log(N − 1) · P e 

H(X|Y )−H(Pe) 

H(X|Y )−1 

log(N−1) 

⇒ P e ≥ 

log(N−1) 

≥ 

The last inequality holds since H(P e ) ≤ 1 

FANO’S INEQUALITY EXAMPLE 

Consider a source X = 1 : 5, P (X) = [0.35, 0.35, 0.1, 0.1, 0.1] T , also let Y=1,2 if X ≤ 2 then 

y=x with probability 6/7, while if x¿2 then y=1 or 2 with equal probability. 

Our best strategy is to guess hat(x) = y. We now calculate tha actual error probability and 

the bound given by Fano’s inequality. 

2

• Actual error guessing: P e = 1 − ∑ 5 

i=1 P (x i = y i ) = 1 − (0.35 · (6/7) + 0.35 · (6/7)) = 0.4 

• The lower bound given by the Fano’s inequality: 

We have P (X|Y = 1) = = 

P (X,Y =1) 

P (Y =1 

P (Y =1|X·P (X) 

P (Y =1) 

From this, we have P (X|y = 1) = [0.6, 0.1, 0.1, 0.1, 0.1] T and P (X|y = 2) = [0.1, 0.6, 0.1, 0.1, 0.1] T . 

Therefore, 

H(X|Y ) = H(X|Y = 1) · P (Y = 1) + H(X|Y = 2) · P (Y = 2) 

= 0.5 · (2 · 0.6 · log(1/0.6) + 8 · 0.1 · log(1/0.1)) = 1.771(bits). 

Hence, P e = 0.4 > 

H(X|Y )−1 

log(N−1) 

= 1.771−1 

log(5−1) = 0.3855 

SUMMARY 

• Markov X → Y → Z 

P (X, Z|Y ) = P (X|Y )P (Z|Y ) 

• Data Processing Theorem 

I(X; Y ) ≥ I(X; Z) 

I(X; Y ) ≥ I(X; Y |Z) 

• Fano’s Inequality 

P e ≥ 

H(X|Y )−1 

log(N−1) 

STRONG AND WEAK TYPICALITY 

Consider a random variable X, with P (X = A, B, C) = [0.5, 0.25, 0.25] 

• Strongly typical sequences: ABAACABCAABC 

Correct proportion: P (A) = 6/12; P (B) = 3/12; P (C) = 3/12 

H(X) = −0.5 · log(0.5) + 2 · 0.25 · log(0.25) = 1.5 bits 

• Weak typical sequences: BBBBBBBBBBAA 

Incorrect proportion: P (A) = 2/12; P (B) = 10/12; P (C) = 0 

CONVERGENCE OF RANDOM NUMBERS 

• Almost Sure Convergence: 

P r(lim n→∞ X n = X) = 1 

Example Let X n = ±2 −n with p =[0.5,0.5]. 

P r(lim n→∞ X n = 0) = 1 

• Convergence in Probability: 

lim n→∞ P r(|X n − X| ≥ ɛ) → 0, ∀ɛ > 0 

Example Consider X n = 1, 0 with p = [1/2 n , 1 − 1/2 n ]. Therefore, choose n > −log(ɛ) 

lim n→∞ P r(|X n − 1| ≥ ɛ) → 0 

3

WEAKLAW OF LARGE NUMBERS 

Given i.i.d X i , let S n = 1 n · ∑n 

i=1 X i, then 

E[S n ] = E[X] = µ, V ar(S n ) = 1 n · V ar(X) = 1 n · σ 

As n increases, the variance of S n decreases, i.e., values of S n become clustered around the 

probability 

mean: S n −→ 

µ; P (|S n − µ| ≥ ɛ) n→∞ −→ 0 

PROOF OF WEAK LAW OF LARGE NUMBERS 

• Chebyshev’s Inequality P (|x − µ| ≥ ɛ) ≤ σ2 

ɛ 2 

Proof σ 2 = E[(X−µ) 2 ] = ∑ ∀x P (x)(x − µ)2 > ∑ |x−µ|≥ɛ 

P (x) · (x − µ) > ɛ2·∑|x−µ|≥ɛ 

P (x) 

⇒ P (|x − µ| ≥ ɛ) ≤ σ2 

ɛ 2 

• WLLN 

We apply Chebyshev: P (|S n − µ| ≥ ɛ) ≤ σ2 

n·ɛ 2 

n→∞ 

−→ 0 

TYPICAL SET 

Let x n is the i.i.d sequence of X i for i = 1 to n. Definition of typical set: 

T n ɛ = x n ∈ X n : | − n −1 · logp(x n ) − H(X)| < ɛ with H(X) = H(X i ) 

EXAMPLE OF TYPICAL SET 

Consider Bernoulli distribution with p 

4

Scribe 9 - Classes

Create successful ePaper yourself

Delete template?

Save as template?