18.02.2014 Views

Scribe 9 - Classes

Scribe 9 - Classes

Scribe 9 - Classes

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

ECE566: Information Theory - Fall 2011<br />

Lecture 6+7 - Date: 10/24/2011<br />

<strong>Scribe</strong>: Duong Nguyen-Huu<br />

LONG MARKOV CHAIN<br />

For a Markov chain X 1 → X 2 → X 3 → · · · → X n then mutual information increases as you get<br />

closer and closer:<br />

I(X 1 ; X 2 ) ≥ I(X 1 ; X 3 ) ≥ I(X 1 ; X 4 ) ≥ · · · ≥ I(X 1 ; X n )<br />

Proof:<br />

We have a Markov chain: X 1 → X 2 → X 3 → X 4 (1) then we will show that X 1 → X 3 → X 4<br />

(2) also form a Markov chain.<br />

P (X 1 , X 4 |X 3 ) = P (X 1 |X 3 ) · P (X 4 |X 1 , X 3 ) = P (X 1 |X 3 ) · P (X 4 |X 3 )<br />

Using Data Processing Theorem for Chain (1) and (2) then:<br />

I(X 1 ; X 2 ) ≥ I(X 1 ; X 3 ) ≥ I(X 1 ; X 4 )<br />

SUFFICIENT STATISTICS<br />

• If probability density function (pdf) of X depends on a unknown parameter θ and you<br />

extract a statistic T(X) from your observation, then θ → X → T (X) => I(θ; T (X)) ≤<br />

I(θ; X)<br />

• T(X) is sufficient for θ if the stronger condition: θ → X → T (X) → θ<br />

⇔ I(θ; T (X)) = I(θ; X) (1) (T(X) includes information about θ as much as X<br />

⇔ θ → T (X) → X → θ (2) (reverse chain)<br />

⇔ P (X|T (X), θ) = P (X|T (X)) (3) (Markov property of (2))<br />

Proof of (1): Using the chain and the reserve chain with data processing theorem, we<br />

show that:<br />

I(θ; T (X)) ≥ I(θ; X) and I(θ; T (X)) ≥ I(θ; X) ⇒ equality<br />

EXAMPLES SUFFICIENT STATISTICS<br />

Example 1: Consider X Bernulli(θ).<br />

n∑<br />

T (X) = x i with x i ∈ [0, 1]<br />

i=1<br />

We have P (x|θ, T (X) = k) = 1<br />

C n k<br />

independent of θ<br />

⇒ P (x|θ, T (X) = k) = P (x|T (X) = k) then T(X) is sufficient for θ.<br />

Note: Fisher-Neyman factorization theorem<br />

If the probability density function is f θ (x), then T is sufficient for θ if and only if functions g<br />

and h can be found such that:<br />

f θ (x) = h(X) · g θ (T (X))<br />

1


i.e. the density f can be factored into a product such that one factor, h, does not depend on θ<br />

and the other factor, which does depend on θ, depends on X only through T(X).<br />

Example 2: If X 1 , . . . , X n are independent and uniformly distributed on the interval [0, θ],<br />

then T (X) = max(X 1 , . . . , X n ) is sufficient for θ the sample maximum is a sufficient statistic<br />

for the population maximum.<br />

To see this, consider the joint probability density function of X = (X 1 , . . . , X n ). Because the<br />

observations are independent, the pdf can be written as a product of individual densities.<br />

f X (x 1 , . . . , x n ) = 1 θ 1 {0≤x 1 ≤θ} · · · 1<br />

θ 1 {0≤x n≤θ} (1)<br />

= 1<br />

θ n 1 {0≤min{x i }}1 {max{xi }≤θ} (2)<br />

where 1 {...} is the indicator function. Thus the density takes form required by the FisherNeyman<br />

factorization theorem, where h(x) = 1 {min{xi }≥0}, and the rest of the expression is a function<br />

of only θ and T (x) = max{x i }.<br />

FANO’S INEQUALITY<br />

This scenario is viewed as a communication link between a source and a receiver. X is data<br />

transmitted from the source, and Y is received data at the receiver. The channel is noisy. If we<br />

estimate X from Y, what is P (X ≠ ˆX) ?<br />

H(X|Y )−H(Pe)<br />

log(N−1)<br />

≥<br />

H(X|Y )−1<br />

log(N−1)<br />

P e ≡ P (X ≠ ˆX) ≥<br />

N is the size of the outcome set of X. This inequality shows a lower bound on error of<br />

estimation of X from Y.<br />

Proof: Let E = I(X ≠ hatX) is an indicator function, so E is a binary random variable as:<br />

P (E = 1) = P e and P (E = 0) = 1 − P e and H(E) = H(P e )<br />

We have: H(E, X|Y ) = H(X|Y ) + X(E|X, Y ) = H(X, Y ) (1)(because knowing X and Y, we<br />

know E).<br />

Also, we have: H(E, X|Y ) = H(E|Y ) + H(X|E, Y )<br />

≤ H(E) + H(X|Y, E = 0) · (1 − P e ) + H(X|Y, E = 1) · P e<br />

= H(P e ) + H(X|Y, E = 1) · P e (since H(X|Y, E = 0) = 0 .)<br />

≤ H(P e ) + log(N − 1) · P e (2) (since H(X—Y,E=1) is maximized when X is equally likely in<br />

other N-1 choices)<br />

From (1) and (2), we have: H(X|Y ) ≤ H(P e ) + log(N − 1) · P e<br />

H(X|Y )−H(Pe)<br />

H(X|Y )−1<br />

log(N−1)<br />

⇒ P e ≥<br />

log(N−1)<br />

≥<br />

The last inequality holds since H(P e ) ≤ 1<br />

FANO’S INEQUALITY EXAMPLE<br />

Consider a source X = 1 : 5, P (X) = [0.35, 0.35, 0.1, 0.1, 0.1] T , also let Y=1,2 if X ≤ 2 then<br />

y=x with probability 6/7, while if x¿2 then y=1 or 2 with equal probability.<br />

Our best strategy is to guess hat(x) = y. We now calculate tha actual error probability and<br />

the bound given by Fano’s inequality.<br />

2


• Actual error guessing: P e = 1 − ∑ 5<br />

i=1 P (x i = y i ) = 1 − (0.35 · (6/7) + 0.35 · (6/7)) = 0.4<br />

• The lower bound given by the Fano’s inequality:<br />

We have P (X|Y = 1) = =<br />

P (X,Y =1)<br />

P (Y =1<br />

P (Y =1|X·P (X)<br />

P (Y =1)<br />

From this, we have P (X|y = 1) = [0.6, 0.1, 0.1, 0.1, 0.1] T and P (X|y = 2) = [0.1, 0.6, 0.1, 0.1, 0.1] T .<br />

Therefore,<br />

H(X|Y ) = H(X|Y = 1) · P (Y = 1) + H(X|Y = 2) · P (Y = 2)<br />

= 0.5 · (2 · 0.6 · log(1/0.6) + 8 · 0.1 · log(1/0.1)) = 1.771(bits).<br />

Hence, P e = 0.4 ><br />

H(X|Y )−1<br />

log(N−1)<br />

= 1.771−1<br />

log(5−1) = 0.3855<br />

SUMMARY<br />

• Markov X → Y → Z<br />

P (X, Z|Y ) = P (X|Y )P (Z|Y )<br />

• Data Processing Theorem<br />

I(X; Y ) ≥ I(X; Z)<br />

I(X; Y ) ≥ I(X; Y |Z)<br />

• Fano’s Inequality<br />

P e ≥<br />

H(X|Y )−1<br />

log(N−1)<br />

STRONG AND WEAK TYPICALITY<br />

Consider a random variable X, with P (X = A, B, C) = [0.5, 0.25, 0.25]<br />

• Strongly typical sequences: ABAACABCAABC<br />

Correct proportion: P (A) = 6/12; P (B) = 3/12; P (C) = 3/12<br />

H(X) = −0.5 · log(0.5) + 2 · 0.25 · log(0.25) = 1.5 bits<br />

• Weak typical sequences: BBBBBBBBBBAA<br />

Incorrect proportion: P (A) = 2/12; P (B) = 10/12; P (C) = 0<br />

CONVERGENCE OF RANDOM NUMBERS<br />

• Almost Sure Convergence:<br />

P r(lim n→∞ X n = X) = 1<br />

Example Let X n = ±2 −n with p =[0.5,0.5].<br />

P r(lim n→∞ X n = 0) = 1<br />

• Convergence in Probability:<br />

lim n→∞ P r(|X n − X| ≥ ɛ) → 0, ∀ɛ > 0<br />

Example Consider X n = 1, 0 with p = [1/2 n , 1 − 1/2 n ]. Therefore, choose n > −log(ɛ)<br />

lim n→∞ P r(|X n − 1| ≥ ɛ) → 0<br />

3


WEAKLAW OF LARGE NUMBERS<br />

Given i.i.d X i , let S n = 1 n · ∑n<br />

i=1 X i, then<br />

E[S n ] = E[X] = µ, V ar(S n ) = 1 n · V ar(X) = 1 n · σ<br />

As n increases, the variance of S n decreases, i.e., values of S n become clustered around the<br />

probability<br />

mean: S n −→<br />

µ; P (|S n − µ| ≥ ɛ) n→∞ −→ 0<br />

PROOF OF WEAK LAW OF LARGE NUMBERS<br />

• Chebyshev’s Inequality P (|x − µ| ≥ ɛ) ≤ σ2<br />

ɛ 2<br />

Proof σ 2 = E[(X−µ) 2 ] = ∑ ∀x P (x)(x − µ)2 > ∑ |x−µ|≥ɛ<br />

P (x) · (x − µ) > ɛ2·∑|x−µ|≥ɛ<br />

P (x)<br />

⇒ P (|x − µ| ≥ ɛ) ≤ σ2<br />

ɛ 2<br />

• WLLN<br />

We apply Chebyshev: P (|S n − µ| ≥ ɛ) ≤ σ2<br />

n·ɛ 2<br />

n→∞<br />

−→ 0<br />

TYPICAL SET<br />

Let x n is the i.i.d sequence of X i for i = 1 to n. Definition of typical set:<br />

T n ɛ = x n ∈ X n : | − n −1 · logp(x n ) − H(X)| < ɛ with H(X) = H(X i )<br />

EXAMPLE OF TYPICAL SET<br />

Consider Bernoulli distribution with p<br />

4

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!