Scribe 9 - Classes
Scribe 9 - Classes
Scribe 9 - Classes
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
ECE566: Information Theory - Fall 2011<br />
Lecture 6+7 - Date: 10/24/2011<br />
<strong>Scribe</strong>: Duong Nguyen-Huu<br />
LONG MARKOV CHAIN<br />
For a Markov chain X 1 → X 2 → X 3 → · · · → X n then mutual information increases as you get<br />
closer and closer:<br />
I(X 1 ; X 2 ) ≥ I(X 1 ; X 3 ) ≥ I(X 1 ; X 4 ) ≥ · · · ≥ I(X 1 ; X n )<br />
Proof:<br />
We have a Markov chain: X 1 → X 2 → X 3 → X 4 (1) then we will show that X 1 → X 3 → X 4<br />
(2) also form a Markov chain.<br />
P (X 1 , X 4 |X 3 ) = P (X 1 |X 3 ) · P (X 4 |X 1 , X 3 ) = P (X 1 |X 3 ) · P (X 4 |X 3 )<br />
Using Data Processing Theorem for Chain (1) and (2) then:<br />
I(X 1 ; X 2 ) ≥ I(X 1 ; X 3 ) ≥ I(X 1 ; X 4 )<br />
SUFFICIENT STATISTICS<br />
• If probability density function (pdf) of X depends on a unknown parameter θ and you<br />
extract a statistic T(X) from your observation, then θ → X → T (X) => I(θ; T (X)) ≤<br />
I(θ; X)<br />
• T(X) is sufficient for θ if the stronger condition: θ → X → T (X) → θ<br />
⇔ I(θ; T (X)) = I(θ; X) (1) (T(X) includes information about θ as much as X<br />
⇔ θ → T (X) → X → θ (2) (reverse chain)<br />
⇔ P (X|T (X), θ) = P (X|T (X)) (3) (Markov property of (2))<br />
Proof of (1): Using the chain and the reserve chain with data processing theorem, we<br />
show that:<br />
I(θ; T (X)) ≥ I(θ; X) and I(θ; T (X)) ≥ I(θ; X) ⇒ equality<br />
EXAMPLES SUFFICIENT STATISTICS<br />
Example 1: Consider X Bernulli(θ).<br />
n∑<br />
T (X) = x i with x i ∈ [0, 1]<br />
i=1<br />
We have P (x|θ, T (X) = k) = 1<br />
C n k<br />
independent of θ<br />
⇒ P (x|θ, T (X) = k) = P (x|T (X) = k) then T(X) is sufficient for θ.<br />
Note: Fisher-Neyman factorization theorem<br />
If the probability density function is f θ (x), then T is sufficient for θ if and only if functions g<br />
and h can be found such that:<br />
f θ (x) = h(X) · g θ (T (X))<br />
1
i.e. the density f can be factored into a product such that one factor, h, does not depend on θ<br />
and the other factor, which does depend on θ, depends on X only through T(X).<br />
Example 2: If X 1 , . . . , X n are independent and uniformly distributed on the interval [0, θ],<br />
then T (X) = max(X 1 , . . . , X n ) is sufficient for θ the sample maximum is a sufficient statistic<br />
for the population maximum.<br />
To see this, consider the joint probability density function of X = (X 1 , . . . , X n ). Because the<br />
observations are independent, the pdf can be written as a product of individual densities.<br />
f X (x 1 , . . . , x n ) = 1 θ 1 {0≤x 1 ≤θ} · · · 1<br />
θ 1 {0≤x n≤θ} (1)<br />
= 1<br />
θ n 1 {0≤min{x i }}1 {max{xi }≤θ} (2)<br />
where 1 {...} is the indicator function. Thus the density takes form required by the FisherNeyman<br />
factorization theorem, where h(x) = 1 {min{xi }≥0}, and the rest of the expression is a function<br />
of only θ and T (x) = max{x i }.<br />
FANO’S INEQUALITY<br />
This scenario is viewed as a communication link between a source and a receiver. X is data<br />
transmitted from the source, and Y is received data at the receiver. The channel is noisy. If we<br />
estimate X from Y, what is P (X ≠ ˆX) ?<br />
H(X|Y )−H(Pe)<br />
log(N−1)<br />
≥<br />
H(X|Y )−1<br />
log(N−1)<br />
P e ≡ P (X ≠ ˆX) ≥<br />
N is the size of the outcome set of X. This inequality shows a lower bound on error of<br />
estimation of X from Y.<br />
Proof: Let E = I(X ≠ hatX) is an indicator function, so E is a binary random variable as:<br />
P (E = 1) = P e and P (E = 0) = 1 − P e and H(E) = H(P e )<br />
We have: H(E, X|Y ) = H(X|Y ) + X(E|X, Y ) = H(X, Y ) (1)(because knowing X and Y, we<br />
know E).<br />
Also, we have: H(E, X|Y ) = H(E|Y ) + H(X|E, Y )<br />
≤ H(E) + H(X|Y, E = 0) · (1 − P e ) + H(X|Y, E = 1) · P e<br />
= H(P e ) + H(X|Y, E = 1) · P e (since H(X|Y, E = 0) = 0 .)<br />
≤ H(P e ) + log(N − 1) · P e (2) (since H(X—Y,E=1) is maximized when X is equally likely in<br />
other N-1 choices)<br />
From (1) and (2), we have: H(X|Y ) ≤ H(P e ) + log(N − 1) · P e<br />
H(X|Y )−H(Pe)<br />
H(X|Y )−1<br />
log(N−1)<br />
⇒ P e ≥<br />
log(N−1)<br />
≥<br />
The last inequality holds since H(P e ) ≤ 1<br />
FANO’S INEQUALITY EXAMPLE<br />
Consider a source X = 1 : 5, P (X) = [0.35, 0.35, 0.1, 0.1, 0.1] T , also let Y=1,2 if X ≤ 2 then<br />
y=x with probability 6/7, while if x¿2 then y=1 or 2 with equal probability.<br />
Our best strategy is to guess hat(x) = y. We now calculate tha actual error probability and<br />
the bound given by Fano’s inequality.<br />
2
• Actual error guessing: P e = 1 − ∑ 5<br />
i=1 P (x i = y i ) = 1 − (0.35 · (6/7) + 0.35 · (6/7)) = 0.4<br />
• The lower bound given by the Fano’s inequality:<br />
We have P (X|Y = 1) = =<br />
P (X,Y =1)<br />
P (Y =1<br />
P (Y =1|X·P (X)<br />
P (Y =1)<br />
From this, we have P (X|y = 1) = [0.6, 0.1, 0.1, 0.1, 0.1] T and P (X|y = 2) = [0.1, 0.6, 0.1, 0.1, 0.1] T .<br />
Therefore,<br />
H(X|Y ) = H(X|Y = 1) · P (Y = 1) + H(X|Y = 2) · P (Y = 2)<br />
= 0.5 · (2 · 0.6 · log(1/0.6) + 8 · 0.1 · log(1/0.1)) = 1.771(bits).<br />
Hence, P e = 0.4 ><br />
H(X|Y )−1<br />
log(N−1)<br />
= 1.771−1<br />
log(5−1) = 0.3855<br />
SUMMARY<br />
• Markov X → Y → Z<br />
P (X, Z|Y ) = P (X|Y )P (Z|Y )<br />
• Data Processing Theorem<br />
I(X; Y ) ≥ I(X; Z)<br />
I(X; Y ) ≥ I(X; Y |Z)<br />
• Fano’s Inequality<br />
P e ≥<br />
H(X|Y )−1<br />
log(N−1)<br />
STRONG AND WEAK TYPICALITY<br />
Consider a random variable X, with P (X = A, B, C) = [0.5, 0.25, 0.25]<br />
• Strongly typical sequences: ABAACABCAABC<br />
Correct proportion: P (A) = 6/12; P (B) = 3/12; P (C) = 3/12<br />
H(X) = −0.5 · log(0.5) + 2 · 0.25 · log(0.25) = 1.5 bits<br />
• Weak typical sequences: BBBBBBBBBBAA<br />
Incorrect proportion: P (A) = 2/12; P (B) = 10/12; P (C) = 0<br />
CONVERGENCE OF RANDOM NUMBERS<br />
• Almost Sure Convergence:<br />
P r(lim n→∞ X n = X) = 1<br />
Example Let X n = ±2 −n with p =[0.5,0.5].<br />
P r(lim n→∞ X n = 0) = 1<br />
• Convergence in Probability:<br />
lim n→∞ P r(|X n − X| ≥ ɛ) → 0, ∀ɛ > 0<br />
Example Consider X n = 1, 0 with p = [1/2 n , 1 − 1/2 n ]. Therefore, choose n > −log(ɛ)<br />
lim n→∞ P r(|X n − 1| ≥ ɛ) → 0<br />
3
WEAKLAW OF LARGE NUMBERS<br />
Given i.i.d X i , let S n = 1 n · ∑n<br />
i=1 X i, then<br />
E[S n ] = E[X] = µ, V ar(S n ) = 1 n · V ar(X) = 1 n · σ<br />
As n increases, the variance of S n decreases, i.e., values of S n become clustered around the<br />
probability<br />
mean: S n −→<br />
µ; P (|S n − µ| ≥ ɛ) n→∞ −→ 0<br />
PROOF OF WEAK LAW OF LARGE NUMBERS<br />
• Chebyshev’s Inequality P (|x − µ| ≥ ɛ) ≤ σ2<br />
ɛ 2<br />
Proof σ 2 = E[(X−µ) 2 ] = ∑ ∀x P (x)(x − µ)2 > ∑ |x−µ|≥ɛ<br />
P (x) · (x − µ) > ɛ2·∑|x−µ|≥ɛ<br />
P (x)<br />
⇒ P (|x − µ| ≥ ɛ) ≤ σ2<br />
ɛ 2<br />
• WLLN<br />
We apply Chebyshev: P (|S n − µ| ≥ ɛ) ≤ σ2<br />
n·ɛ 2<br />
n→∞<br />
−→ 0<br />
TYPICAL SET<br />
Let x n is the i.i.d sequence of X i for i = 1 to n. Definition of typical set:<br />
T n ɛ = x n ∈ X n : | − n −1 · logp(x n ) − H(X)| < ɛ with H(X) = H(X i )<br />
EXAMPLE OF TYPICAL SET<br />
Consider Bernoulli distribution with p<br />
4