- Text
- Entropy,
- Inequality,
- Fisher,
- Variables,
- Monotonicity,
- Inequalities,
- Functions,
- Lemma,
- Barron,
- Sums,
- Theorem,
- Stat.yale.edu

The Monotonicity of Information in the Central Limit Theorem and ...

For clarity **of** presentation **of** ideas, we focus first on **the** i.i.d. case, beg**in**n**in**g with Fisher **in**formation. Proposition I:[MONOTONICITY OF FISHER INFORMA- TION]For i.i.d. r**and**om variables with absolutely cont**in**uous densities, I(Y n ) ≤ I(Y n−1 ), (24) with equality iff X 1 is normal or I(Y n ) = ∞. Pro**of**: We use **the** follow**in**g notational conventions: **The** (unnormalized) sum is V n = ∑ i∈[n] X i, **the** leave-one-out sum leav**in**g out X j is V (j) = ∑ i≠j X i, **and** **the** normalized leaveone-out sum is Y (j) = √ 1 n−1 ∑i≠j X i. If X ′ = aX, **the**n ρ X ′(X ′ ) = 1 a ρ X(X); hence ρ Yn (Y n ) = √ nρ Vn (V n ) (a) = √ nE[ρ V (j)(V (j) )|V n ] √ n = n − 1 E[ρ Y (j)(Y (j) )|V n ] (b) = 1 √ n(n − 1) n ∑ j=1 E[ρ Y (j)(Y (j) )|Y n ]. (25) Here, (a) follows from application **of** Lemma I to V n = V (j) + X j , keep**in**g **in** m**in**d that Y n−1 (hence V (j) ) has an absolutely cont**in**uous density, while (b) follows from symmetry. Set ρ j = ρ Y (j)(Y (j) ); **the**n we have ρ Yn (Y n ) = [ 1 ∑ n ∣ ∣∣∣ √ E ρ j Y n ]. (26) n(n − 1) j=1 S**in**ce **the** length **of** a vector is not less than **the** length **of** its projection (i.e., by Cauchy-Schwartz **in**equality), [ n I(Y n ) = E[ρ Yn (Y n )] 2 1 ≤ n(n − 1) E ∑ ] 2 ρ j . (27) Lemma II yields [ ∑ n ] 2 E ρ j ≤ (n − 1) ∑ j=1 j∈[n] j=1 E[ρ j ] 2 = (n − 1)nI(Y n−1 ), (28) which gives **the** **in**equality **of** Proposition I on substitution **in**to (27). **The** **in**equality implied by Lemma II can be tight only if each ρ j is an additive function, but we already know that ρ j is a function **of** **the** sum. **The** only functions that are both additive **and** functions **of** **the** sum are l**in**ear functions **of** **the** sum; hence **the** two sides **of** (24) can be f**in**ite **and** equal only if **the** score ρ j is l**in**ear, i.e., if all **the** X i are normal. It is trivial to check that X 1 normal or I(Y n ) = ∞ imply equality. We can now prove **the** monotonicity result for entropy **in** **the** i.i.d. case. **The**orem I: Suppose X i are i.i.d. r**and**om variables with densities. Suppose X 1 has mean 0 **and** f**in**ite variance, **and** **The**n Y n = 1 √ n n ∑ i=1 X i (29) H(Y n ) ≥ H(Y n−1 ). (30) **The** two sides are f**in**ite **and** equal iff X 1 is normal. Pro**of**: Recall **the** **in**tegral form **of** **the** de Bruijn identity, which is now a st**and**ard method to “lift” results from Fisher divergence to relative entropy. This identity was first stated **in** its differential form by Stam [2] (**and** attributed by him to de Bruijn), **and** proved **in** its **in**tegral form by Barron [6]: if X t is equal **in** distribution to X + √ tZ, where Z is normally distributed **in**dependent **of** X, **the**n H(X) = 1 2 log(2πev) − 1 2 ∫ ∞ 0 [ I(X t ) − 1 v + t ] dt (31) is valid **in** **the** case that **the** variances **of** Z **and** X are both v. This has **the** advantage **of** positivity **of** **the** **in**tegr**and** but **the** disadvantage that is seems to depend on v. One can use ∫ ∞ [ 1 log v = 1 + t − 1 ] (32) v + t to re-express it **in** **the** form H(X) = 1 2 log(2πe) − 1 2 0 ∫ ∞ 0 [ I(X t ) − 1 ] dt. (33) 1 + t Comb**in****in**g this with Proposition I, **the** pro**of** is f**in**ished. IV. EXTENSIONS For **the** case **of** **in**dependent, non-identically distributed (i.n.i.d.) summ**and**s, we need a general version **of** **the** “variance drop” lemma. Lemma III:[VARIANCE DROP: GENERAL VERSION]Suppose we are given a class **of** functions ψ (s) : R |s| → R for any s ∈ Ω m , **and** Eψ (s) (X 1 , . . . , X m ) = 0 for each s. Let w be any probability distribution on Ω m . Def**in**e U(X 1 , . . . , X n ) = ∑ s∈Ω m w s ψ (s) (X s ), (34) where we write ψ (s) (X s ) for a function **of** X s .**The**n ( ) n − 1 ∑ EU 2 ≤ w m − 1 sE[ψ 2 (s) (X s )] 2 , (35) s∈Ω m **and** equality can hold only if each ψ (s) is an additive function (**in** **the** sense def**in**ed earlier). Remark: When ψ (s) = ψ (i.e., all **the** ψ (s) are **the** same), ψ is symmetric **in** its arguments, **and** w is uniform, **the**n U def**in**ed above is a U-statistic **of** degree m with symmetric, mean zero kernel ψ. Lemma III **the**n becomes **the** well-known bound for

**the** variance **of** a U-statistic shown by Hoeffd**in**g [12], namely EU 2 ≤ m n Eψ2 . This gives our core **in**equality (7). Proposition II:Let {X i } be **in**dependent r**and**om variables with densities **and** f**in**ite variances. Def**in**e T n = ∑ X i **and** T m (s) = T (s) = ∑ X i , (36) i∈s i∈[n] where s ∈ Ω m = {s ⊂ [n] : |s| = m}. Let w be any probability distribution on Ω m . If each T m (s) has an absolute cont**in**uous density, **the**n ( ) n − 1 ∑ I(T n ) ≤ w m − 1 sI(T 2 m (s) ), (37) s∈Ω m where w s = w({s}). Both sides can be f**in**ite **and** equal only if each X i is normal. Pro**of**: In **the** sequel, for convenience, we abuse notation by us**in**g ρ to denote several different score functions; ρ(Y ) always means ρ Y (Y ). For each j, Lemma I **and** **the** fact that T (s) m has an absolutely cont**in**uous density imply [ ( ∑ )∣ ] ∣∣∣ ρ(T n ) = E ρ X i T n . (38) i∈s Tak**in**g a convex comb**in**ations **of** **the**se identities gives, for any {w s } such that ∑ s∈Ω m w s = 1, ρ(T n ) = ∑ [ ( ∑ )∣ ] ∣∣∣ w s E ρ X i T n s∈Ω m i∈s [ ∑ ] (39) = E w s ρ(T (s) ) ∣ T n . s∈Ω m By apply**in**g **the** Cauchy-Schwartz **in**equality **and** Lemma III **in** succession, we get [ ∑ ] 2 I(T n ) ≤ E w s ρ(T (s) ) s∈Ω ( m ) n − 1 ∑ ≤ E[w s ρ(T (s) )] 2 (40) m − 1 s∈Ω ( ) m n − 1 ∑ = w m − 1 sI(T 2 (s) ). s∈Ω m **The** application **of** Lemma III can yield equality only if each ρ(T (s) ) is additive; s**in**ce **the** score ρ(T (s) ) is already a function **of** **the** sum T (s) , it must **in** fact be a l**in**ear function, so that each X i must be normal. A. Pro**of** **of** Fact I ∑ ā 2 s = ∑ s∈Ω m ∑ s∈Ω m i∈s = ∑ i∈[n] ( n − 1 m − 1 APPENDIX a 2 i = ∑ ∑ i∈[n] S∋i,|s|=m ) a 2 i = ( n − 1 m − 1 a 2 i ) ∑ a 2 i . i∈[n] (41) B. Pro**of** **of** Fact II Let E s denote **the** **in**tegrat**in**g out **of** **the** variables **in** s, so that E j = E {j} . Keep**in**g **in** m**in**d that **the** order **of** **in**tegrat**in**g out **in**dependent variables does not matter (i.e., **the** E j are commut**in**g projection operators **in** L 2 ), we can write n∏ φ = [E j + (I − E j )]φ where j=1 = ∑ ∏ ∏ E j (I − E j )φ s⊂[n] j /∈s = ∑ φ s , s⊂[n] j∈s (42) φ s = Ēsφ ∏ ≡ E s c (I − E j )φ. (43) j /∈s In order to show that **the** subspaces H s are orthogonal, observe that for any s 1 **and** s 2 , **the**re is at least one j such that s 1 is conta**in**ed **in** **the** image **of** E j **and** s 2 is conta**in**ed **in** **the** image **of** (I − E j ); hence every vector **in** s 1 is orthogonal to every vector **in** s 2 . REFERENCES [1] C. Shannon, “A ma**the**matical **the**ory **of** communication,” Bell System Tech. J., vol. 27, pp. 379–423, 623–656, 1948. [2] A. Stam, “Some **in**equalities satisfied by **the** quantities **of** **in**formation **of** Fisher **and** Shannon,” **Information** **and** Control, vol. 2, pp. 101–112, 1959. [3] S. Artste**in**, K. M. Ball, F. Bar**the**, **and** A. Naor, “Solution **of** Shannon’s problem on **the** monotonicity **of** entropy,” J. Amer. Math. Soc., vol. 17, no. 4, pp. 975–982 (electronic), 2004. [4] M. Madiman **and** A. Barron, “Generalized entropy power **in**equalities **and** monotonicity properties **of** **in**formation,” Submitted, 2006. [5] R. Shimizu, “On Fisher’s amount **of** **in**formation for location family,” **in** Statistical Distributions **in** Scientific Work, G. et al, Ed. Reidel, 1975, vol. 3, pp. 305–312. [6] A. Barron, “Entropy **and** **the** central limit **the**orem,” Ann. Probab., vol. 14, pp. 336–342, 1986. [7] O. Johnson **and** A. Barron, “Fisher **in**formation **in**equalities **and** **the** central limit **the**orem,” Probab. **The**ory Related Fields, vol. 129, no. 3, pp. 391–409, 2004. [8] D. Shlyakhtenko, “A free analogue **of** Shannon’s problem on monotonicity **of** entropy,” Prepr**in**t, 2005. [Onl**in**e]. Available: arxiv:math.OA/0510103 [9] H. Schultz, “Semicircularity, gaussianity **and** monotonicity **of** entropy,” Prepr**in**t, 2005. [Onl**in**e]. Available: arxiv: math.OA/0512492 [10] A. Dembo, T. Cover, **and** J. Thomas, “**Information**-**the**oretic **in**equalities,” IEEE Trans. Inform. **The**ory, vol. 37, no. 6, pp. 1501–1518, 1991. [11] N. Blachman, “**The** convolution **in**equality for entropy powers,” IEEE Trans. **Information** **The**ory, vol. IT-11, pp. 267–271, 1965. [12] W. Hoeffd**in**g, “A class **of** statistics with asymptotically normal distribution,” Ann. Math. Stat., vol. 19, no. 3, pp. 293–325, 1948. [13] R. von Mises, “On **the** asymptotic distribution **of** differentiable statistical functions,” Ann. Math. Stat., vol. 18, no. 3, pp. 309–348, 1947. [14] B. Kurkjian **and** M. Zelen, “A calculus for factorial arrangements,” Ann. Math. Statist., vol. 33, pp. 600–619, 1962. [15] R. L. Jacobsen, “L**in**ear algebra **and** ANOVA,” Ph.D. dissertation, Cornell University, 1968. [16] H. Rub**in** **and** R. A. Vitale, “Asymptotic distribution **of** symmetric statistics,” Ann. Statist., vol. 8, no. 1, pp. 165–170, 1980. [17] B. Efron **and** C. Ste**in**, “**The** jackknife estimate **of** variance,” Ann. Stat., vol. 9, no. 3, pp. 586–596, 1981. [18] A. Takemura, “Tensor analysis **of** ANOVA decomposition,” J. Amer. Statist. Assoc., vol. 78, no. 384, pp. 894–900, 1983.

- Page 1 and 2: The Monotonicity of Information in
- Page 3: inside the integral for the smooth