- Text
- Entropy,
- Inequality,
- Fisher,
- Variables,
- Monotonicity,
- Inequalities,
- Functions,
- Lemma,
- Barron,
- Sums,
- Theorem,
- Stat.yale.edu

The Monotonicity of Information in the Central Limit Theorem and ...

**in**formation **and** entropy (discussed **in** Section II), are familiar **in** past work on entropy power **in**equalities. An additional trick is needed to obta**in** **the** denom**in**ators **of** n−1 **and** ( n−1 m−1) **in** (2) **and** (3) respectively. This is a simple variance drop **in**equality for statistics expressible via sums **of** functions **of** m out **of** n variables, which is familiar **in** o**the**r statistical contexts (as we shall discuss). It was first used for **in**formation **in**equality development **in** ABBN [3]. **The** variational characterization **of** Fisher **in**formation that is an essential **in**gredient **of** ABBN [3] is not needed **in** our pro**of**s. For clarity **of** presentation, we f**in**d it convenient to first outl**in**e **the** pro**of** **of** (6) for i.i.d. r**and**om variables. Thus, **in** Section III, we establish **the** monotonicity result (6) **in** a simple **and** reveal**in**g manner that boils down to **the** geometry **of** projections (conditional expectations). Whereas ABBN [3] requires that X 1 has a C 2 density for monotonicity **of** Fisher divergence, absolute cont**in**uity **of** **the** density suffices **in** our approach. Fur**the**rmore, whereas **the** recent prepr**in**t **of** Shlyakhtenko [8] proves **the** analogue **of** **the** monotonicity fact for non-commutative or “free” probability **the**ory, his method implies a pro**of** for **the** classical case only assum**in**g f**in**iteness **of** all moments, while our direct pro**of** requires only f**in**ite variance assumptions. Our pro**of** also reveals **in** a simple manner **the** cases **of** equality **in** (6) (c.f., Schultz [9]). Although we do not write it out for brevity, **the** monotonicity **of** entropy for st**and**ardized sums **of** d-dimensional r**and**om vectors has an identical pro**of**. We recall that for a r**and**om variable X with density f, **the** entropy is H(X) = −E[log f(X)]. For a differentiable density, **the** score function is ρ X (x) = ∂ ∂x log f(x), **and** **the** Fisher **in**formation is I(X) = E[ρ 2 X (X)]. **The**y are l**in**ked by an **in**tegral form **of** **the** de Bruijn identity due to Barron [6], which permits certa**in** convolution **in**equalities for I to translate **in**to correspond**in**g **in**equalities for H. Underly**in**g our **in**equalities is **the** demonstration for **in**dependent, not necessarily identically distributed (i.n.i.d.) r**and**om variables with absolutely cont**in**uous densities that ( ) n − 1 ∑ ∑ ) I(X 1 + . . . + X n ) ≤ w 2 m − 1 sI( X i (7) s∈Ω m i∈s for any non-negative weights w s that add to 1 over all subsets s ⊂ {1, . . . , n} **of** size m. Optimiz**in**g over w yields an **in**equality for **in**verse Fisher **in**formation that extends **the** orig**in**al **in**equality **of** Stam: 1 I(X 1 + . . . + X n ) ≥ 1 ( n−1 m−1 ) ∑ I( ∑ i∈s X i) . (8) s∈Ω m 1 Alternatively, us**in**g a scal**in**g property **of** Fisher **in**formation to re-express our core **in**equality (7), we see that **the** Fisher **in**formation **of** **the** sum is bounded by a convex comb**in**ation **of** Fisher **in**formations **of** scaled partial sums: I(X 1 + . . . + X n ) ≤ ∑ ( ∑ w s I √ s∈Ω m i∈s X i w s ( n−1 m−1 ) ) . (9) This **in**tegrates to give an **in**equality for entropy that is an extension **of** **the** “l**in**ear form **of** **the** entropy power **in**equality” developed by Dembo et al [10]. Specifically we obta**in** H(X 1 + . . . + X n ) ≤ ∑ ( ∑ w s H √ s∈Ω m i∈s X i w s ( n−1 m−1 ) ) . (10) Likewise us**in**g **the** scal**in**g property **of** entropy on (10) **and** optimiz**in**g over w yields our extension **of** **the** entropy power **in**equality { ( n−1 m−1 exp{2H(X 1 + . . . + X n )} ≥ 1 ) ∑ s∈Ω m exp 2H ( ∑ ) } X j . j∈s Thus both **in**verse Fisher **in**formation **and** entropy power satisfy an **in**equality **of** **the** form ( ) n − 1 ψ ( ) ∑ ∑ X 1 + . . . + X n ≥ ψ( X i ). (11) m − 1 s∈Ω m i∈s We motivate **the** form (11) us**in**g **the** follow**in**g almost trivial fact, which is proved **in** **the** Appendix for **the** reader’s convenience. Let [n] = {1, 2, . . . , n}. Fact I: For arbitrary non-negative numbers {a 2 i : i ∈ [n]}, ∑ ( ) n − 1 ∑ a 2 i = a 2 i , (12) m − 1 ∑ s∈Ω m i∈s i∈[n] where **the** first sum on **the** left is taken over **the** collection Ω m = {s ⊂ [n] : |s| = m} **of** sets conta**in****in**g m **in**dices. If Fact I is thought **of** as (m, n)-additivity, **the**n (8) **and** (3) represent **the** (m, n)-superadditivity **of** **in**verse Fisher **in**formation **and** entropy power respectively. In **the** case **of** normal r**and**om variables, **the** **in**verse Fisher **in**formation **and** **the** entropy power equal **the** variance. Thus **in** that case (8) **and** (3) become Fact I with a 2 i equal to **the** variance **of** X i. II. SCORE FUNCTIONS AND PROJECTIONS **The** first tool we need is a projection property **of** score functions **of** sums **of** **in**dependent r**and**om variables, which is well-known for smooth densities (c.f., Blachman [11]). For completeness, we give **the** pro**of**. As shown by Johnson **and** Barron [7], it is sufficient that **the** densities are absolutely cont**in**uous; see [7][Appendix 1] for an explanation **of** why this is so. Lemma I:[CONVOLUTION IDENTITY FOR SCORE FUNC- TIONS] If V 1 **and** V 2 are **in**dependent r**and**om variables, **and** V 1 has an absolutely cont**in**uous density with score ρ V1 , **the**n V 1 + V 2 has **the** score ρ V1+V 2 (v) = E[ρ V1 (V 1 )|V 1 + V 2 = v] (13) Pro**of**: Let f V1 **and** f V be **the** densities **of** V 1 **and** V = V 1 + V 2 respectively. **The**n, ei**the**r br**in**g**in**g **the** derivative

**in**side **the** **in**tegral for **the** smooth case, or via **the** more general formalism **in** [7], f ′ V (v) = ∂ ∂v E[f V 1 (v − V 2 )] = E[f ′ V 1 (v − V 2 )] = E[f V1 (v − V 2 )ρ V1 (v − V 2 )] so that ρ V (v) = f V ′ (v) [ ] f V (v) = E fV1 (v − V 2 ) ρ V1 (v − V 2 ) f V (v) = E[ρ V1 (V 1 )|V 1 + V 2 = v]. (14) (15) **The** second tool we need is a “variance drop lemma”, which goes back at least to Hoeffd**in**g’s sem**in**al work [12] on U- statistics (see his **The**orem 5.2). An equivalent statement **of** **the** variance drop lemma was formulated **in** ABBN [3]. In [4], we prove **and** use a more general result to study **the** i.n.i.d. case. First we need to recall a decomposition **of** functions **in** L 2 (R n ), which is noth**in**g but **the** Analysis **of** Variance (ANOVA) decomposition **of** a statistic. **The** follow**in**g conventions are useful. [n] is **the** **in**dex set {1, 2, . . . , n}. For any s ⊂ [n], X s st**and**s for **the** collection **of** r**and**om variables {X i : i ∈ s}. For any j ∈ [n], E j ψ denotes **the** conditional expectation **of** ψ, given all r**and**om variables o**the**r than X j , i.e., E j ψ(x 1 , . . . , x n ) = E[ψ(X 1 , . . . , X n )|X i = x i ∀i ≠ j] (16) averages out **the** dependence on **the** j-th coord**in**ate. Fact II:[ANOVA DECOMPOSITION] Suppose ψ : R n → R satisfies Eψ 2 (X 1 , . . . , X n ) < ∞, i.e., ψ ∈ L 2 , for **in**dependent r**and**om variables X 1 , X 2 , . . . , X n . For s ⊂ [n], def**in**e **the** orthogonal l**in**ear subspaces H s = {ψ ∈ L 2 : E j ψ = ψ1 {j /∈s} ∀j ∈ [n]} (17) **of** functions depend**in**g only on **the** variables **in**dexed by s. **The**n L 2 is **the** orthogonal direct sum **of** this family **of** subspaces, i.e., any ψ ∈ L 2 can be written **in** **the** form ψ = ∑ ψ s , (18) where ψ s ∈ H s . s⊂[n] Remark: In **the** language **of** ANOVA familiar to statisticians, when φ is **the** empty set, ψ φ is **the** mean; ψ {1} , ψ {2} , . . . , ψ {n} are **the** ma**in** effects; {ψ s : |s| = 2} are **the** pairwise **in**teractions, **and** so on. Fact II implies that for any subset s ⊂ [n], **the** function ∑ {R:R⊂s} ψ R is **the** best approximation (**in** mean square) to ψ that depends only on **the** collection X s **of** r**and**om variables. Remark: **The** historical roots **of** this decomposition lie **in** **the** work **of** von Mises [13] **and** Hoeffd**in**g [12]. For various ref**in**ements **and** **in**terpretations, see Kurkjian **and** Zelen [14], Jacobsen [15], Rub**in** **and** Vitale [16], Efron **and** Ste**in** [17], **and** Takemura [18]; **the**se works **in**clude applications **of** such decompositions to experimental design, l**in**ear models, U- statistics, **and** jackknife **the**ory. **The** Appendix conta**in**s a brief pro**of** **of** Fact II for **the** convenience **of** **the** reader. We say that a function f : R d → R is an additive function if **the**re exist functions f i : R → R such that f(x 1 , . . . , x d ) = ∑ i∈[d] f i(x i ). Lemma II:[VARIANCE DROP]Let ψ : R n−1 → R. Suppose, for each j ∈ [n], ψ j = ψ(X 1 , . . . , X j−1 , X j+1 , . . . , X n ) has mean 0. **The**n [ ∑ n ] 2 E ψ j ≤ (n − 1) ∑ E[ψ j ] 2 . (19) j=1 j∈[n] Equality can hold only if ψ is an additive function. so that Pro**of**: By **the** Cauchy-Schwartz **in**equality, for any V j , [ 1 ∑ ] 2 V j ≤ 1 ∑ [V j ] 2 , (20) n n j∈[n] [ ∑ E j∈[n] j∈[n] ] 2 V j ≤ n ∑ E[V j ] 2 . (21) j∈[n] Let Ē s be **the** operation that produces **the** component Ē s ψ = ψ s (see **the** appendix for a fur**the**r characterization **of** it); **the**n [ ∑ ] 2 [ ∑ ∑ ] 2 E ψ j = E Ē s ψ j j∈[n] (a) = ∑ (b) ≤ s⊂[n] j∈[n] s⊂[n] ∑ [ ∑ ] 2 E Ē s ψ j j /∈s − 1) s⊂[n](n ∑ j /∈s (c) = (n − 1) ∑ E[ψ j ] 2 . j∈[n] E [ Ē s ψ j ] 2 (22) Here, (a) **and** (c) employ **the** orthogonal decomposition **of** Fact II **and** Parseval’s **the**orem. **The** **in**equality (b) is based on two facts: firstly, E j ψ j = ψ j s**in**ce ψ j is **in**dependent **of** X j , **and** hence Ēsψ j = 0 if j ∈ s; secondly, we can ignore **the** null set φ **in** **the** outer sum s**in**ce **the** mean **of** a score function is 0, **and** **the**refore {j : j /∈ s} **in** **the** **in**ner sum has at most n − 1 elements. For equality to hold, Ē s ψ j can only be nonzero when s has exactly 1 element, i.e., each ψ j must consist only **of** ma**in** effects **and** no **in**teractions, so that it must be additive. III. MONOTONICITY IN THE IID CASE For i.i.d. r**and**om variables, **in**equalities (2) **and** (3) reduce to **the** monotonicity H(Y n ) ≥ H(Y m ) for n > m, where Y n = 1 √ n n ∑ i=1 X i . (23)