On Fisher information inequalities and score functions in non-invertible

dsd.sztaki.hu

On Fisher information inequalities and score functions in non-invertible

Journal of Inequalities in Pure andApplied Mathematicshttp://jipam.vu.edu.au/Volume 4, Issue 4, Article 71, 2003ON FISHER INFORMATION INEQUALITIES AND SCORE FUNCTIONS INNON-INVERTIBLE LINEAR SYSTEMSC. VIGNAT AND J.-F. BERCHERE.E.C.S. UNIVERSITY OF MICHIGAN1301 N. BEAL AVENUEANN ARBOR MI 48109, USA.vignat@univ-mlv.frURL: http://www-syscom.univ-mlv.fr/ vignat/ÉQUIPE SIGNAL ET INFORMATIONESIEE AND UMLV93 162 NOISY-LE-GRAND, FRANCE.jf.bercher@esiee.frReceived 04 June, 2003; accepted 09 September, 2003Communicated by F. HansenABSTRACT. In this note, we review score functions properties and discuss inequalities on theFisher Information Matrix of a random vector subjected to linear non-invertible transformations.We give alternate derivations of results previously published in [6] and provide new interpretationsof the cases of equality.Key words and phrases: Fisher information, Non-invertible linear systems.2000 Mathematics Subject Classification. 62B10, 93C05, 94A17.1. INTRODUCTIONThe Fisher information matrix J X of a random vector X appears as a useful theoretic tool todescribe the propagation of information through systems. For instance, it is directly involvedin the derivation of the Entropy Power Inequality (EPI), that describes the evolution of the entropyof random vectors submitted to linear transformations. The first results about informationtransformation were given in the 60’s by Blachman [1] and Stam [5]. Later, Papathanasiou [4]derived an important series of Fisher Information Inequalities (FII) with applications to characterizationof normality. In [6], Zamir extended the FII to the case of non-invertible linearsystems. However, the proofs given in his paper, completed in the technical report [7], involvecomplicated derivations, especially for the characterization of the cases of equality.The main contributions of this note are threefold. First, we review some properties of scorefunctions and characterize the estimation of a score function under linear constraint. Second,ISSN (electronic): 1443-5756c○ 2003 Victoria University. All rights reserved.Part of this work was presented at International Symposium on Information Theory, Lausanne, Switzerland, 2002.077-03


2 C. VIGNAT AND J.-F. BERCHERwe give two alternate derivations of Zamir’s FII inequalities and show how they can be relatedto Papathanasiou’s results. Third, we examine the cases of equality and give an interpretationthat highlights the concept of extractable component of the input vector of a linear system, andits relationship with the concepts of pseudoinverse and gaussianity.2. NOTATIONS AND DEFINITIONSIn this note, we consider a linear system with a (n × 1) random vector input X and a (m × 1)random vector output Y , represented by a m × n matrix A, with m ≤ n asY = AX.Matrix A is assumed to have full row rank (rank A = m).Let f X and f Y denote the probability densities of X and Y . The probability density f X issupposed to satisfy the three regularity conditions (cf. [4])(1) f X (x) is continuous and has continuous first and second order partial derivatives,(2) f X (x) is defined on R n and lim ||x||→∞ f X (x) = 0,(3) the Fisher information matrix J X (with respect to a translation parameter) is defined as∫[J X ] i,j =and is supposed nonsingular.[ ∂ ln fX (x)R ∂x n i]∂ ln f X (x)f X (x)dx,∂x jWe also define the score functions φ X (·) : R n → R n associated with f X according to:φ X (x) = ∂ ln f X(x).∂xThe statistical expectation operator E X is∫E X [h(X)] = h(x)f X (x)dx.R nE X,Y and E X|Y will denote the mutual and conditional expectations, computed with the mutualand conditional probability density functions f X,Y and f X|Y respectively.The covariance matrix of a random vector g(X) is defined bycov[g(X)] = E X[(g(X) − EX [g(X)])(g(X) − E X [g(X)]) T ] .The gradient operator ∇ X is defined by[ ∂h(X)∇ X h(X) = , . . . , ∂h(X) ] T.∂x 1 ∂x nFinally, in what follows, a matrix inequality such as A ≥ B means that matrix (A − B) isnonnegative definite.3. PRELIMINARY RESULTSWe derive here a first theorem that extends Lemma 1 of [7]. The problem addressed is tofind an estimator φ ̂ X (X) of the score function φ X (X) in terms of the observations Y = AX.Obviously, this estimator depends of Y , but this dependence is omitted here for notationalconvenience.J. Inequal. Pure and Appl. Math., 4(4) Art. 71, 2003 http://jipam.vu.edu.au/


FISHER INFORMATION INEQUALITIES 3Theorem 3.1. Under the hypotheses expressed in Section 2, the solution to the minimum meansquare estimation problem[(3.1) φX̂(X) = arg min E X,Y ||φX (X) − w(Y )|| 2] subject to Y = AX,isw(3.2) ̂ φX (X) = A T φ Y (Y ) .The proof we propose here relies on elementary algebraic manipulations according to therules expressed in the following lemma.Lemma 3.2. If X and Y are two random vectors such that Y = AX, where A is a full rowrankmatrix then for any smooth functions g 1 : R m → R, g 2 : R n → R, h 1 : R n → R n ,h 2 : R m → R m ,Rule 0 E X [g 1 (AX)] = E Y [g 1 (Y )]Rule 1E X [φ X (X) g 2 (X)] = −E X [∇ X g 2 (X)]Rule 2 E X[φX (X) h T 1 (X) ] = −E X[∇X h T 1 (X) ]Rule 3 ∇ X h T 2 (AX) = A T ∇ Y h T 2 (Y )Rule 4 E X[∇X φ T Y (Y ) ] = −A T J Y .Proof. Rule 0 is proved in [2, vol. 2, p.133]. Rule 1 and Rule 2 are easily proved using integrationby parts. For Rule 3, denote by h k the k th component of vector h = h 2 , and remarkthat∂∂x jh k (AX) ==m∑i=1∂h k (AX)∂y i∂y i∂x jm∑A ij [∇ Y h k (Y )] ii=1= [ A T ∇ Y h k (Y ) ] j .Now h T (Y ) = [ h T 1 (Y ) , . . . , h T n (Y ) ] so that ∇ X h T (Y ) = [ ∇ X h T 1 (AX) , . . . , ∇ X h T n (AX) ] =A T ∇ Y h T (Y ) .Rule 4 can be deduced as follows:E X[∇X φ T Y (Y ) ]Rule 3= A T E X[∇Y φ T Y (Y ) ]Rule 0= A T E Y[∇Y φ T Y (Y ) ]Rule 2= −A T E Y[φ Y (Y ) φ Y (Y ) T ] .For the proof of Theorem 3.1, we will also need the following orthogonality result.□J. Inequal. Pure and Appl. Math., 4(4) Art. 71, 2003 http://jipam.vu.edu.au/


4 C. VIGNAT AND J.-F. BERCHERLemma 3.3. For all multivariate functions h : R m → R n , ̂ φX (X) = A T φ Y (Y ) satisfies(3.3) E X,Y(φ X (X) − ̂ φ X (X)) Th (Y ) = 0.Proof. Expand into two terms and compute first term using the trace operator tr(·)]E X,Y[φ X (X) T [h (Y ) = tr E X,Y φX (X) h T (Y ) ]Rule 2, Rule 0 [= −tr E Y ∇X h T (Y ) ]Second term writesthus the terms are equal.Rule 3= −tr A T E Y[∇Y h T (Y ) ] .[ ]E X,Y φ̂X (X) T [ ]h (Y ) = tr E X,Y φX ̂ (X)h T (Y )[= tr E Y A T φ Y (Y ) h T (Y ) ]= tr A T E Y[φY (Y ) h T (Y ) ]Rule 2= −tr A T E Y[∇Y h T (Y ) ]Using Lemma 3.2 and Lemma 3.3 we are now in a position to prove Theorem 3.1.Proof of Theorem 3.1. From Lemma 3.3, we haveE X,Y[(φ X (X) − ̂φ) ] [(X (X) h(Y ) = E X,Y φX (X) − A T φ Y (Y ) ) h(Y ) ]= E Y[EX|Y[(φX (X) − A T φ Y (Y ) ) h(Y ) ]] = 0.Since this is true for all h, it means the inner expectation is null, so thatE X|Y [φ X (X)] = A T φ Y (Y ) .□Hence, we deduce that the estimator ̂φ X (X) = A T φ Y (Y ) is nothing else but the conditionalexpectation of φ X (X) given Y . Since it is well known (see [8] for instance) that the conditionalexpectation is the solution of the Minimum Mean Square Error (MMSE) estimation problemaddressed in Theorem 3.1, the result follows.□Theorem 3.1 not only restates Zamir’s result in terms of an estimation problem, but alsoextends its conditions of application since our proof does not require, as in [7], the independenceof the components of X.4. FISHER INFORMATION MATRIX INEQUALITIESAs was shown by Zamir [6], the result of Theorem 3.1 may be used to derive the pair ofFisher Information Inequalities stated in the following theorem:Theorem 4.1. Under the assumptions of Theorem 3.1,(4.1) J X ≥ A T J Y Aand(4.2) J Y ≤ ( AJ −1X AT ) −1.J. Inequal. Pure and Appl. Math., 4(4) Art. 71, 2003 http://jipam.vu.edu.au/


FISHER INFORMATION INEQUALITIES 5We exhibit here an extension and two alternate proofs of these results, that do not even rely onTheorem 3.1. The first proof relies on a classical matrix inequality combined with the algebraicproperties of score functions as expressed by Rule 1 to Rule 4. The second (partial) proof isdeduced as a particular case of results expressed by Papathanasiou [4].The first proof we propose is based on the well-known result expressed in the followinglemma.Lemma 4.2. If U = [ ACwith equality if and only if rank(U) = dim(D).BD]is a block symmetric non-negative matrix such that D −1 exists, thenA − BD −1 C ≥ 0,Proof. Consider the block L∆M factorization [3] of matrix U :[ ][ ][]I BD−1 A − BDU =−1 C 0 I 00 I0 D D −1 C I} {{ }} {{ }} {{ }LWe remark that the symmetry of U implies that L = M and thus∆∆ = L −1 UL −TM T .so that ∆ is a symmetric nonnegative definite matrix. Hence, all its principal minors are nonnegative,andA − BD −1 C ≥ 0.□Using this matrix inequality, we can complete the proof of Theorem 4.1 by considering thetwo following (m + n) × (m + n) matrices[ ]φX (X) [U 1 = EφTφ Y (Y ) X (X) φ T Y (Y ) ] ,[ ]φY (Y ) [U 2 = EφTφ X (X) Y (Y ) φ T X (X) ] .For matrix U 1 , we have, from Lemma 4.2(4.3) E X[φX (X) φ T X (X) ]≥ E X,Y[φX (X) φ T Y (Y ) ] ( E Y[φY (Y ) φ T Y (Y ) ]) −1EX,Y[φY (Y ) φ T X (X) ] .Then, using the rules of Lemma 3.2, we can recognize thatE X[φX (X) φ T X (X) ] = J X ,E Y[φY (Y ) φ T Y (Y ) ] = J Y ,E X,Y[φX (X) φ T Y (Y ) ] = −E Y[∇φTY (Y ) ] = A T J Y ,E X,Y[φY (Y ) φ T X (X) ] = ( A T J Y) T= JY A.Replacing these expressions in inequality (4.3), we deduce the first inequality (4.1).Applying the result of Lemma 4.2 to matrix U 2 yields similarlyMultiplying both on left and right by J −1YJ Y ≥ JY T AJ −1X AT J Y .= ( )J −1 TY yields inequality (4.2).Another proof of inequality (4.2) is now exhibited, as a consequence of a general resultderived by Papathanasiou [4]. This result states as follows.J. Inequal. Pure and Appl. Math., 4(4) Art. 71, 2003 http://jipam.vu.edu.au/


6 C. VIGNAT AND J.-F. BERCHERTheorem 4.3. (Papathanasiou [4]) If g(X) is a function R n → R m such that, ∀i ∈ [1, m], g i (x)is differentiable and var[g i (X)] ≤ ∞, the covariance matrix cov[g(X)] of g(X) verifies:cov[g(X)] ≥ E X[∇ T g(X) ] J −1X E X [∇g(X)] .Now, inequality (4.2) simply [ results from the choice g(X) = φ Y (AX), since in this casecov[g(X)] = J Y and E X ∇ T g(X) ] = −J Y A. Note that Papathanasiou’s theorem does notallow us to retrieve inequality (4.1).5. CASE OF EQUALITY IN MATRIX FIIWe now explicit the cases of equality in both inequalities (4.1) and (4.2). Case of equalityin inequality (4.2) was already characterized in [7] and introduces the notion of ‘extractablecomponents’ of vector X. Our alternate proof also makes use of this notion and establishes alink with the pseudoinverse of matrix A.Case of equality in inequality (4.1). The case of equality in inequality (4.1) is characterizedby the following theorem.Theorem 5.1. Suppose that components X i of X are mutually independent. Then equality holdsin (4.1) if and only if matrix A possesses (n − m) null columns or, equivalently, if A writes, upto a permutation of its column vectorswhere A 0 is a m × m non-singular matrix.A = [A 0 | 0 m×(n−m) ],Proof. According to the first proof of Theorem 4.1 and the case of equality in Lemma 4.2,equality holds in (4.1) if there exists a non-random matrix B and a non-random vector c suchthatφ X (X) = Bφ Y (Y ) + c.However, as random variables φ X (X) and φ Y (Y ) have zero-mean, E X [φ(X)] = 0,E Y [φ(Y )] = 0, then necessarily c = 0. Moreover, applying Rule 2 and Rule 4 yieldson one side, andon the other side, so that finally B = A T andE X,Y[φ X (X) φ Y (Y ) T ] = A T J YE X,Y[φ X (X) φ Y (Y ) T ] = BJ Yφ X (X) = A T φ Y (Y ) .Now, since A has rank m, it can be written, up to a permutation of its columns, under the formA = [A 0 | A 0 M ] ,where A 0 is an invertible m × m matrix, and M is an m × (n − m) matrix. Suppose M ≠ 0and express equivalently X as [ ]X0 }mX =X 1 }n − mso thatY = AX= A 0 X 0 + A 0 MX 1= A 0 ˜X,J. Inequal. Pure and Appl. Math., 4(4) Art. 71, 2003 http://jipam.vu.edu.au/


FISHER INFORMATION INEQUALITIES 7with ˜X = X 0 + MX 1 . Since A 0 is square and invertible, it follows that( )φ Y (Y ) = A −T0 φ ˜X ˜Xso thatφ X = A T φ Y (Y )( )= A T A −T0 φ ˜X ˜X[ ]A=T ( )0M T A T A −T0 φ ˜X ˜X0[ ] ( I=M T φ ˜X)˜X⎡ ( ⎤˜X)φ ˜X= ⎣ ( ) ⎦ .M T φ ˜X ˜XAs X has independent components, φ X can be decomposed as[ ]φX0 (Xφ X =0 )φ X1 (X 1 )so that finally⎡ ( ⎤[ ]˜X)φX0 (X 0 )= ⎣φ ˜X( ) ⎦ ,φ X1 (X 1 ) M T φ ˜X ˜Xfrom which we deduce thatφ X1 (X 1 ) = M T φ X0 (X 0 ) .As X 0 and X 1 are independent, this is not possible unless M = 0, which is the equality conditionexpressed in Theorem 5.1.Reciprocally, if these conditions are met, then obviously, equality is reached in inequality(4.1). □Case of equality in inequality (4.2). Assuming that components of X are mutually independent,the case of equality in inequality (4.2) is characterized as follows:Theorem 5.2. Equality holds in inequality (4.2) if and only if each component X i of X verifiesat least one of the following conditionsa) X i is Gaussian,b) X i can be recovered from the observation of Y = AX, i.e. X i is ‘extractable’,c) X i corresponds to a null column of A.Proof. According to the (first) proof of inequality (4.2), equality holds, as previously, if andonly if there exists a matrix C such that(5.1) φ Y (Y ) = Cφ X (X),which implies that J Y = CJ X C t . Then, as by assumption J −1Y= AJ −1X At , C = J Y AJ −1Xissuch a matrix. Denoting ˜φ X (X) = J −1X φ X(X) and ˜φ Y (Y ) = J −1Yφ Y (Y ), equality (5.1) writes(5.2) ˜φY (Y ) = A ˜φ X (X).The rest of the proof relies on the following two well-known results:• if X is Gaussian then equality holds in inequality (4.2),J. Inequal. Pure and Appl. Math., 4(4) Art. 71, 2003 http://jipam.vu.edu.au/


8 C. VIGNAT AND J.-F. BERCHER• if A is a non singular square matrix, equality holds in inequality (4.2) irrespectively ofX.We thus need to isolate the ‘invertible part’ of matrix A. In this aim, we consider the pseudoinverseA # of A and form the product A # A. This matrix writes, up to a permutation of rowsand columns⎡A # A = ⎣ I 0 0⎤0 M 0 ⎦ ,0 0 0where I is the n i × n i identity, M is a n ni × n ni matrix and 0 is a n z × n z matrix with n z =n − n i − n ni (i stands for invertible, n i for not invertible and z for zero). Remark that n z isexactly the number of null columns of A. Following [6, 7], n i is the number of ‘extractable’components, that is the number of components of X that can be deduced from the observationY = AX. We provide here an alternate characterization of n i as follows: the set of solutions ofY = AX is an affine setX = A # Y + (I − A # A)Z = X 0 + (I − A # A)Z,where X 0 is the minimum norm solution of the linear system Y = AX and Z is any vector.Thus, n i is exactly the number of components shared by X and X 0 .The expression of A # A allows us to express R n as the direct sum R n = R i ⊕ R ni ⊕ R z , andto express accordingly X as X = [ ]Xi T , Xni, T XzT T. Then equality in inequality (4.2) can bestudied separately in the three subspaces as follows:(1) restricted to subspace R i , A is an invertible operator, and thus equality holds withoutcondition,(2) restricted to subspace R ni , equality (5.2) writes M ˜φ(X ni ) = ˜φ(MX ni ) that means thatnecessarily all components of X ni are gaussian,(3) restricted to subspace R z , equality holds without condition.□As a final note, remark that, although A is supposed full rank, n i ≤ rankA. For instance,consider matrix[ ]1 0 0A =0 1 1for which n i = 1 and n ni = 2. This example shows that the notion of ‘extractability’ should notbe confused with the invertibility restricted to a subspace. A is clearly invertible in the subspacex 3 = 0. However, such a subspace is irrelevant here since, as we deal with continuous randominput vectors, X has a null probability to belong to this subspace.ACKNOWLEDGMENTThe authors wish to acknowledge Pr. Alfred O. Hero III at EECS for useful discussions andsuggestions, particularly regarding Theorem 3.1. The authors also wish to thank the referee forhis careful reading of the manuscript that helped to improve its quality.REFERENCES[1] N.M. BLACHMAN, The convolution inequality for entropy powers, IEEE Trans. on InformationTheory, IT , 11 (1965), 267–271.[2] W. FELLER, Introduction to Probability Theory and its Applications, New York, John Wiley &Sons, 1971.[3] G. GOLUB, Matrix Computations, Johns Hopkins University Press, 1996.J. Inequal. Pure and Appl. Math., 4(4) Art. 71, 2003 http://jipam.vu.edu.au/


FISHER INFORMATION INEQUALITIES 9[4] V. PAPATHANASIOU, Some characteristic properties of the Fisher information matrix viaCacoullos-type inequalities, J. Multivariate Analysis, 14 (1993), 256–265.[5] A.J. STAM, Some inequalities satisfied by the quantities of information of Fisher and Shannon,Inform. Control, 2 (1959), 101–112.[6] R. ZAMIR, A proof of the Fisher information matrix inequality via a data processing argument,IEEE Trans. on Information Theory, IT, 44(3) (1998), 1246–1250.[7] R. ZAMIR, A necessary and sufficient condition for equality in the Matrix Fisher Informationinequality, Technical report, Tel Aviv University, Dept. Elec. Eng. Syst., 1997. Available onlinehttp://www.eng.tau.ac.il/~zamir/techreport/crb.ps.gz[8] H.L. van TREES, Detection, Estimation, and Modulation Theory, Part I, New York London, JohnWiley and Sons, 1968.J. Inequal. Pure and Appl. Math., 4(4) Art. 71, 2003 http://jipam.vu.edu.au/

More magazines by this user
Similar magazines