05.04.2013 Views

Maximization of Information Divergence from Multinomial ...

Maximization of Information Divergence from Multinomial ...

Maximization of Information Divergence from Multinomial ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<strong>Maximization</strong> <strong>of</strong> <strong>Information</strong> <strong>Divergence</strong> <strong>from</strong> <strong>Multinomial</strong> Distributions Jozef Juríček supervised by Frantiˇsek Matúˇs<br />

<strong>Maximization</strong> <strong>of</strong> <strong>Information</strong> <strong>Divergence</strong> <strong>from</strong> <strong>Multinomial</strong> Distributions<br />

References<br />

ROBUST 2010, Králiky, 4 th February<br />

Jozef Juríček<br />

Charles University in Prague<br />

Faculty <strong>of</strong> Mathematics and Physics<br />

Department <strong>of</strong> Probability and Mathematical Statistics<br />

Supervisor: Ing. Frantiˇsek Matúˇs, CSc.<br />

Academy <strong>of</strong> Sciences <strong>of</strong> the Czech Republic<br />

Institute <strong>of</strong> <strong>Information</strong> Theory and Automation<br />

Department <strong>of</strong> Decision-Making Theory<br />

Outline 1 Introduction 2 Preliminaries 3 Result 4 Further Research<br />

[1] Ay, N., Knauf, A. (2006). Maximizing multi-information. Kybernetika 45 517-538.<br />

[2] Csiszár, I., Matúˇs, F. (2003). <strong>Information</strong> projections revisited. IEEE Transactions <strong>Information</strong> Theory 49 1474-1490.<br />

[3] Matúˇs, F. (2004). <strong>Maximization</strong> <strong>of</strong> information divergences <strong>from</strong> binary i.i.d. sequences. Proceedings IPMU (2004) 2 1303-1306. Perugia, Italy.<br />

• this problem is a generalization <strong>of</strong> the problem solved in [3]<br />

ROBUST 2010, Králiky, 4 th February Page 1 <strong>of</strong> 5<br />

●<br />

●<br />

●<br />

●<br />

●<br />

●<br />

●<br />

●<br />

●<br />

●<br />


<strong>Maximization</strong> <strong>of</strong> <strong>Information</strong> <strong>Divergence</strong> <strong>from</strong> <strong>Multinomial</strong> Distributions Jozef Juríček supervised by Frantiˇsek Matúˇs<br />

1 Introduction<br />

• more general problem <strong>of</strong> maximization <strong>of</strong> information divergence <strong>from</strong> exponential family<br />

emerged in probabilistic models for evolution and learning in neural networks, based on infomax principles<br />

• maximizers admit interpretation as stochastic systems with high complexity w.r.t. exponential family<br />

• weakly speaking, the problem is to find all (empirical) distributions (data), which lies farthest <strong>from</strong> the model,<br />

when modelling by the exponential family; in the sense <strong>of</strong> Kullback-Leibler divergence and maximum likelihood estimation method<br />

• in other words, in specific situation <strong>of</strong> multinomial family (N id.indep.trials each with n possible outcomes; N ≧ 2, n ≧ 2 fixed)<br />

⎧<br />

⎨<br />

M = Q(z) =<br />

⎩<br />

N<br />

z<br />

calculate sup<br />

P ∈P(Z)<br />

n <br />

j=1<br />

p zj<br />

j<br />

<br />

z∈Z<br />

D(P M)<br />

and find all maximizers arg sup<br />

P ∈P(Z)<br />

: p ∈ [0; 1] n ,<br />

⎫<br />

n<br />

<br />

⎬<br />

pj = 1 , Z = z ∈ {0, 1, . . . , N}<br />

⎭ n :<br />

j=1<br />

D(P M)<br />

⎧ <br />

⎪⎨<br />

P (z)<br />

P (z) ln , s(P ) ⊆ s(ν),<br />

D(P ν) =<br />

ν(z)<br />

z∈s(P )<br />

⎪⎩<br />

+∞, otherwise<br />

D(P E) = inf<br />

Q∈E<br />

s(P ) = {z ∈ Z : P (z) > 0}<br />

D(P Q) = min D(P Q)<br />

Q∈E<br />

P(Z) simplex <strong>of</strong> all pm’s on Z<br />

n<br />

<br />

zj = N<br />

j=1<br />

the state space<br />

or configurations ,<br />

N<br />

z<br />

<br />

=<br />

N = 2, n = 2 N = 3, n = 2<br />

<strong>Multinomial</strong> family M in simplex P(Z)<br />

N!<br />

z1!z2! . . . zn!<br />

ROBUST 2010, Králiky, 4 th February Page 2 <strong>of</strong> 5


<strong>Maximization</strong> <strong>of</strong> <strong>Information</strong> <strong>Divergence</strong> <strong>from</strong> <strong>Multinomial</strong> Distributions Jozef Juríček supervised by Frantiˇsek Matúˇs<br />

2 Preliminaries<br />

<br />

• Exponential family Eµ,f = Qµ,f,ϑ ∼ e 〈ϑ,f(z)〉 µ(z) <br />

z∈Z<br />

: ϑ ∈ Rd<br />

<br />

µ nonzero reference measure on a finite set Z<br />

f : Z → R d the directional statistics<br />

Theorem 1 Let P ∈ P(Z), µ be a strictly positive measure on Z (s(µ) = Z), E = Eµ,f be the exponential family.<br />

Then there exists unique rI-projection (generalized MLE; GMLE) P E ∈ E, s.t. P E = arg inf<br />

Q∈E<br />

D(P Q) = arg min D(P Q).<br />

Q∈E<br />

For P empirical distribution, s.t. P E ∈ E, P E is the MLE for data with emprical distribution P . Details in [2].<br />

• M = E µ,f with f(z) = z, µ(z) = <br />

N<br />

z , z ∈ Z = z ∈ {0, 1, . . . , N} n : n j=1 zj<br />

<br />

= N<br />

ROBUST 2010, Králiky, 4 th February Page 3 <strong>of</strong> 5


<strong>Maximization</strong> <strong>of</strong> <strong>Information</strong> <strong>Divergence</strong> <strong>from</strong> <strong>Multinomial</strong> Distributions Jozef Juríček supervised by Frantiˇsek Matúˇs<br />

• denote X = {1, . . . , n} N , for permutation π ∈ SN, π : {1, . . . , N} 1−1<br />

↔ {1, . . . , N}, x ∈ X and P ∈ P(X) dentote x π = (xπ(1), . . . , xπ(N)) ⊤ , P π (x) = P (x π )<br />

• Exchangable distributions E := {P ∈ P(X) : P (x) = P (x π ), x ∈ X; ∀π ∈ SN}<br />

• 1-factorizable distributions F := {Q ∈ P(X) : Q(x) = N<br />

i=1 Qi(xi), x ∈ X}, Qi(xi) = <br />

Lemma 2 Denote X z := <br />

x ∈ X : ∀j ∈ {1, . . . , n} : |{i ∈ {1, . . . , N} : xi = j}| = zj , z ∈ Z (remind Z =<br />

(i) The mapping h : P(Z) 1−1<br />

↔ E such that h(P ) = P ′ , P ′ (x) =<br />

P (z)<br />

( N<br />

z ) for z ∈ Z s.t. x ∈ Xz is a bijection,<br />

h↾ M : M 1−1<br />

↔ E ∩ F = E ∩ F and for h−1 : E 1−1<br />

↔ P(Z), the inverse <strong>of</strong> h, h−1 (P ′ ) = P and P (z) = N z<br />

(ii) For any P ∈ P(Z), Q ∈ M, it holds D(P Q) = D h(P )h(Q) .<br />

x ′ ∈X:<br />

x ′ i =xi<br />

Q(x ′ ), i = 1, . . . , N are the marginals<br />

<br />

z ∈ {0, 1, . . . , N} n : n<br />

j=1 zj = N<br />

P ′ (x) for any x ∈ X z .<br />

(iii) For any P ∈ E, Q ∈ F \ E ∩ F, there exists π ∈ SN, such that for Q π , Q π (x) = Q(x π ), it holds Q π ≡ Q and D(P Q) = D(P Q π ).<br />

(iv) For any P ∈ E: D(P F) = infQ∈E∩F D(P Q) and arg inf Q∈E∩F D(P Q) = P F ∈ E ∩ F.<br />

(v) sup<br />

P ∈P(Z)<br />

D(P M) = sup<br />

P ∈E<br />

D(P E ∩ F) = sup D(P F) ≤ sup<br />

P ∈E<br />

P ∈P(X)<br />

P(Z)<br />

M<br />

h<br />

−→<br />

←−<br />

h −1<br />

h↾ M<br />

−→<br />

←−<br />

h −1 ↾ M<br />

D(P F) and arg sup<br />

P ∈P(Z)<br />

E⊂ P(X)<br />

E ∩ F<br />

D(P M) = h −1 (arg sup<br />

P ∈E<br />

<br />

). It holds:<br />

D(P E ∩ F)) = h −1 (arg sup D(P F)).<br />

P ∈E<br />

ROBUST 2010, Králiky, 4 th February Page 4 <strong>of</strong> 5


<strong>Maximization</strong> <strong>of</strong> <strong>Information</strong> <strong>Divergence</strong> <strong>from</strong> <strong>Multinomial</strong> Distributions Jozef Juríček supervised by Frantiˇsek Matúˇs<br />

• P ∈ P(X): D(P F) = M(P ), the multi-information<br />

Theorem 3 (Maximizing the multi-information) arg sup D(P F) =<br />

D(PΠF) = (N − 1) ln n and P F Π = U X = 1<br />

n N<br />

P ∈P(X)<br />

<br />

PΠ = 1<br />

n<br />

n<br />

j=1<br />

δ (j,π2(j),...,πN(j)) ⊤ : Π = (π2, . . . , πN) ∈ S N−1<br />

n<br />

<br />

x∈X δx, ∀Π ∈ S N−1<br />

n . Main result <strong>of</strong> [1].<br />

3 Result Notation: ej,j = (0, . . . , 0, 1j, 0, . . . , 0n) ⊤ ɛ j,j = δ 2e j,j<br />

e k,l = (0, . . . , 0, 1k, 0, . . . , 0, 1l, 0, . . . , 0n) ⊤ ɛ k,l = 2δ e k,l, k < l<br />

Corollary 4 (Maximizing D(·M)) arg sup P ∈P(Z) D(P M) = h −1 E ∩ arg sup P ∈P(X) D(P F) .<br />

For N = 2, arg sup D(P M) =<br />

P ∈P(Z)<br />

⎧<br />

⎪⎨<br />

Pπ =<br />

⎪⎩<br />

1<br />

n<br />

⎛<br />

⎜<br />

⎝<br />

<br />

j∈{1,...,n}:<br />

j≤π(j)<br />

⎞<br />

ɛ j,π(j)<br />

⎟<br />

⎠ , π ∈ Sn :<br />

∀j, k ∈ {1, . . . , n} :<br />

[π(j) = k] ⇒ [π(k) = j]<br />

sup D(P M) = (N − 1) ln(n) and for every maximizer Psup, it holds P<br />

P ∈P(Z)<br />

M sup(z) = (Nz<br />

)<br />

nN , z ∈ Z.<br />

⎫<br />

⎪⎬<br />

. For N > 2, the only maximizer PId = h<br />

⎪⎭<br />

−1 (U X ) = 1<br />

n<br />

<br />

,<br />

n<br />

δNej,j. • application <strong>of</strong> Theorem 1, Lemma 2 and Theorem 3 essentially simplified pro<strong>of</strong> (and in more general situation) than pro<strong>of</strong> given in [3]<br />

N = 2, n = 2<br />

N = 3, n = 2<br />

4 Further research (?) for Fk, the k-factorizable, study <strong>of</strong> D · h −1 (E ∩ Fk) <br />

N = 2, n = 3<br />

arg sup P(X) D(P F) =<br />

j=1<br />

{ 1<br />

3 (δ11 + δ22 + δ33), 1<br />

3 (δ11 + δ23 + δ32),<br />

1<br />

3 (δ13 + δ22 + δ31), 1<br />

3 (δ12 + δ21 + δ33),<br />

1<br />

3 (δ12 + δ23 + δ31), 1<br />

3 (δ13 + δ21 + δ32) }<br />

arg sup E D(P E ∩ F) =<br />

{ 1<br />

3 (δ11 + δ22 + δ33), 1<br />

3 (δ11 + δ23 + δ32),<br />

1<br />

3 (δ13 + δ22 + δ31), 1<br />

3 (δ12 + δ21 + δ33) }<br />

arg sup P(Z) D(P M) =<br />

{ 1<br />

3 (δ200 + δ020 + δ002), 1<br />

3δ200 + 2<br />

3δ011, 1<br />

3δ020 + 2 1<br />

δ101,<br />

3 3δ002 + 2<br />

3δ110 }<br />

sup P(Z) D(P M) = ln 3<br />

ROBUST 2010, Králiky, 4 th February jozef.juricek@matfyz.cz jozef.juricek.matfyz.cz Page 5 <strong>of</strong> 5

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!