22.01.2015 Views

Towards Optimal Learning of Chain graphs under Composition

Towards Optimal Learning of Chain graphs under Composition

Towards Optimal Learning of Chain graphs under Composition

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<strong>Towards</strong> <strong>Optimal</strong> <strong>Learning</strong> <strong>of</strong> <strong>Chain</strong> <strong>graphs</strong><br />

<strong>under</strong> <strong>Composition</strong><br />

Jose M. Peña<br />

Dag Sonntag<br />

ADIT, Linköping University, Sweden<br />

15-16 March 2012


Overview<br />

• <strong>Chain</strong> <strong>graphs</strong><br />

• Expresiveness<br />

• Equivalence<br />

• <strong>Learning</strong><br />

1


<strong>Chain</strong> <strong>graphs</strong><br />

• A chain graph for four binary random variables.<br />

p(ABCD) = p(A)p(B)p(CD|AB)<br />

= 1 Z ψ1 A (A)ψ2 B (B)ψ3 AB (AB)ψ3 CA (CA)ψ3 BD (BD)ψ3 CD (CD)<br />

ψ 1 A (A) A = 0 A = 1<br />

0.2 0.8<br />

ψ 2 B (B) B = 0 B = 1<br />

0.3 0.7<br />

ψ 3 AB (AB) B = 0 B = 1<br />

A = 0 1 2<br />

A = 1 4 3<br />

ψ 3 CA (CA) A = 0 A = 1<br />

C = 0 5 6<br />

C = 1 8 7<br />

ψ 3 BD (BD) D = 0 D = 1<br />

B = 0 9 10<br />

B = 1 11 12<br />

ψ 3 CD (CD) D = 0 D = 1<br />

C = 0 13.3 14.4<br />

C = 1 15.5 16.6<br />

• A chain graph represents a probability distribution by factorizing it according<br />

to a graph.<br />

• The graph represents independencies holding in the distribution, which<br />

– makes efficient inference possible, and<br />

– gives insight into the distribution without numerical calculation.<br />

2


<strong>Chain</strong> <strong>graphs</strong> con’t<br />

• A chain graph (CG) is a graph (possibly) containing both undirected and<br />

directed edges and no directed pseudocycles.<br />

• A connectivity component <strong>of</strong> a CG G is a set <strong>of</strong> nodes such that there<br />

exists an undirected path in G between every pair <strong>of</strong> nodes in the set.<br />

The set must be maximal with respect to set inclusion.<br />

• The subgraph <strong>of</strong> G induced by a set <strong>of</strong> nodes I, G I , is the graph over I<br />

where two nodes are connected by a (un)directed edge if that edge is in<br />

G.<br />

• The moral graph <strong>of</strong> G, G m , is the undirected graph where two nodes are<br />

adjacent iff they are adjacent in G or they are both in P a(B i ) for some<br />

connectivity component B i <strong>of</strong> G.<br />

• C(G m ) denotes the set <strong>of</strong> complete sets in G m<br />

respect to set inclusion.<br />

that are maximal with<br />

G Components G B3 P a(B 3 ) (G B3 P a(B 3 )) m C((G B3 P a(B 3 )) m )<br />

B 1 = {A}<br />

B 2 = {B}<br />

B 3 = {C, D}<br />

{{A, B}, {A, C},<br />

{B, D}, {C, D}}<br />

3


<strong>Chain</strong> <strong>graphs</strong> con’t<br />

• We say that a probability distribution p factorizes according to G if<br />

– p(x) = ∏ n<br />

i=1 p(x B i<br />

|x P a(Bi )) where<br />

– p(x Bi |x P a(Bi )) = 1 Z i ∏C∈C((G Bi P a(B i )) m ) ψi C (x C) where each ψ i C (x C) is a nonnegative<br />

real function, and each Z i is a normalization constant.<br />

• Bayesian networks and Markov networks ⊂ CGs ⊂ factor <strong>graphs</strong>.<br />

• A CG for four binary random variables.<br />

p(ABCD) = p(A)p(B)p(CD|AB)<br />

= 1 Z ψ1 A (A)ψ2 B (B)ψ3 AB (AB)ψ3 CA (CA)ψ3 BD (BD)ψ3 CD (CD)<br />

ψ 1 A (A) A = 0 A = 1<br />

0.2 0.8<br />

ψ 2 B (B) B = 0 B = 1<br />

0.3 0.7<br />

ψ 3 AB (AB) B = 0 B = 1<br />

A = 0 1 2<br />

A = 1 4 3<br />

ψ 3 CA (CA) A = 0 A = 1<br />

C = 0 5 6<br />

C = 1 8 7<br />

ψ 3 BD (BD) D = 0 D = 1<br />

B = 0 9 10<br />

B = 1 11 12<br />

ψ 3 CD (CD) D = 0 D = 1<br />

C = 0 13.3 14.4<br />

C = 1 15.5 16.6<br />

4


<strong>Chain</strong> <strong>graphs</strong> con’t<br />

• The ancestors <strong>of</strong> a set <strong>of</strong> nodes I in G, An(I), is the set <strong>of</strong> nodes V 1<br />

such that there exists a path V 1 , . . . , V l with V l ∈ I such that V i − V i+1 or<br />

V i → V i+1 is in G for all i.<br />

• Given three disjoint sets <strong>of</strong> nodes I, J and K, we say that I is separated<br />

from J given K in G, I ⊥ G J|K, when every path in (G An(IJK) ) m from a<br />

node in I to a node in J has some node in K.<br />

• We say that a probability distribution p is Markovian with respect to G<br />

when I ⊥ p J|K if I ⊥ G J|K for all I, J and K.<br />

• We say that p is faithful to G when I ⊥ p J|K iff I ⊥ G J|K for all I, J and<br />

K.<br />

• If p is strictly positive, then p factorizes according to G iff p is Markovian<br />

with respect to G.<br />

• This is the classical Lauritzen-Wermuth-Frydenberg interpretation <strong>of</strong> CGs.<br />

There is an alternative interpretation due to Andersson-Madigan-Perlman.<br />

5


Expressiveness<br />

• Each entry <strong>of</strong> the table below contains the exact and (MCMC) approximate<br />

fractions <strong>of</strong> CG independence models that are, in this order, Markov<br />

network independence models, Bayesian network independence models,<br />

none <strong>of</strong> them, and the time required to compute them (in hours, C++<br />

implementation run on a Pentium 2.4 GHz, 512 MB RAM and Windows<br />

2000). ∗<br />

NODES EXACT APPROXIMATE<br />

2 1.00000, 1.00000, 0.00000 1.00000, 1.00000, 0.00000, 1.5 h<br />

3 0.72727, 1.00000, 0.00000 0.73600, 1.00000, 0.00000, 1.9 h<br />

4 0.32000, 0.92500, 0.06000 0.32200, 0.92700, 0.06200, 2.3 h<br />

5 0.08890, 0.76239, 0.22007 0.08200, 0.76500, 0.22600, 2.8 h<br />

6 0.02100, 0.56900, 0.42100, 3.4 h<br />

7 0.00500, 0.40200, 0.59400, 4.2 h<br />

8 0.00000, 0.30200, 0.69800, 5.1 h<br />

9 0.00000, 0.19800, 0.80200, 6.4 h<br />

10 0.00000, 0.13700, 0.86300, 8.2 h<br />

11 0.00000, 0.06400, 0.93600, 12.5 h<br />

12 0.00000, 0.05100, 0.94900, 12.9 h<br />

13 0.00000, 0.04100, 0.95900, 19.2 h<br />

∗ Peña. AISTATS, 2007.<br />

6


Expressiveness con’t<br />

• What if the pure CG independence models do not correspond to any<br />

probabilty distribution Well, then CGs are <strong>of</strong> no use to our purpose.<br />

• Let D + (G) and N (G) denote, respectively, all the strictly positive discrete<br />

probability distributions and regular Gaussian distributions that factorize<br />

according to G.<br />

• Theorem: Let G be a CG <strong>of</strong> dimension d. The parameter space for<br />

D + (G) has positive Lebesgue measure with respect to R d . The subset <strong>of</strong><br />

the parameter space for D + (G) that corresponds to the probability distributions<br />

in D + (G) that are not faithful to G has zero Lebesgue measure<br />

with respect to R d . ∗<br />

• Theorem: Let G be a CG <strong>of</strong> dimension d. The parameter space for N (G)<br />

has positive Lebesgue measure with respect to R d . The subset <strong>of</strong> the<br />

parameter space for N (G) that corresponds to the probability distributions<br />

in N (G) that are not faithful to G has zero Lebesgue measure with<br />

respect to R d . †<br />

∗ Peña. IJAR, 2009.<br />

† Peña. AISTATS, 2011.<br />

7


Equivalence<br />

• Corollary: Let G and H denote two CGs. The following statements<br />

are equivalent in the frame <strong>of</strong> both strictly positive discrete probability<br />

distributions and regular Gaussian distributions:<br />

– G and H are factorization equivalent.<br />

– G and H are Markovian equivalent.<br />

– G and H are independence equivalent.<br />

• A path V 1 , . . . , V l in G is called a complex if the subgraph <strong>of</strong> G induced<br />

by the set <strong>of</strong> nodes in the path looks like V 1 → V 2 − . . . − V l−1 ← V l .<br />

• G and H are Markovian equivalent iff they have the same adjacencies and<br />

the same complexes.<br />

• Thanks to the corollary above, this graphical characterizacion applies to<br />

the other two definitions <strong>of</strong> equivalence.<br />

A → B → C → D = A ← B ← C ← D = A − B − C − D = A ← B − C → D<br />

A → B → C → D ≠ A → B ← C ← D ≠ A → B → C ← D ≠ A → B − C ← D<br />

8


<strong>Learning</strong><br />

• Given a probability distribution p, find a CG G such that p is Markovian<br />

with respect to G and p is not Markovian with respect to any subgraph<br />

<strong>of</strong> G.<br />

• The boundary <strong>of</strong> a node A, Bd(A), is the set <strong>of</strong> nodes B → A or B − A.<br />

• A node V l is a descendant <strong>of</strong> a node V 1 in G if there exists a path V 1 , . . . , V l<br />

such that V i − V i+1 or V i → V i+1 is in G for all i and V i → V i+1 is in G for<br />

some i.<br />

• Nd(A) denotes the non-descendants <strong>of</strong> the node A in G.<br />

• p is Markovian with respect to G iff A⊥ p Nd(A) \ Bd(A)|Bd(A) for all A.<br />

CKES<br />

1 G = empty graph<br />

2 Repeat while possible<br />

3 if there exist two nodes A and B such that B ∈ Nd(A) \ Bd(A) and A̸⊥ p B|Bd(A) then<br />

4 Make A and B adjacent in G<br />

5 Repeat while possible<br />

6 if there exist two nodes A and B such that B ∈ Bd(A) and A⊥ p B|Bd(A) \ B then<br />

7 Remove from G the adjacency between A and B<br />

8 else<br />

9 Replace G by any other CG equivalent to G<br />

9


<strong>Learning</strong> con’t<br />

CKES<br />

1 G = empty graph<br />

2 Repeat while possible<br />

3 if there exist two nodes A and B such that B ∈ Nd(A) \ Bd(A) and A̸⊥ p B|Bd(A) then<br />

4 Make A and B adjacent in G<br />

5 Repeat while possible<br />

6 if there exist two nodes A and B such that B ∈ Bd(A) and A⊥ p B|Bd(A) \ B then<br />

7 Remove from G the adjacency between A and B<br />

8 else<br />

9 Replace G by any other CG equivalent to G<br />

10


<strong>Learning</strong> con’t<br />

• Theorem: The previous algorithm is correct if p satisfies the composition<br />

property, i.e I ⊥ p J|L and I ⊥ p K|L then I ⊥ p JK|L for all disjoint sets <strong>of</strong><br />

nodes I, J, K, L. ∗<br />

• How restrictive is the composition property assumption <br />

– Milder than faithfulness assumption.<br />

– Every regular Gaussian distribution satisfies it.<br />

– It is closed <strong>under</strong> marginalization and conditioning (context-specific<br />

independencies disabled).<br />

• Given a probability distribution p, find a CG G such that p is Markovian<br />

with respect to G and p is not Markovian with respect to any subgraph<br />

<strong>of</strong> G.<br />

– Usually, many CGs are solution.<br />

– How to find one that has smallest dimension or one that represents<br />

most independencies in p <br />

∗ Peña. arXiv:1109.5404v1 [stat.ML].<br />

11


<strong>Learning</strong> con’t<br />

• Evaluated against the LCD algorithm by Ma et al. JMLR, 2008.<br />

• Datasets sampled from faithful probability distributions (LCD’s assumption,<br />

not ours).<br />

• Precision: Percentage <strong>of</strong> the independencies represented in the model<br />

learnt that are true.<br />

• Recall: Percentage <strong>of</strong> the true independencies that are represented in<br />

the model learnt.<br />

• Run each algorithm 100 times on each dataset, and report the best model<br />

learnt in terms <strong>of</strong> precision and recall.<br />

12


<strong>Learning</strong> con’t<br />

13


<strong>Learning</strong> con’t<br />

• What happens if faithfulness fails but the composition holds <br />

True CG CKES LCD<br />

Inclusion optimal Not inclusion optimal<br />

A<br />

B<br />

A<br />

B<br />

A<br />

B<br />

A<br />

B<br />

A<br />

B<br />

H<br />

C<br />

D<br />

C<br />

D<br />

C<br />

D<br />

C<br />

D<br />

C<br />

D<br />

• Future work:<br />

– Compute the BIC score for the 100 models learnt, and see if it correlates<br />

with the best model in terms <strong>of</strong> precision and recall.<br />

– Comparison with order-based learning algorithm developed by J. I.<br />

Alonso and J. M. Puerta (UCLM).<br />

14

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!