Towards Optimal Learning of Chain graphs under Composition

Towards Optimal Learning of Chain graphs 

under Composition 

Jose M. Peña 

Dag Sonntag 

ADIT, Linköping University, Sweden 

15-16 March 2012

Overview 

• Chain graphs 

• Expresiveness 

• Equivalence 

• Learning 

1

Chain graphs 

• A chain graph for four binary random variables. 

p(ABCD) = p(A)p(B)p(CD|AB) 

= 1 Z ψ1 A (A)ψ2 B (B)ψ3 AB (AB)ψ3 CA (CA)ψ3 BD (BD)ψ3 CD (CD) 

ψ 1 A (A) A = 0 A = 1 

0.2 0.8 

ψ 2 B (B) B = 0 B = 1 

0.3 0.7 

ψ 3 AB (AB) B = 0 B = 1 

A = 0 1 2 

A = 1 4 3 

ψ 3 CA (CA) A = 0 A = 1 

C = 0 5 6 

C = 1 8 7 

ψ 3 BD (BD) D = 0 D = 1 

B = 0 9 10 

B = 1 11 12 

ψ 3 CD (CD) D = 0 D = 1 

C = 0 13.3 14.4 

C = 1 15.5 16.6 

• A chain graph represents a probability distribution by factorizing it according 

to a graph. 

• The graph represents independencies holding in the distribution, which 

– makes efficient inference possible, and 

– gives insight into the distribution without numerical calculation. 

2

Chain graphs con’t 

• A chain graph (CG) is a graph (possibly) containing both undirected and 

directed edges and no directed pseudocycles. 

• A connectivity component of a CG G is a set of nodes such that there 

exists an undirected path in G between every pair of nodes in the set. 

The set must be maximal with respect to set inclusion. 

• The subgraph of G induced by a set of nodes I, G I , is the graph over I 

where two nodes are connected by a (un)directed edge if that edge is in 

G. 

• The moral graph of G, G m , is the undirected graph where two nodes are 

adjacent iff they are adjacent in G or they are both in P a(B i ) for some 

connectivity component B i of G. 

• C(G m ) denotes the set of complete sets in G m 

respect to set inclusion. 

that are maximal with 

G Components G B3 P a(B 3 ) (G B3 P a(B 3 )) m C((G B3 P a(B 3 )) m ) 

B 1 = {A} 

B 2 = {B} 

B 3 = {C, D} 

{{A, B}, {A, C}, 

{B, D}, {C, D}} 

3


• We say that a probability distribution p factorizes according to G if 

– p(x) = ∏ n 

i=1 p(x B i 

|x P a(Bi )) where 

– p(x Bi |x P a(Bi )) = 1 Z i ∏C∈C((G Bi P a(B i )) m ) ψi C (x C) where each ψ i C (x C) is a nonnegative 

real function, and each Z i is a normalization constant. 

• Bayesian networks and Markov networks ⊂ CGs ⊂ factor graphs. 

• A CG for four binary random variables. 

p(ABCD) = p(A)p(B)p(CD|AB) 

= 1 Z ψ1 A (A)ψ2 B (B)ψ3 AB (AB)ψ3 CA (CA)ψ3 BD (BD)ψ3 CD (CD) 

ψ 1 A (A) A = 0 A = 1 

0.2 0.8 

ψ 2 B (B) B = 0 B = 1 

0.3 0.7 

ψ 3 AB (AB) B = 0 B = 1 

A = 0 1 2 

A = 1 4 3 

ψ 3 CA (CA) A = 0 A = 1 

C = 0 5 6 

C = 1 8 7 

ψ 3 BD (BD) D = 0 D = 1 

B = 0 9 10 

B = 1 11 12 

ψ 3 CD (CD) D = 0 D = 1 

C = 0 13.3 14.4 

C = 1 15.5 16.6 

4


• The ancestors of a set of nodes I in G, An(I), is the set of nodes V 1 

such that there exists a path V 1 , . . . , V l with V l ∈ I such that V i − V i+1 or 

V i → V i+1 is in G for all i. 

• Given three disjoint sets of nodes I, J and K, we say that I is separated 

from J given K in G, I ⊥ G J|K, when every path in (G An(IJK) ) m from a 

node in I to a node in J has some node in K. 

• We say that a probability distribution p is Markovian with respect to G 

when I ⊥ p J|K if I ⊥ G J|K for all I, J and K. 

• We say that p is faithful to G when I ⊥ p J|K iff I ⊥ G J|K for all I, J and 

K. 

• If p is strictly positive, then p factorizes according to G iff p is Markovian 

with respect to G. 

• This is the classical Lauritzen-Wermuth-Frydenberg interpretation of CGs. 

There is an alternative interpretation due to Andersson-Madigan-Perlman. 

5

Expressiveness 

• Each entry of the table below contains the exact and (MCMC) approximate 

fractions of CG independence models that are, in this order, Markov 

network independence models, Bayesian network independence models, 

none of them, and the time required to compute them (in hours, C++ 

implementation run on a Pentium 2.4 GHz, 512 MB RAM and Windows 

2000). ∗ 

NODES EXACT APPROXIMATE 

2 1.00000, 1.00000, 0.00000 1.00000, 1.00000, 0.00000, 1.5 h 

3 0.72727, 1.00000, 0.00000 0.73600, 1.00000, 0.00000, 1.9 h 

4 0.32000, 0.92500, 0.06000 0.32200, 0.92700, 0.06200, 2.3 h 

5 0.08890, 0.76239, 0.22007 0.08200, 0.76500, 0.22600, 2.8 h 

6 0.02100, 0.56900, 0.42100, 3.4 h 

7 0.00500, 0.40200, 0.59400, 4.2 h 

8 0.00000, 0.30200, 0.69800, 5.1 h 

9 0.00000, 0.19800, 0.80200, 6.4 h 

10 0.00000, 0.13700, 0.86300, 8.2 h 

11 0.00000, 0.06400, 0.93600, 12.5 h 

12 0.00000, 0.05100, 0.94900, 12.9 h 

13 0.00000, 0.04100, 0.95900, 19.2 h 

∗ Peña. AISTATS, 2007. 

6

Expressiveness con’t 

• What if the pure CG independence models do not correspond to any 

probabilty distribution Well, then CGs are of no use to our purpose. 

• Let D + (G) and N (G) denote, respectively, all the strictly positive discrete 

probability distributions and regular Gaussian distributions that factorize 

according to G. 

• Theorem: Let G be a CG of dimension d. The parameter space for 

D + (G) has positive Lebesgue measure with respect to R d . The subset of 

the parameter space for D + (G) that corresponds to the probability distributions 

in D + (G) that are not faithful to G has zero Lebesgue measure 

with respect to R d . ∗ 

• Theorem: Let G be a CG of dimension d. The parameter space for N (G) 

has positive Lebesgue measure with respect to R d . The subset of the 

parameter space for N (G) that corresponds to the probability distributions 

in N (G) that are not faithful to G has zero Lebesgue measure with 

respect to R d . † 

∗ Peña. IJAR, 2009. 

† Peña. AISTATS, 2011. 

7

Equivalence 

• Corollary: Let G and H denote two CGs. The following statements 

are equivalent in the frame of both strictly positive discrete probability 

distributions and regular Gaussian distributions: 

– G and H are factorization equivalent. 

– G and H are Markovian equivalent. 

– G and H are independence equivalent. 

• A path V 1 , . . . , V l in G is called a complex if the subgraph of G induced 

by the set of nodes in the path looks like V 1 → V 2 − . . . − V l−1 ← V l . 

• G and H are Markovian equivalent iff they have the same adjacencies and 

the same complexes. 

• Thanks to the corollary above, this graphical characterizacion applies to 

the other two definitions of equivalence. 

A → B → C → D = A ← B ← C ← D = A − B − C − D = A ← B − C → D 

A → B → C → D ≠ A → B ← C ← D ≠ A → B → C ← D ≠ A → B − C ← D 

8

Learning 

• Given a probability distribution p, find a CG G such that p is Markovian 

with respect to G and p is not Markovian with respect to any subgraph 

of G. 

• The boundary of a node A, Bd(A), is the set of nodes B → A or B − A. 

• A node V l is a descendant of a node V 1 in G if there exists a path V 1 , . . . , V l 

such that V i − V i+1 or V i → V i+1 is in G for all i and V i → V i+1 is in G for 

some i. 

• Nd(A) denotes the non-descendants of the node A in G. 

• p is Markovian with respect to G iff A⊥ p Nd(A) \ Bd(A)|Bd(A) for all A. 

CKES 

1 G = empty graph 

2 Repeat while possible 

3 if there exist two nodes A and B such that B ∈ Nd(A) \ Bd(A) and A̸⊥ p B|Bd(A) then 

4 Make A and B adjacent in G 


6 if there exist two nodes A and B such that B ∈ Bd(A) and A⊥ p B|Bd(A) \ B then 

7 Remove from G the adjacency between A and B 

8 else 

9 Replace G by any other CG equivalent to G 

9

Learning con’t 

CKES 

1 G = empty graph 


3 if there exist two nodes A and B such that B ∈ Nd(A) \ Bd(A) and A̸⊥ p B|Bd(A) then 

4 Make A and B adjacent in G 


6 if there exist two nodes A and B such that B ∈ Bd(A) and A⊥ p B|Bd(A) \ B then 

7 Remove from G the adjacency between A and B 

8 else 

9 Replace G by any other CG equivalent to G 

10


• Theorem: The previous algorithm is correct if p satisfies the composition 

property, i.e I ⊥ p J|L and I ⊥ p K|L then I ⊥ p JK|L for all disjoint sets of 

nodes I, J, K, L. ∗ 

• How restrictive is the composition property assumption 

– Milder than faithfulness assumption. 

– Every regular Gaussian distribution satisfies it. 

– It is closed under marginalization and conditioning (context-specific 

independencies disabled). 

• Given a probability distribution p, find a CG G such that p is Markovian 

with respect to G and p is not Markovian with respect to any subgraph 

of G. 

– Usually, many CGs are solution. 

– How to find one that has smallest dimension or one that represents 

most independencies in p 

∗ Peña. arXiv:1109.5404v1 [stat.ML]. 

11


• Evaluated against the LCD algorithm by Ma et al. JMLR, 2008. 

• Datasets sampled from faithful probability distributions (LCD’s assumption, 

not ours). 

• Precision: Percentage of the independencies represented in the model 

learnt that are true. 

• Recall: Percentage of the true independencies that are represented in 

the model learnt. 

• Run each algorithm 100 times on each dataset, and report the best model 

learnt in terms of precision and recall. 

12


13


• What happens if faithfulness fails but the composition holds 

True CG CKES LCD 

Inclusion optimal Not inclusion optimal 

A 

B 

A 

B 

A 

B 

A 

B 

A 

B 

H 

C 

D 

C 

D 

C 

D 

C 

D 

C 

D 

• Future work: 

– Compute the BIC score for the 100 models learnt, and see if it correlates 

with the best model in terms of precision and recall. 

– Comparison with order-based learning algorithm developed by J. I. 

Alonso and J. M. Puerta (UCLM). 

14

Towards Optimal Learning of Chain graphs under Composition

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?