Probabilistic Graphical Models

ProbabilisticGraphical ModelsLecture 4 – LearningBayesian NetworksCS/CNS/EE 155Andreas Krause

AnnouncementsAnother TA: Hongchao ZhouPlease fill out the questionnaire about recitationsHomework 1 out. Due in class Wed Oct 21Project proposals due Monday Oct 192

Representing the world using BNss 1 s 2 s 3s 4True distribution P’with cond. ind. I(P’)representWant to make sure that I(P) ⊆ I(P’)Need to understand CI properties of BN (G,P)s 6s 11s 5s7s 8s 9s 10s 12Bayes net (G,P)with I(P)3

Factorization Theorems 1 s 2 s 3s 4s 5s7s 8s 6s 11s 9s 10s 12I loc (G) ⊆ I(P)G is an I-map of P(independence map)True distribution Pcan be represented exactly asBayesian network (G,P)4

Additional conditional independenciesBN specifies joint distribution through conditionalparameterization that satisfies Local Markov PropertyI loc (G) = {(X i ⊥ Nondescendants Xi | Pa Xi )}But we also talked about additional properties of CIWeak Union, Intersection, Contraction, …Which additional CI does a particular BN specify?All CI that can be derived through algebraic operations proving CI is very cumbersome!!Is there an easy way to find all independencesof a BN just by looking at its graph??5

ExamplesAGBDEHICFJ6

Active trailsAn undirected path in BN structure G is calledactive trail for observed variables O ⊆ {X 1 ,…,X n }, if forevery consecutive triple of vars X,Y,Z on the pathX Y Z and Y is unobserved (Y ∉ O)X Y Z and Y is unobserved (Y ∉ O)X Y Z and Y is unobserved (Y ∉ O)X Y Z and Y or any of Y’s descendants is observedAny variables X i and X j for which ∄ active trail forobservations O are called d-separated by OWe write d-sep(X i ;X j | O)Sets A and B are d-separated given O if d-sep(X,Y |O)for all X∈A, Y∈B. Write d-sep(A; B | O)7

Soundness of d-separationHave seen: P factorizes according to G I loc (G)⊆ I(P)Define I(G) = {(X ⊥ Y | Z): d-sep G (X;Y |Z)}Theorem: Soundness of d-separationP factorizes over G I(G) ⊆ I(P)Hence, d-separation captures only trueindependencesHow about I(G) = I(P)?8

Completeness of d-separationTheorem: For “almost all” distributions P thatfactorize over G it holds that I(G) = I(P)“almost all”: except for a set of distributions with measure0, assuming only that no finite set of distributions hasmeasure > 09

Algorithm for d-separationHow can we check if X ⊥ Y | Z?Idea: Check every possible path connecting X and Y andverify conditionsExponentially many paths!!! ☹Linear time algorithm:Find all nodes reachable from X1. Mark Z and its ancestors2. Do breadth-first search startingfrom X; stop if path is blockedHave to be careful with implementation details (seereading)BCADEFGHII10

Representing the world using BNss 1 s 2 s 3s 4True distribution P’with cond. ind. I(P’)representWant to make sure that I(P) ⊆ I(P’)Ideally: I(P) = I(P’)Want BN that exactly captures independencies in P’!s 6s 11s 5s7s 8s 9s 10s 12Bayes net (G,P)with I(P)11

Minimal I-mapGraph G is called minimal I-map if it’s an I-map, and ifany edge is deleted no longer I-map.12

Uniqueness of Minimal I-mapsIs the minimal I-Map unique?EBEBJMAAEBJMJMA13

Perfect mapsMinimal I-maps are easy to find, but can containmany unnecessary dependencies.A BN structure G is called P-map (perfect map) fordistribution P if I(G) = I(P)Does every distribution P have a P-map?14

I-EquivalenceTwo graphs G, G’ are called I-equivalent if I(G) = I(G’)I-equivalence partitions graphs into equivalenceclasses15

Skeletons of BNsAGAGBDEHIBDEHICFJCFJI-equivalent BNs must have same skeleton16

Immoralities and I-equivalenceA V-structure X Y Z is called immoral if there isno edge between X and Z (“unmarried parents”)Theorem: I(G) = I(G’) G and G’ have the sameskeleton and the same immoralities.17

Today: Learning BN from dataWant P-map if one existsNeed to findSkeletonImmoralities18

Identifying the skeletonWhen is there an edge between X and Y?When is there no edge between X and Y?19

Algorithm for identifying the skeleton20

Identifying immoralitiesWhen is X – Z – Y an immorality?Immoral for all U, Z ∈ U: ¬ (X ⊥ Y | U)21

From skeleton & immoralities to BN StructuresRepresent I-equivalence class as partially-directedacyclic graph (PDAG)How do I convert PDAG into BN?22

Testing independenceSo far, assumed that we know I(P’), i.e., allindependencies associated with true dist. P’Often, access to P’ only through sample data (e.g.,sensor measurements, etc.)Given vars X, Y, Z, want to test whether X ⊥ Y | Z23

Next topic: Learning BN from DataTwo main parts:Learning structure (conditional independencies)Learning parameters (CPDs)24

Parameter learningSuppose X is Bernoulli distribution (coin flip) withunknown parameter P(X=H) =θ.Given training data D = {x (1) ,…,x (m) }(e.g., H H T H H H T T H T H H H..) how do we estimate θ?25

Maximum Likelihood EstimationGiven: data set DHypothesis: data generated i.i.d. from binomialdistribution with P(X = H) = θOptimize for θ which makes D most likely:26

Solving the optimization problem27

Learning general BNsKnown structureUnknown structureFully observableMissing data28

Estimating CPDsGiven data D = {(x 1 ,y 1 ),…,(x n ,y n )} of samples from X,Y,want to estimate P(X | Y)29

MLE for Bayes nets30

Algorithm for BN MLE31

Learning general BNsFully observableKnown structureEasy! ☺Unknown structure???Missing dataHard (EM)Very hard (later)32

Structure learningTwo main classes of approaches:Constraint basedSearch for P-map (if one exists):Identify PDAGTurn PDAG into BN (using algorithm in reading)Key problem: Perform independence testsOptimization basedDefine scoring function (e.g., likelihood of data)Think about structure as parametersMore common; can solve simple cases exactly33

MLE for structure learningFor fixed structure, can compute likelihood of data34

Decomposable scoreLog-data likelihoodMLE score decomposes over families of the BN (nodes+ parents)Score(G ; D) = ∑ i FamScore(X i | Pa i ; D)Can exploit for computational efficiency!35

Finding the optimal MLE structureLog-likelihood score:Want G * = argmax G Score(G ; D)Lemma: G ⊆ G’ Score(G; D) ≤ Score(G’; D)36

Finding the optimal MLE structureOptimal solution for MLE is always the fullyconnected graph!!! ☹Non-compact representation; Overfitting!!Solutions:Priors over parameters / structures (later)Constraint optimization (e.g., bound #parents)37

Constraint optimization of BN structuresTheorem: for any fixed d ≥ 2, finding the optimal BN(w.r.t. MLE score) is NP-hardWhat about d=1??Want to find optimal tree!38

Finding the optimal tree BNScoring functionScoring a tree39

Finding the optimal tree skeletonCan reduce to following problem:Given graph G = (V,E), and nonnegative weights w e foreach edge e=(X i ,X j )In our case: w e = I(X i ,X j )Want to find tree T⊆ E that maximizes ∑ e∈T w eMaximum spanning tree problem!Can solve in time O(|E| log |E|)!40

Chow-Liu algorithmFor each pair X i , X j of variables computeCompute mutual informationDefine complete graph with weight of edge (X i ,X i )given by the mutual informationFind maximum spanning tree skeletonOrient the skeleton using breadth-first search41

Generalizing Chow-LiuTree-augmented Naïve Bayes Model [Friedman ’97]If evidence variables are correlated, Naïve Bayesmodels can be overconfidentKey idea: Learn optimal tree for conditionaldistribution P(X 1 ,…,X n | Y)Can do optimally using Chow-Liu (homework! ☺)42

TasksSubscribe to Mailing listhttps://utils.its.caltech.edu/mailman/listinfo/cs155Select recitation timesRead Koller & Friedman Chapter 17.1-17.3, 18.1-2,18.4.1Form groups and think about class projects. If youhave difficulty finding a group, email Pete Trautman43

Probabilistic Graphical Models

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?