Bayesian Learning Handouts.pdf - Richard J. Povinelli

Bayesian Learning 

Dr. Richard J. Povinelli 

Many slides adapted from Tom Mitchell’s slides and 

Andrew W. Moore’s slides (http://www.cs.cmu.edu/~awm/tutorials ) 

rev 1.0, 1/20/2003 

Two Roles for Bayesian Methods 

Page 1 

� Provides practical learning algorithms: 

� Naive Bayes learning 

� Bayesian belief network learning 

� Combine prior knowledge (prior probabilities) with 

observed data 

� Requires prior probabilities 

� Provides useful conceptual framework 

� Provides “gold standard” for evaluating other 

learning algorithms 

� Additional insight into Occam's razor 

rev 1.0, 1/20/2003 

Discrete Random Variables 

� A is a Boolean-valued random variable if 

A denotes an event, and there is some 

degree of uncertainty as to whether A 

occurs. 

� Examples 

� A = The US president in 2023 will be male 

� A = You wake up tomorrow with a 

headache 

� A = You have Ebola 

rev 1.0, 1/20/2003 

Page 3 

Page 5 

Overview of Class 

� Bayes Theorem 

� MAP, ML hypotheses 

� MAP learners 

� Minimum description length principle 

� Bayes optimal classifier 

� Naive Bayes learner 

� Example: Learning over text data 

� Bayesian belief networks 

� Expectation Maximization algorithm 

rev 1.0, 1/20/2003 

Ways to deal with Uncertainty 

� Three-valued logic: True / False / Maybe 

� Fuzzy logic (truth values between 0 and 1) 

� Non-monotonic reasoning (especially focused 

on Penguin informatics) 

� Dempster-Shafer theory (and an extension 

known as quasi-Bayesian theory) 

� Possibabilistic Logic 

� Probability 

Probabilities 

rev 1.0, 1/20/2003 

� We write P(A) as “the fraction of 

possible worlds in which A is true” 

� We could at this point spend 2 hours on 

the philosophy of this. 

� But we won’t. 

rev 1.0, 1/20/2003 

Page 2 

Page 4 

Page 6

Event space of 

all possible 

worlds 

Its area is 1 

Visualizing A 

Worlds in which 

A is true 

Worlds in which A is False 

rev 1.0, 1/20/2003 

Interpreting the axioms 

� 0

Theorems from the Axioms 

� Axioms 

� 0

Bayes Rule 

P ( A∧B) P ( B A) 

= 

P ( A) 

P ( A B) P ( B) 

= 

P ( A) 

Bayes, Thomas (1763) An essay 

towards solving a problem in the doctrine 

of chances. Philosophical Transactions 

of the Royal Society of London, 53:370- 

418 

rev 1.0, 1/20/2003 

Using Bayes Rule to Gamble 

$1.00 

The “Win” envelope has 

a dollar and four 

beads in it 

rev 1.0, 1/20/2003 

The “Lose” envelope has 

three beads and no 

money 

Interesting question: before deciding, you are allowed to see one bead 

drawn from the envelope. 

Suppose it’s black: How much should you pay? 

Suppose it’s red: How much should you pay? 

Choosing Hypotheses 

P ( D h) P ( h) 

P ( h D) 

= 

P ( D) 

Generally want the most probable hypothesis given the training data 

Maximum a posteriori hypothesis hMAP 

hMAP = argmax P ( h D) 

h∈H P ( D h) P ( h) 

= argmax 

h∈H P ( D) 

= argmax P ( D h) P ( h) 

h∈H If assume P ( hi) = P ( hj) 

then can further simplify, 

and choose the 

Maximum likelihood (ML) hypothesis 

hML = argmax P ( D hi) 

hi∈H rev 1.0, 1/20/2003 

Page 19 

Page 21 

Page 23 

Using Bayes Rule to Gamble 

$1.00 

R R B B R B B 

The “Win” envelope has 

a dollar and four 

beads in it 

rev 1.0, 1/20/2003 

The “Lose” envelope 

has three beads and 

no money 

Trivial question: someone draws an envelope at random and offers to 

sell it to you. How much should you pay? 

Classroom Activity 

� Suppose it’s black: How much should 

you pay? 

� Suppose it’s red: How much should 

you pay? 

� With a partner figure out how you 

represent each of these. You have 5 

minutes. 

rev 1.0, 1/20/2003 

Bayes Theorem 

� Does patient have cancer or not? 

� A patient takes a lab test and the result comes back positive. 

The test returns a correct positive result in only 97% of the 

cases in which the disease is actually present, and a correct 

negative result in only 98% of the cases in which the 

disease is not present. Furthermore, 0.007 of the entire 

population have this cancer. 

� Probabilities 

� P(cancer) = 

� P(¬cancer) = 

� P(+ | cancer) = 

� P(- | cancer) = 

� P(+ | ¬cancer) = 

� P(- | ¬cancer) = 

rev 1.0, 1/20/2003 

Page 20 

Page 22 

Page 24

Brute Force MAP 

Hypothesis Learner 

1. For each hypothesis h ∈ H, 

calculate 

the posterior probability 

( ) 

( ) 

P ( D) 

( ) 

P D h P h 

P h D = 

2. Output the hypothesis hMAP 

with the 

highest posterior probability 

h = argmax P h D 

MAP 

h∈H ( ) 

rev 1.0, 1/20/2003 

Relation to Concept Learning II 

� Assume fixed set of instances 

� Assume D is the set of classifications 

D = < c (x1 ),…,c (xm )> 

� Choose P (D |h) 

� P (D |h) = 1 if h consistent with D 

� P (D |h) = 0 otherwise 

� Choose P (h) to have a uniform distribution 

� P (h) = 1/|H| for all h in H 

� Then, 

⎧⎪ 

⎪ 

1 

if h is consistent with D 

P ( h D) 

= ⎪ 

⎨VS H , D 

⎪⎪⎪⎩ 

0 otherwise 

rev 1.0, 1/20/2003 

Characterizing Learning Algorithms 

by Equivalent MAP Learners 

rev 1.0, 1/20/2003 

Page 25 

Page 27 

Page 29 

Relation to Concept Learning I 

� Consider our usual concept learning task 

� instance space X, hypothesis space H, training 

examples D 

� consider the FindS learning algorithm (outputs 

most specific hypothesis from the version space 

VS H,D ) 

� What would Bayes rule produce as the MAP 

hypothesis? 

� Does FindS output a MAP hypothesis?? 

rev 1.0, 1/20/2003 

Evolution of Posterior Probabilities 

rev 1.0, 1/20/2003 

Learning A Real Valued Function 

� Consider any real-valued 

target function f 

� Training examples 

, where d i is a 

noisy training value 

� d i = f(x i ) + e i 

� e i is an independent random 

variable N(0,σ) 

� Then the maximum likelihood 

hypothesis h ML is the one that 

minimizes the sum of squared 

errors 

rev 1.0, 1/20/2003 

m 

argmin 

h∈H i = 1 

Page 26 

Page 28 

( ( ) ) 2 

hML = ∑ 

di−h xi 

Page 30

Learning A Real Valued Function 

h∈ H i = 1 

h∈ H i = 1 

( ) 

h = argmax p D h 

ML 

h∈H = argmax 

= argmax 

m 

∏ 

m 

∏ 

( i ) 

p d h 

1 

2πσ 

rev 1.0, 1/20/2003 

2 

e 

( ) 2 

1⎛di 

−h x ⎞ 

i 

− ⎜ ⎟ 

⎜ ⎟ 

2⎜⎜⎝ 

σ 

⎟ 

⎠⎟ 

Learning to Predict Probabilities 

� Consider predicting survival probability from patient data 

� Training examples < x i , d i >, where d i is 1 or 0 

� Want to train neural network to output a probability given x i 

(not a 0 or 1) 

� In this case can show 

hML = argmax ∑dilnh( 

xi) + ( 1 −di) ln( 1 −h( 

xi) 

) 

h∈ H i = 1 

Weight update rule for a sigmoid unit: 

w ← w + ∆w 

where 

∆ w = η∑ 

( d − h( x ) ) x 

m 

jk jk jk 

m 

jk i i ijk 

i = 1 

rev 1.0, 1/20/2003 

Example 

� H = decision trees, D = training data 

labels 

� L C1 (h) is # bits to describe tree h 

� L C2 (D|h) is # bits to describe D given h 

� Note L C2 (D|h) =0 if examples classified 

perfectly by h. Need only describe 

exceptions 

� Hence h MDL trades off tree size for 

training errors 

rev 1.0, 1/20/2003 

Page 31 

Page 33 

Page 35 

Maximize Natural Log Instead 

⎛ 

⎜ m 

⎜ 

ML = ln ⎜ 

⎜argmax ⎜ ∏ 

h∈ H ⎜ i = 1 

⎜⎝ 

1 

2 

2πσ 

2 

1⎛di 

h( xi 

) ⎞ 

⎜ − ⎟ ⎞ 

− ⎜ ⎟ 

2 

⎜ ⎟ ⎟ 

⎜⎝ σ ⎠⎟ 

⎟ 

⎠⎟ 

⎛ 

m ⎜ ⎛ 

= argmax ln⎜ ∑ ⎜ ⎜ 

h∈ H i = 1 ⎜ ⎜ 

⎜ 

⎝ 

⎝ 

2 

⎛ 1⎛di 

−h( xi 

) ⎞ 

⎜ 

⎟ ⎞⎞ − ⎟ 

1 ⎞ ⎜ 

⎟ 

2 

⎜ ⎟ 

⎟ ⎜ ⎜⎜⎝ σ ⎠⎟⎟⎟ 

⎟ + ln⎜e 

⎟ ⎟ 

2 ⎟ ⎜ ⎟ ⎟⎟ 

2πσ 

⎠⎟⎜ 

⎟⎟ 

⎝ ⎜ ⎠⎟⎠ 

⎟ 

m ⎛ 

⎜ ⎛ 

= argmax ln⎜ 

∑ ⎜ ⎜ h∈H ⎜ i = 1 ⎜⎜⎝ ⎜⎝ 

2 

1 ⎞ 

⎟ 1 ⎛di − h( x ) ⎞ ⎞ 

i ⎟ 

⎟ 

2 ⎟ − 

⎜ ⎟ 

⎜ ⎟ ⎟ 

2πσ 

⎠⎟2⎜ 

⎟ 

⎜⎝ 

⎟ 

σ 

⎟ 

⎠⎟⎟ 

⎠⎟ 

m 

2 

= argmax ∑−( 

di −h( 

xi 

) ) 

h∈ H i = 1 

m 

= argmin∑ 

( d − h( x ) ) 

h∈ H i = 1 

h e 

2 

i i 

rev 1.0, 1/20/2003 

rev 1.0, 1/20/2003 

Page 32 

Minimum Description Length Principle 

� Occam's razor: prefer the shortest 

hypothesis 

� MDL: prefer the hypothesis h that 

minimizes 

h = argminL 

h + L D h 

LC( x) 

x under encoding C 

( ) ( ) 

MDL C1 C 2 

h∈H where is the description length of 

MAP MDL Principle I 

( ) ( ) 

h = argmax P D h P h 

MAP 

h∈H rev 1.0, 1/20/2003 

( ) ( ) 

= argmax log P D h + log P h 

h∈H 2 2 

Page 34 

( ) ( ) 

= argmin −log P D h −log 

P h 

h∈H 2 2 

Page 36

MAP MDL Principle II 

� Interesting fact from information theory: 

� The optimal (shortest expected coding length) 

code for an event with probability p is -log 2p bits. 

� So interpret equation on last slide: 

� -log 2P (h) is length of h under optimal code 

� -log 2P (D |h) is length of D given h under optimal 

code 

� Prefer the hypothesis that minimizes 

� length(h) + length(misclassifications) 

rev 1.0, 1/20/2003 

Classroom Activity 

� Consider: 

� Three possible hypotheses: 

� P (h 1 |D)=0.4, P (h 2 |D)= 0.3, P (h 3 |D)= 0.3 

� Given new instance x, 

� h 1 (x)=+, h 2 (x)=-, h 3 (x)=- 

� What's most probable classification 

of x? 

� With a partner figure out how you 

represent each of these. You have 

5 minutes. 

rev 1.0, 1/20/2003 

Gibbs Classifier 

� Bayes optimal classifier provides best result, but can 

be expensive if many hypotheses. 

� Gibbs algorithm: 

1. Choose one hypothesis at random, according to P (h |D) 

2. Use this to classify new instance 

� Surprising fact: Assume target concepts are drawn at 

random from H according to priors on H. Then: 

� E [error Gibbs ] ≤ 2E [error Bayes Optimal ] 

� Suppose correct, uniform prior distribution over H, 

then 

� Pick any hypothesis from VS, with uniform probability 

� Its expected error no worse than twice Bayes optimal 

rev 1.0, 1/20/2003 

Page 37 

Page 39 

Page 41 

Most Probable Classification 

of New Instances 

� So far we've sought the most probable 

hypothesis given the data D (i.e., hMAP) � Given new instance x, what is its most 

probable classification? 

� hMAP (x) is not the most probable 

classification! 

rev 1.0, 1/20/2003 

Bayes Optimal Classifier 

Bayes optimal classification: 

argmax P v h P h D 

v j ∈V 

hi∈H ( j i ) ( i ) 

Example: 

P ( h1 D) = .4, P ( − h1) = 0, P ( + h1) 

= 1 

P ( h2 D) = .3, P ( − h2) = 1, P ( + h2) 

= 0 

P ( h3 D) = .3, P ( − h3) = 1, P ( + h3) 

= 0 

therefore 

+ = .4, − = .6 

∑P ( hi ) P ( hi D) ∑ P ( hi ) P ( hi D) 

hi ∈H hi ∈H 

and 

argmax 

∑ 

∑ 

v j ∈V 

hi∈H ( j i ) ( i ) 

P v h P h D = − 

rev 1.0, 1/20/2003 

Naive Bayes Classifier I 

� Along with decision trees, neural networks, 

nearest neighbor, one of the most practical 

learning methods. 

� When to use 

� Moderate or large training set available 

� Attributes that describe instances are conditionally 

independent given classification 

� Successful applications: 

� Diagnosis 

� Classifying text documents 

rev 1.0, 1/20/2003 

Page 38 

Page 40 

Page 42

Naive Bayes Classifier II 

� Assume target function f: X→V, where each instance x described by 

attributes . 

Most probable value of f ( x) 

) is: 

vMAP = argmax P ( v j a1, a2, …, 

an) 

v j ∈V 

P ( a1, a2, …, 

anv j) P ( v j) 

v MAP = argmax 

v j ∈V 

P ( a1, a2, …, 

an) 

vMAP = argmax P ( a1, a2, …, 

anv j) P ( v j) 

v j ∈V 

Naive Bayes assumption: 

P ( a1, a2, …, 

an v j) = ∏P 

( ai v j) 

i 

which gives Naive Bayes classifier: 

v = argmax P ( v ) ∏P 

( a v ) 

NB 

v j ∈V 

j 

i 

i j 

rev 1.0, 1/20/2003 

Naive Bayes: Example 

� Consider PlayTennis again, and new instance 

� 

� Want to compute: 

( ) ∏ ( ) 

v = argmax P v P a v 

NB j i j 

v j ∈V 

i 

( ) ( ) ( ) ( ) ( ) 

( ) ( ) ( ) ( ) ( ) 

P y P sun y P cool y P high y P strong y = 0.005 

P n P sun n P cool n P high n P strong n = 0.021 

→ v = n 

NB 

rev 1.0, 1/20/2003 

Naive Bayes: Subtleties II 

2. What if none of the training instances with target 

value vj have attribute value a ? i Then 

� ˆ P ( ai v j) 

= 0,and ... 

� 

ˆ P ( v ) ˆ 

j ∏P 

( ai v j) 

= 0 

i 

� Typical solution is Bayesian estimate for P_hat(ai | v ) j 

� ˆ nc+ mp 

P ( ai v j) 

← 

n + m 

� where 

� n is number of training examples for which v=v j , 

rev 1.0, 1/20/2003 

Page 43 

Page 45 

� n c number of examples for which v=v j and a=a i 

� p is prior estimate for P_hat(a i | v j ) 

� m is weight given to prior (i.e. number of “virtual” examples) 

Page 47 

Naive Bayes Algorithm 

( examples ) 

NaiveBayesLearn 

For each target value v j 

ˆ P ( v j ) ← estimate P ( v j ) 

For each attribute value aiof each attribute a 

ˆ P a v ˆ P a v 

( i j ) ← estimate ( i j ) 

v 

( x ) 

argmax ˆ P v ˆ P a v 

ClassifyNewIns tance 

= ∏ 

( ) ( ) 

NB j i j 

v j ∈V a ∈x 

rev 1.0, 1/20/2003i 

Naive Bayes: Subtleties I 

1. Conditional independence assumption is often 

violated 

� P ( a1, a2, …, 

anv j ) = ∏P 

( aiv j ) 

i 

rev 1.0, 1/20/2003 

Page 44 

� ...but it works surprisingly well anyway. Note don't 

need estimated posteriors P_hat(v j |x) to be correct; 

need only that 

� argmax ˆ P ( v ) ˆ 

j ∏P 

( aj v j) = argmax P ( v j) P ( a1, …, 

anv j ) 

v j ∈V i 

v j ∈V 

� see [Domingos & Pazzani, 1996] for analysis 

� Naive Bayes posteriors often unrealistically close to 1 

or 0 

Learning to Classify Text I 

� Why? 

� Learn which news articles are of interest 

� Learn to classify web pages by topic 

� Naive Bayes is among most effective 

algorithms 

� What attributes shall we use to 

represent text documents?? 

rev 1.0, 1/20/2003 

Page 46 

Page 48

Learning to Classify Text II 

� Target concept Interesting? : Document 

→{+,-} 

1. Represent each document by vector of words 

� one attribute per word position in document 

2. Learning: Use training examples to estimate 

� P(+) 

� P(-) 

� P(doc|+) 

� P(doc|-) 

rev 1.0, 1/20/2003 

LearnNaiveBayesText (Examples, V) 

1. Collect all words and other tokens that occur in Examples 

� Vocabulary ← all distinct words and other tokens in Examples 

2. Calculate the required P (v j ) and P (w k |v j ) probability terms 

3. For each target value v j in V do 

� docs j ← subset of Examples for which the target value is v j 

� P (v j ) ← | docs j | / |Examples| 

� Text j ← a single document created by concatenating all members 

of docs j 

� n ← total number of words in Text j (counting duplicate words 

multiple times) 

� for each word w k in Vocabulary 

� nk ← number of times word wk occurs in Textj � P (wk |v ) j ← (nk + 1) / (n + |Vocabulary|) 

rev 1.0, 1/20/2003 

Twenty NewsGroups 

� Given 1000 training documents from each group 

Page 49 

Page 51 

� Learn to classify new documents according to which 

newsgroup it came from 

� comp.graphics, comp.os.ms-windows.misc, 

comp.sys.ibm.pc.hardware, comp.sys.mac.hardware, 

comp.windows.x 

� misc.forsale, rec.autos, rec.motorcycles, rec.sport.baseball, 

rec.sport.hockey 

� alt.atheism, soc.religion.christian, talk.religion.misc, 

talk.politics.mideast, talk.politics.misc, talk.politics.guns 

� sci.space, sci.crypt, sci.electronics, sci.med 

� Naive Bayes: 89% classification accuracy 

rev 1.0, 1/20/2003 

Page 53 

Learning to Classify Text III 

� Naive Bayes conditional independence 

assumption length( doc ) 

� P doc v = P a = w v 

( j ) ∏ ( i k j ) 

i = 1 

� where P (a i = w k| v j) is probability that 

word in position i is w k, given v j 

� one more assumption: 

� P (a i =w k|v j) = P(a m=w k|v j), forall i,m 

rev 1.0, 1/20/2003 

ClassifyNaiveBayesText(Doc) 

� positions ← all word positions in Doc 

that contain tokens found in Vocabulary 

� Return v NB, where 

rev 1.0, 1/20/2003 

( ) ( ) 

v NB = argmax P v j ∏ P aivi v j ∈V i ∈positions 

Article from rec.sport.hockey 

� Path: cantaloupe.srv.cs.cmu.edu!dasnews.harvard.edu!ogicse!uwm.edu 

� From: xxx@yyy.zzz.edu (John Doe) 

� Subject: Re: This year's biggest and worst (opinion)... 

� Date: 5 Apr 93 09:53:39 GMT 

� I can only comment on the Kings, but the most obvious candidate for 

pleasant surprise is Alex Zhitnik. He came highly touted as a defensive 

defenseman, but he's clearly much more than that. Great skater and 

hard shot (though wish he were more accurate). In fact, he pretty 

much allowed the Kings to trade away that huge defensive liability Paul 

Coffey. Kelly Hrudey is only the biggest disappointment if you thought 

he was any good to begin with. But, at best, he's only a mediocre 

goaltender. A better choice would be Tomas Sandstrom, though not 

through any fault of his own, but because some thugs in Toronto 

decided 

rev 1.0, 1/20/2003 

Page 50 

Page 52 

Page 54

Learning Curve for 20 Newsgroups 

Accuracy vs. Training set size (1/3 withheld for test) 

rev 1.0, 1/20/2003 

Conditional Independence I 

Page 55 

� Definition: X is conditionally independent of Y 

given Z if the probability distribution 

governing X is independent of the value of Y 

given the value of Z; that is, if 

� (∀ x i ,y j ,z k ) P (X=x i | Y=y j , Z=z k ) = P (X=x i | Z=z k ) 

� more compactly, we write 

� P(X | Y,Z) = P(X | Z) 

rev 1.0, 1/20/2003 

Bayesian Belief Network I 

� Network represents a set of conditional independence 

assertions: 

� Each node is asserted to be conditionally independent of its 

nondescendants, given its immediate predecessors. 

� Directed acyclic graph 

rev 1.0, 1/20/2003 

Page 57 

Page 59 

Bayesian Belief Networks 

� Interesting because 

� Naive Bayes assumption of conditional 

independence too restrictive 

� But it's intractable without some such assumptions 

� Bayesian belief networks describe conditional 

independence among subsets of variables 

� Allows combining prior knowledge about 

(in)dependencies among variables with observed 

training data 

� (Also called Bayes nets) 

rev 1.0, 1/20/2003 

Conditional Independence II 

� Example: Thunder is conditionally 

independent of Rain, given Lightning 

� P (Thunder | Rain, Lightning) = 

P (Thunder | Lightning) 

� Naive Bayes uses conditional 

independence to justify 

� P (X,Y |Z) = P (X |Y,Z ) P (Y |Z ) 

� P (X,Y |Z) = P (X |Z ) P (Y |Z ) 

rev 1.0, 1/20/2003 

Bayesian Belief Network II 

� Represents joint probability distribution over all variables 

� e.g., P (Storm, BusTourGroup,…, ForestFire) 

� in general, P (y 1 ,…, y n ) = ∏ n i=1 P (y i | Parents(Y i )) where 

Parents(Y i ) denotes immediate predecessors of Y i in graph 

� so, joint distribution is fully defined by graph, plus the 

P (y i | Parents(Y i )) 

rev 1.0, 1/20/2003 

Page 56 

Page 58 

Page 60

Inference in Bayesian Networks 

� How can one infer the (probabilities of) values of one 

or more network variables, given observed values of 

others? 

� Bayes net contains all information needed for this inference 

� If only one variable with unknown value, easy to infer it 

� In general case, problem is NP hard 

� In practice, can succeed in many cases 

� Exact inference methods work well for some network 

structures 

� Monte Carlo methods “simulate” the network randomly to 

calculate approximate solutions 

rev 1.0, 1/20/2003 

Learning Bayes Nets 

Page 61 

� Suppose structure known, variables partially 

observable 

� e.g., observe ForestFire, Storm, 

BusTourGroup, Thunder, but not Lightning, 

Campfire ... 

� Similar to training neural network with hidden 

units 

� In fact, can learn network conditional probability 

tables using gradient ascent! 

� Converge to network h that (locally) maximizes P 

(D |h) 

rev 1.0, 1/20/2003 

More on Learning Bayes Nets 

� EM algorithm can also be used. Repeatedly: 

1. Calculate probabilities of unobserved variables, 

assuming h 

2. Calculate new w ijk to maximize E [lnP (D|h)] where 

D now includes both observed and (calculated 

probabilities of) unobserved variables 

� When structure unknown... 

� Algorithms use greedy search to add/substract 

edges and nodes 

� Active research topic 

rev 1.0, 1/20/2003 

Page 63 

Page 65 

Learning of Bayesian Networks 

� Several variants of this learning task 

� Network structure might be known or 

unknown 

� Training examples might provide values of 

all network variables, or just some 

� If structure known and observe all 

variables 

� Then it's easy as training a Naive Bayes 

classifier 

rev 1.0, 1/20/2003 

Gradient Ascent for Bayes Nets 

� Let wijk denote one entry in the conditional 

probability table for variable Yi in the network 

� wijk = P (Yi =yij | Parents(Yi ) = the list uik of values) 

� e.g., if Yi = Campfire, then uik might be 

 

� Perform gradient ascent by repeatedly 

1. update all wijk using training data D 

wijk ← wijk + η Σd∈D Ph (yij , uik | d) / wijk 2. then, renormalize the wijk to assure 

� Σj wijk = 1 

� 0 ≤ wijk ≤ 1 

rev 1.0, 1/20/2003 

Summary: Bayesian Belief Networks 

rev 1.0, 1/20/2003 

Page 62 

Page 64 

� Combine prior knowledge with observed data 

� Impact of prior knowledge (when correct!) is 

to lower the sample complexity 

� Active research area 

� Extend from boolean to real-valued variables 

� Parameterized distributions instead of tables 

� Extend to first-order instead of propositional 

systems 

� More effective inference methods 

Page 66

Expectation Maximization (EM) 

� When to use: 

� Data is only partially observable 

� Unsupervised clustering (target value 

unobservable) 

� Supervised learning (some instance attributes 

unobservable) 

� Some uses: 

� Train Bayesian Belief Networks 

� Unsupervised clustering (AUTOCLASS) 

� Learning Hidden Markov Models 

rev 1.0, 1/20/2003 

EM for Estimating k Means I 

� Given: 

� Instances from X generated by mixture of k Gaussian 

distributions 

� Unknown means of the k Gaussians 

� Don't know which instance x i was generated by which 

Gaussian 

� Determine: 

� Maximum likelihood estimates of 

� Think of full description of each instance as 

y i = < x i , z i1 , z i2 >, where 

� z ij is 1 if x i generated by j th Gaussian 

� x i observable 

� z ij unobservable 

rev 1.0, 1/20/2003 

EM for Estimating k Means III 

M step: 

rev 1.0, 1/20/2003 

Page 67 

Page 69 

� Calculate a new maximum likelihood hypothesis 

h’ = , assuming the value taken on by each hidden 

variable z ij is its expected value E[z ij ] calculated above. 

Replace h = by h’ = . 

m 

i = 1 

j ← m 

µ 

∑ 

∑ 

E ⎡z ⎤ 

⎣ ij ⎦ 

xi 

E ⎡z ⎤ 

⎣ ij ⎦ 

i = 1 

Page 71 

Generating Data from Mixture 

of k Gaussians 

� Each instance x generated by 

� Choosing one of the k Gaussians with uniform probability 

� Generating an instance at random according to that Gaussian 

rev 1.0, 1/20/2003 

EM for Estimating k Means II 

� EM Algorithm: Pick random initial h = , then 

iterate 

E step: 

� Calculate the expected value E[z ij ] of each hidden variable 

z ij , assuming the current hypothesis h = holds 

E ⎡z ⎤ 

⎣ ij ⎦ 

= 

= 

( = i µ = µ j ) 

∑ p( x = xi 

µ = µ n) 

∑ 

p x x 

2 

n = 1 

1 

2 

− ( x ) 

2 i −µ 

j 

2σ 

e 

1 

2 

2 − ( x ) 

2 i −µ 

n 

2σ 

e 

n = 1 

EM Algorithm 

rev 1.0, 1/20/2003 

rev 1.0, 1/20/2003 

Page 68 

Page 70 

� Converges to local maximum likelihood 

h and provides estimates of hidden 

variables z ij 

� In fact, local maximum in E [lnP (Y |h)] 

� Y is complete (observable plus 

unobservable variables) data 

� Expected value is taken over possible 

values of unobserved variables inY 

Page 72

General EM Problem 

� Given: 

� Observed data X={x 1 ,…, x m } 

� Unobserved data Z={z 1 ,…, z m } 

� Parameterized probability distribution P (Y |h), where 

� Y={y 1 ,…, y m } is the full data y i = x i ∪ z i 

� h are the parameters 

� Determine: 

� h that (locally) maximizes E [lnP (Y |h)] 

� Many uses: 

� Train Bayesian belief networks 

� Unsupervised clustering (e.g., k means) 

� Hidden Markov Models 

rev 1.0, 1/20/2003 

Summary Points 

� Bayes Theorem 

� MAP, ML hypotheses 

� MAP learners 

� Minimum description length principle 

� Bayes optimal classifier 

� Naive Bayes learner 

� Example: Learning over text data 

� Bayesian belief networks 

� Expectation Maximization algorithm 

rev 1.0, 1/20/2003 

Page 73 

Page 75 

General EM Method 

� Define likelihood function Q (h' | h) which 

calculates Y = X ∪Z using observed X and 

current parameters h to estimate Z 

� Q (h' | h) ← E [ln P (Y | h‘ ) | h, X ] 

EM Algorithm: 

Estimation (E) step: Calculate Q (h' | h) using the 

current hypothesis h and the observed data X to 

estimate the probability distribution over Y . 

� Q (h' | h) ← E [ln P (Y | h‘ ) | h, X ] 

Maximization (M) step: Replace hypothesis h by the 

hypothesis h‘ that maximizes this Q function. 

� h ← argmax h' Q (h' | h) 

rev 1.0, 1/20/2003 

Page 74

Bayesian Learning Handouts.pdf - Richard J. Povinelli

Create successful ePaper yourself

Delete template?

Save as template?