02.12.2012 Views

Bayesian Learning Handouts.pdf - Richard J. Povinelli

Bayesian Learning Handouts.pdf - Richard J. Povinelli

Bayesian Learning Handouts.pdf - Richard J. Povinelli

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>Bayesian</strong> <strong>Learning</strong><br />

Dr. <strong>Richard</strong> J. <strong>Povinelli</strong><br />

Many slides adapted from Tom Mitchell’s slides and<br />

Andrew W. Moore’s slides (http://www.cs.cmu.edu/~awm/tutorials )<br />

rev 1.0, 1/20/2003<br />

Two Roles for <strong>Bayesian</strong> Methods<br />

Page 1<br />

� Provides practical learning algorithms:<br />

� Naive Bayes learning<br />

� <strong>Bayesian</strong> belief network learning<br />

� Combine prior knowledge (prior probabilities) with<br />

observed data<br />

� Requires prior probabilities<br />

� Provides useful conceptual framework<br />

� Provides “gold standard” for evaluating other<br />

learning algorithms<br />

� Additional insight into Occam's razor<br />

rev 1.0, 1/20/2003<br />

Discrete Random Variables<br />

� A is a Boolean-valued random variable if<br />

A denotes an event, and there is some<br />

degree of uncertainty as to whether A<br />

occurs.<br />

� Examples<br />

� A = The US president in 2023 will be male<br />

� A = You wake up tomorrow with a<br />

headache<br />

� A = You have Ebola<br />

rev 1.0, 1/20/2003<br />

Page 3<br />

Page 5<br />

Overview of Class<br />

� Bayes Theorem<br />

� MAP, ML hypotheses<br />

� MAP learners<br />

� Minimum description length principle<br />

� Bayes optimal classifier<br />

� Naive Bayes learner<br />

� Example: <strong>Learning</strong> over text data<br />

� <strong>Bayesian</strong> belief networks<br />

� Expectation Maximization algorithm<br />

rev 1.0, 1/20/2003<br />

Ways to deal with Uncertainty<br />

� Three-valued logic: True / False / Maybe<br />

� Fuzzy logic (truth values between 0 and 1)<br />

� Non-monotonic reasoning (especially focused<br />

on Penguin informatics)<br />

� Dempster-Shafer theory (and an extension<br />

known as quasi-<strong>Bayesian</strong> theory)<br />

� Possibabilistic Logic<br />

� Probability<br />

Probabilities<br />

rev 1.0, 1/20/2003<br />

� We write P(A) as “the fraction of<br />

possible worlds in which A is true”<br />

� We could at this point spend 2 hours on<br />

the philosophy of this.<br />

� But we won’t.<br />

rev 1.0, 1/20/2003<br />

Page 2<br />

Page 4<br />

Page 6


Event space of<br />

all possible<br />

worlds<br />

Its area is 1<br />

Visualizing A<br />

Worlds in which<br />

A is true<br />

Worlds in which A is False<br />

rev 1.0, 1/20/2003<br />

Interpreting the axioms<br />

� 0


Theorems from the Axioms<br />

� Axioms<br />

� 0


Bayes Rule<br />

P ( A∧B) P ( B A)<br />

=<br />

P ( A)<br />

P ( A B) P ( B)<br />

=<br />

P ( A)<br />

Bayes, Thomas (1763) An essay<br />

towards solving a problem in the doctrine<br />

of chances. Philosophical Transactions<br />

of the Royal Society of London, 53:370-<br />

418<br />

rev 1.0, 1/20/2003<br />

Using Bayes Rule to Gamble<br />

$1.00<br />

The “Win” envelope has<br />

a dollar and four<br />

beads in it<br />

rev 1.0, 1/20/2003<br />

The “Lose” envelope has<br />

three beads and no<br />

money<br />

Interesting question: before deciding, you are allowed to see one bead<br />

drawn from the envelope.<br />

Suppose it’s black: How much should you pay?<br />

Suppose it’s red: How much should you pay?<br />

Choosing Hypotheses<br />

P ( D h) P ( h)<br />

P ( h D)<br />

=<br />

P ( D)<br />

Generally want the most probable hypothesis given the training data<br />

Maximum a posteriori hypothesis hMAP<br />

hMAP = argmax P ( h D)<br />

h∈H P ( D h) P ( h)<br />

= argmax<br />

h∈H P ( D)<br />

= argmax P ( D h) P ( h)<br />

h∈H If assume P ( hi) = P ( hj)<br />

then can further simplify,<br />

and choose the<br />

Maximum likelihood (ML) hypothesis<br />

hML = argmax P ( D hi)<br />

hi∈H rev 1.0, 1/20/2003<br />

Page 19<br />

Page 21<br />

Page 23<br />

Using Bayes Rule to Gamble<br />

$1.00<br />

R R B B R B B<br />

The “Win” envelope has<br />

a dollar and four<br />

beads in it<br />

rev 1.0, 1/20/2003<br />

The “Lose” envelope<br />

has three beads and<br />

no money<br />

Trivial question: someone draws an envelope at random and offers to<br />

sell it to you. How much should you pay?<br />

Classroom Activity<br />

� Suppose it’s black: How much should<br />

you pay?<br />

� Suppose it’s red: How much should<br />

you pay?<br />

� With a partner figure out how you<br />

represent each of these. You have 5<br />

minutes.<br />

rev 1.0, 1/20/2003<br />

Bayes Theorem<br />

� Does patient have cancer or not?<br />

� A patient takes a lab test and the result comes back positive.<br />

The test returns a correct positive result in only 97% of the<br />

cases in which the disease is actually present, and a correct<br />

negative result in only 98% of the cases in which the<br />

disease is not present. Furthermore, 0.007 of the entire<br />

population have this cancer.<br />

� Probabilities<br />

� P(cancer) =<br />

� P(¬cancer) =<br />

� P(+ | cancer) =<br />

� P(- | cancer) =<br />

� P(+ | ¬cancer) =<br />

� P(- | ¬cancer) =<br />

rev 1.0, 1/20/2003<br />

Page 20<br />

Page 22<br />

Page 24


Brute Force MAP<br />

Hypothesis Learner<br />

1. For each hypothesis h ∈ H,<br />

calculate<br />

the posterior probability<br />

( )<br />

( )<br />

P ( D)<br />

( )<br />

P D h P h<br />

P h D =<br />

2. Output the hypothesis hMAP<br />

with the<br />

highest posterior probability<br />

h = argmax P h D<br />

MAP<br />

h∈H ( )<br />

rev 1.0, 1/20/2003<br />

Relation to Concept <strong>Learning</strong> II<br />

� Assume fixed set of instances <br />

� Assume D is the set of classifications<br />

D = < c (x1 ),…,c (xm )><br />

� Choose P (D |h)<br />

� P (D |h) = 1 if h consistent with D<br />

� P (D |h) = 0 otherwise<br />

� Choose P (h) to have a uniform distribution<br />

� P (h) = 1/|H| for all h in H<br />

� Then,<br />

⎧⎪<br />

⎪<br />

1<br />

if h is consistent with D<br />

P ( h D)<br />

= ⎪<br />

⎨VS H , D<br />

⎪⎪⎪⎩<br />

0 otherwise<br />

rev 1.0, 1/20/2003<br />

Characterizing <strong>Learning</strong> Algorithms<br />

by Equivalent MAP Learners<br />

rev 1.0, 1/20/2003<br />

Page 25<br />

Page 27<br />

Page 29<br />

Relation to Concept <strong>Learning</strong> I<br />

� Consider our usual concept learning task<br />

� instance space X, hypothesis space H, training<br />

examples D<br />

� consider the FindS learning algorithm (outputs<br />

most specific hypothesis from the version space<br />

VS H,D )<br />

� What would Bayes rule produce as the MAP<br />

hypothesis?<br />

� Does FindS output a MAP hypothesis??<br />

rev 1.0, 1/20/2003<br />

Evolution of Posterior Probabilities<br />

rev 1.0, 1/20/2003<br />

<strong>Learning</strong> A Real Valued Function<br />

� Consider any real-valued<br />

target function f<br />

� Training examples<br />

, where d i is a<br />

noisy training value<br />

� d i = f(x i ) + e i<br />

� e i is an independent random<br />

variable N(0,σ)<br />

� Then the maximum likelihood<br />

hypothesis h ML is the one that<br />

minimizes the sum of squared<br />

errors<br />

rev 1.0, 1/20/2003<br />

m<br />

argmin<br />

h∈H i = 1<br />

Page 26<br />

Page 28<br />

( ( ) ) 2<br />

hML = ∑<br />

di−h xi<br />

Page 30


<strong>Learning</strong> A Real Valued Function<br />

h∈ H i = 1<br />

h∈ H i = 1<br />

( )<br />

h = argmax p D h<br />

ML<br />

h∈H = argmax<br />

= argmax<br />

m<br />

∏<br />

m<br />

∏<br />

( i )<br />

p d h<br />

1<br />

2πσ<br />

rev 1.0, 1/20/2003<br />

2<br />

e<br />

( ) 2<br />

1⎛di<br />

−h x ⎞<br />

i<br />

− ⎜ ⎟<br />

⎜ ⎟<br />

2⎜⎜⎝<br />

σ<br />

⎟<br />

⎠⎟<br />

<strong>Learning</strong> to Predict Probabilities<br />

� Consider predicting survival probability from patient data<br />

� Training examples < x i , d i >, where d i is 1 or 0<br />

� Want to train neural network to output a probability given x i<br />

(not a 0 or 1)<br />

� In this case can show<br />

hML = argmax ∑dilnh(<br />

xi) + ( 1 −di) ln( 1 −h(<br />

xi)<br />

)<br />

h∈ H i = 1<br />

Weight update rule for a sigmoid unit:<br />

w ← w + ∆w<br />

where<br />

∆ w = η∑<br />

( d − h( x ) ) x<br />

m<br />

jk jk jk<br />

m<br />

jk i i ijk<br />

i = 1<br />

rev 1.0, 1/20/2003<br />

Example<br />

� H = decision trees, D = training data<br />

labels<br />

� L C1 (h) is # bits to describe tree h<br />

� L C2 (D|h) is # bits to describe D given h<br />

� Note L C2 (D|h) =0 if examples classified<br />

perfectly by h. Need only describe<br />

exceptions<br />

� Hence h MDL trades off tree size for<br />

training errors<br />

rev 1.0, 1/20/2003<br />

Page 31<br />

Page 33<br />

Page 35<br />

Maximize Natural Log Instead<br />

⎛<br />

⎜ m<br />

⎜<br />

ML = ln ⎜<br />

⎜argmax ⎜ ∏<br />

h∈ H ⎜ i = 1<br />

⎜⎝<br />

1<br />

2<br />

2πσ<br />

2<br />

1⎛di<br />

h( xi<br />

) ⎞<br />

⎜ − ⎟ ⎞<br />

− ⎜ ⎟<br />

2<br />

⎜ ⎟ ⎟<br />

⎜⎝ σ ⎠⎟<br />

⎟<br />

⎠⎟<br />

⎛<br />

m ⎜ ⎛<br />

= argmax ln⎜ ∑ ⎜ ⎜<br />

h∈ H i = 1 ⎜ ⎜<br />

⎜<br />

⎝<br />

⎝<br />

2<br />

⎛ 1⎛di<br />

−h( xi<br />

) ⎞<br />

⎜<br />

⎟ ⎞⎞ − ⎟<br />

1 ⎞ ⎜<br />

⎟<br />

2<br />

⎜ ⎟<br />

⎟ ⎜ ⎜⎜⎝ σ ⎠⎟⎟⎟<br />

⎟ + ln⎜e<br />

⎟ ⎟<br />

2 ⎟ ⎜ ⎟ ⎟⎟<br />

2πσ<br />

⎠⎟⎜<br />

⎟⎟<br />

⎝ ⎜ ⎠⎟⎠<br />

⎟<br />

m ⎛<br />

⎜ ⎛<br />

= argmax ln⎜<br />

∑ ⎜ ⎜ h∈H ⎜ i = 1 ⎜⎜⎝ ⎜⎝<br />

2<br />

1 ⎞<br />

⎟ 1 ⎛di − h( x ) ⎞ ⎞<br />

i ⎟<br />

⎟<br />

2 ⎟ −<br />

⎜ ⎟<br />

⎜ ⎟ ⎟<br />

2πσ<br />

⎠⎟2⎜<br />

⎟<br />

⎜⎝<br />

⎟<br />

σ<br />

⎟<br />

⎠⎟⎟<br />

⎠⎟<br />

m<br />

2<br />

= argmax ∑−(<br />

di −h(<br />

xi<br />

) )<br />

h∈ H i = 1<br />

m<br />

= argmin∑<br />

( d − h( x ) )<br />

h∈ H i = 1<br />

h e<br />

2<br />

i i<br />

rev 1.0, 1/20/2003<br />

rev 1.0, 1/20/2003<br />

Page 32<br />

Minimum Description Length Principle<br />

� Occam's razor: prefer the shortest<br />

hypothesis<br />

� MDL: prefer the hypothesis h that<br />

minimizes<br />

h = argminL<br />

h + L D h<br />

LC( x)<br />

x under encoding C<br />

( ) ( )<br />

MDL C1 C 2<br />

h∈H where is the description length of<br />

MAP MDL Principle I<br />

( ) ( )<br />

h = argmax P D h P h<br />

MAP<br />

h∈H rev 1.0, 1/20/2003<br />

( ) ( )<br />

= argmax log P D h + log P h<br />

h∈H 2 2<br />

Page 34<br />

( ) ( )<br />

= argmin −log P D h −log<br />

P h<br />

h∈H 2 2<br />

Page 36


MAP MDL Principle II<br />

� Interesting fact from information theory:<br />

� The optimal (shortest expected coding length)<br />

code for an event with probability p is -log 2p bits.<br />

� So interpret equation on last slide:<br />

� -log 2P (h) is length of h under optimal code<br />

� -log 2P (D |h) is length of D given h under optimal<br />

code<br />

� Prefer the hypothesis that minimizes<br />

� length(h) + length(misclassifications)<br />

rev 1.0, 1/20/2003<br />

Classroom Activity<br />

� Consider:<br />

� Three possible hypotheses:<br />

� P (h 1 |D)=0.4, P (h 2 |D)= 0.3, P (h 3 |D)= 0.3<br />

� Given new instance x,<br />

� h 1 (x)=+, h 2 (x)=-, h 3 (x)=-<br />

� What's most probable classification<br />

of x?<br />

� With a partner figure out how you<br />

represent each of these. You have<br />

5 minutes.<br />

rev 1.0, 1/20/2003<br />

Gibbs Classifier<br />

� Bayes optimal classifier provides best result, but can<br />

be expensive if many hypotheses.<br />

� Gibbs algorithm:<br />

1. Choose one hypothesis at random, according to P (h |D)<br />

2. Use this to classify new instance<br />

� Surprising fact: Assume target concepts are drawn at<br />

random from H according to priors on H. Then:<br />

� E [error Gibbs ] ≤ 2E [error Bayes Optimal ]<br />

� Suppose correct, uniform prior distribution over H,<br />

then<br />

� Pick any hypothesis from VS, with uniform probability<br />

� Its expected error no worse than twice Bayes optimal<br />

rev 1.0, 1/20/2003<br />

Page 37<br />

Page 39<br />

Page 41<br />

Most Probable Classification<br />

of New Instances<br />

� So far we've sought the most probable<br />

hypothesis given the data D (i.e., hMAP) � Given new instance x, what is its most<br />

probable classification?<br />

� hMAP (x) is not the most probable<br />

classification!<br />

rev 1.0, 1/20/2003<br />

Bayes Optimal Classifier<br />

Bayes optimal classification:<br />

argmax P v h P h D<br />

v j ∈V<br />

hi∈H ( j i ) ( i )<br />

Example:<br />

P ( h1 D) = .4, P ( − h1) = 0, P ( + h1)<br />

= 1<br />

P ( h2 D) = .3, P ( − h2) = 1, P ( + h2)<br />

= 0<br />

P ( h3 D) = .3, P ( − h3) = 1, P ( + h3)<br />

= 0<br />

therefore<br />

+ = .4, − = .6<br />

∑P ( hi ) P ( hi D) ∑ P ( hi ) P ( hi D)<br />

hi ∈H hi ∈H<br />

and<br />

argmax<br />

∑<br />

∑<br />

v j ∈V<br />

hi∈H ( j i ) ( i )<br />

P v h P h D = −<br />

rev 1.0, 1/20/2003<br />

Naive Bayes Classifier I<br />

� Along with decision trees, neural networks,<br />

nearest neighbor, one of the most practical<br />

learning methods.<br />

� When to use<br />

� Moderate or large training set available<br />

� Attributes that describe instances are conditionally<br />

independent given classification<br />

� Successful applications:<br />

� Diagnosis<br />

� Classifying text documents<br />

rev 1.0, 1/20/2003<br />

Page 38<br />

Page 40<br />

Page 42


Naive Bayes Classifier II<br />

� Assume target function f: X→V, where each instance x described by<br />

attributes .<br />

Most probable value of f ( x)<br />

) is:<br />

vMAP = argmax P ( v j a1, a2, …,<br />

an)<br />

v j ∈V<br />

P ( a1, a2, …,<br />

anv j) P ( v j)<br />

v MAP = argmax<br />

v j ∈V<br />

P ( a1, a2, …,<br />

an)<br />

vMAP = argmax P ( a1, a2, …,<br />

anv j) P ( v j)<br />

v j ∈V<br />

Naive Bayes assumption:<br />

P ( a1, a2, …,<br />

an v j) = ∏P<br />

( ai v j)<br />

i<br />

which gives Naive Bayes classifier:<br />

v = argmax P ( v ) ∏P<br />

( a v )<br />

NB<br />

v j ∈V<br />

j<br />

i<br />

i j<br />

rev 1.0, 1/20/2003<br />

Naive Bayes: Example<br />

� Consider PlayTennis again, and new instance<br />

� <br />

� Want to compute:<br />

( ) ∏ ( )<br />

v = argmax P v P a v<br />

NB j i j<br />

v j ∈V<br />

i<br />

( ) ( ) ( ) ( ) ( )<br />

( ) ( ) ( ) ( ) ( )<br />

P y P sun y P cool y P high y P strong y = 0.005<br />

P n P sun n P cool n P high n P strong n = 0.021<br />

→ v = n<br />

NB<br />

rev 1.0, 1/20/2003<br />

Naive Bayes: Subtleties II<br />

2. What if none of the training instances with target<br />

value vj have attribute value a ? i Then<br />

� ˆ P ( ai v j)<br />

= 0,and ...<br />

�<br />

ˆ P ( v ) ˆ<br />

j ∏P<br />

( ai v j)<br />

= 0<br />

i<br />

� Typical solution is <strong>Bayesian</strong> estimate for P_hat(ai | v ) j<br />

� ˆ nc+ mp<br />

P ( ai v j)<br />

←<br />

n + m<br />

� where<br />

� n is number of training examples for which v=v j ,<br />

rev 1.0, 1/20/2003<br />

Page 43<br />

Page 45<br />

� n c number of examples for which v=v j and a=a i<br />

� p is prior estimate for P_hat(a i | v j )<br />

� m is weight given to prior (i.e. number of “virtual” examples)<br />

Page 47<br />

Naive Bayes Algorithm<br />

( examples )<br />

NaiveBayesLearn<br />

For each target value v j<br />

ˆ P ( v j ) ← estimate P ( v j )<br />

For each attribute value aiof each attribute a<br />

ˆ P a v ˆ P a v<br />

( i j ) ← estimate ( i j )<br />

v<br />

( x )<br />

argmax ˆ P v ˆ P a v<br />

ClassifyNewIns tance<br />

= ∏<br />

( ) ( )<br />

NB j i j<br />

v j ∈V a ∈x<br />

rev 1.0, 1/20/2003i<br />

Naive Bayes: Subtleties I<br />

1. Conditional independence assumption is often<br />

violated<br />

� P ( a1, a2, …,<br />

anv j ) = ∏P<br />

( aiv j )<br />

i<br />

rev 1.0, 1/20/2003<br />

Page 44<br />

� ...but it works surprisingly well anyway. Note don't<br />

need estimated posteriors P_hat(v j |x) to be correct;<br />

need only that<br />

� argmax ˆ P ( v ) ˆ<br />

j ∏P<br />

( aj v j) = argmax P ( v j) P ( a1, …,<br />

anv j )<br />

v j ∈V i<br />

v j ∈V<br />

� see [Domingos & Pazzani, 1996] for analysis<br />

� Naive Bayes posteriors often unrealistically close to 1<br />

or 0<br />

<strong>Learning</strong> to Classify Text I<br />

� Why?<br />

� Learn which news articles are of interest<br />

� Learn to classify web pages by topic<br />

� Naive Bayes is among most effective<br />

algorithms<br />

� What attributes shall we use to<br />

represent text documents??<br />

rev 1.0, 1/20/2003<br />

Page 46<br />

Page 48


<strong>Learning</strong> to Classify Text II<br />

� Target concept Interesting? : Document<br />

→{+,-}<br />

1. Represent each document by vector of words<br />

� one attribute per word position in document<br />

2. <strong>Learning</strong>: Use training examples to estimate<br />

� P(+)<br />

� P(-)<br />

� P(doc|+)<br />

� P(doc|-)<br />

rev 1.0, 1/20/2003<br />

LearnNaiveBayesText (Examples, V)<br />

1. Collect all words and other tokens that occur in Examples<br />

� Vocabulary ← all distinct words and other tokens in Examples<br />

2. Calculate the required P (v j ) and P (w k |v j ) probability terms<br />

3. For each target value v j in V do<br />

� docs j ← subset of Examples for which the target value is v j<br />

� P (v j ) ← | docs j | / |Examples|<br />

� Text j ← a single document created by concatenating all members<br />

of docs j<br />

� n ← total number of words in Text j (counting duplicate words<br />

multiple times)<br />

� for each word w k in Vocabulary<br />

� nk ← number of times word wk occurs in Textj � P (wk |v ) j ← (nk + 1) / (n + |Vocabulary|)<br />

rev 1.0, 1/20/2003<br />

Twenty NewsGroups<br />

� Given 1000 training documents from each group<br />

Page 49<br />

Page 51<br />

� Learn to classify new documents according to which<br />

newsgroup it came from<br />

� comp.graphics, comp.os.ms-windows.misc,<br />

comp.sys.ibm.pc.hardware, comp.sys.mac.hardware,<br />

comp.windows.x<br />

� misc.forsale, rec.autos, rec.motorcycles, rec.sport.baseball,<br />

rec.sport.hockey<br />

� alt.atheism, soc.religion.christian, talk.religion.misc,<br />

talk.politics.mideast, talk.politics.misc, talk.politics.guns<br />

� sci.space, sci.crypt, sci.electronics, sci.med<br />

� Naive Bayes: 89% classification accuracy<br />

rev 1.0, 1/20/2003<br />

Page 53<br />

<strong>Learning</strong> to Classify Text III<br />

� Naive Bayes conditional independence<br />

assumption length( doc )<br />

� P doc v = P a = w v<br />

( j ) ∏ ( i k j )<br />

i = 1<br />

� where P (a i = w k| v j) is probability that<br />

word in position i is w k, given v j<br />

� one more assumption:<br />

� P (a i =w k|v j) = P(a m=w k|v j), forall i,m<br />

rev 1.0, 1/20/2003<br />

ClassifyNaiveBayesText(Doc)<br />

� positions ← all word positions in Doc<br />

that contain tokens found in Vocabulary<br />

� Return v NB, where<br />

rev 1.0, 1/20/2003<br />

( ) ( )<br />

v NB = argmax P v j ∏ P aivi v j ∈V i ∈positions<br />

Article from rec.sport.hockey<br />

� Path: cantaloupe.srv.cs.cmu.edu!dasnews.harvard.edu!ogicse!uwm.edu<br />

� From: xxx@yyy.zzz.edu (John Doe)<br />

� Subject: Re: This year's biggest and worst (opinion)...<br />

� Date: 5 Apr 93 09:53:39 GMT<br />

� I can only comment on the Kings, but the most obvious candidate for<br />

pleasant surprise is Alex Zhitnik. He came highly touted as a defensive<br />

defenseman, but he's clearly much more than that. Great skater and<br />

hard shot (though wish he were more accurate). In fact, he pretty<br />

much allowed the Kings to trade away that huge defensive liability Paul<br />

Coffey. Kelly Hrudey is only the biggest disappointment if you thought<br />

he was any good to begin with. But, at best, he's only a mediocre<br />

goaltender. A better choice would be Tomas Sandstrom, though not<br />

through any fault of his own, but because some thugs in Toronto<br />

decided<br />

rev 1.0, 1/20/2003<br />

Page 50<br />

Page 52<br />

Page 54


<strong>Learning</strong> Curve for 20 Newsgroups<br />

Accuracy vs. Training set size (1/3 withheld for test)<br />

rev 1.0, 1/20/2003<br />

Conditional Independence I<br />

Page 55<br />

� Definition: X is conditionally independent of Y<br />

given Z if the probability distribution<br />

governing X is independent of the value of Y<br />

given the value of Z; that is, if<br />

� (∀ x i ,y j ,z k ) P (X=x i | Y=y j , Z=z k ) = P (X=x i | Z=z k )<br />

� more compactly, we write<br />

� P(X | Y,Z) = P(X | Z)<br />

rev 1.0, 1/20/2003<br />

<strong>Bayesian</strong> Belief Network I<br />

� Network represents a set of conditional independence<br />

assertions:<br />

� Each node is asserted to be conditionally independent of its<br />

nondescendants, given its immediate predecessors.<br />

� Directed acyclic graph<br />

rev 1.0, 1/20/2003<br />

Page 57<br />

Page 59<br />

<strong>Bayesian</strong> Belief Networks<br />

� Interesting because<br />

� Naive Bayes assumption of conditional<br />

independence too restrictive<br />

� But it's intractable without some such assumptions<br />

� <strong>Bayesian</strong> belief networks describe conditional<br />

independence among subsets of variables<br />

� Allows combining prior knowledge about<br />

(in)dependencies among variables with observed<br />

training data<br />

� (Also called Bayes nets)<br />

rev 1.0, 1/20/2003<br />

Conditional Independence II<br />

� Example: Thunder is conditionally<br />

independent of Rain, given Lightning<br />

� P (Thunder | Rain, Lightning) =<br />

P (Thunder | Lightning)<br />

� Naive Bayes uses conditional<br />

independence to justify<br />

� P (X,Y |Z) = P (X |Y,Z ) P (Y |Z )<br />

� P (X,Y |Z) = P (X |Z ) P (Y |Z )<br />

rev 1.0, 1/20/2003<br />

<strong>Bayesian</strong> Belief Network II<br />

� Represents joint probability distribution over all variables<br />

� e.g., P (Storm, BusTourGroup,…, ForestFire)<br />

� in general, P (y 1 ,…, y n ) = ∏ n i=1 P (y i | Parents(Y i )) where<br />

Parents(Y i ) denotes immediate predecessors of Y i in graph<br />

� so, joint distribution is fully defined by graph, plus the<br />

P (y i | Parents(Y i ))<br />

rev 1.0, 1/20/2003<br />

Page 56<br />

Page 58<br />

Page 60


Inference in <strong>Bayesian</strong> Networks<br />

� How can one infer the (probabilities of) values of one<br />

or more network variables, given observed values of<br />

others?<br />

� Bayes net contains all information needed for this inference<br />

� If only one variable with unknown value, easy to infer it<br />

� In general case, problem is NP hard<br />

� In practice, can succeed in many cases<br />

� Exact inference methods work well for some network<br />

structures<br />

� Monte Carlo methods “simulate” the network randomly to<br />

calculate approximate solutions<br />

rev 1.0, 1/20/2003<br />

<strong>Learning</strong> Bayes Nets<br />

Page 61<br />

� Suppose structure known, variables partially<br />

observable<br />

� e.g., observe ForestFire, Storm,<br />

BusTourGroup, Thunder, but not Lightning,<br />

Campfire ...<br />

� Similar to training neural network with hidden<br />

units<br />

� In fact, can learn network conditional probability<br />

tables using gradient ascent!<br />

� Converge to network h that (locally) maximizes P<br />

(D |h)<br />

rev 1.0, 1/20/2003<br />

More on <strong>Learning</strong> Bayes Nets<br />

� EM algorithm can also be used. Repeatedly:<br />

1. Calculate probabilities of unobserved variables,<br />

assuming h<br />

2. Calculate new w ijk to maximize E [lnP (D|h)] where<br />

D now includes both observed and (calculated<br />

probabilities of) unobserved variables<br />

� When structure unknown...<br />

� Algorithms use greedy search to add/substract<br />

edges and nodes<br />

� Active research topic<br />

rev 1.0, 1/20/2003<br />

Page 63<br />

Page 65<br />

<strong>Learning</strong> of <strong>Bayesian</strong> Networks<br />

� Several variants of this learning task<br />

� Network structure might be known or<br />

unknown<br />

� Training examples might provide values of<br />

all network variables, or just some<br />

� If structure known and observe all<br />

variables<br />

� Then it's easy as training a Naive Bayes<br />

classifier<br />

rev 1.0, 1/20/2003<br />

Gradient Ascent for Bayes Nets<br />

� Let wijk denote one entry in the conditional<br />

probability table for variable Yi in the network<br />

� wijk = P (Yi =yij | Parents(Yi ) = the list uik of values)<br />

� e.g., if Yi = Campfire, then uik might be<br />

<br />

� Perform gradient ascent by repeatedly<br />

1. update all wijk using training data D<br />

wijk ← wijk + η Σd∈D Ph (yij , uik | d) / wijk 2. then, renormalize the wijk to assure<br />

� Σj wijk = 1<br />

� 0 ≤ wijk ≤ 1<br />

rev 1.0, 1/20/2003<br />

Summary: <strong>Bayesian</strong> Belief Networks<br />

rev 1.0, 1/20/2003<br />

Page 62<br />

Page 64<br />

� Combine prior knowledge with observed data<br />

� Impact of prior knowledge (when correct!) is<br />

to lower the sample complexity<br />

� Active research area<br />

� Extend from boolean to real-valued variables<br />

� Parameterized distributions instead of tables<br />

� Extend to first-order instead of propositional<br />

systems<br />

� More effective inference methods<br />

Page 66


Expectation Maximization (EM)<br />

� When to use:<br />

� Data is only partially observable<br />

� Unsupervised clustering (target value<br />

unobservable)<br />

� Supervised learning (some instance attributes<br />

unobservable)<br />

� Some uses:<br />

� Train <strong>Bayesian</strong> Belief Networks<br />

� Unsupervised clustering (AUTOCLASS)<br />

� <strong>Learning</strong> Hidden Markov Models<br />

rev 1.0, 1/20/2003<br />

EM for Estimating k Means I<br />

� Given:<br />

� Instances from X generated by mixture of k Gaussian<br />

distributions<br />

� Unknown means of the k Gaussians<br />

� Don't know which instance x i was generated by which<br />

Gaussian<br />

� Determine:<br />

� Maximum likelihood estimates of <br />

� Think of full description of each instance as<br />

y i = < x i , z i1 , z i2 >, where<br />

� z ij is 1 if x i generated by j th Gaussian<br />

� x i observable<br />

� z ij unobservable<br />

rev 1.0, 1/20/2003<br />

EM for Estimating k Means III<br />

M step:<br />

rev 1.0, 1/20/2003<br />

Page 67<br />

Page 69<br />

� Calculate a new maximum likelihood hypothesis<br />

h’ = , assuming the value taken on by each hidden<br />

variable z ij is its expected value E[z ij ] calculated above.<br />

Replace h = by h’ = .<br />

m<br />

i = 1<br />

j ← m<br />

µ<br />

∑<br />

∑<br />

E ⎡z ⎤<br />

⎣ ij ⎦<br />

xi<br />

E ⎡z ⎤<br />

⎣ ij ⎦<br />

i = 1<br />

Page 71<br />

Generating Data from Mixture<br />

of k Gaussians<br />

� Each instance x generated by<br />

� Choosing one of the k Gaussians with uniform probability<br />

� Generating an instance at random according to that Gaussian<br />

rev 1.0, 1/20/2003<br />

EM for Estimating k Means II<br />

� EM Algorithm: Pick random initial h = , then<br />

iterate<br />

E step:<br />

� Calculate the expected value E[z ij ] of each hidden variable<br />

z ij , assuming the current hypothesis h = holds<br />

E ⎡z ⎤<br />

⎣ ij ⎦<br />

=<br />

=<br />

( = i µ = µ j )<br />

∑ p( x = xi<br />

µ = µ n)<br />

∑<br />

p x x<br />

2<br />

n = 1<br />

1<br />

2<br />

− ( x )<br />

2 i −µ<br />

j<br />

2σ<br />

e<br />

1<br />

2<br />

2 − ( x )<br />

2 i −µ<br />

n<br />

2σ<br />

e<br />

n = 1<br />

EM Algorithm<br />

rev 1.0, 1/20/2003<br />

rev 1.0, 1/20/2003<br />

Page 68<br />

Page 70<br />

� Converges to local maximum likelihood<br />

h and provides estimates of hidden<br />

variables z ij<br />

� In fact, local maximum in E [lnP (Y |h)]<br />

� Y is complete (observable plus<br />

unobservable variables) data<br />

� Expected value is taken over possible<br />

values of unobserved variables inY<br />

Page 72


General EM Problem<br />

� Given:<br />

� Observed data X={x 1 ,…, x m }<br />

� Unobserved data Z={z 1 ,…, z m }<br />

� Parameterized probability distribution P (Y |h), where<br />

� Y={y 1 ,…, y m } is the full data y i = x i ∪ z i<br />

� h are the parameters<br />

� Determine:<br />

� h that (locally) maximizes E [lnP (Y |h)]<br />

� Many uses:<br />

� Train <strong>Bayesian</strong> belief networks<br />

� Unsupervised clustering (e.g., k means)<br />

� Hidden Markov Models<br />

rev 1.0, 1/20/2003<br />

Summary Points<br />

� Bayes Theorem<br />

� MAP, ML hypotheses<br />

� MAP learners<br />

� Minimum description length principle<br />

� Bayes optimal classifier<br />

� Naive Bayes learner<br />

� Example: <strong>Learning</strong> over text data<br />

� <strong>Bayesian</strong> belief networks<br />

� Expectation Maximization algorithm<br />

rev 1.0, 1/20/2003<br />

Page 73<br />

Page 75<br />

General EM Method<br />

� Define likelihood function Q (h' | h) which<br />

calculates Y = X ∪Z using observed X and<br />

current parameters h to estimate Z<br />

� Q (h' | h) ← E [ln P (Y | h‘ ) | h, X ]<br />

EM Algorithm:<br />

Estimation (E) step: Calculate Q (h' | h) using the<br />

current hypothesis h and the observed data X to<br />

estimate the probability distribution over Y .<br />

� Q (h' | h) ← E [ln P (Y | h‘ ) | h, X ]<br />

Maximization (M) step: Replace hypothesis h by the<br />

hypothesis h‘ that maximizes this Q function.<br />

� h ← argmax h' Q (h' | h)<br />

rev 1.0, 1/20/2003<br />

Page 74

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!