10.07.2015 Views

Information Theory, Inference, and Learning ... - Inference Group

Information Theory, Inference, and Learning ... - Inference Group

Information Theory, Inference, and Learning ... - Inference Group

SHOW MORE
SHOW LESS
  • No tags were found...

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.33.5: The case of an unknown Gaussian 429the whole distribution. We may also be interested in its normalizing constantP (D | H) if we wish to do model comparison. The probability distributionP (w | D, H) is often a complex distribution. In a variational approach to inference,we introduce an approximating probability distribution over the parameters,Q(w; θ), <strong>and</strong> optimize this distribution (by varying its own parametersθ) so that it approximates the posterior distribution of the parametersP (w | D, H) well.One objective function we may choose to measure the quality of the approximationis the variational free energy∫˜F (θ) = d k Q(w; θ)w Q(w; θ) lnP (D | w, H)P (w | H) . (33.34)The denominator P (D | w, H)P (w | H) is, within a multiplicative constant, theposterior probability P (w | D, H) = P (D | w, H)P (w | H)/P (D | H). So thevariational free energy ˜F (θ) can be viewed as the sum of − ln P (D | H) <strong>and</strong>the relative entropy between Q(w; θ) <strong>and</strong> P (w | D, H). ˜F (θ) is bounded belowby − ln P (D | H) <strong>and</strong> only attains this value for Q(w; θ) = P (w | D, H). Forcertain models <strong>and</strong> certain approximating distributions, this free energy, <strong>and</strong>its derivatives with respect to the approximating distribution’s parameters,can be evaluated.The approximation of posterior probability distributions using variationalfree energy minimization provides a useful approach to approximating Bayesianinference in a number of fields ranging from neural networks to the decoding oferror-correcting codes (Hinton <strong>and</strong> van Camp, 1993; Hinton <strong>and</strong> Zemel, 1994;Dayan et al., 1995; Neal <strong>and</strong> Hinton, 1998; MacKay, 1995a). The methodis sometimes called ensemble learning to contrast it with traditional learningprocesses in which a single parameter vector is optimized. Another name forit is variational Bayes. Let us examine how ensemble learning works in thesimple case of a Gaussian distribution.33.5 The case of an unknown Gaussian: approximating the posteriordistribution of µ <strong>and</strong> σWe will fit an approximating ensemble Q(µ, σ) to the posterior distributionthat we studied in Chapter 24,P (µ, σ | {x n } N n=1) = P ({x n} N n=1 | µ, σ)P (µ, σ)=P ({x n } N n=1 ) (33.35)()1exp − N(µ−¯x)2 +S 1 1(2πσ 2 ) N/2 2σ 2 σ µ σP ({x n } N n=1 ) . (33.36)We make the single assumption that the approximating ensemble is separablein the form Q(µ, σ) = Q µ (µ)Q σ (σ). No restrictions on the functional form ofQ µ (µ) <strong>and</strong> Q σ (σ) are made.We write down a variational free energy,∫Q µ (µ)Q σ (σ)˜F (Q) = dµ dσ Q µ (µ)Q σ (σ) lnP (D | µ, σ)P (µ, σ) . (33.37)We can find the optimal separable distribution Q by considering separatelythe optimization of ˜F over Q µ (µ) for fixed Q σ (σ), <strong>and</strong> then the optimizationof Q σ (σ) for fixed Q µ (µ).

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!