10.07.2015 Views

Information Theory, Inference, and Learning ... - Inference Group

Information Theory, Inference, and Learning ... - Inference Group

Information Theory, Inference, and Learning ... - Inference Group

SHOW MORE
SHOW LESS
  • No tags were found...

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.496 41 — <strong>Learning</strong> as <strong>Inference</strong>ɛP ∗ (x)Q(x; x (1) )x (1)(a)Dumb Metropolis−ηg(b)Gradient descent(c)LangevinFigure 41.3. One step of theLangevin method in twodimensions (c), contrasted with atraditional ‘dumb’ Metropolismethod (a) <strong>and</strong> with gradientdescent (b). The proposal densityof the Langevin method is givenby ‘gradient descent with noise’.ImplementationHow shall we compute the integral (41.18)? For our toy problem, the weightspace is three dimensional; for a realistic neural network the dimensionalityK might be in the thous<strong>and</strong>s.Bayesian inference for general data modelling problems may be implementedby exact methods (Chapter 25), by Monte Carlo sampling (Chapter29), or by deterministic approximate methods, for example, methods thatmake Gaussian approximations to P (w | D, α) using Laplace’s method (Chapter27) or variational methods (Chapter 33). For neural networks there are fewexact methods. The two main approaches to implementing Bayesian inferencefor neural networks are the Monte Carlo methods developed by Neal (1996)<strong>and</strong> the Gaussian approximation methods developed by MacKay (1991).41.4 Monte Carlo implementation of a single neuronFirst we will use a Monte Carlo approach in which the task of evaluating theintegral (41.18) is solved by treating y(x (N+1) ; w) as a function f of w whosemean we compute using〈f(w)〉 ≃ 1 ∑f(w (r) ) (41.19)Rwhere {w (r) } are samples from the posterior distribution 1Z Mexp(−M(w)) (cf.equation (29.6)). We obtain the samples using a Metropolis method (section29.4). As an aside, a possible disadvantage of this Monte Carlo approach isthat it is a poor way of estimating the probability of an improbable event, i.e.,a P (t | D, H) that is very close to zero, if the improbable event is most likelyto occur in conjunction with improbable parameter values.How to generate the samples {w (r) }? Radford Neal introduced the HamiltonianMonte Carlo method to neural networks. We met this sophisticatedMetropolis method, which makes use of gradient information, in Chapter 30.The method we now demonstrate is a simple version of Hamiltonian MonteCarlo called the Langevin Monte Carlo method.rThe Langevin Monte Carlo methodThe Langevin method (algorithm 41.4) may be summarized as ‘gradient descentwith added noise’, as shown pictorially in figure 41.3. A noise vector pis generated from a Gaussian with unit variance. The gradient g is computed,

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!