10.07.2015 Views

Information Theory, Inference, and Learning ... - Inference Group

Information Theory, Inference, and Learning ... - Inference Group

Information Theory, Inference, and Learning ... - Inference Group

SHOW MORE
SHOW LESS
  • No tags were found...

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.500 41 — <strong>Learning</strong> as <strong>Inference</strong>wnew = w ;gnew = g ;for tau = 1:Taup = p - epsilon * gnew / 2 ;wnew = wnew + epsilon * p ;gnew = gradM ( wnew ) ;p = p - epsilon * gnew / 2 ;endfor# make half-step in p# make step in w# find new gradient# make half-step in pAlgorithm 41.8. Octave sourcecode for the Hamiltonian MonteCarlo method. The algorithm isidentical to the Langevin methodin algorithm 41.4, except for thereplacement of the four linesmarked * in that algorithm by thefragment shown here.1050-5-10-15-20-25Langevin -300 2000 4000 6000 8000 1000050-5-10-15-20-25-30-35HMC -400 2000 4000 6000 8000 10000Figure 41.9. Comparison ofsampling properties of theLangevin Monte Carlo method<strong>and</strong> the Hamiltonian Monte Carlo(HMC) method. The horizontalaxis is the number of gradientevaluations made. Each figureshows the weights during the first10,000 iterations. The rejectionrate during this HamiltonianMonte Carlo simulation was 8%.The Bayesian classifier is better able to identify the points where the classificationis uncertain. This pleasing behaviour results simply from a mechanicalapplication of the rules of probability.Optimization <strong>and</strong> typicalityA final observation concerns the behaviour of the functions G(w) <strong>and</strong> M(w)during the Monte Carlo sampling process, compared with the values of G <strong>and</strong>M at the optimum w MP (figure 41.5). The function G(w) fluctuates aroundthe value of G(w MP ), though not in a symmetrical way. The function M(w)also fluctuates, but it does not fluctuate around M(w MP ) – obviously it cannot,because M is minimized at w MP , so M could not go any smaller – furthermore,M only rarely drops close to M(w MP ). In the language of information theory,the typical set of w has different properties from the most probable state w MP .A general message therefore emerges – applicable to all data models, notjust neural networks: one should be cautious about making use of optimizedparameters, as the properties of optimized parameters may be unrepresentativeof the properties of typical, plausible parameters; <strong>and</strong> the predictionsobtained using optimized parameters alone will often be unreasonably overconfident.Reducing r<strong>and</strong>om walk behaviour using Hamiltonian Monte CarloAs a final study of Monte Carlo methods, we now compare the Langevin MonteCarlo method with its big brother, the Hamiltonian Monte Carlo method. Thechange to Hamiltonian Monte Carlo is simple to implement, as shown in algorithm41.8. Each single proposal makes use of multiple gradient evaluations

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!