10.07.2015 Views

Information Theory, Inference, and Learning ... - Inference Group

Information Theory, Inference, and Learning ... - Inference Group

Information Theory, Inference, and Learning ... - Inference Group

SHOW MORE
SHOW LESS
  • No tags were found...

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

¡¤¢¢¤¨©¢£¡¢£¨¢£ ©Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.502 41 — <strong>Learning</strong> as <strong>Inference</strong>ψ(a, s 2 )¢£ ¤¤¢¤ ¢£§§ ¤(a)¡£¢¡£¢(b)Figure 41.10. The marginalizedprobability, <strong>and</strong> an approximationto it. (a) The function ψ(a, s 2 ),evaluated numerically. In (b) thefunctions ψ(a, s 2 ) <strong>and</strong> φ(a, s 2 )defined in the text are shown as afunction of a for s 2 = 4. FromMacKay (1992b).¡¢ ¥¤ ¤ ¡£¢ ¦¢£§¥¤¦ §¡£¢(a)543210-1-2-30 1 2 3 4 5 6Calculating the marginalized probability(b)AB00 5 10The output y(x; w) depends on w only through the scalar a(x; w), so we canreduce the dimensionality of the integral by finding the probability density ofa. We are assuming a locally Gaussian posterior probability distribution overw = w MP + ∆w, P (w | D, α) ≃ (1/Z Q ) exp(− 1 2 ∆wT A∆w). For our singleneuron, the activation a(x; w) is a linear function of w with ∂a/∂w = x, sofor any x, the activation a is Gaussian-distributed.⊲ Exercise 41.2. [2 ] Assuming w is Gaussian-distributed with mean w MP <strong>and</strong>variance–covariance matrix A −1 , show that the probability distributionof a(x) isP (a | x, D, α) = Normal(a MP , s 2 ) = √ 1 (exp − (a − a MP) 2 )2πs 2 2s 2 ,where a MP = a(x; w MP ) <strong>and</strong> s 2 = x T A −1 x.10 Figure 41.11. The Gaussianapproximation in weight space<strong>and</strong> its approximate predictions ininput space. (a) A projection ofthe Gaussian approximation onto5 the (w 1 , w 2 ) plane of weightspace. The one- <strong>and</strong>two-st<strong>and</strong>ard-deviation contoursare shown. Also shown are thetrajectory of the optimizer, <strong>and</strong>the Monte Carlo method’ssamples. (b) The predictivefunction obtained from theGaussian approximation <strong>and</strong>equation (41.30). (cf. figure 41.2.)(41.28)This means that the marginalized output is:∫P (t=1 | x, D, α) = ψ(a MP , s 2 ) ≡ da f(a) Normal(a MP , s 2 ). (41.29)This is to be contrasted with y(x; w MP ) = f(a MP ), the output of the most probablenetwork. The integral of a sigmoid times a Gaussian can be approximatedby:ψ(a MP , s 2 ) ≃ φ(a MP , s 2 ) ≡ f(κ(s)a MP ) (41.30)with κ = 1/ √ 1 + πs 2 /8 (figure 41.10).DemonstrationFigure 41.11 shows the result of fitting a Gaussian approximation at the optimumw MP , <strong>and</strong> the results of using that Gaussian approximation <strong>and</strong> equa-

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!