01.08.2013 Views

Information Theory, Inference, and Learning ... - MAELabs UCSD

Information Theory, Inference, and Learning ... - MAELabs UCSD

Information Theory, Inference, and Learning ... - MAELabs UCSD

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981<br />

You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.<br />

502 41 — <strong>Learning</strong> as <strong>Inference</strong><br />

ψ(a, s 2 )<br />

¡<br />

¤ ¤ ¢£<br />

¢<br />

¢£§<br />

¤ ¤<br />

¤ ¢ § ¤<br />

¡£¢<br />

¢<br />

¡£¢<br />

(a)<br />

¥¤<br />

(a)<br />

5<br />

4<br />

3<br />

2<br />

1<br />

0<br />

-1<br />

-2<br />

-3<br />

¢ ¢<br />

¦ § ¤ ¦ ¤ §<br />

¡£¢<br />

¨¨<br />

© ©<br />

<br />

0 1 2 3 4 5 6<br />

Calculating the marginalized probability<br />

<br />

¡£¢<br />

(b)<br />

¢£<br />

¡<br />

¢£¨<br />

¢£ ©<br />

¡¢ ¥¤ ¤ ¡£¢ ¦<br />

¢£§<br />

(b)<br />

A<br />

10<br />

5<br />

B<br />

0<br />

0 5 10<br />

The output y(x; w) depends on w only through the scalar a(x; w), so we can<br />

reduce the dimensionality of the integral by finding the probability density of<br />

a. We are assuming a locally Gaussian posterior probability distribution over<br />

w = wMP + ∆w, P (w | D, α) (1/ZQ) exp(− 1<br />

2∆wTA∆w). For our single<br />

neuron, the activation a(x; w) is a linear function of w with ∂a/∂w = x, so<br />

for any x, the activation a is Gaussian-distributed.<br />

⊲ Exercise 41.2. [2 ] Assuming w is Gaussian-distributed with mean w MP <strong>and</strong><br />

variance–covariance matrix A −1 , show that the probability distribution<br />

of a(x) is<br />

P (a | x, D, α) = Normal(aMP, s 2 ) =<br />

1<br />

√<br />

2πs2 exp<br />

<br />

− (a − aMP) 2<br />

2s2 <br />

,<br />

(41.28)<br />

where aMP = a(x; wMP) <strong>and</strong> s2 = xTA−1x. This means that the marginalized output is:<br />

P (t=1 | x, D, α) = ψ(aMP, s 2 <br />

) ≡ da f(a) Normal(aMP, s 2 ). (41.29)<br />

This is to be contrasted with y(x; w MP) = f(a MP), the output of the most probable<br />

network. The integral of a sigmoid times a Gaussian can be approximated<br />

by:<br />

ψ(a MP, s 2 ) φ(a MP, s 2 ) ≡ f(κ(s)a MP) (41.30)<br />

with κ = 1/ 1 + πs 2 /8 (figure 41.10).<br />

Demonstration<br />

Figure 41.11 shows the result of fitting a Gaussian approximation at the optimum<br />

w MP, <strong>and</strong> the results of using that Gaussian approximation <strong>and</strong> equa-<br />

Figure 41.10. The marginalized<br />

probability, <strong>and</strong> an approximation<br />

to it. (a) The function ψ(a, s 2 ),<br />

evaluated numerically. In (b) the<br />

functions ψ(a, s 2 ) <strong>and</strong> φ(a, s 2 )<br />

defined in the text are shown as a<br />

function of a for s 2 = 4. From<br />

MacKay (1992b).<br />

Figure 41.11. The Gaussian<br />

approximation in weight space<br />

<strong>and</strong> its approximate predictions in<br />

input space. (a) A projection of<br />

the Gaussian approximation onto<br />

the (w1, w2) plane of weight<br />

space. The one- <strong>and</strong><br />

two-st<strong>and</strong>ard-deviation contours<br />

are shown. Also shown are the<br />

trajectory of the optimizer, <strong>and</strong><br />

the Monte Carlo method’s<br />

samples. (b) The predictive<br />

function obtained from the<br />

Gaussian approximation <strong>and</strong><br />

equation (41.30). (cf. figure 41.2.)

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!