01.08.2013 Views

Information Theory, Inference, and Learning ... - MAELabs UCSD

Information Theory, Inference, and Learning ... - MAELabs UCSD

Information Theory, Inference, and Learning ... - MAELabs UCSD

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981<br />

You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.<br />

542 45 — Gaussian Processes<br />

t<br />

t<br />

3.0<br />

2.0<br />

1.0<br />

0.0<br />

−1.0<br />

−2.0<br />

−3.0 −1.0 1.0 3.0 5.0<br />

x<br />

(a) 2 exp − (x−x′ ) 2<br />

2(1.5) 2<br />

<br />

4.0<br />

2.0<br />

0.0<br />

−2.0<br />

−4.0<br />

−3.0 −1.0 1.0 3.0 5.0<br />

x<br />

(c) 2 exp − sin2 (π(x−x ′ )/3.0)<br />

2(0.5) 2<br />

Multilayer neural networks <strong>and</strong> Gaussian processes<br />

<br />

t<br />

t<br />

4.0<br />

2.0<br />

0.0<br />

−2.0<br />

−4.0<br />

−3.0 −1.0 1.0 3.0 5.0<br />

x<br />

(b) 2 exp − (x−x′ ) 2<br />

2(0.35) 2<br />

<br />

6.0<br />

4.0<br />

2.0<br />

0.0<br />

−2.0<br />

−4.0<br />

−3.0 −1.0 1.0 3.0 5.0<br />

x<br />

(d) 2 exp − (x−x′ ) 2<br />

2(1.5) 2<br />

<br />

+ xx ′<br />

Figures 44.2 <strong>and</strong> 44.3 show some r<strong>and</strong>om samples from the prior distribution<br />

over functions defined by a selection of st<strong>and</strong>ard multilayer perceptrons with<br />

large numbers of hidden units. Those samples don’t seem a million miles away<br />

from the Gaussian process samples of figure 45.1. And indeed Neal (1996)<br />

showed that the properties of a neural network with one hidden layer (as<br />

in equation (45.4)) converge to those of a Gaussian process as the number of<br />

hidden neurons tends to infinity, if st<strong>and</strong>ard ‘weight decay’ priors are assumed.<br />

The covariance function of this Gaussian process depends on the details of the<br />

priors assumed for the weights in the network <strong>and</strong> the activation functions of<br />

the hidden units.<br />

45.3 Using a given Gaussian process model in regression<br />

We have spent some time talking about priors. We now return to our data<br />

<strong>and</strong> the problem of prediction. How do we make predictions with a Gaussian<br />

process?<br />

Having formed the covariance matrix C defined in equation (45.32) our task<br />

is to infer tN+1 given the observed vector tN. The joint density P (tN+1, tN )<br />

is a Gaussian; so the conditional distribution<br />

P (tN+1 | tN) = P (tN+1, tN )<br />

P (tN)<br />

(45.34)<br />

is also a Gaussian. We now distinguish between different sizes of covariance<br />

matrix C with a subscript, such that CN+1 is the (N +1)×(N +1) covariance<br />

Figure 45.1. Samples drawn from<br />

Gaussian process priors. Each<br />

panel shows two functions drawn<br />

from a Gaussian process prior.<br />

The four corresponding covariance<br />

functions are given below each<br />

plot. The decrease in lengthscale<br />

from (a) to (b) produces more<br />

rapidly fluctuating functions. The<br />

periodic properties of the<br />

covariance function in (c) can be<br />

seen. The covariance function in<br />

(d) contains the non-stationary<br />

term xx ′ corresponding to the<br />

covariance of a straight line, so<br />

that typical functions include<br />

linear trends. From Gibbs (1997).

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!