01.08.2013 Views

Information Theory, Inference, and Learning ... - MAELabs UCSD

Information Theory, Inference, and Learning ... - MAELabs UCSD

Information Theory, Inference, and Learning ... - MAELabs UCSD

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981<br />

You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.<br />

540 45 — Gaussian Processes<br />

45.2 From parametric models to Gaussian processes<br />

Linear models<br />

Let us consider a regression problem using H fixed basis functions, for example<br />

one-dimensional radial basis functions as defined in equation (45.3).<br />

Let us assume that a list of N input points {x (n) } has been specified <strong>and</strong><br />

define the N × H matrix R to be the matrix of values of the basis functions<br />

{φh(x)} H h=1 at the points {xn},<br />

Rnh ≡ φh(x (n) ). (45.17)<br />

We define the vector yN to be the vector of values of y(x) at the N points,<br />

yn ≡ <br />

Rnhwh. (45.18)<br />

If the prior distribution of w is Gaussian with zero mean,<br />

h<br />

P (w) = Normal(w; 0, σ 2 wI), (45.19)<br />

then y, being a linear function of w, is also Gaussian distributed, with mean<br />

zero. The covariance matrix of y is<br />

So the prior distribution of y is:<br />

Q = 〈yy T 〉 = 〈Rww T R T 〉 = R 〈ww T 〉 R T<br />

(45.20)<br />

= σ 2 wRR T . (45.21)<br />

P (y) = Normal(y; 0, Q) = Normal(y; 0, σ 2 w RRT ). (45.22)<br />

This result, that the vector of N function values y has a Gaussian distribution,<br />

is true for any selected points XN. This is the defining property of a<br />

Gaussian process. The probability distribution of a function y(x) is a Gaussian<br />

process if for any finite selection of points x (1) , x (2) , . . . , x (N) , the density<br />

P (y(x (1) ), y(x (2) ), . . . , y(x (N) )) is a Gaussian.<br />

Now, if the number of basis functions H is smaller than the number of<br />

data points N, then the matrix Q will not have full rank. In this case the<br />

probability distribution of y might be thought of as a flat elliptical pancake<br />

confined to an H-dimensional subspace in the N-dimensional space in which<br />

y lives.<br />

What about the target values? If each target tn is assumed to differ by<br />

additive Gaussian noise of variance σ2 ν from the corresponding function value<br />

yn then t also has a Gaussian prior distribution,<br />

P (t) = Normal(t; 0, Q + σ 2 νI). (45.23)<br />

We will denote the covariance matrix of t by C:<br />

C = Q + σ 2 νI = σ2 wRRT + σ 2 νI. (45.24)<br />

Whether or not Q has full rank, the covariance matrix C has full rank since<br />

σ2 νI is full rank.<br />

What does the covariance matrix Q look like? In general, the (n, n ′ ) entry<br />

of Q is<br />

<br />

Qnn ′ = [σ2 wRR T ]nn ′ = σ2 w<br />

h<br />

φh(x (n) )φh(x (n′ ) ) (45.25)

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!