10.07.2015 Views

Information Theory, Inference, and Learning ... - Inference Group

Information Theory, Inference, and Learning ... - Inference Group

Information Theory, Inference, and Learning ... - Inference Group

SHOW MORE
SHOW LESS
  • No tags were found...

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.45Gaussian ProcessesAfter the publication of Rumelhart, Hinton <strong>and</strong> Williams’s (1986) paper onsupervised learning in neural networks there was a surge of interest in theempirical modelling of relationships in high-dimensional data using nonlinearparametric models such as multilayer perceptrons <strong>and</strong> radial basis functions.In the Bayesian interpretation of these modelling methods, a nonlinear functiony(x) parameterized by parameters w is assumed to underlie the data{x (n) , t n } N n=1 , <strong>and</strong> the adaptation of the model to the data corresponds to aninference of the function given the data. We will denote the set of input vectorsby X N ≡ {x (n) } N n=1 <strong>and</strong> the set of corresponding target values by the vectort N ≡ {t n } N n=1 . The inference of y(x) is described by the posterior probabilitydistributionP (y(x) | t N , X N ) = P (t N | y(x), X N )P (y(x)). (45.1)P (t N | X N )Of the two terms on the right-h<strong>and</strong> side, the first, P (t N | y(x), X N ), is theprobability of the target values given the function y(x), which in the case ofregression problems is often assumed to be a separable Gaussian distribution;<strong>and</strong> the second term, P (y(x)), is the prior distribution on functions assumedby the model. This prior is implicit in the choice of parametric model <strong>and</strong>the choice of regularizers used during the model fitting. The prior typicallyspecifies that the function y(x) is expected to be continuous <strong>and</strong> smooth,<strong>and</strong> has less high frequency power than low frequency power, but the precisemeaning of the prior is somewhat obscured by the use of the parametric model.Now, for the prediction of future values of t, all that matters is the assumedprior P (y(x)) <strong>and</strong> the assumed noise model P (t N | y(x), X N ) – theparameterization of the function y(x; w) is irrelevant.The idea of Gaussian process modelling is to place a prior P (y(x)) directlyon the space of functions, without parameterizing y(x). The simplest type ofprior over functions is called a Gaussian process. It can be thought of as thegeneralization of a Gaussian distribution over a finite vector space to a functionspace of infinite dimension. Just as a Gaussian distribution is fully specifiedby its mean <strong>and</strong> covariance matrix, a Gaussian process is specified by a mean<strong>and</strong> a covariance function. Here, the mean is a function of x (which we willoften take to be the zero function), <strong>and</strong> the covariance is a function C(x, x ′ )that expresses the expected covariance between the values of the function yat the points x <strong>and</strong> x ′ . The function y(x) in any one data modelling problemis assumed to be a single sample from this Gaussian distribution. Gaussianprocesses are already well established models for various spatial <strong>and</strong> temporalproblems – for example, Brownian motion, Langevin processes <strong>and</strong> Wienerprocesses are all examples of Gaussian processes; Kalman filters, widely used535

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!