10.07.2015 Views

Information Theory, Inference, and Learning ... - Inference Group

Information Theory, Inference, and Learning ... - Inference Group

Information Theory, Inference, and Learning ... - Inference Group

SHOW MORE
SHOW LESS
  • No tags were found...

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.45.1: St<strong>and</strong>ard methods for nonlinear regression 539Here, if we denote by D the linear operator that maps y(x) to the derivativeof y(x), we can write equation (45.10) asln P (y(x) | α) = − 1 ∫2 α dx [D p y(x)] 2 + const = − 1 2 y(x)T Ay(x) + const,(45.13)which has the same form as equation (45.12) with µ(x) = 0, <strong>and</strong> A ≡ [D p ] T D p .In order for the prior in equation (45.12) to be a proper prior, A must be apositive definite operator, i.e., one satisfying y(x) T Ay(x) > 0 for all functionsy(x) other than y(x) = 0.Splines can be written as parametric modelsSplines may be written in terms of an infinite set of fixed basis functions, as inequation (45.2), as follows. First rescale the x axis so that the interval (0, 2π)is much wider than the range of x values of interest. Let the basis functionsbe a Fourier set {cos hx, sin hx, h=0, 1, 2, . . .}, so the function isUse the regularizer∞∑∞∑y(x) = w h(cos) cos(hx) + w h(sin) sin(hx). (45.14)h=0E W (w) =∞∑h=0to define a Gaussian prior on w,P (w | α) =h=11∞ 2 h p 2 wh(cos) 2 + ∑ 12 h p 2 wh(sin) 2 (45.15)h=11Z W (α) exp(−αE W ). (45.16)If p = 2 then we have the cubic splines regularizer E W (w) = ∫ y (2) (x) 2 dx, asin equation (45.9); if p = 1 we have the regularizer E W (w) = ∫ y (1) (x) 2 dx,etc. (To make the prior proper we must add an extra regularizer on theterm w 0(cos) .) Thus in terms of the prior P (y(x)) there is no fundamentaldifference between the ‘nonparametric’ splines approach <strong>and</strong> other parametricapproaches.Representation is irrelevant for predictionFrom the point of view of prediction at least, there are two objects of interest.The first is the conditional distribution P (t N+1 | t N , X N+1 ) defined inequation (45.7). The other object of interest, should we wish to compare onemodel with others, is the joint probability of all the observed data given themodel, the evidence P (t N | X N ), which appeared as the normalizing constantin equation (45.5). Neither of these quantities makes any reference to the representationof the unknown function y(x). So at the end of the day, our choiceof representation is irrelevant.The question we now address is, in the case of popular parametric models,what form do these two quantities take? We will see that for st<strong>and</strong>ard modelswith fixed basis functions <strong>and</strong> Gaussian distributions on the unknown parameters,the joint probability of all the observed data given the model, P (t N | X N ),is a multivariate Gaussian distribution with mean zero <strong>and</strong> with a covariancematrix determined by the basis functions; this implies that the conditionaldistribution P (t N+1 | t N , X N+1 ) is also a Gaussian distribution, whose me<strong>and</strong>epends linearly on the values of the targets t N . St<strong>and</strong>ard parametric modelsare simple examples of Gaussian processes.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!