10.07.2015 Views

Information Theory, Inference, and Learning ... - Inference Group

Information Theory, Inference, and Learning ... - Inference Group

Information Theory, Inference, and Learning ... - Inference Group

SHOW MORE
SHOW LESS
  • No tags were found...

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.45.1: St<strong>and</strong>ard methods for nonlinear regression 537Example 45.3. Adaptive basis functions. Alternatively, we might make a functiony(x) from basis functions that depend on additional parametersincluded in the vector w. In a two-layer feedforward neural networkwith nonlinear hidden units <strong>and</strong> a linear output, the function can bewritteny(x; w) =H∑h=1( I∑w (2)h tanh w (1)hi x i + w (1)h0i=1)+ w (2)0 (45.4)where I is the dimensionality of the input space <strong>and</strong> the weight vectorw consists of the input weights {w (1)hi}, the hidden unit biases {w(1)h0 },the output weights {w (2)h} <strong>and</strong> the output bias w(2)0. In this model, thedependence of y on w is nonlinear.Having chosen the parameterization, we then infer the function y(x; w) byinferring the parameters w. The posterior probability of the parameters isP (w | t N , X N ) = P (t N | w, X N )P (w). (45.5)P (t N | X N )The factor P (t N | w, X N ) states the probability of the observed data pointswhen the parameters w (<strong>and</strong> hence, the function y) are known. This probabilitydistribution is often taken to be a separable Gaussian, each data pointt n differing from the underlying value y(x (n) ; w) by additive noise. The factorP (w) specifies the prior probability distribution of the parameters. This toois often taken to be a separable Gaussian distribution. If the dependence of yon w is nonlinear the posterior distribution P (w | t N , X N ) is in general not aGaussian distribution.The inference can be implemented in various ways. In the Laplace method,we minimize an objective functionM(w) = − ln [P (t N | w, X N )P (w)] (45.6)with respect to w, locating the locally most probable parameters, then use thecurvature of M, ∂ 2 M(w)/∂w i ∂w j , to define error bars on w. Alternatively wecan use more general Markov chain Monte Carlo techniques to create samplesfrom the posterior distribution P (w | t N , X N ).Having obtained one of these representations of the inference of w giventhe data, predictions are then made by marginalizing over the parameters:∫P (t N+1 | t N , X N+1 ) = d H w P (t N+1 | w, x (N+1) )P (w | t N , X N ). (45.7)If we have a Gaussian representation of the posterior P (w | t N , X N ), then thisintegral can typically be evaluated directly. In the alternative Monte Carloapproach, which generates R samples w (r) that are intended to be samplesfrom the posterior distribution P (w | t N , X N ), we approximate the predictivedistribution byP (t N+1 | t N , X N+1 ) ≃ 1 RR∑P (t N+1 | w (r) , x (N+1) ). (45.8)r=1

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!