10.07.2015 Views

Information Theory, Inference, and Learning ... - Inference Group

Information Theory, Inference, and Learning ... - Inference Group

Information Theory, Inference, and Learning ... - Inference Group

SHOW MORE
SHOW LESS
  • No tags were found...

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.41<strong>Learning</strong> as <strong>Inference</strong>41.1 Neural network learning as inferenceIn Chapter 39 we trained a simple neural network as a classifier by minimizingan objective functionM(w) = G(w) + αE W (w) (41.1)made up of an error functionG(w) = − ∑ []t (n) ln y(x (n) ; w) + (1 − t (n) ) ln(1 − y(x (n) ; w))n(41.2)<strong>and</strong> a regularizerE W (w) = 1 ∑wi 2 2. (41.3)This neural network learning process can be given the following probabilisticinterpretation.We interpret the output y(x; w) of the neuron literally as defining (whenits parameters w are specified) the probability that an input x belongs to classt = 1, rather than the alternative t = 0. Thus y(x; w) ≡ P (t = 1 | x, w). Theneach value of w defines a different hypothesis about the probability of class 1relative to class 0 as a function of x.We define the observed data D to be the targets {t} – the inputs {x} areassumed to be given, <strong>and</strong> not to be modelled. To infer w given the data, werequire a likelihood function <strong>and</strong> a prior probability over w. The likelihoodfunction measures how well the parameters w predict the observed data; it isthe probability assigned to the observed t values by the model with parametersset to w. Now the two equationsiP (t = 1 | w, x) = yP (t = 0 | w, x) = 1 − y(41.4)can be rewritten as the single equationP (t | w, x) = y t (1 − y) 1−t = exp[t ln y + (1 − t) ln(1 − y)] . (41.5)So the error function G can be interpreted as minus the log likelihood:P (D | w) = exp[−G(w)]. (41.6)Similarly the regularizer can be interpreted in terms of a log prior probabilitydistribution over the parameters:P (w | α) =1Z W (α) exp(−αE W ). (41.7)492

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!