10.07.2015 Views

Information Theory, Inference, and Learning ... - Inference Group

Information Theory, Inference, and Learning ... - Inference Group

Information Theory, Inference, and Learning ... - Inference Group

SHOW MORE
SHOW LESS
  • No tags were found...

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.528 44 — Supervised <strong>Learning</strong> in Multilayer NetworksyOutput ❅ ✛❅ σ out 0❅❅ Hidden layer❅❅❅❅ -5σ bias✲✟ ✟✟✟✟✟✟ ✛ σ in Input-101 xOutput105Figure 44.3. Properties of afunction produced by a r<strong>and</strong>omnetwork. The vertical scale of atypical function produced by the√ network with r<strong>and</strong>om weights is ofHσout ∼ 1/σ inorder √ Hσ out ; the horizontalrange in which the function variessignificantly is of order σ bias /σ in ;<strong>and</strong> the shortest horizontal length∼ σ bias/σ in-2 -1 0 1 2 3 4Input<strong>and</strong> weights θ (1)j, w (1)jl, θ (2)i<strong>and</strong> w (2)ijto r<strong>and</strong>om values, <strong>and</strong> plot the resultingfunction y(x). I set the hidden units’ biases θ (1)jto r<strong>and</strong>om values from aGaussian with zero mean <strong>and</strong> st<strong>and</strong>ard deviation σ bias ; the input-to-hiddenweights w (1)jlto r<strong>and</strong>om values with st<strong>and</strong>ard deviation σ in ; <strong>and</strong> the bias <strong>and</strong>output weights θ (2)i<strong>and</strong> w (2)ijto r<strong>and</strong>om values with st<strong>and</strong>ard deviation σ out .The sort of functions that we obtain depend on the values of σ bias , σ in<strong>and</strong> σ out . As the weights <strong>and</strong> biases are made bigger we obtain more complexfunctions with more features <strong>and</strong> a greater sensitivity to the input variable.The vertical scale of a typical function produced by the network with r<strong>and</strong>omweights is of order √ Hσ out ; the horizontal range in which the function variessignificantly is of order σ bias /σ in ; <strong>and</strong> the shortest horizontal length scale is oforder 1/σ in .Radford Neal (1996) has also shown that in the limit as H → ∞ thestatistical properties of the functions generated by r<strong>and</strong>omizing the weights areindependent of the number of hidden units; so, interestingly, the complexity ofthe functions becomes independent of the number of parameters in the model.What determines the complexity of the typical functions is the characteristicmagnitude of the weights. Thus we anticipate that when we fit these models toreal data, an important way of controlling the complexity of the fitted functionwill be to control the characteristic magnitude of the weights.Figure 44.4 shows one typical function produced by a network with twoinputs <strong>and</strong> one output. This should be contrasted with the function producedby a traditional linear regression model, which is a flat plane. Neural networkscan create functions with more complexity than a linear regression.scale is of order 1/σ in . Thefunction shown was produced bymaking a r<strong>and</strong>om network withH = 400 hidden units, <strong>and</strong>Gaussian weights with σ bias = 4,σ in = 8, <strong>and</strong> σ out = 0.5.-110-1-2-0.500.51 -10-0.5Figure 44.4. One sample from theprior of a two-input network with{H, σ w in , σw bias , σw out} ={400, 8.0, 8.0, 0.05}.10.544.2 How a regression network is traditionally trainedThis network is trained using a data set D = {x (n) , t (n) } by adjusting w so asto minimize an error function, e.g.,E D (w) = 1 2∑ ∑ (t (n)2i− y i (x (n) ; w)). (44.3)niThis objective function is a sum of terms, one for each input/target pair {x, t},measuring how close the output y(x; w) is to the target t.This minimization is based on repeated evaluation of the gradient of E D .This gradient can be efficiently computed using the backpropagation algorithm(Rumelhart et al., 1986), which uses the chain rule to find the derivatives.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!