10.07.2015 Views

Information Theory, Inference, and Learning ... - Inference Group

Information Theory, Inference, and Learning ... - Inference Group

Information Theory, Inference, and Learning ... - Inference Group

SHOW MORE
SHOW LESS
  • No tags were found...

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.43.1: From Hopfield networks to Boltzmann machines 523We concentrate on the first term in the numerator, the likelihood, <strong>and</strong> derive amaximum likelihood algorithm (though there might be advantages in pursuinga full Bayesian approach as we did in the case of the single neuron). Wedifferentiate the logarithm of the likelihood,[ N]∏N∑[ 1ln P (x (n) | W) =2 x(n)T Wx (n) − ln Z(W)], (43.6)n=1n=1with respect to w ij , bearing in mind that W is defined to be symmetric withw ji = w ij .Exercise 43.1. [2 ] Show that the derivative of ln Z(W) with respect to w ij is∂∂w ijln Z(W) = ∑ xx i x j P (x | W) = 〈x i x j 〉 P (x | W). (43.7)[This exercise is similar to exercise 22.12 (p.307).]The derivative of the log likelihood is therefore:∂∂w ijln P ({x (n) } N 1 } | W) =N∑ [n=1= Nx (n)ix (n)j− 〈x i x j 〉 P (x | W)](43.8)[〈x i x j 〉 Data− 〈x i x j 〉 P (x | W)]. (43.9)This gradient is proportional to the difference of two terms. The first term isthe empirical correlation between x i <strong>and</strong> x j ,〈x i x j 〉 Data≡ 1 NN∑ [n=1x (n)i]x (n)j, (43.10)<strong>and</strong> the second term is the correlation between x i <strong>and</strong> x j under the currentmodel,〈x i x j 〉 P (x | W)≡ ∑ x i x j P (x | W). (43.11)xThe first correlation 〈x i x j 〉 Datais readily evaluated – it is just the empiricalcorrelation between the activities in the real world. The second correlation,〈x i x j 〉 P (x | W), is not so easy to evaluate, but it can be estimated by MonteCarlo methods, that is, by observing the average value of x i x j while the activityrule of the Boltzmann machine, equation (43.3), is iterated.In the special case W = 0, we can evaluate the gradient exactly because,by symmetry, the correlation 〈x i x j 〉 P (x | W)must be zero. If the weights areadjusted by gradient descent with learning rate η, then, after one iteration,the weights will bew ij = ηN∑ [n=1x (n)i]x (n)j, (43.12)precisely the value of the weights given by the Hebb rule, equation (16.5), withwhich we trained the Hopfield network.Interpretation of Boltzmann machine learningOne way of viewing the two terms in the gradient (43.9) is as ‘waking’ <strong>and</strong>‘sleeping’ rules. While the network is ‘awake’, it measures the correlationbetween x i <strong>and</strong> x j in the real world, <strong>and</strong> weights are increased in proportion.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!