10.07.2015 Views

Information Theory, Inference, and Learning ... - Inference Group

Information Theory, Inference, and Learning ... - Inference Group

Information Theory, Inference, and Learning ... - Inference Group

SHOW MORE
SHOW LESS
  • No tags were found...

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.440 34 — Independent Component Analysis <strong>and</strong> Latent Variable Modelling(a)(c)-4-2x10 2 4420 x2642-2-4x1-8 -6 -4 -20 2 4 6 880 x2-2-4-6-8(b)302010x2 0-10-20-4-30-30(d)-20-2-10x102 4420 x2-2-40 10 20 30x1We could also use a tanh nonlinearity with gain β, that is, φ i (a i ) =− tanh(βa i ), whose implicit probabilistic model is p i (s i ) ∝ 1/[cosh(βs i )] 1/β . Inthe limit of large β, the nonlinearity becomes a step function <strong>and</strong> the probabilitydistribution p i (s i ) becomes a biexponential distribution, p i (s i ) ∝ exp(−|s|).In the limit β → 0, p i (s i ) approaches a Gaussian with mean zero <strong>and</strong> variance1/β. Heavier-tailed distributions than these may also be used. The Student<strong>and</strong> Cauchy distributions spring to mind.Figure 34.3. Illustration of thegenerative models implicit in thelearning algorithm.(a) Distributions over twoobservables generated by 1/ coshdistributions on the[ latent ]3/4 1/2variables, for G =1/2 1(compact [ distribution) ] <strong>and</strong>2 −1G =(broader−1 3/2distribution). (b) Contours of thegenerative distributions when thelatent variables have Cauchydistributions. The learningalgorithm fits this amoeboidobject to the empirical data insuch a way as to maximize thelikelihood. The contour plot in(b) does not adequately representthis heavy-tailed distribution.(c) Part of the tails of the Cauchydistribution, giving the contours0.01 . . . 0.1 times the density atthe origin. (d) Some data fromone of the generative distributionsillustrated in (b) <strong>and</strong> (c). Can youtell which? 200 samples werecreated, of which 196 fell in theplotted region.Example distributionsFigures 34.3(a–c) illustrate typical distributions generated by the independentcomponents model when the components have 1/ cosh <strong>and</strong> Cauchy distributions.Figure 34.3d shows some samples from the Cauchy model. The Cauchydistribution, being the more heavy-tailed, gives the clearest picture of how thepredictive distribution depends on the assumed generative parameters G.34.3 A covariant, simpler, <strong>and</strong> faster learning algorithmWe have thus derived a learning algorithm that performs steepest descentson the likelihood function. The algorithm does not work very quickly, evenon toy data; the algorithm is ill-conditioned <strong>and</strong> illustrates nicely the generaladvice that, while finding the gradient of an objective function is a splendididea, ascending the gradient directly may not be. The fact that the algorithm isill-conditioned can be seen in the fact that it involves a matrix inverse, whichcan be arbitrarily large or even undefined.Covariant optimization in generalThe principle of covariance says that a consistent algorithm should give thesame results independent of the units in which quantities are measured (Knuth,

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!