10.07.2015 Views

Information Theory, Inference, and Learning ... - Inference Group

Information Theory, Inference, and Learning ... - Inference Group

Information Theory, Inference, and Learning ... - Inference Group

SHOW MORE
SHOW LESS
  • No tags were found...

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.9.2: Review of probability <strong>and</strong> information 147with arbitrarily small probability of error.9.2 Review of probability <strong>and</strong> informationAs an example, we take the joint distribution XY from exercise 8.6 (p.140).The marginal distributions P (x) <strong>and</strong> P (y) are shown in the margins.P (x, y) x P (y)1 2 3 41 1/ 8 1/ 16 1/ 32 1/ 32 1/ 4y 2 1/ 16 1/ 8 1/ 32 1/ 32 1/ 43 1/ 16 1/ 16 1/ 16 1/ 16 1/ 44 1/ 4 0 0 0 1/ 4P (x) 1/ 2 1/ 4 1/ 8 1/ 8The joint entropy is H(X, Y ) = 27/8 bits. The marginal entropies are H(X) =7/4 bits <strong>and</strong> H(Y ) = 2 bits.We can compute the conditional distribution of x for each value of y, <strong>and</strong>the entropy of each of those conditional distributions:P (x | y) x H(X | y)/bits1 2 3 41 1/ 2 1/ 4 1/ 8 1/ 8 7/ 4y 2 1/ 4 1/ 2 1/ 8 1/ 8 7/ 43 1/ 4 1/ 4 1/ 4 1/ 4 24 1 0 0 0 0H(X | Y ) = 11/ 8Note that whereas H(X | y = 4) = 0 is less than H(X), H(X | y = 3) is greaterthan H(X). So in some cases, learning y can increase our uncertainty aboutx. Note also that although P (x | y = 2) is a different distribution from P (x),the conditional entropy H(X | y = 2) is equal to H(X). So learning that yis 2 changes our knowledge about x but does not reduce the uncertainty ofx, as measured by the entropy. On average though, learning y does conveyinformation about x, since H(X | Y ) < H(X).One may also evaluate H(Y |X) = 13/8 bits. The mutual information isI(X; Y ) = H(X) − H(X | Y ) = 3/8 bits.9.3 Noisy channelsA discrete memoryless channel Q is characterized by an input alphabetA X , an output alphabet A Y , <strong>and</strong> a set of conditional probability distributionsP (y | x), one for each x ∈ A X .These transition probabilities may be written in a matrixQ j|i = P (y = b j | x = a i ). (9.1)I usually orient this matrix with the output variable j indexing the rows<strong>and</strong> the input variable i indexing the columns, so that each column of Q isa probability vector. With this convention, we can obtain the probabilityof the output, p Y , from a probability distribution over the input, p X , byright-multiplication:p Y = Qp X . (9.2)

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!