10.07.2015 Views

Information Theory, Inference, and Learning ... - Inference Group

Information Theory, Inference, and Learning ... - Inference Group

Information Theory, Inference, and Learning ... - Inference Group

SHOW MORE
SHOW LESS
  • No tags were found...

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.202 12 — Hash Codes: Codes for Efficient <strong>Information</strong> Retrievalis a binding site found upstream of the alpha-actin gene in humans.Does the fact that some binding sites consist of a repeated subsequenceinfluence your answer to part (a)?12.7 SolutionsSolution to exercise 12.1 (p.194). First imagine comparing the string x withanother r<strong>and</strong>om string x (s) . The probability that the first bits of the twostrings match is 1/2. The probability that the second bits match is 1/2. Assumingwe stop comparing once we hit the first mismatch, the expected numberof matches is 1, so the expected number of comparisons is 2 (exercise 2.34,p.38).Assuming the correct string is located at r<strong>and</strong>om in the raw list, we willhave to compare with an average of S/2 strings before we find it, which costs2S/2 binary comparisons; <strong>and</strong> comparing the correct strings takes N binarycomparisons, giving a total expectation of S + N binary comparisons, if thestrings are chosen at r<strong>and</strong>om.In the worst case (which may indeed happen in practice), the other stringsare very similar to the search key, so that a lengthy sequence of comparisonsis needed to find each mismatch. The worst case is when the correct stringis last in the list, <strong>and</strong> all the other strings differ in the last bit only, giving arequirement of SN binary comparisons.Solution to exercise 12.2 (p.197). The likelihood ratio for the two hypotheses,H 0 : x (s) = x, <strong>and</strong> H 1 : x (s) ≠ x, contributed by the datum ‘the first bits ofx (s) <strong>and</strong> x are equal’ isP (Datum | H 0 )P (Datum | H 1 ) = 1 = 2. (12.5)1/2If the first r bits all match, the likelihood ratio is 2 r to one. On finding that30 bits match, the odds are a billion to one in favour of H 0 , assuming we startfrom even odds. [For a complete answer, we should compute the evidencegiven by the prior information that the hash entry s has been found in thetable at h(x). This fact gives further evidence in favour of H 0 .]Solution to exercise 12.3 (p.198). Let the hash function have an output alphabetof size T = 2 M . If M were equal to log 2 S then we would have exactlyenough bits for each entry to have its own unique hash. The probability thatone particular pair of entries collide under a r<strong>and</strong>om hash function is 1/T . Thenumber of pairs is S(S − 1)/2. So the expected number of collisions betweenpairs is exactlyS(S − 1)/(2T ). (12.6)If we would like this to be smaller than 1, then we need T > S(S − 1)/2 soM > 2 log 2 S. (12.7)We need twice as many bits as the number of bits, log 2 S, that would besufficient to give each entry a unique name.If we are happy to have occasional collisions, involving a fraction f of thenames S, then we need T > S/f (since the probability that one particularname is collided-with is f ≃ S/T ) soM > log 2 S + log 2 [1/f], (12.8)

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!