10.07.2015 Views

Information Theory, Inference, and Learning ... - Inference Group

Information Theory, Inference, and Learning ... - Inference Group

Information Theory, Inference, and Learning ... - Inference Group

SHOW MORE
SHOW LESS
  • No tags were found...

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.194 12 — Hash Codes: Codes for Efficient <strong>Information</strong> Retrievalmemory is independent of the memory size. But in our definition of thetask, we assumed that N is about 200 bits or more, so the amount ofmemory required would be of size 2 200 ; this solution is completely outof the question. Bear in mind that the number of particles in the solarsystem is only about 2 190 .The raw list is a simple list of ordered pairs (s, x (s) ) ordered by the valueof s. The mapping from x to s is achieved by searching through the listof strings, starting from the top, <strong>and</strong> comparing the incoming string xwith each record x (s) until a match is found. This system is very easyto maintain, <strong>and</strong> uses a small amount of memory, about SN bits, butis rather slow to use, since on average five million pairwise comparisonswill be made.⊲ Exercise 12.1. [2, p.202] Show that the average time taken to find the requiredstring in a raw list, assuming that the original names were chosen atr<strong>and</strong>om, is about S + N binary comparisons. (Note that you don’thave to compare the whole string of length N, since a comparison canbe terminated as soon as a mismatch occurs; show that you need onaverage two binary comparisons per incorrect string match.) Comparethis with the worst-case search time – assuming that the devil choosesthe set of strings <strong>and</strong> the search key.The st<strong>and</strong>ard way in which phone directories are made improves on the look-uptable <strong>and</strong> the raw list by using an alphabetically-ordered list.Alphabetical list. The strings {x (s) } are sorted into alphabetical order.Searching for an entry now usually takes less time than was neededfor the raw list because we can take advantage of the sortedness; forexample, we can open the phonebook at its middle page, <strong>and</strong> comparethe name we find there with the target string; if the target is ‘greater’than the middle string then we know that the required string, if it exists,will be found in the second half of the alphabetical directory. Otherwise,we look in the first half. By iterating this splitting-in-the-middle procedure,we can identify the target string, or establish that the string is notlisted, in ⌈log 2 S⌉ string comparisons. The expected number of binarycomparisons per string comparison will tend to increase as the searchprogresses, but the total number of binary comparisons required will beno greater than ⌈log 2 S⌉N.The amount of memory required is the same as that required for the rawlist.Adding new strings to the database requires that we insert them in thecorrect location in the list. To find that location takes about ⌈log 2 S⌉binary comparisons.Can we improve on the well-established alphabetized list? Let us considerour task from some new viewpoints.The task is to construct a mapping x → s from N bits to log 2 S bits. Thisis a pseudo-invertible mapping, since for any x that maps to a non-zero s, thecustomer database contains the pair (s, x (s) ) that takes us back. Where havewe come across the idea of mapping from N bits to M bits before?We encountered this idea twice: first, in source coding, we studied blockcodes which were mappings from strings of N symbols to a selection of onelabel in a list. The task of information retrieval is similar to the task (which

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!