10.07.2015 Views

Information Theory, Inference, and Learning ... - Inference Group

Information Theory, Inference, and Learning ... - Inference Group

Information Theory, Inference, and Learning ... - Inference Group

SHOW MORE
SHOW LESS
  • No tags were found...

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.12.2: Hash codes 195we never actually solved) of making an encoder for a typical-set compressioncode.The second time that we mapped bit strings to bit strings of anotherdimensionality was when we studied channel codes. There, we consideredcodes that mapped from K bits to N bits, with N greater than K, <strong>and</strong> wemade theoretical progress using r<strong>and</strong>om codes.In hash codes, we put together these two notions. We will study r<strong>and</strong>omcodes that map from N bits to M bits where M is smaller than N.The idea is that we will map the original high-dimensional space down intoa lower-dimensional space, one in which it is feasible to implement the dumblook-up table method which we rejected a moment ago.12.2 Hash codesFirst we will describe how a hash code works, then we will study the propertiesof idealized hash codes. A hash code implements a solution to the informationretrievalproblem, that is, a mapping from x to s, with the help of a pseudor<strong>and</strong>omfunction called a hash function, which maps the N-bit string x to anM-bit string h(x), where M is smaller than N. M is typically chosen such thatthe ‘table size’ T ≃ 2 M is a little bigger than S – say, ten times bigger. Forexample, if we were expecting S to be about a million, we might map x intoa 30-bit hash h (regardless of the size N of each item x). The hash functionis some fixed deterministic function which should ideally be indistinguishablefrom a fixed r<strong>and</strong>om code. For practical purposes, the hash function must bequick to compute.Two simple examples of hash functions are:string length N ≃ 200number of strings S ≃ 2 23size of hash function M ≃ 30 bitssize of hash table T = 2 M≃ 2 30Figure 12.2. Revised cast ofcharacters.Division method. The table size T is a prime number, preferably one thatis not close to a power of 2. The hash value is the remainder when theinteger x is divided by T .Variable string addition method. This method assumes that x is a stringof bytes <strong>and</strong> that the table size T is 256. The characters of x are added,modulo 256. This hash function has the defect that it maps strings thatare anagrams of each other onto the same hash.It may be improved by putting the running total through a fixed pseudor<strong>and</strong>ompermutation after each character is added. In the variablestring exclusive-or method with table size ≤ 65 536, the string is hashedtwice in this way, with the initial running total being set to 0 <strong>and</strong> 1respectively (algorithm 12.3). The result is a 16-bit hash.Having picked a hash function h(x), we implement an information retrieveras follows. (See figure 12.4.)Encoding. A piece of memory called the hash table is created of size 2 M bmemory units, where b is the amount of memory needed to represent aninteger between 0 <strong>and</strong> S. This table is initially set to zero throughout.Each memory x (s) is put through the hash function, <strong>and</strong> at the locationin the hash table corresponding to the resulting vector h (s) = h(x (s) ),the integer s is written – unless that entry in the hash table is alreadyoccupied, in which case we have a collision between x (s) <strong>and</strong> some earlierx (s′) which both happen to have the same hash code. Collisions can beh<strong>and</strong>led in various ways – we will discuss some in a moment – but firstlet us complete the basic picture.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!