23.11.2014 Views

Data Structures and Algorithms in Java[1].pdf - Fulvio Frisone

Data Structures and Algorithms in Java[1].pdf - Fulvio Frisone

Data Structures and Algorithms in Java[1].pdf - Fulvio Frisone

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

static <strong>in</strong>thashCode(longi) {return (<strong>in</strong>t)((i >> 32) + (<strong>in</strong>t) i);}<br />

Indeed, the approach of summ<strong>in</strong>g components can be extended to any object x<br />

whose b<strong>in</strong>ary representation can be viewed as a k-tuple (x 0 ,x 1 ,… ,x k −1) of<br />

<strong>in</strong>tegers, for we can then form a hash code for x as . For example, given<br />

any float<strong>in</strong>g po<strong>in</strong>t number, we can sum its mantissa <strong>and</strong> exponent as long <strong>in</strong>tegers,<br />

<strong>and</strong> then apply a hash code for long <strong>in</strong>tegers to the result.<br />

Polynomial Hash Codes<br />

The summation hash code, described above, is not a good choice for character<br />

str<strong>in</strong>gs or other variable-length objects that can be viewed as tuples of the form<br />

(x 0 ,x 1 ,…,x k −1), where the order of the x i 's is significant. For example, consider a<br />

hash code for a character str<strong>in</strong>g s that sums the ASCII (or Unicode) values of the<br />

characters <strong>in</strong> s. This hash code unfortunately produces lots of unwanted collisions<br />

for common groups of str<strong>in</strong>gs. In particular, "temp01" <strong>and</strong> "temp10" collide us<strong>in</strong>g<br />

this function, as do "stop", "tops", "pots", <strong>and</strong> "spot". A better hash code<br />

should somehow take <strong>in</strong>to consideration the positions of the x i 's. An alternative<br />

hash code, which does exactly this, is to choose a nonzero constant, a≠1, <strong>and</strong> use<br />

as a hash code the value<br />

x 0 a k−1 + x 1 a k−2 + …+ x k−2 a + x k−1 .<br />

Mathematically speak<strong>in</strong>g, this is simply a polynomial <strong>in</strong> a that takes the compo<br />

nents (x 0 ,x 1 ,… ,x k −1) of an object x as its coefficients. This hash code is therefore<br />

called a polynomial hash code. By Horner's rule (see Exercise C-4.11), this poly<br />

nomial can be written as<br />

x k −1 + a(x k−2 + a(x k−3 + … + a(x 2 + a(x 1 + ax 0 ))…)).<br />

Intuitively, a polynomial hash code uses multiplication by the constant a as a way<br />

of "mak<strong>in</strong>g room" for each component <strong>in</strong> a tuple of values while also preserv <strong>in</strong>g<br />

a characterization of the previous components.<br />

Of course, on a typical computer, evaluat<strong>in</strong>g a polynomial will be done us<strong>in</strong>g the<br />

f<strong>in</strong>ite bit representation for a hash code; hence, the value will periodically over<br />

flow the bits used for an <strong>in</strong>teger. S<strong>in</strong>ce we are more <strong>in</strong>terested <strong>in</strong> a good spread of<br />

the object x with respect to other keys, we simply ignore such overflows. Still, we<br />

should be m<strong>in</strong>dful that such overflows are occurr<strong>in</strong>g <strong>and</strong> choose the constant a so<br />

that it has some nonzero, low-order bits, which will serve to preserve some of the<br />

<strong>in</strong>formation content even as we are <strong>in</strong> an overflow situation.<br />

We have done some experimental studies that suggest that 33, 37, 39, <strong>and</strong> 41 are<br />

particularly good choices for a when work<strong>in</strong>g with character str<strong>in</strong>gs that are<br />

English words. In fact, <strong>in</strong> a list of over 50,000 English words formed as the union<br />

526

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!