11.07.2015 Views

Data Structures and Algorithm Analysis - Computer Science at ...

Data Structures and Algorithm Analysis - Computer Science at ...

Data Structures and Algorithm Analysis - Computer Science at ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

316 Chap. 9 Searchingbirthday is similar to assigning records to slots in a table (of size 365) using thebirthday as a hash function. Note th<strong>at</strong> this observ<strong>at</strong>ion tells us nothing about whichstudents share a birthday, or on which days of the year shared birthdays fall.To be practical, a d<strong>at</strong>abase organized by hashing must store records in a hashtable th<strong>at</strong> is not so large th<strong>at</strong> it wastes space. Typically, this means th<strong>at</strong> the hashtable will be around half full. Because collisions are extremely likely to occurunder these conditions (by chance, any record inserted into a table th<strong>at</strong> is half fullwill have a collision half of the time), does this mean th<strong>at</strong> we need not worry aboutthe ability of a hash function to avoid collisions? Absolutely not. The differencebetween a good hash function <strong>and</strong> a bad hash function makes a big difference inpractice. Technically, any function th<strong>at</strong> maps all possible key values to a slot inthe hash table is a hash function. In the extreme case, even a function th<strong>at</strong> mapsall records to the same slot is a hash function, but it does nothing to help us findrecords during a search oper<strong>at</strong>ion.We would like to pick a hash function th<strong>at</strong> stores the actual records in the collectionsuch th<strong>at</strong> each slot in the hash table has equal probability of being filled. Unfortun<strong>at</strong>ely,we normally have no control over the key values of the actual records,so how well any particular hash function does this depends on the distribution ofthe keys within the allowable key range. In some cases, incoming d<strong>at</strong>a are welldistributed across their key range. For example, if the input is a set of r<strong>and</strong>omnumbers selected uniformly from the key range, any hash function th<strong>at</strong> assigns thekey range so th<strong>at</strong> each slot in the hash table receives an equal share of the rangewill likely also distribute the input records uniformly within the table. However,in many applic<strong>at</strong>ions the incoming records are highly clustered or otherwise poorlydistributed. When input records are not well distributed throughout the key rangeit can be difficult to devise a hash function th<strong>at</strong> does a good job of distributing therecords throughout the table, especially if the input distribution is not known inadvance.There are many reasons why d<strong>at</strong>a values might be poorly distributed.1. N<strong>at</strong>ural frequency distributions tend to follow a common p<strong>at</strong>tern where a fewof the entities occur frequently while most entities occur rel<strong>at</strong>ively rarely.For example, consider the popul<strong>at</strong>ions of the 100 largest cities in the UnitedSt<strong>at</strong>es. If you plot these popul<strong>at</strong>ions on a number line, most of them will beclustered toward the low side, with a few outliers on the high side. This is anexample of a Zipf distribution (see Section 9.2). Viewed the other way, thehome town for a given person is far more likely to be a particular large citythan a particular small town.2. Collected d<strong>at</strong>a are likely to be skewed in some way. Field samples might berounded to, say, the nearest 5 (i.e., all numbers end in 5 or 0).3. If the input is a collection of common English words, the beginning letterwill be poorly distributed.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!