11.07.2015 Views

Data Structures and Algorithm Analysis - Computer Science at ...

Data Structures and Algorithm Analysis - Computer Science at ...

Data Structures and Algorithm Analysis - Computer Science at ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

314 Chap. 9 SearchingThis method of computing sets from bit vectors is sometimes applied to documentretrieval. Consider the problem of picking from a collection of documentsthose few which contain selected keywords. For each keyword, the document retrievalsystem stores a bit vector with one bit for each document. If the user wants toknow which documents contain a certain three keywords, the corresponding threebit vectors are AND’ed together. Those bit positions resulting in a value of 1 correspondto the desired documents. Altern<strong>at</strong>ively, a bit vector can be stored for eachdocument to indic<strong>at</strong>e those keywords appearing in the document. Such an organiz<strong>at</strong>ionis called a sign<strong>at</strong>ure file. The sign<strong>at</strong>ures can be manipul<strong>at</strong>ed to find documentswith desired combin<strong>at</strong>ions of keywords.9.4 HashingThis section presents a completely different approach to searching arrays: by directaccess based on key value. The process of finding a record using some comput<strong>at</strong>ionto map its key value to a position in the array is called hashing. Most hashingschemes place records in the array in wh<strong>at</strong>ever order s<strong>at</strong>isfies the needs of theaddress calcul<strong>at</strong>ion, thus the records are not ordered by value or frequency. Thefunction th<strong>at</strong> maps key values to positions is called a hash function <strong>and</strong> will bedenoted by h. The array th<strong>at</strong> holds the records is called the hash table <strong>and</strong> will bedenoted by HT. A position in the hash table is also known as a slot. The numberof slots in hash table HT will be denoted by the variable M, with slots numberedfrom 0 to M − 1. The goal for a hashing system is to arrange things such th<strong>at</strong>, forany key value K <strong>and</strong> some hash function h, i = h(K) is a slot in the table suchth<strong>at</strong> 0 ≤ h(K) < M, <strong>and</strong> we have the key of the record stored <strong>at</strong> HT[i] equal toK.Hashing is not good for applic<strong>at</strong>ions where multiple records with the same keyvalue are permitted. Hashing is not a good method for answering range searches. Inother words, we cannot easily find all records (if any) whose key values fall withina certain range. Nor can we easily find the record with the minimum or maximumkey value, or visit the records in key order. Hashing is most appropri<strong>at</strong>e for answeringthe question, “Wh<strong>at</strong> record, if any, has key value K?” For applic<strong>at</strong>ions whereaccess involves only exact-m<strong>at</strong>ch queries, hashing is usually the search method ofchoice because it is extremely efficient when implemented correctly. As you willsee in this section, however, there are many approaches to hashing <strong>and</strong> it is easyto devise an inefficient implement<strong>at</strong>ion. Hashing is suitable for both in-memory<strong>and</strong> disk-based searching <strong>and</strong> is one of the two most widely used methods for organizinglarge d<strong>at</strong>abases stored on disk (the other is the B-tree, which is covered inChapter 10).As a simple (though unrealistic) example of hashing, consider storing n recordseach with a unique key value in the range 0 to n − 1. In this simple case, a record

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!