Algorithms and Data Structures

More documents

Recommendations

Info

N.Wirth. Algorithms and Data Structures. Oberon version 200 5 Key Transformations (Hashing) 5.1 Introduction The principal question discussed in Chap. 4 at length is the following: Given a set of items characterized by a key (upon which an ordering relation is defined), how is the set to be organized so that retrieval of an item with a given key involves as little effort as possible? Clearly, in a computer store each item is ultimately accessed by specifying a storage address. Hence, the stated problem is essentially one of finding an appropriate mapping H of keys (K) into addresses (A): H: K → A In Chap. 4 this mapping was implemented in the form of various list and tree search algorithms based on different underlying data organizations. Here we present yet another approach that is basically simple and very efficient in many cases. The fact that it also has some disadvantages is discussed subsequently. The data organization used in this technique is the array structure. H is therefore a mapping transforming keys into array indices, which is the reason for the term key transformation that is generally used for this technique. It should be noted that we shall not need to rely on any dynamic allocation procedures; the array is one of the fundamental, static structures. The method of key transformations is often used in problem areas where tree structures are comparable competitors. The fundamental difficulty in using a key transformation is that the set of possible key values is much larger than the set of available store addresses (array indices). Take for example names consisting of up to 16 letters as keys identifying individuals in a set of a thousand persons. Hence, there are 26 16 possible keys which are to be mapped onto 10 3 possible indices. The function H is therefore obviously a many-toone function. Given a key k, the first step in a retrieval (search) operation is to compute its associated index h = H(k), and the second - evidently necessary - step is to verify whether or not the item with the key k is indeed identified by h in the array (table) T, i.e., to check whether T[H(k)].key = k. We are immediately confronted with two questions: 1. What kind of function H should be used? 2. How do we cope with the situation that H does not yield the location of the desired item? The answer to the second question is that some method must be used to yield an alternative location, say index h', and, if this is still not the location of the wanted item, yet a third index h", and so on. The case in which a key other than the desired one is at the identified location is called a collision; the task of generating alternative indices is termed collision handling. In the following we shall discuss the choice of a transformation function and methods of collision handling. 5.2 Choice of a Hash Function A prerequisite of a good transformation function is that it distributes the keys as evenly as possible over the range of index values. Apart from satisfying this requirement, the distribution is not bound to any pattern, and it is actually desirable that it give the impression of being entirely random. This property has given this method the somewhat unscientific name hashing, i.e., chopping the argument up, or making a mess. H is called the hash function. Clearly, it should be efficiently computable, i.e., be composed of very few basic arithmetic operations. Assume that a transfer function ORD(k) is available and denotes the ordinal number of the key k in the set of all possible keys. Assume, furthermore, that the array indices i range over the intergers 0 .. N-1, where N is the size of the array. Then an obvious choice is
N.Wirth. Algorithms and Data Structures. Oberon version 201 H(k) = ORD(k) MOD N It has the property that the key values are spread evenly over the index range, and it is therefore the basis of most key transformations. It is also very efficiently computable, if N is a power of 2. But it is exactly this case that must be avoided, if the keys are sequences of letters. The assumption that all keys are equally likely is in this case mistaken. In fact, words that differ by only a few characters then most likely map onto identical indices, thus effectively causing a most uneven distribution. It is therefore particularly recommended to let N be a prime number [5-2]. This has the conseqeunce that a full division operation is needed that cannot be replaced by a mere masking of binary digits, but this is no serious drawback on most modern computers that feature a built-in division instruction. Often, hash funtions are used which consist of applying logical operations such as the exclusive or to some parts of the key represented as a sequence of binary digits. These operations may be faster than division on some computers, but they sometimes fail spectacularly to distribute the keys evenly over the range of indices. We therefore refrain from discussing such methods in further detail. 5.3 Collision Handling If an entry in the table corresponding to a given key turns out not to be the desired item, then a collision is present, i.e., two items have keys mapping onto the same index. A second probe is necessary, one based on an index obtained in a deterministic manner from the given key. There exist several methods of generating secondary indices. An obvious one is to link all entries with identical primary index H(k) together in a linked list. This is called direct chaining. The elements of this list may be in the primary table or not; in the latter case, storage in which they are allocated is usually called an overflow area. This method has the disadvantage that secondary lists must be maintained, and that each entry must provide space for a pointer (or index) to its list of collided items. An alternative solution for resolving collisions is to dispense with links entirely and instead simply look at other entries in the same table until the item is found or an open position is encountered, in which case one may assume that the specified key is not present in the table. This method is called open addressing [5-3]. Naturally, the sequence of indices of secondary probes must always be the same for a given key. The algorithm for a table lookup can then be sketched as follows: h := H(k); i := 0; REPEAT IF T[h].key = k THEN item found ELSIF T[h].key = free THEN item is not in table ELSE (*collision*) i := i+1; h := H(k) + G(i) END UNTIL found or not in table (or table full) Various functions for resolving collisions have been proposed in the literature. A survey of the topic by Morris in 1968 [4-8] stimulated considerable activities in this field. The simplest method is to try for the next location - considering the table to be circular - until either the item with the specified key is found or an empty location is encountered. Hence, G(i) = i; the indices h i used for probing in this case are h 0 = H(k) h i = (h i-1 + i) MOD N, i = 1 ... N-1 This method is called linear probing and has the disadvantage that entries have a tendency to cluster around the primary keys (keys that had not collided upon insertion). Ideally, of course, a function G should be chosen that again spreads the keys uniformly over the remaining set of locations. In practice, however,
Page 1 and 2:
Algorithms and Data Structures © N
Page 3 and 4:
N.Wirth. Algorithms and Data Struct
Page 5 and 6:
Page 7 and 8:
Page 9 and 10:
Page 11 and 12:
Page 13 and 14:
Page 15 and 16:
Page 17 and 18:
Page 19 and 20:
Page 21 and 22:
Page 23 and 24:
Page 25 and 26:
Page 27 and 28:
Page 29 and 30:
Page 31 and 32:
Page 33 and 34:
Page 35 and 36:
Page 37 and 38:
Page 39 and 40:
Page 41 and 42:
Page 43 and 44:
Page 45 and 46:
Page 47 and 48:
Page 49 and 50:
Page 51 and 52:
Page 53 and 54:
Page 55 and 56:
Page 57 and 58:
Page 59 and 60:
Page 61 and 62:
Page 63 and 64:
Page 65 and 66:
Page 67 and 68:
Page 69 and 70:
Page 71 and 72:
Page 73 and 74:
Page 75 and 76:
Page 77 and 78:
Page 79 and 80:
Page 81 and 82:
Page 83 and 84:
Page 85 and 86:
Page 87 and 88:
Page 89 and 90:
Page 91 and 92:
Page 93 and 94:
Page 95 and 96:
Page 97 and 98:
Page 99 and 100:
Page 101 and 102:
Page 103 and 104:
Page 105 and 106:
Page 107 and 108:
Page 109 and 110:
Page 111 and 112:
Page 113 and 114:
Page 115 and 116:
Page 117 and 118:
Page 119 and 120:
Page 121 and 122:
Page 123 and 124:
Page 125 and 126:
Page 127 and 128:
Page 129 and 130:
Page 131 and 132:
Page 133 and 134:
Page 135 and 136:
Page 137 and 138:
Page 139 and 140:
Page 141 and 142:
Page 143 and 144:
Page 145 and 146:
Page 147 and 148:
Page 149 and 150: N.Wirth. Algorithms and Data Struct
Page 199: N.Wirth. Algorithms and Data Struct
Page 211: N.Wirth. Algorithms and Data Struct
show all

Algorithms and Data Structures

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?