What is Hashing?

Question: I want to know about direct addressing, dictionary lookup, and indirect addressing. Can you explain about division remainder, digit analysis/extraction, folding, mid-square, and radix conversion? What is collision and what are the differences between double hashing, chaining, bucket addressing, perfect hashing, and dynamic hashing? What is linear probing and what are synonyms?
anonymous

Answer: Hash tables are one of the most efficient means of searching for data. A hash table establishes a mapping between all of the possible keys and the position where the data associated with those keys is stored. I don't know exactly how all of the hashing methods that you are asking about work but the following should go at least partway towards answering your questions.

Direct Addressing is where the hashing algorithm guarantees that no two keys will generate the exact same hash value. This is the ideal situation but is impractical in most situations. Instead most hash functions map multiple keys to the same has value and provide indirect addressing mechanisms to handle the situation where two keys map to the same value. The situation where this occurs is called a collision. Two keys that hash to the same value like this are called synonyms.

A hash function attempts to distribute keys in a random manner so as to minimize the need to resolve collisions. One of the simplest hashing algorithms is division remainder where you divide the key value by some fixed number and take the remainder as the key location (also known as modulo arithmetic). For example let's say we have space for one thousand entries, so we divide the key of each value by 1000 and take the remainder so the entry with key 5078950770 will occupy position 770 in our array (5078950770 modulo 1000 is 770). Of course with this particular example every 1000th key will map to the same location.

A Chained hash table is one where the data is stored in linked lists which contain all of the data entries whose keys map to the same hash value. The linked list (or bucket) can grow to contain as many entries as required but as the list grows it becomes gradually less efficient in the speed at which the entries can be accessed.

An open addressed hash table stores the data in a table instead of in a linked list and can use a variety of probing methods to locate the required entry within the table. Linear probing is the least efficient method as it involves searching sequentially through the table until the desired entry is found. Where only a few entries map to the same hash value this may be good enough but where lots of entries map to the same hash value, a more efficient method is required.

Double hashing is one of the more effective methods of probing an open addressed hash table. It works by adding the hash codings of two auxiliary hash functions. The function for performing double hashing is h(k,i) = (h1(k) + ih2(k)) mod m where 0 <= i <m, m is the number of entries in the table, and h1 and h2 are the hash functions. To ensure that all entries in the table are filled before collisions occur we can either make m a power of 2 and have h2 always return an odd number or we make m prime and have h2 always return a value less than m. For example if h1(k) = k mod m and h2(k) = (k mod (m - 1)) and we set m to 1699 (a prime number) then when we hash 15385 we probe position 94 first and then every 113th position through the table as i increases until we find the entry we're looking for.

Dynamic or Universal hashing involves using a random number generator to dynamically generate the hashing algorithm at run time. This method reduces the chance that your keys will produce a bad distribution across the table.

As you have noticed from the above examples, hashing algorithms work on numeric keys. digit analysis/extraction is the method that you use to convert the actual keys into numerical values that cal be fed through your hashing algorithm. To get the best results you should aim to base your hashing values on those parts of the actual keys most likely to vary from one record to another.