Hashing techniques
Definition
Hashing is the process of transforming a key or input message into a fixed-length value using a hash function. The resulting value is called a hash, hash code, or digest, and it is typically used to locate data in a hash table or to verify data integrity.
A good hashing technique aims to:
- Produce the same hash for the same input every time
- Distribute values uniformly across the available range
- Minimize collisions, where different inputs produce the same hash value
- Allow fast insertion, searching, and deletion in data structures
For example, if a student ID is used as a key, hashing may convert it into an index in an array so that the student record can be stored and accessed quickly.
Main Content
1. Hash Function
- A hash function is the core of any hashing technique. It takes an input key such as a number, string, or record and converts it into a numerical value, usually within a fixed range.
- The quality of the hash function strongly affects performance. A good hash function should be deterministic, fast to compute, and capable of distributing keys evenly to reduce clustering and collisions.
A simple example of a hash function for integers is:
h(key) = key mod table_size
If the table size is 10:
h(23) = 23 mod 10 = 3h(45) = 45 mod 10 = 5h(33) = 33 mod 10 = 3
Here, 23 and 33 map to the same location, which leads to a collision.
A hash function used in practice may be much more sophisticated. For strings, it may combine character codes using multiplication, addition, and modulus operations to create a more uniform distribution.
Important characteristics of a good hash function:
- It should be easy and quick to compute
- It should spread keys evenly
- It should reduce the chance of two different keys mapping to the same index
- It should be stable so that the same key always gives the same result
2. Hash Table
- A hash table is a data structure that stores key-value pairs using a hash function to compute the index where each item should be stored.
- It provides very fast access in average cases, often close to constant time for searching, inserting, and deleting.
A hash table usually contains:
- An array of fixed size
- A hash function to compute array positions
- A strategy to handle collisions
Example of storage:
If the table size is 7 and the keys are 14, 21, and 28, using key mod 7:
14 mod 7 = 021 mod 7 = 028 mod 7 = 0
All three keys map to index 0, so collision handling becomes necessary.
A hash table can store:
- Simple values, such as names or numbers
- Complex records, such as employee details, student marks, or database rows
Why hash tables are powerful:
- They provide fast lookups
- They are suitable for large data sets
- They support efficient membership testing, such as checking whether an item exists
Simple structure:
Index: 0 1 2 3 4 5 6
-------------------------
Table: | | | | | | |
-------------------------
After inserting some values:
Index: 0 1 2 3 4 5 6
-------------------------
Table: |14 | | | | | |
-------------------------
3. Collision Handling
- A collision occurs when two or more different keys produce the same hash value.
- Collision handling is essential because collisions are unavoidable in real systems, especially when the number of keys is larger than the number of available table positions.
There are several common collision handling methods:
a) Chaining
In chaining, each table location stores a list, linked list, or bucket of all items that hash to the same index.
Example: If keys 14, 21, and 28 all hash to index 0, the table entry at index 0 stores all three items in a chain.
Index 0: 14 -> 21 -> 28
Advantages of chaining:
- Simple to implement
- Works well even when the table is moderately full
- Easy to insert new items
b) Open Addressing
In open addressing, when a collision occurs, the algorithm searches for another empty slot in the table according to a probing sequence.
Common probing methods include:
Linear probing
- : check the next slot one by one
Quadratic probing
- : use a quadratic function to determine the next slot
Double hashing
- : use a second hash function to compute the next step
Example of linear probing: If index 3 is occupied, check 4, then 5, then 6, and so on.
Advantages of open addressing:
- Uses only the hash table array
- Avoids extra linked structures
- Can be memory efficient
Disadvantages:
- Performance can degrade as the table becomes full
- Can cause clustering, especially with linear probing
c) Rehashing / Resizing
When collisions become too frequent or the load factor becomes too high, the table can be expanded and all existing keys are inserted again using a new table size.
This improves distribution and performance but requires extra time during resizing.
Working / Process
1. Choose a hash function and table size
Select a suitable hash function based on the type of data, and decide the size of the hash table. The table size is often chosen as a prime number to improve distribution and reduce clustering.
2. Convert the key into an index
Apply the hash function to the key to get an index in the hash table. For example, if the key is 52 and the table size is 10, then 52 mod 10 = 2.
3. Store or retrieve the item and handle collisions if needed
If the computed slot is empty, insert the item there. If another item is already present, apply the collision handling method such as chaining or probing. For searching, use the same hash function and collision strategy to find the item efficiently.
Example workflow:
- Insert key 23 into a table of size 10
- Compute
23 mod 10 = 3 - Place 23 at index 3
- Insert key 33
- Compute
33 mod 10 = 3 - Since index 3 is occupied, handle the collision using chaining or probing
For a chaining-based table:
Index 3: 23 -> 33
For open addressing with linear probing:
- Put 23 at index 3
- If 33 also hashes to 3 and index 4 is empty, place 33 at index 4
Advantages / Applications
- Hashing provides very fast average-case operations for search, insertion, and deletion, making it highly efficient for large data sets.
- It is widely used in dictionaries, symbol tables, caches, database indexing, password storage systems, digital signatures, and file integrity checking.
- Hashing helps reduce the time complexity of many tasks from linear time to near constant time in practical use, especially when the hash function is well designed and collisions are managed properly.
Additional major applications include:
Password verification
- : storing hashed passwords instead of plain text
Data integrity checking
- : detecting whether files have been altered
Database indexing
- : quickly locating rows using key-based hashing
Compiler design
- : storing identifiers in symbol tables
Caching
- : quickly retrieving previously stored results
Distributed systems
- : assigning data across servers using consistent hashing
Why hashing is valuable:
- Fast access
- Efficient memory use in many cases
- Useful for exact-match queries
- Strong support for security-related tasks when cryptographic hashes are used
Summary
- Hashing converts data into a fixed value for fast storage and retrieval.
- A hash table uses a hash function to place and find items efficiently.
- Collisions are handled using methods like chaining, linear probing, quadratic probing, and double hashing.
- Important terms to remember: hash function, hash table, collision, chaining, open addressing, probing, load factor, rehashing, digest, index