Hashing & Indexing

Definition

Hashing is a technique that transforms a key into a fixed-size value called a hash code using a hash function, so that the corresponding data can be stored and retrieved efficiently from a hash table.

Indexing is a technique of creating a separate, organized structure that stores key values along with pointers or references to the actual data location, allowing faster searching, retrieval, and sometimes sorting of records.

In simple terms:

Hashing

answers: “Where should I store or find this item quickly?”

Indexing

answers: “How can I quickly locate this record without scanning everything?”

Main Content

1. Hashing

Hash Function and Hash Table

A hash function takes an input key, such as an integer, string, or object, and produces a numeric hash value.
That hash value is converted into an index of a hash table, where the actual record is stored.
Example: If the key is StudentID = 1024, a hash function may compute a table position like hash(1024) = 7, so the record is stored in bucket 7.
A good hash function should distribute keys uniformly across the table to minimize collisions.
A hash table is usually an array-like structure where each slot is called a bucket.

Collisions and Collision Resolution

A collision occurs when two different keys produce the same hash index.
Collisions are unavoidable in real systems because many possible keys map into a limited number of table positions.
Common collision handling methods include:
- Chaining: multiple elements are stored in a linked list or bucket at the same index.
- Open addressing: if a collision occurs, the algorithm searches for another empty slot.
Example of chaining:
- hash(12) = 3
- hash(23) = 3
- Both values are stored in bucket 3 using a list.
Efficient collision resolution is essential because too many collisions reduce the speed advantage of hashing.

2. Indexing

Primary Index and Secondary Index

A primary index is built on a primary key or ordered field, often corresponding to the physical ordering of data.
A secondary index is built on a non-ordering attribute, such as department name or city, to support searches on fields other than the primary key.
Primary indexing is often sparse, meaning it contains entries for only some records or blocks.
Secondary indexing is often dense, meaning it may include an entry for each record or each key occurrence.
Example:
- A student file stored by roll number can have a primary index on roll number.
- Another index on department can serve as a secondary index.

Dense Index and Sparse Index

A dense index contains an index entry for every search key value or every record.
A sparse index contains entries only for some of the records, usually one entry per block or group.
Dense indexes provide faster lookups but require more memory.
Sparse indexes use less storage but may need additional searching inside a block after locating the approximate area.
Example:
- If a file has records sorted by name, a sparse index might store the first name in each block.
- A dense index might store every name with a pointer to the exact record.

3. Differences, Performance, and Use Cases

Search Efficiency and Complexity

Hashing is ideal for exact-match searches, such as “find employee ID 5001.”
Average-case hashing search time is often close to O(1), though worst-case can degrade to O(n) if many collisions occur.
Indexing, especially with tree-based structures or ordered files, often supports O(log n) search and also enables ordered traversal.
Indexing is better when queries involve ranges, such as “find all students with marks between 70 and 90.”
Hashing is not naturally suited for range queries because hash values do not preserve order.

Storage and Maintenance

Hashing usually requires careful choice of table size, hash function, and load factor.
The load factor is the ratio of stored elements to table slots; higher load factor can increase collisions.
Indexes must be updated when records are inserted, deleted, or modified, which adds overhead.
In databases, maintaining multiple indexes improves search speed but can slow down write operations.
Example:
- A library catalog may use indexing for author names and categories.
- A key-value store may use hashing for quick retrieval by unique key.

Working / Process

1. Key Selection and Structure Preparation

Choose the search field or key, such as employee ID, roll number, product code, or book title.
For hashing, design a hash function and create a hash table of suitable size.
For indexing, decide whether the index will be primary, secondary, dense, or sparse.
Example:
- A database storing customer records may choose CustomerID as the hash key.
- The same database may create an index on LastName for search queries.

2. Insertion and Mapping

For hashing, the hash function converts the key into a bucket location.
The record is placed in the computed bucket; if a collision occurs, the chosen collision strategy is applied.
For indexing, the key is added to the index structure with a pointer/reference to the actual record or block.
Example:
- Record with key 45 hashes to bucket 2.
- If bucket 2 is occupied, the record is chained there or placed in the next available slot.
A simplified view:

Key ---> Hash Function ---> Table Position ---> Stored Record 45 ---> h(45) ---> 2 ---> Data item

3. Searching and Retrieval

To search using hashing, compute the hash of the desired key and inspect the bucket directly.
If the bucket contains multiple entries or a collision chain, search within that bucket only.
To search using indexing, locate the key in the index first, then follow the pointer to the actual record or block.
Example:
- To find employee E104, hash E104 and check that location.
- To find all records in a date range, use the index to jump to the starting point and scan relevant ordered entries.
This process avoids scanning the full dataset and greatly improves performance.

Advantages / Applications

Very fast search and retrieval

Hashing allows rapid exact-key lookup, often close to constant time on average.
Indexing reduces search time by narrowing the search space significantly.
This is especially useful in large databases and information systems.

Efficient database and file access

Indexes are widely used in database management systems to speed up SELECT queries.
Hashing is commonly used in hash tables, caches, symbol tables, password verification systems, and dictionaries.
Example applications:
- Library management systems
- Banking systems
- E-commerce product lookup
- Search engines and content retrieval systems

Supports large-scale data management

These techniques make it possible to handle millions or billions of records efficiently.
Indexing supports ordered access and range searching.
Hashing supports quick direct access for unique identifiers.
Together, they are essential in operating systems, compilers, databases, and web applications.

Summary

Hashing maps a key to a table position for fast direct access.
Indexing stores key-to-location references to speed up record retrieval.
Hashing is best for exact-match searches, while indexing is also useful for ordered and range-based searches.
Important terms to remember: hash function, hash table, collision, chaining, open addressing, primary index, secondary index, dense index, sparse index, load factor.