Hashing & Indexing
Definition
Hashing is a technique that transforms a key into a fixed-size value called a hash code using a hash function, so that the corresponding data can be stored and retrieved efficiently from a hash table.
Indexing is a technique of creating a separate, organized structure that stores key values along with pointers or references to the actual data location, allowing faster searching, retrieval, and sometimes sorting of records.
In simple terms:
Hashing
- answers: “Where should I store or find this item quickly?”
Indexing
- answers: “How can I quickly locate this record without scanning everything?”
Main Content
1. Hashing
Hash Function and Hash Table
- A hash function takes an input key, such as an integer, string, or object, and produces a numeric hash value.
- That hash value is converted into an index of a hash table, where the actual record is stored.
- Example: If the key is
StudentID = 1024, a hash function may compute a table position likehash(1024) = 7, so the record is stored in bucket 7. - A good hash function should distribute keys uniformly across the table to minimize collisions.
- A hash table is usually an array-like structure where each slot is called a bucket.
Collisions and Collision Resolution
- A collision occurs when two different keys produce the same hash index.
- Collisions are unavoidable in real systems because many possible keys map into a limited number of table positions.
- Common collision handling methods include:
- Chaining: multiple elements are stored in a linked list or bucket at the same index.
- Open addressing: if a collision occurs, the algorithm searches for another empty slot.
- Example of chaining:
hash(12) = 3hash(23) = 3- Both values are stored in bucket 3 using a list.
- Efficient collision resolution is essential because too many collisions reduce the speed advantage of hashing.
2. Indexing
Primary Index and Secondary Index
- A primary index is built on a primary key or ordered field, often corresponding to the physical ordering of data.
- A secondary index is built on a non-ordering attribute, such as department name or city, to support searches on fields other than the primary key.
- Primary indexing is often sparse, meaning it contains entries for only some records or blocks.
- Secondary indexing is often dense, meaning it may include an entry for each record or each key occurrence.
- Example:
- A student file stored by roll number can have a primary index on roll number.
- Another index on department can serve as a secondary index.
Dense Index and Sparse Index
- A dense index contains an index entry for every search key value or every record.
- A sparse index contains entries only for some of the records, usually one entry per block or group.
- Dense indexes provide faster lookups but require more memory.
- Sparse indexes use less storage but may need additional searching inside a block after locating the approximate area.
- Example:
- If a file has records sorted by name, a sparse index might store the first name in each block.
- A dense index might store every name with a pointer to the exact record.
3. Differences, Performance, and Use Cases
Search Efficiency and Complexity
- Hashing is ideal for exact-match searches, such as “find employee ID 5001.”
- Average-case hashing search time is often close to O(1), though worst-case can degrade to O(n) if many collisions occur.
- Indexing, especially with tree-based structures or ordered files, often supports O(log n) search and also enables ordered traversal.
- Indexing is better when queries involve ranges, such as “find all students with marks between 70 and 90.”
- Hashing is not naturally suited for range queries because hash values do not preserve order.
Storage and Maintenance
- Hashing usually requires careful choice of table size, hash function, and load factor.
- The load factor is the ratio of stored elements to table slots; higher load factor can increase collisions.
- Indexes must be updated when records are inserted, deleted, or modified, which adds overhead.
- In databases, maintaining multiple indexes improves search speed but can slow down write operations.
- Example:
- A library catalog may use indexing for author names and categories.
- A key-value store may use hashing for quick retrieval by unique key.
Working / Process
1. Key Selection and Structure Preparation
- Choose the search field or key, such as employee ID, roll number, product code, or book title.
- For hashing, design a hash function and create a hash table of suitable size.
- For indexing, decide whether the index will be primary, secondary, dense, or sparse.
- Example:
- A database storing customer records may choose
CustomerIDas the hash key. - The same database may create an index on
LastNamefor search queries.
- A database storing customer records may choose
2. Insertion and Mapping
- For hashing, the hash function converts the key into a bucket location.
- The record is placed in the computed bucket; if a collision occurs, the chosen collision strategy is applied.
- For indexing, the key is added to the index structure with a pointer/reference to the actual record or block.
- Example:
- Record with key
45hashes to bucket2. - If bucket
2is occupied, the record is chained there or placed in the next available slot.
- Record with key
-
A simplified view:
Key ---> Hash Function ---> Table Position ---> Stored Record 45 ---> h(45) ---> 2 ---> Data item
3. Searching and Retrieval
- To search using hashing, compute the hash of the desired key and inspect the bucket directly.
- If the bucket contains multiple entries or a collision chain, search within that bucket only.
- To search using indexing, locate the key in the index first, then follow the pointer to the actual record or block.
- Example:
- To find employee
E104, hashE104and check that location. - To find all records in a date range, use the index to jump to the starting point and scan relevant ordered entries.
- To find employee
- This process avoids scanning the full dataset and greatly improves performance.
Advantages / Applications
Very fast search and retrieval
- Hashing allows rapid exact-key lookup, often close to constant time on average.
- Indexing reduces search time by narrowing the search space significantly.
- This is especially useful in large databases and information systems.
Efficient database and file access
- Indexes are widely used in database management systems to speed up
SELECTqueries. - Hashing is commonly used in hash tables, caches, symbol tables, password verification systems, and dictionaries.
- Example applications:
- Library management systems
- Banking systems
- E-commerce product lookup
- Search engines and content retrieval systems
Supports large-scale data management
- These techniques make it possible to handle millions or billions of records efficiently.
- Indexing supports ordered access and range searching.
- Hashing supports quick direct access for unique identifiers.
- Together, they are essential in operating systems, compilers, databases, and web applications.
Summary
- Hashing maps a key to a table position for fast direct access.
- Indexing stores key-to-location references to speed up record retrieval.
- Hashing is best for exact-match searches, while indexing is also useful for ordered and range-based searches.
- Important terms to remember: hash function, hash table, collision, chaining, open addressing, primary index, secondary index, dense index, sparse index, load factor.