Hashing & Indexing

Comprehensive study notes, diagrams, and exam preparation for Hashing & Indexing.

Hashing & Indexing

Definition

Hashing is a technique that transforms a key into a fixed-size value called a hash code using a hash function, so that the corresponding data can be stored and retrieved efficiently from a hash table.

Indexing is a technique of creating a separate, organized structure that stores key values along with pointers or references to the actual data location, allowing faster searching, retrieval, and sometimes sorting of records.

In simple terms:

Hashing

  • answers: “Where should I store or find this item quickly?”

Indexing

  • answers: “How can I quickly locate this record without scanning everything?”

Main Content

1. Hashing

Hash Function and Hash Table

  • A hash function takes an input key, such as an integer, string, or object, and produces a numeric hash value.
  • That hash value is converted into an index of a hash table, where the actual record is stored.
  • Example: If the key is StudentID = 1024, a hash function may compute a table position like hash(1024) = 7, so the record is stored in bucket 7.
  • A good hash function should distribute keys uniformly across the table to minimize collisions.
  • A hash table is usually an array-like structure where each slot is called a bucket.

Collisions and Collision Resolution

  • A collision occurs when two different keys produce the same hash index.
  • Collisions are unavoidable in real systems because many possible keys map into a limited number of table positions.
  • Common collision handling methods include:
    • Chaining: multiple elements are stored in a linked list or bucket at the same index.
    • Open addressing: if a collision occurs, the algorithm searches for another empty slot.
  • Example of chaining:
    • hash(12) = 3
    • hash(23) = 3
    • Both values are stored in bucket 3 using a list.
  • Efficient collision resolution is essential because too many collisions reduce the speed advantage of hashing.

2. Indexing

Primary Index and Secondary Index

  • A primary index is built on a primary key or ordered field, often corresponding to the physical ordering of data.
  • A secondary index is built on a non-ordering attribute, such as department name or city, to support searches on fields other than the primary key.
  • Primary indexing is often sparse, meaning it contains entries for only some records or blocks.
  • Secondary indexing is often dense, meaning it may include an entry for each record or each key occurrence.
  • Example:
    • A student file stored by roll number can have a primary index on roll number.
    • Another index on department can serve as a secondary index.

Dense Index and Sparse Index

  • A dense index contains an index entry for every search key value or every record.
  • A sparse index contains entries only for some of the records, usually one entry per block or group.
  • Dense indexes provide faster lookups but require more memory.
  • Sparse indexes use less storage but may need additional searching inside a block after locating the approximate area.
  • Example:
    • If a file has records sorted by name, a sparse index might store the first name in each block.
    • A dense index might store every name with a pointer to the exact record.

3. Differences, Performance, and Use Cases

Search Efficiency and Complexity

  • Hashing is ideal for exact-match searches, such as “find employee ID 5001.”
  • Average-case hashing search time is often close to O(1), though worst-case can degrade to O(n) if many collisions occur.
  • Indexing, especially with tree-based structures or ordered files, often supports O(log n) search and also enables ordered traversal.
  • Indexing is better when queries involve ranges, such as “find all students with marks between 70 and 90.”
  • Hashing is not naturally suited for range queries because hash values do not preserve order.

Storage and Maintenance

  • Hashing usually requires careful choice of table size, hash function, and load factor.
  • The load factor is the ratio of stored elements to table slots; higher load factor can increase collisions.
  • Indexes must be updated when records are inserted, deleted, or modified, which adds overhead.
  • In databases, maintaining multiple indexes improves search speed but can slow down write operations.
  • Example:
    • A library catalog may use indexing for author names and categories.
    • A key-value store may use hashing for quick retrieval by unique key.

Working / Process

1. Key Selection and Structure Preparation

  • Choose the search field or key, such as employee ID, roll number, product code, or book title.
  • For hashing, design a hash function and create a hash table of suitable size.
  • For indexing, decide whether the index will be primary, secondary, dense, or sparse.
  • Example:
    • A database storing customer records may choose CustomerID as the hash key.
    • The same database may create an index on LastName for search queries.

2. Insertion and Mapping

  • For hashing, the hash function converts the key into a bucket location.
  • The record is placed in the computed bucket; if a collision occurs, the chosen collision strategy is applied.
  • For indexing, the key is added to the index structure with a pointer/reference to the actual record or block.
  • Example:
    • Record with key 45 hashes to bucket 2.
    • If bucket 2 is occupied, the record is chained there or placed in the next available slot.
  • A simplified view:

    Key ---> Hash Function ---> Table Position ---> Stored Record 45 ---> h(45) ---> 2 ---> Data item

3. Searching and Retrieval

  • To search using hashing, compute the hash of the desired key and inspect the bucket directly.
  • If the bucket contains multiple entries or a collision chain, search within that bucket only.
  • To search using indexing, locate the key in the index first, then follow the pointer to the actual record or block.
  • Example:
    • To find employee E104, hash E104 and check that location.
    • To find all records in a date range, use the index to jump to the starting point and scan relevant ordered entries.
  • This process avoids scanning the full dataset and greatly improves performance.

Advantages / Applications

Very fast search and retrieval

  • Hashing allows rapid exact-key lookup, often close to constant time on average.
  • Indexing reduces search time by narrowing the search space significantly.
  • This is especially useful in large databases and information systems.

Efficient database and file access

  • Indexes are widely used in database management systems to speed up SELECT queries.
  • Hashing is commonly used in hash tables, caches, symbol tables, password verification systems, and dictionaries.
  • Example applications:
    • Library management systems
    • Banking systems
    • E-commerce product lookup
    • Search engines and content retrieval systems

Supports large-scale data management

  • These techniques make it possible to handle millions or billions of records efficiently.
  • Indexing supports ordered access and range searching.
  • Hashing supports quick direct access for unique identifiers.
  • Together, they are essential in operating systems, compilers, databases, and web applications.

Summary

  • Hashing maps a key to a table position for fast direct access.
  • Indexing stores key-to-location references to speed up record retrieval.
  • Hashing is best for exact-match searches, while indexing is also useful for ordered and range-based searches.
  • Important terms to remember: hash function, hash table, collision, chaining, open addressing, primary index, secondary index, dense index, sparse index, load factor.