State machine replication

Comprehensive study notes, diagrams, and exam preparation for State machine replication.

State Machine Replication

Definition

State machine replication is a fault-tolerance technique in which multiple replicas of a service execute the same sequence of client requests in the same order, starting from the same initial state, so that each replica reaches the same final state and produces the same outputs.

The key idea is that the service is treated as a deterministic state machine:

  • it has a state,
  • it accepts commands or requests,
  • it transitions to a new state after each command,
  • and the result of each command depends only on the current state and the command itself.

By replicating this state machine across several servers and ensuring identical command ordering, the system can survive failures while preserving correctness.


Main Content

1. Deterministic State Machines

  • A state machine is an abstract model in which a system moves from one state to another based on inputs or commands.
  • In replication, the state machine must be deterministic, meaning the same input applied to the same state always produces the same output and next state.

A deterministic state machine is essential because replicas must behave identically. If one replica receives a command like transfer $100 from account A to B, every replica must compute exactly the same resulting balances. If the logic includes randomness, time-dependent behavior, or external side effects, replicas may diverge. Therefore, replicated services often isolate non-deterministic elements and only replicate the pure command execution.

Examples include:

  • a key-value store where put(key, value) and get(key) are executed in order,
  • a banking ledger where transactions are applied sequentially,
  • a lock service where clients request locks in a globally agreed order.

2. Total Order of Operations

  • All replicas must process requests in the same sequence, often called total order or log order.
  • Consensus mechanisms such as Paxos, Raft, or Byzantine agreement protocols are used to establish this common order.

The order of operations matters because many systems are not commutative. For instance, if one client deposits money and another withdraws money from the same account, processing the deposit first may succeed while processing the withdrawal first may fail. To ensure every replica reaches the same state, all operations must be delivered in exactly the same order.

A common approach is:

  • clients submit requests to the replication layer,
  • the layer selects an agreed-upon order,
  • each replica appends the request to its log,
  • replicas execute log entries one by one.

This ordered log becomes the source of truth. Even if messages arrive at replicas in different network orders, the consensus protocol ensures that the final execution order is identical everywhere.

3. Replication, Fault Tolerance, and Consistency

  • Multiple replicas provide resilience against failures such as crashes, restarts, and network partitions.
  • Consistency is maintained because replicas follow the same replicated log and deterministic execution model.

Replication improves availability because one server failure does not stop the service. If a leader fails in a leader-based system, another replica can take over after a new consensus round. If some replicas are temporarily unreachable, the remaining replicas can continue serving as long as the system’s fault-tolerance assumptions still hold.

There are important consistency guarantees:

Linearizability

  • : each operation appears to happen atomically at a single point in time.

Strong consistency

  • : all clients observe the same order of committed operations.

Replica agreement

  • : no two correct replicas decide differently on the same log position.

This is especially valuable in systems that require correctness over speed, such as financial systems, metadata services, coordination services, and distributed configuration stores.


Working / Process

1. Client submits a request

  • A client sends an operation, such as insert, delete, deposit, or lock.
  • The request is forwarded to the replication protocol rather than being applied immediately by one server.

2. Consensus determines the order

  • The replicas use a consensus algorithm to agree on the next operation in the global sequence.
  • Once the order is decided, the operation is written into the replicated log and acknowledged by enough replicas to be considered committed.

3. Replicas execute the same commands

  • Each replica applies the committed log entry to its local state machine in order.
  • Because every replica starts from the same state and applies the same deterministic sequence, their states remain identical.
  • If a replica crashes and later recovers, it can replay the log or restore from a snapshot to catch up.

Advantages / Applications

High availability and fault tolerance

  • The system can continue operating even if some replicas fail, making it suitable for critical services that cannot afford downtime.

Strong consistency and predictable behavior

  • All replicas agree on the same sequence of operations, which makes the system easier to reason about and helps prevent data corruption or conflicting updates.

Widely used in real-world distributed systems

  • It is applied in databases, distributed lock managers, configuration systems, service discovery, leader election, replicated storage, and consensus-based platforms where correctness is more important than raw speed.

Summary

State machine replication is a method for building reliable distributed systems by running the same deterministic service on multiple machines and keeping their operations in the same order. It combines replication with consensus so that replicas stay consistent and can continue operating after failures.