Consensus (Raft, Paxos, ZAB)

What it is

Consensus algorithms let a group of nodes agree on a single ordered log of values (or leader choice) despite crashes and network partitions, so clients see coherent state from a quorum of correct nodes.

Consensus majority commit path showing a client write, leader log append, follower replication, quorum commit, and minority partition safety.

Raft (one paragraph)

Raft elects a leader via timeouts and heartbeats; the leader appends log entries and replicates to a majority before committing. It is designed for understandability (separate subproblems: leader election, log replication, safety). Common in etcd, Consul (older), many KV systems and coordination layers. Interviewers often want you to say leader + replicated log + majority commit and that split votes trigger new elections.

Paxos (one paragraph)

Paxos (and Multi-Paxos) solves agreement on a value (or a sequence) with proposers, acceptors, and learners; two phases (prepare/promise and accept) handle concurrency. It is the classic academic foundation; complex to implement in production. Many systems use Raft or ZAB instead for operability, but “Paxos family” still signals you know the space.

ZAB (ZooKeeper Atomic Broadcast) (one paragraph)

ZAB is the protocol behind Apache ZooKeeper: primary-backup style broadcast where a leader serializes total order of state changes to followers; optimized for linearizable writes and event ordering for coordination (locks, config, service discovery). Good when the problem is strong consistency metadata, not huge data volumes.

What interviewers want

Why consensus: avoid split-brain and duplicate leaders in replicated state machines.
Tradeoff: CP flavor in CAP for the metadata path; latency to majority (see latency-throughput).
Not for every datastore—often only control plane (who is leader, config) while data plane uses other replication (see replication).
Ability to contrast Raft (teachable, leader-centric) vs Paxos (general, harder) vs ZAB (ZK’s total order broadcast).
Cluster size and failure tolerance tradeoffs—sketch with back-of-the-envelope (majority of odd-sized membership).

  Client -> Leader -> replicate to majority -> commit -> apply
  Partition: minority partition cannot commit (safety)

Failure modes

Leader failure: election delay → unavailability until new leader.
Slow followers: drag commit if synchronous; lag on async paths.
Misconfigured quorum: even number of nodes or wrong majority math.

Alternatives

Primary-secondary without consensus: simpler but manual failover risk.
CRDTs / last-write-wins: avoid consensus for some high-scale low-coordination data—different correctness model.

On this page