Replication, sharding & operations

For B-tree vs LSM, CAP/PACELC, quorum, consistent hashing, and leaderless models, see Storage engines, CAP & distributed data—this page stays closer to operations, RDS/Dynamo-style failure stories, and backup math.

Core details

Single-leader replication: writes go to primary; async replicas serve read scale or analytics. Replication lag is normal—reads may be seconds behind.

Failover: promote replica → split brain and data loss window risks must be rehearsed (automatic vs operator-led).

Sharding / partitioning: split data by partition key so most queries hit one shard; resharding is painful—design key early from access paths.

Hot partition: one key (celebrity user) dominates a shard—mitigate with key redesign, salting, caching, or rate limits.

Backups:

Concept	Interview one-liner
RPO	max acceptable data loss window
RTO	max acceptable downtime to restore
Restore drill	periodic proof backups are recoverable

Logical vs physical backups trade portability vs speed and size.

Observability: lag metrics, replication slot backlog, disk growth, checkpoint behavior—tie alerts to customer-visible symptoms where possible.

Understanding

“Strong consistency on replicas” without routing sticky to primary or version checks invites bugs. User sees stale profile after update until you read your writes.

Global secondary indexes in sharded systems mean cross-shard work—cost jumps; prefer primary access pattern locality.

The visual model below groups the operational story: single-leader replication helps read scale but introduces lag, failover needs fencing, shard keys must match access paths, and backups only count after restore drills prove RPO/RTO.

Replication, sharding, and operations diagram showing primary-to-replica lag, failover fencing tokens, shard-by-access-path choices, hot partition watch, and RPO/RTO recovery proof.

Senior understanding

Failure	Story elements
Replica promoted with lag	in-flight writes, brain split, fencing old primary
Backup missing WAL	cannot point-in-time recover—policy gap
Shard imbalance	monitoring per-shard QPS/size; reshard plan
AZ outage	availability zone placement, quorum if using consensus stores

Cross-link idempotent writers when failover causes duplicate delivery (idempotency).

Fencing tokens & stale primaries

After a failover, the old primary might still accept writes (split brain) if it believes it is leader.

Fencing = give the current epoch (lease / fencing token / ballot number) with each write; storage rejects writes with outdated tokens. Alternative: STONITH-class “make old node stop” in infra.

Loading diagram…

Diagram