THN Interview Prep

Replication, sharding & operations

For B-tree vs LSM, CAP/PACELC, quorum, consistent hashing, and leaderless models, see Storage engines, CAP & distributed data—this page stays closer to operations, RDS/Dynamo-style failure stories, and backup math.

Core details

Single-leader replication: writes go to primary; async replicas serve read scale or analytics. Replication lag is normal—reads may be seconds behind.

Failover: promote replica → split brain and data loss window risks must be rehearsed (automatic vs operator-led).

Sharding / partitioning: split data by partition key so most queries hit one shard; resharding is painful—design key early from access paths.

Hot partition: one key (celebrity user) dominates a shard—mitigate with key redesign, salting, caching, or rate limits.

Backups:

ConceptInterview one-liner
RPOmax acceptable data loss window
RTOmax acceptable downtime to restore
Restore drillperiodic proof backups are recoverable

Logical vs physical backups trade portability vs speed and size.

Observability: lag metrics, replication slot backlog, disk growth, checkpoint behavior—tie alerts to customer-visible symptoms where possible.

Understanding

“Strong consistency on replicas” without routing sticky to primary or version checks invites bugs. User sees stale profile after update until you read your writes.

Global secondary indexes in sharded systems mean cross-shard work—cost jumps; prefer primary access pattern locality.

Senior understanding

FailureStory elements
Replica promoted with lagin-flight writes, brain split, fencing old primary
Backup missing WALcannot point-in-time recover—policy gap
Shard imbalancemonitoring per-shard QPS/size; reshard plan
AZ outageavailability zone placement, quorum if using consensus stores

Cross-link idempotent writers when failover causes duplicate delivery (idempotency).

Fencing tokens & stale primaries

After a failover, the old primary might still accept writes (split brain) if it believes it is leader.

Fencing = give the current epoch (lease / fencing token / ballot number) with each write; storage rejects writes with outdated tokens. Alternative: STONITH-class “make old node stop” in infra.

Loading diagram…

Diagram

Loading diagram…

See also

Last updated on

Spotted something unclear or wrong on this page?

On this page