Sharding
What it is
Sharding (horizontal partitioning) splits data across multiple physical nodes so each holds a subset of rows or keys. Clients or middleware route queries to the correct shard using a shard key.
Strategies
- Range sharding: consecutive key ranges per shard (e.g. user ids
0–1Mon shard A). Range scans are easy; hot ranges can overload one shard. - Hash sharding:
hash(shardKey) mod numShardsspreads load more evenly; range queries across shards are harder. - Directory / lookup service: central map from key to shard; flexible resharding at cost of extra hop and component to scale.
hash(userId) --> shard index --> DB instance
range(time) --> time shard --> (watch hot days)Hot keys
A shard key that is too popular (celebrity user, viral object) puts disproportionate load on one node.
Mitigations:
- Cache at edge (see caching).
- Sub-shard or split hot key into synthetic suffixes with merger in app (advanced).
- Avoid low-cardinality shard keys that collapse traffic.
Resharding
Growing data or skew forces split, merge, or rebalance.
- Double-write / dual read during migration; verify checksums; switch traffic gradually.
- Consistent hashing on ring can reduce full reshuffles when adding nodes (also used by load-balancer algorithms).
- Plan downtime vs online migration; use backlog and rate limits to protect source DB.
When to use
- Single-node DB cannot fit data or QPS after vertical scaling and replication read replicas are insufficient for writes.
- Regulatory data residency (shard by region).
Alternatives
- Partitioned tables inside one managed cluster (BigQuery, Citus-style)—shard-like without you operating shards.
- No sharding: stronger consistency story; simpler ops until you must scale out.
Failure modes
- Cross-shard transactions: expensive or unsupported; design aggregates without global transactions.
- Rebalancing bugs: partial writes, wrong routing table.
- Uneven growth: monitor shard size and QPS per shard.
Interview talking points
- Always state shard key choice and hot key plan; use back-of-envelope for growth per shard.
- Link resharding to zero-downtime stories and monitoring.
- Compare to Dynamo-style partitioning in sql-vs-nosql.
Related fundamentals
- Back-of-the-envelope estimation
- Latency and throughput — per-shard QPS and tail latency under skew
Last updated on
Spotted something unclear or wrong on this page?