Storage engines, CAP & distributed data
This page matches the SDE3 / staff bar: not only using RDS or DynamoDB, but explaining on-disk structure, trade-offs under partitions, and what breaks at scale. Stacks that combine Node.js, AWS, and Docker usually hit: Amazon RDS (Postgres/MySQL), DynamoDB (or Cassandra-class wide-column), and ElastiCache—wiring should be consistent with those failure modes.
B-trees vs LSM-trees (storage engine families)
| B-tree family (many RDBMS) | LSM-tree family (DynamoDB/RocksDB/Cassandra-style) | |
|---|---|---|
| Write path | In-place or copy-on-write pages; updates touch indexed pages | Append to memtable, then flush to sorted SSTables; compaction rewrites |
| Read path | Typically fewer seeks per key lookup (tree height) | May read several SSTable generations + bloom filters |
| Strength | Predictable read latency; mature range scans in RDBMS | High write throughput, good for workloads that append and tune compaction |
| Pain | Write amplification on hot pages; page splits | Compaction storms, read amplification if tuning poor, staleness of merged view during backlog |
Interview line: “Postgres/MySQL engines are B-tree–centric for OLTP; Dynamo and many NoSQL stacks use LSM-style layers because they optimize for partitioned, high-ingest paths—then I’d ask about p99 during compaction and hot keys.”
AWS lens: RDS (Postgres/MySQL) → think B-tree, WAL, checkpoints, vacuum/bloat. DynamoDB → Amazon’s managed LSM-like partitioned store; throttling and hot partitions dominate ops stories.
Topic mind map (ASCII)
Storage engines & distributed data
├── On disk
│ ├── B-tree family (RDS OLTP) .......... seeks, pages, splits
│ └── LSM family (Dynamo/Cassandra) ... memtable, SSTables, compaction
├── Indexes
│ ├── clustered vs secondary vs heap .... locality & hops
│ └── composite / spatial / over-index . planner + write cost
├── Theory
│ ├── CAP .............................. partition → A vs C
│ └── PACELC ........................... else → L vs C (replica lag)
├── Replication
│ ├── single-leader .................... RDS primary + replicas
│ ├── multi-leader ..................... conflicts / merge
│ └── leaderless + quorum ............ W+R>N overlap intuition
└── Partitioning
├── hash vs range .................... skew patterns
├── consistent hashing ............... ring, virtual nodes
└── hot partition (celebrity) ...... throttle, cache, key splitIndexing deep dive: clustered, composite, spatial, over-indexing
Clustered vs non-clustered
| Concept | Meaning | Typical engines |
|---|---|---|
| Clustered index | Table rows are stored in the order of that index (or directly in the index structure) | InnoDB primary key is clustered; SQL Server clustered PK |
| Non-clustered / secondary | Index entries point to row/PK/row id; extra hop | Secondary indexes on InnoDB → back to PK |
| Heap table + indexes | Rows live in a heap; indexes reference row locations | PostgreSQL default: heap + separate indexes (no clustered PK in the MySQL sense) |
Staff: “Clustered PK chooses physical locality for range scans on that key; random UUID PK can fragment inserts—sometimes time-ordered or sequenced keys trade insert locality vs security.”
Composite indexes
- Leading column must match WHERE for the index to be used as you expect.
- Range on a middle column can stop use of later columns in classic B-tree composites.
(Covered in more depth on Indexing & query tuning.)
Spatial indexes
- R-tree, GiST (Postgres), etc.—support nearest, containment for geo.
- Cost: larger structures, tooling for explain plans; still subject to selectivity and SRID mistakes.
Over-indexing cost
| Cost | Why it matters |
|---|---|
| Write amplification | Every INSERT/UPDATE/DELETE touches each index |
| Planner confusion | Too many similar indexes → wrong plan choice |
| Disk / memory | Buffer pool churn, slower backups & replication |
| Lock granularity | Extra indexes can widen lock or version pressure |
Rule: index from measured slow queries, not anticipation.
CAP theorem & PACELC
CAP (simplified): in a network partition, you cannot have both strong linearizability (often labeled “C” in talks) and full availability for all operations—P forces a choice.
Reality check: CAP is not “pick two forever.” It’s a partition scenario lens; most engineering is latency + consistency under normal conditions.
PACELC extends: Else (no partition)—choose Latency vs Consistency.
| If… | Trade |
|---|---|
| Partition | A vs C (availability of conflicting operations vs rejecting/stalling) |
| Normal | L vs C (fast possibly stale vs slower fresher) |
DynamoDB / Cassandra class: tunable R/W quorums and eventual paths—normal ops optimize L until you crank consistency or strong reads.
RDS: single-region primary + replicas → app still sees replica lag (PACELC “else”: you chose L on the read path).
Replication strategies & quorum
Single-leader (primary / replica)
- All writes to leader; replicas async or sync chain.
- RDS: Multi-AZ for durability/failover; read replicas off the path for scale (lag).
Multi-leader
- Multiple acceptors of writes; conflict resolution required (last-write-wins, CRDTs, app merge).
- Useful multi-region; hard—interviewers probe conflict stories.
Leaderless (Dynamo-style)
- Any replica can serve reads/writes; correctness leans on quorum, version vectors or LWW, and background convergence:
- Read repair — a read touches R replicas; if one is stale, issue a repair write so replicas catch up.
- Hinted handoff — if a replica is down, a peer stores a hint and hands off writes when it returns (reduces temporary divergence).
- Anti-entropy — periodic Merkle-tree (or similar) diffs between replicas to find missing data.
Quorum (W + R > N intuition)
For N replicas:
- W = minimum nodes that ack write
- R = minimum nodes read from
Often taught: if W + R > N, you can get overlap so reads see a write (tunable eventual vs strong is vendor-specific). DynamoDB uses explicit consistency knobs (e.g. strongly consistent reads on demand) rather than you setting W/R manually—but the mental model is still asked.
| Pattern | Narrative |
|---|---|
| R=1, W=1 | fast, risky stale |
| W majority | durability bias |
| R+W > N | overlap argument in interviews |
Partitioning (sharding): hash vs range
| Strategy | Pros | Cons |
|---|---|---|
| Hash of partition key | Spread load, simple routing | Range queries across keys expensive |
| Range | Range scans local to shard | Hot ranges (“latest” feeds) |
| Composite | SKU + tenant etc. | design mistakes → uneven shards |
Consistent hashing
- Reduces full reshuffles when nodes join/leave—keys map to a ring; only neighbors move data.
- Interview: mention virtual nodes to balance physical host skew.
Hot partitions (“celebrity problem”)
- One key absorbs disproportionate traffic → throttle (DynamoDB), split keyspace, cache, write sharding (split logical key into N physical keys), rate limits.
Node + Docker: many containers share one logical “service”—they still hit one partition if the key is wrong. Fix is data model, not “more pods.”
Node.js, AWS, Docker (how this shows up)
| Piece | Practice |
|---|---|
| Node | Single-threaded event loop; don’t do CPU-heavy compaction in-process; use RDS/Dynamo as dedicated stores |
| Connection pools | pg / mysql2 pools per container; too many replicas × too many pods = connection storm on RDS |
| Docker | Horizontal pods don’t fix hot partition; sticky in-memory cache in each pod ≠ shared consistency |
| RDS | B-tree + WAL mental model; failover DNS; replica lag for reads |
| DynamoDB | Provisioned vs on-demand, GSI/LSI access patterns, partition limits, consider DAX only with clear staleness story |
Interview questions & short answers
Q: When does an LSM store beat a B-tree for your workload?
A: Sustained high write or append-heavy ingest where batched compaction is acceptable and read p99 is tuned with bloom filters, level shape, and hot key isolation—not “because NoSQL.”
Q: Explain PACELC in one breath.
A: If partitioned: availability vs consistency; if not: latency vs freshness—replica reads are the everyday ELC trade.
Q: Why W+R>N?
A: Guarantees at least one replica in the read set saw the write’s quorum—overlap argument; map to vendor semantics honestly.
See also
- AWS data services & Node.js at scale
- CDC & outbox projections
- Replication, sharding & operations — backups, RPO/RTO, ops narratives
- Storage models & access paths — product-level SQL vs NoSQL
- Transactions & isolation — ACID/BASE, MVCC, 2PC vs saga
- Application caching & consistency — ElastiCache patterns
Last updated on
Spotted something unclear or wrong on this page?