Storage engines, CAP & distributed data

This page matches the SDE3 / staff bar: not only using RDS or DynamoDB, but explaining on-disk structure, trade-offs under partitions, and what breaks at scale. Stacks that combine Node.js, AWS, and Docker usually hit: Amazon RDS (Postgres/MySQL), DynamoDB (or Cassandra-class wide-column), and ElastiCache—wiring should be consistent with those failure modes.

B-trees vs LSM-trees (storage engine families)

	B-tree family (many RDBMS)	LSM-tree family (DynamoDB/RocksDB/Cassandra-style)
Write path	In-place or copy-on-write pages; updates touch indexed pages	Append to memtable, then flush to sorted SSTables; compaction rewrites
Read path	Typically fewer seeks per key lookup (tree height)	May read several SSTable generations + bloom filters
Strength	Predictable read latency; mature range scans in RDBMS	High write throughput, good for workloads that append and tune compaction
Pain	Write amplification on hot pages; page splits	Compaction storms, read amplification if tuning poor, staleness of merged view during backlog

Interview line: “Postgres/MySQL engines are B-tree–centric for OLTP; Dynamo and many NoSQL stacks use LSM-style layers because they optimize for partitioned, high-ingest paths—then I’d ask about p99 during compaction and hot keys.”

Loading diagram…

AWS lens: RDS (Postgres/MySQL) → think B-tree, WAL, checkpoints, vacuum/bloat. DynamoDB → Amazon’s managed LSM-like partitioned store; throttling and hot partitions dominate ops stories.

Topic mind map (ASCII)

Storage engines & distributed data
├── On disk
│   ├── B-tree family (RDS OLTP) .......... seeks, pages, splits
│   └── LSM family (Dynamo/Cassandra) ... memtable, SSTables, compaction
├── Indexes
│   ├── clustered vs secondary vs heap .... locality & hops
│   └── composite / spatial / over-index . planner + write cost
├── Theory
│   ├── CAP .............................. partition → A vs C
│   └── PACELC ........................... else → L vs C (replica lag)
├── Replication
│   ├── single-leader .................... RDS primary + replicas
│   ├── multi-leader ..................... conflicts / merge
│   └── leaderless + quorum ............ W+R>N overlap intuition
└── Partitioning
    ├── hash vs range .................... skew patterns
    ├── consistent hashing ............... ring, virtual nodes
    └── hot partition (celebrity) ...... throttle, cache, key split

Indexing deep dive: clustered, composite, spatial, over-indexing

Clustered vs non-clustered

Concept	Meaning	Typical engines
Clustered index	Table rows are stored in the order of that index (or directly in the index structure)	InnoDB primary key is clustered; SQL Server clustered PK
Non-clustered / secondary	Index entries point to row/PK/row id; extra hop	Secondary indexes on InnoDB → back to PK
Heap table + indexes	Rows live in a heap; indexes reference row locations	PostgreSQL default: heap + separate indexes (no clustered PK in the MySQL sense)

Staff: “Clustered PK chooses physical locality for range scans on that key; random UUID PK can fragment inserts—sometimes time-ordered or sequenced keys trade insert locality vs security.”

Composite indexes

Leading column must match WHERE for the index to be used as you expect.
Range on a middle column can stop use of later columns in classic B-tree composites.

(Covered in more depth on Indexing & query tuning.)

Spatial indexes

R-tree, GiST (Postgres), etc.—support nearest, containment for geo.
Cost: larger structures, tooling for explain plans; still subject to selectivity and SRID mistakes.

Over-indexing cost

Cost	Why it matters
Write amplification	Every INSERT/UPDATE/DELETE touches each index
Planner confusion	Too many similar indexes → wrong plan choice
Disk / memory	Buffer pool churn, slower backups & replication
Lock granularity	Extra indexes can widen lock or version pressure

Rule: index from measured slow queries, not anticipation.

CAP theorem & PACELC

CAP (simplified): in a network partition, you cannot have both strong linearizability (often labeled “C” in talks) and full availability for all operations—P forces a choice.

Reality check: CAP is not “pick two forever.” It’s a partition scenario lens; most engineering is latency + consistency under normal conditions.

PACELC extends: Else (no partition)—choose Latency vs Consistency.

If…	Trade
Partition	A vs C (availability of conflicting operations vs rejecting/stalling)
Normal	L vs C (fast possibly stale vs slower fresher)

DynamoDB / Cassandra class: tunable R/W quorums and eventual paths—normal ops optimize L until you crank consistency or strong reads.

RDS: single-region primary + replicas → app still sees replica lag (PACELC “else”: you chose L on the read path).

Loading diagram…

Replication strategies & quorum

Single-leader (primary / replica)

All writes to leader; replicas async or sync chain.
RDS: Multi-AZ for durability/failover; read replicas off the path for scale (lag).

Multi-leader

Multiple acceptors of writes; conflict resolution required (last-write-wins, CRDTs, app merge).
Useful multi-region; hard—interviewers probe conflict stories.

Leaderless (Dynamo-style)

Any replica can serve reads/writes; correctness leans on quorum, version vectors or LWW, and background convergence:
- Read repair — a read touches R replicas; if one is stale, issue a repair write so replicas catch up.
- Hinted handoff — if a replica is down, a peer stores a hint and hands off writes when it returns (reduces temporary divergence).
- Anti-entropy — periodic Merkle-tree (or similar) diffs between replicas to find missing data.

Loading diagram…

Quorum (W + R > N intuition)

For N replicas:

W = minimum nodes that ack write
R = minimum nodes read from

Often taught: if W + R > N, you can get overlap so reads see a write (tunable eventual vs strong is vendor-specific). DynamoDB uses explicit consistency knobs (e.g. strongly consistent reads on demand) rather than you setting W/R manually—but the mental model is still asked.

Pattern	Narrative
R=1, W=1	fast, risky stale
W majority	durability bias
R+W > N	overlap argument in interviews

Partitioning (sharding): hash vs range

Strategy	Pros	Cons
Hash of partition key	Spread load, simple routing	Range queries across keys expensive
Range	Range scans local to shard	Hot ranges (“latest” feeds)
Composite	SKU + tenant etc.	design mistakes → uneven shards

Consistent hashing

Reduces full reshuffles when nodes join/leave—keys map to a ring; only neighbors move data.
Interview: mention virtual nodes to balance physical host skew.

Hot partitions (“celebrity problem”)

One key absorbs disproportionate traffic → throttle (DynamoDB), split keyspace, cache, write sharding (split logical key into N physical keys), rate limits.

Node + Docker: many containers share one logical “service”—they still hit one partition if the key is wrong. Fix is data model, not “more pods.”

Loading diagram…

Node.js, AWS, Docker (how this shows up)

Piece	Practice
Node	Single-threaded event loop; don’t do CPU-heavy compaction in-process; use RDS/Dynamo as dedicated stores
Connection pools	`pg` / `mysql2` pools per container; too many replicas × too many pods = connection storm on RDS
Docker	Horizontal pods don’t fix hot partition; sticky in-memory cache in each pod ≠ shared consistency
RDS	B-tree + WAL mental model; failover DNS; replica lag for reads
DynamoDB	Provisioned vs on-demand, GSI/LSI access patterns, partition limits, consider DAX only with clear staleness story

Interview questions & short answers

Q: When does an LSM store beat a B-tree for your workload?
A: Sustained high write or append-heavy ingest where batched compaction is acceptable and read p99 is tuned with bloom filters, level shape, and hot key isolation—not “because NoSQL.”

Q: Explain PACELC in one breath.
A: If partitioned: availability vs consistency; if not: latency vs freshness—replica reads are the everyday ELC trade.

Q: Why W+R>N?
A: Guarantees at least one replica in the read set saw the write’s quorum—overlap argument; map to vendor semantics honestly.

Storage engines, CAP & distributed data

B-trees vs LSM-trees (storage engine families)

Topic mind map (ASCII)

Indexing deep dive: clustered, composite, spatial, over-indexing

Clustered vs non-clustered

Composite indexes

Spatial indexes

Over-indexing cost

CAP theorem & PACELC

Replication strategies & quorum

Single-leader (primary / replica)

Multi-leader

Leaderless (Dynamo-style)

Quorum (W + R > N intuition)

Partitioning (sharding): hash vs range

Consistent hashing

Hot partitions (“celebrity problem”)

Node.js, AWS, Docker (how this shows up)

Interview questions & short answers

See also

On this page