Databases & caching
Data systems are where constraints become real: latency, correctness, money, and compliance all meet the storage layer. Staff-level answers name consistency guarantees, failure modes (replica lag, split brain, hot keys), and operational habits (migrations, backups, observability)—not only “pick Postgres.”
How to use this page
- Use Core database ideas as a vocabulary checklist (ACID, isolation, plans, replication).
- Use Caching alongside databases to keep source of truth vs derived copies explicit.
- Skim Diagrams before mock interviews—practice narrating arrows out loud.
- Run Things to prepare as a literal checklist for projects or interview stories.
- Staff / SDE3 bar (Node.js + AWS): pair RDS (B-tree OLTP) with Dynamo-style partitioned thinking where relevant; treat ElastiCache as a consistency problem, not only latency—internals live under Storage engines, CAP & distributed data and Application caching & consistency.
Topic study plan (deep pages)
Each topic under /databases/topics/... follows: Core details → Understanding → Senior understanding → Diagram.
| Topic | Focus |
|---|---|
| Storage models & access paths | SQL vs NoSQL, document modeling, data warehouse, use cases, when to choose which |
| Storage engines, CAP & distributed data | B-tree vs LSM, CAP/PACELC, quorum, read repair, sharding—RDS / Dynamo mental models |
| AWS data services & Node.js | RDS Proxy, DynamoDB GSI/LSI/streams, ElastiCache cluster, Memcached vs Redis |
| CDC & outbox projections | Transactional outbox, CDC, dual-write antipattern, projections |
| Transactions & isolation | ACID vs BASE, isolation & MVCC, 2PC vs saga, scopes & locks |
| Indexing & query tuning | Normalization, index patterns, query optimization, indexing Q&A |
| Replication, sharding & operations | Leaders, lag, partitions, backups, RPO/RTO |
| Application caching & consistency | ElastiCache/Redis topologies, eviction (LRU/LFU/W-TinyLFU), penetration/herd/avalanche, dual-write |
Core database ideas
Storage models (when you’d reach for each)
Deep page: Storage models & access paths (SQL vs NoSQL, warehouse vs OLTP, document modeling, when to choose which).
| Family | Strength | Typical costs / caveats |
|---|---|---|
| Relational (OLTP) | joins, constraints, transactions | schema migration discipline |
| Wide-column / KV | huge scale, simple access patterns | awkward ad-hoc analytics |
| Document | flexible shapes, nested reads | cross-document consistency harder |
| Search / analytics stores | rankings, aggregations | not a source of truth for money |
| Time-series | metrics, retention policies | cardinality & retention ops |
Interview line: “I’d choose the model after access paths and consistency requirements, not the logo on the box.”
ACID (know what each letter guards)
| Letter | Guards against | Implication |
|---|---|---|
| Atomicity | partial writes | batch work or explicit compensation |
| Consistency | invalid states your rules forbid | constraints + app invariants together |
| Isolation | concurrent anomalies | pick level per transaction, not globally maximal |
| Durability | acknowledged loss | WAL / replication; fsync semantics matter |
Isolation anomalies (name → fix intuition)
| Anomaly | One-line intuition | Common levers |
|---|---|---|
| Dirty read | seeing uncommitted work | isolation ↑, shorter txns |
| Non-repeatable read | row changes between reads in one txn | locking, snapshot isolation |
| Phantom read | new rows appear in range | range locks / higher isolation sparingly |
| Lost update | two writers clobber | optimistic version column, SELECT … FOR UPDATE |
Not every endpoint needs SERIALIZABLE—cost and deadlock risk are real; scope transactions to the smallest critical section.
Indexing & the query path
Deep page: Indexing & query tuning (normalization, index Q&A, optimization patterns).
- B-tree (and friends) power most default indexes; composite index column order matches filter + sort patterns.
- Covering indexes avoid heap/table lookups when the planner can satisfy the query from the index alone.
- Cardinality & statistics skew plans—stale stats → wrong nested loops vs seq scans.
Operational loop:
- Identify hot query text (slow log /
pg_stat_statementsclass metrics). - EXPLAIN (ANALYZE) or vendor equivalent—compare estimated vs actual rows.
- Fix with query rewrite before index soup; add indexes only when the plan proves it.
Schema evolution
Treat migrations as deploy trains: additive-first columns, dual-write / backfill phases, backward-compatible reads until cutover. Breaking changes are timelines, not single PRs.
Connections & pools
Exhausted connection pools look like “the database is slow.” Watch wait time to acquire a connection, not only query duration.
Related depth: Database query path, Profiling services & async, /backend.
AWS + integration: AWS data services & Node.js, CDC & outbox.
Distributed data (production realism)
Replication basics
| Pattern | Read freshness | Ops story |
|---|---|---|
| Single leader | strong on primary; replicas may lag | failover, promotion drills |
| Multi-leader | conflict handling required | rare; careful merge rules |
| quorum / consensus | pricey; correctness-critical configs | operational complexity |
Replica lag surfaces as intermittent “wrong data” unless you engineer read-your-writes (sticky to primary, version checks, or user-visible staleness).
Partitioning / sharding
Choose a partition key from real access paths—resharding later is expensive. Watch hot partitions (celebrity keys) and plan salting or key redesign early.
Backup & recovery (you should be able to narrate)
- RPO/RTO targets tied to product risk.
- Restore drills on a schedule—not “we assume backups work.”
- Logical vs physical backup tradeoffs (portability vs speed).
Caching alongside databases
Caches are derived state. The database (or durable event log) remains authority unless you consciously accept staleness with product guardrails.
Layers (stack north to south)
| Layer | Latency win | Staleness / safety traps |
|---|---|---|
| Browser / CDN | huge for assets & static | personalized HTML at edge |
| Application memory | microsecond class | per-instance inconsistency |
| Shared remote (Redis class) | cross-instance sharing | thundering herd, key poisoning |
| DB buffer pool | automatic | not a substitute for app cache design |
Patterns (who owns invalidation?)
| Pattern | Flow | Owns freshness |
|---|---|---|
| Cache-aside | app reads cache, loads DB on miss, populates | app + TTL/events |
| Read-through | cache library loads on miss | cache service + TTL |
| Write-through | writes go to cache + DB together | tighter but write latency |
| Write-behind | writes acknowledged before durable flush | risk—queue loss |
Mitigate cache stampede: TTL jitter, request coalescing / single-flight, probabilistic early refresh.
Related depth: Caching & consistency, Caching & retry amplification.
Diagrams
Authority vs derived cache
Cache-aside read path
Transaction scope vs long work
Things to prepare
Before you design or pitch a data layer
- Access paths: list the 5–10 hottest queries and their filters/sorts.
- Consistency: which reads must be linearizable vs OK with seconds of lag?
- Growth: data volume curve, retention, archival, legal holds.
- Multi-tenancy: tenant key in every table/index story; noisy neighbor controls.
- Migration runway: additive schema steps, backfills, dual reads/writes.
Before production hardening
- Backups with documented restore drill frequency and owner.
- Replication lag dashboards and alerts tied to user-facing symptoms.
- Connection pool limits per service; saturation metrics.
- Slow query logging and top N review cadence.
- Secrets rotation path; least privilege DB roles.
- Idempotency for writers (see /backend/topics/idempotency-at-least-once).
Interview prep (90 minutes, repeatable)
| Block | Minutes | Output |
|---|---|---|
| Draw read + write path with cache + primary + replica | 15 | can narrate every arrow |
| Walk one slow query from log → plan → fix | 20 | EXPLAIN vocabulary |
| Cache invalidation story for one feature | 15 | events, TTL, stampede |
| Failure: replica lag + dual write | 15 | UX + detection |
| Schema change without downtime | 15 | additive phases |
Memorize one crisp line for each: isolation level choice, stampede mitigation, RPO/RTO, when not to cache.
Pitfalls
- Treating read replicas as strongly consistent without version checks or routing rules.
- Giant transactions holding locks while calling HTTP or queues.
- Unbounded caches without eviction, TTL, or memory caps—OOM incidents.
- Cache as source of truth for balances or inventory without durable reconciliation.
- Indexing every column without examining plans—write amplification and bloated storage.
Related
- /backend — lifecycle, timeouts, messaging, caching entry points.
- /performance — tail latency, retries, thundering herds, DB performance notes.
- /security — injection, least privilege, secrets.
Mark this page when you finish learning it.
Spotted something unclear or wrong on this page?