PostgreSQL Internals (Interview Deep Dive)

PostgreSQL is a row-oriented, MVCC relational database with ACID transactions. Interview answers shine when you connect WAL durability, snapshot isolation, index access paths, and replication trade-offs to real symptoms (write amplification, bloat, replica lag).

WAL (Write-Ahead Log): durability and recovery

Changes go to shared buffers (RAM), but commit requires durable WAL records before acknowledging clients (depending on synchronous_commit and replication settings). On crash, replay reconstructs committed work.

Checkpoints flush dirty pages to data files and bound recovery time; aggressive tuning trades I/O spikes vs faster crash recovery.

Why interviewers care: WAL is the spine for replication (physical/logical) and PITR (continuous archiving). If your disk stalls, commits stall—tie to latency-throughput.

Link mentally to append-only journals in other systems (Redis deep dive AOF, Kafka deep dive logs); the durability story rhymes even when implementations differ.

MVCC: snapshots, tuple visibility, and bloat

PostgreSQL uses multi-version concurrency control: updates create new row versions; old versions stick around until vacuum reclaims them.

Readers take a snapshot (transaction ID horizon) and see rows visible at that snapshot—readers do not block writers and vice versa for typical reads—great concurrency. But UPDATE-heavy workloads generate dead tuples → table bloat → autovacuum pressure.

VACUUM (and VACUUM FULL offline rewrite in extreme cases) manages reclaim. Long-running transactions or idle sessions pin snapshots and prevent cleanup—classic replica lag + bloat story.

Isolation levels map to snapshot behaviors; cross-reference isolation levels and consistency models when contrasting Postgres strong isolation with NoSQL stores.

Indexes: B-tree, GiST, BRIN (high level)

B-tree (default): equality and range on scalar keys; workhorse for WHERE id = ?, ORDER BY, many joins.

GiST (Generalized Search Tree): extensible framework for R-trees, full-text, range types, geometric queries—interviews mention PostGIS and nearest-neighbor searches.

GIN (not requested but often paired): inverted indexes for arrays, jsonb containment, full-text.

BRIN (Block Range Index): tiny indexes for correlated physical ordering (time-series append-only). Summarizes min/max per page range—great when tables are massive and queries align with insert order; weak when values scatter randomly.

Choose indexes per selectivity and write amplification: each index adds overhead on insert/update.

Replication: physical vs logical

Streaming replication ships WAL records to standbys (near real-time). Standbys apply changes asynchronously by default → replica lag under write spikes or heavy queries on replicas.

Synchronous replicas trade availability/latency for durability—commits wait for remote flush.

Logical replication decodes WAL into row changes for selective replication, upgrades, or heterogeneous targets.

Framing for interviews: read-your-writes across primary/replica is not automatic; route critical reads to primary or use sticky sessions / causal patterns carefully.

Link to replication for generic leader-follower vocabulary.

Planner basics you should mention

The query planner picks sequential scan vs index scan vs bitmap combinations based on statistics (ANALYZE). Bad stats → wrong plans → sudden regressions.

Partitioning (declarative) prunes chunks for time-series—pairs with BRIN or btree per partition.

Practical operations checklist

Watch bloat, autovacuum lag, long transactions.
Size shared_buffers, work_mem with caution (large sorts multiply per operation).
Use connection pooling (PgBouncer) to avoid fork storms.

Interview phrase

“Postgres gives me MVCC for concurrent reads without locks on rows, WAL for crash-safe commits and replication, and I pick btree vs GiST vs BRIN by data correlation and query shape; I watch vacuum health because updates leave dead tuples behind.”

Hotspots and sequence contention

Primary keys based on monotonic sequences can create insert hotspots on the right edge of btree indexes—similar spirit to NoSQL hot shards. Mitigate with hash partitioning, UUID v7 time-sortable keys, or sharded sequences patterns.

JSONB and hybrid modeling

JSONB combines relational constraints with flexible schemas—GIN indexes accelerate containment queries. Nice for evolving payloads, but avoid replacing thoughtful relational modeling with opaque blobs when integrity matters.

Locking: row-level vs predicate gaps

MVCC reads typically avoid blocking, but writes still lock rows. SELECT … FOR UPDATE serializes competing updates. Serializable isolation detects anomalies via SSI—great correctness, occasional serialization failures that apps must retry (pair with idempotency).

Deadlocks happen—consistent lock ordering in application code mitigates.

Extensions ecosystem

PostGIS (spatial), pgvector (embeddings), Citus (sharding extension)—interviews sometimes probe whether you reach for managed services vs self-managed extensions. Extensions alter backup/upgrade risk profiles.

Major upgrades and replication gotchas

Logical upgrades via replication switches minimize downtime but require version compatibility matrices. Replica promotion must confirm lag and sequence ownership.

Observability: what to watch

Buffer cache hit ratio (not a silver bullet) with IO patterns.
Checkpoint spikes, WAL generation rate, autovacuum workers falling behind.
Replication lag seconds on replicas serving reads.

Compare/contrast with Cassandra and DynamoDB

Postgres: rich constraints, joins, strong transactional semantics per node/cluster with careful routing—vertical scale first, then read replicas, then sharding (sharding).
DynamoDB/Cassandra: partition-first scale-out with different consistency knobs—access-pattern-first modeling.

Pick Postgres when integrity + complex queries dominate; augment with caches (caching) rather than prematurely distributing.

Sample sizing exercise (interview flavor)

If random reads dominate and working set fits RAM, buffer cache keeps p99 low; when not, indexes and partition pruning matter more than raw CPU. Always connect hardware sizing to query plans and working set, not row counts alone—echo back-of-envelope discipline.

Foreign keys, triggers, and migration discipline

Foreign keys enforce referential integrity—great for correctness, sometimes contentious at multi-terabyte migration moments. Triggers hide behavior—powerful but can surprise ORMs and migrations. For zero-downtime deploys, pair schema changes with expand/contract patterns and application dual-writes when necessary.

Read replicas for analytics

Offload heavy reporting to replicas with statement timeouts and role separation so OLTP primaries stay crisp. Watch replica replay lag when dashboards must be fresh—see consistency models.

On this page