Indexing & query tuning
Topic mind map (ASCII)
Indexing & query tuning
├── Normalization ............. 1NF / 2NF / 3NF / BCNF (+ when to denorm)
├── Index design .............. B-tree, composite order, covering, partial
├── Planner ................... stats, estimates vs actual, seq vs index scan
├── Workload fixes ............ N+1, pagination, SELECT *, OR → UNION
├── Ops ....................... pools, vacuum/bloat (MVCC context), slow log loop
└── Link to storage engines ... clustered vs heap (see storage-engines topic)Core knowledge
| Topic | Carry one line |
|---|---|
| Index | A sorted/probed structure so the engine skips full table scans for matching predicates. |
| Composite index | Column order must match how filters and ORDER BY are written. |
| Covering index | Index contains all columns needed—index-only plan possible. |
| Normalization | Reduce redundancy and update anomalies; may split tables. |
| Denormalize | Duplicate data for read speed—pay on writes and consistency. |
| Query plan | How the engine joins, filters, sorts—verify with EXPLAIN (ANALYZE). |
Normalization (1NF → 3NF → BCNF)
Goal: each fact stored once in the “right” place so updates stay consistent.
1NF — atomic columns
- Each column holds one value per row; no repeating groups hidden in a single column.
- Interview trap: stuffing comma-separated tags in one string → harder to index/query; prefer junction table or array type with eyes open on indexing.
2NF — no partial dependency on part of a composite key
- For tables with composite primary keys, non-key columns must depend on the whole key—not only half.
- Classic fix: split into tables so each non-key depends on one key.
3NF — no transitive dependency
- Non-key columns must depend only on the key, not on other non-key columns.
- Example: if
employee_id → dept_idanddept_id → dept_name,dept_namebelongs indepartments, notemployees.
BCNF (Boyce–Codd)
- Stricter than 3NF: every determinant is a candidate key. Appears in exam-style edge cases; in production you recognize “odd functional dependencies” and refactor.
When interviews ask “why normalize?”
→ Fewer update anomalies (same fact updated in two places), cleaner constraints, smaller write amplification for duplicated blobs.
Denormalization — when and how
| Reason | Pattern | Cost |
|---|---|---|
| Read hot path | Duplicate label on child rows (e.g. product_name on line items) | Writes must update duplicates or accept staleness |
| Reporting | Materialized summary tables fed by jobs | Lag + job failures to monitor |
| Wide events | Snapshot fields for immutable history | Storage; great for audit / analytics |
Staff line: “Denormalize only with a measured hot query and a plan for staleness / rebuild.“
Index types (conceptual)
| Type | Good for | Watch |
|---|---|---|
| B-tree (default) | equality, range, sort | write overhead per index |
| Hash | exact match (where supported) | no range on hash |
| Unique | enforce one row per key | NULL semantics per engine |
| Partial / filtered | index subset of rows | must match query predicates |
| Full-text / GIN / special | text search inside DB | often pair with external search at scale |
Composite column order (repeat until automatic):
- Equality columns in the where clause (
a = ? AND b = ?). - Range column last among filter columns (
c > ?). - INCLUDE / covering columns for SELECT list to enable index-only scans when the engine supports it.
Indexing interview questions & answers
Q1: “Why is my index not used?”
Answer: Check: function on column (WHERE lower(email)) prevents use; wrong column order vs composite; parameter sniffing / stale stats; selectivity so planner prefers seq scan; query not actually filtering on the leading column. Prove with EXPLAIN and actual row counts.
Q2: “Too many indexes—what breaks?”
Answer: Every index slows inserts/updates/deletes and uses disk/memory. Too many also confuse optimizers. Start from top slow queries, add one index, re-profile writes, repeat.
Q3: “Covering index vs wider composite?”
Answer: Covering (including non-key columns or multi-column composite that matches filter + select) avoids heap lookups. Balance index size and write cost—not every query deserves a covering index.
Q4: “N+1 in production—how do you fix?”
Answer: Batch: join, IN (...) with chunking, or dataloader pattern. Prove reduction in round-trips with traces and query count metrics.
Query optimization playbook
Workflow (always this order)
- Measure: slow log,
pg_stat_statementsclass, trace one representative query. - EXPLAIN (ANALYZE): estimated vs actual rows—huge gap ⇒ stats or rewrite.
- Rewrite: narrower
SELECT, push predicates, limit join width, batch instead of loop. - Index: match leading predicates; avoid index soup.
- Re-run under production-like volume; check locks and pool wait.
Common rewrites (mental patterns)
| Problem | Direction |
|---|---|
| Select * | Project only needed columns—less IO, better covering chances |
| OR explosion | UNION of two selective branches or redesign predicates |
| Offset deep pagination | Keyset pagination (WHERE id > ? ORDER BY id LIMIT …) |
| Correlated subquery | Join or lateral when planner struggles |
| Implicit conversion | Match types—index columns not wrapped in casts/functions |
Symptom → first split
| Symptom | First split |
|---|---|
| p95 up, low DB CPU | pool wait, network, app mutex |
| p95 up, high DB CPU | bad plan, missing index, seq scan surprise |
| Intermittent timeouts | stats, lock wait, contention |
Diagram — slow query loop
Diagram — normalization vs denormalize (read vs write)
See also
- Storage engines, CAP & distributed data — B-tree vs LSM, clustered vs secondary, over-indexing at scale
- Storage models & access paths — OLTP vs warehouse, SQL vs NoSQL
- Replication, sharding & operations
- /performance/topics/database-query-path
- /performance/topics/profiling-services-async
Last updated on
Spotted something unclear or wrong on this page?