Apache Cassandra Internals (Interview Deep Dive)

Cassandra is a wide-column, partition-row store built for multi-datacenter availability and horizontal scale. The interview story centers on the partition key, log-structured storage, tunable consistency, and operational realities (compaction, repairs, LWW pitfalls).

Data model: keyspace → table → partition (by partition key) → rows ordered by clustering columns on disk.

Partition key: routing, hotspots, and query shape

The partition key decides which node(s) own the token range (Murmur3 hash in classic setups). All efficient reads/writes should be single-partition queries. Cross-partition reads are possible but expensive—design queries per access pattern, like DynamoDB but with explicit CQL.

Hot partitions occur when one partition key receives disproportionate traffic or size (wide partitions with millions of rows). Symptoms: latency spikes, compaction pressure, repair strain.

Mitigate by key redesign, bucketizing time series (sensor_id + day_bucket), application-level sharding of hot tenants, and caching (caching).

Cross-link sharding: Cassandra makes partition functions explicit—there is no magical load balancer without modeling.

Replication, consistency levels, and lightweight transactions

Cassandra’s replication factor copies partitions to multiple nodes (often across racks/DCs). Clients choose consistency level (CL) per request:

ONE, LOCAL_ONE, QUORUM, LOCAL_QUORUM, ALL, etc.

R=W=QUORUM is a common middle ground; ANY is loosest writes.

Lightweight transactions (IF NOT EXISTS) use Paxos rounds—powerful but expensive; avoid as default hot path.

Frame trade-offs with consistency models and CAP / PACELC: Cassandra leans AP with tunable C for each operation.

Gossip, failure detection, and cluster membership

Nodes gossip cluster state—membership, schema versions, application state hints. Gossip is eventually consistent; convergence is fast but not instantaneous—brief metadata mismatches can occur during changes.

Phi accrual failure detection (conceptually) tunes suspicion of node death vs network jitter—reduces false positives compared with naive timeouts.

For interviews: acknowledge gossip is scalable but operational changes (bootstrap, decommission) must be orchestrated; never assume instantaneous global agreement.

Storage engine: LSM, SSTables, memtables, compaction

Writes land in a memtable and append-only commitlog; flush creates SSTables (sorted immutable files). Reads merge memtable + SSTables with bloom filters and partition key caches to skip irrelevant files.

Compaction strategies (SizeTiered, Leveled, TimeWindowCompactionStrategy for time-series) trade read amplification, write amplification, and disk space. Wrong strategy → compaction backlog and read latency cliffs.

Link I/O trade-offs to latency-throughput.

Read repair and hinted handoff

Read repair reconciles stale replicas by comparing digests/values during reads (depending on configuration and version semantics).

Hinted handoff stores writes for temporarily down replicas—helps durability—still requires eventual convergence via repair.

Full repairs (nodetool repair) walk ranges—expensive; in modern clusters often scheduled incrementally. Interview point: anti-entropy is not free.

Last-write-wins (LWW) pitfalls

Cassandra resolves conflicts with timestamps—last write wins per cell using client-supplied or server-side clocks.

Pitfalls:

Clock skew across app servers → surprising “older wins” outcomes—use synchronized time (careful NTP) or Cassandra-side timestamps judiciously.
Retries with stale timestamps can resurrect old values briefly—pair with idempotency and deterministic write paths.
Deletes are tombstones with TTL—late replicas may resurrect data if repairs lag—design retention and grace periods.

Contrast with CRDTs or external coordination when ordering truly matters.

When Cassandra fits

Excellent for high-ingest time-series, write-heavy messaging metadata, global replication when accepting eventual reads.

Poor fit for ad hoc joins, strong cross-row transactions, or tiny datasets—Postgres may win (Postgres deep dive).

Interview phrase

“I model around partition keys so queries stay single-partition; I pick consistency levels per operation and accept LWW semantics—so I care about clocks, tombstones, and repairs; I watch compaction strategy vs workload shape and use Cassandra when AP + scale outweigh relational joins.”

CQL modeling: partition size guardrails

Wide partitions are not theoretical—unbounded collections per partition degrade read repair, compaction, and GC. Cassandra warns via partition size metrics; correct modeling caps row counts per partition or bucketizes time.

Static columns (within partition) enable metadata shared across clustering rows—useful patterns but easy to misuse.

Lightweight transactions in depth (when to avoid)

LWT uses consensus rounds per operation—orders of magnitude slower than normal writes. Reserve for rare invariants (username claim) not per-message throughput.

Materialized views and secondary indexes

Materialized views maintain derived tables—convenient but not free; failures during view builds require operational attention. Secondary indexes hit local partitions but scatter queries—often prefer denormalized tables + batch writers for predictable perf.

Multi-datacenter replication

Replication across DCs trades latency for survival—clients pick LOCAL_QUORUM in each DC for steady state; global strong reads are not Cassandra’s default happy path.

Tombstones and compaction interactions

Tombstones propagate through SSTables; compaction eventually purges them—if compaction stalls, read path work rises. gc_grace_seconds defines how long tombstones must spread—coordinate with repair schedules.

Operational commands interview mentions

nodetool status for ring health.
nodetool tpstats for thread pool backpressure.
Repair windows coordinated with compaction strategies—misconfigured windows cause cascading latency.

Client-side retry discipline

Write timeouts + idempotent retries should align with LWW expectations—use clear write timestamps policies and avoid blind duplicate retries without understanding ordering.

Benchmark reality

Cassandra shines at sustained concurrent writes with well-modeled partitions; microbenchmarks without compaction pressure mislead. Align discussion with latency-throughput and realistic dataset sizes.

On this page