Kafka Internals (Interview Deep Dive)
Kafka is a distributed append-only log: producers append records to topics, consumers read with positions (offsets). The storage and replication story is what interviewers probe when you claim “at-least-once with idempotent producers” or “log compaction for config topics.”
Core vocabulary: broker (server process), topic (category of records), partition (ordered, append-only sub-log), record (key, value, timestamp, headers), offset (monotonic position inside a partition), consumer group (cooperative readers sharing partitions), replica (copy of a partition on another broker), controller (cluster metadata leader), ISR (in-sync replica set), leader (partition writer/reader of record for that partition).
Partitions: ordering, scale, and hot keys
A topic splits into partitions for horizontal scale. Total order exists only within a single partition, not across partitions. If you need strict global order, you pay with a single partition (throughput ceiling) or you design for per-key order by hashing the key to a partition: all events for user-123 land in the same partition and keep order, while other users run in parallel.
Partition count is a trade-off:
- More partitions improve producer/consumer parallelism but increase metadata, election noise, and end-to-end lag variance if misbalanced.
- Too few partitions cap throughput and cause hot spots when one key dominates.
Hot partition risk: if all events share one key, one broker’s leader does all work. Mitigate with key design, sub-key sharding, or business-level splitting. See sharding for the general pattern.
Replication, leaders, and the ISR (In-Sync Replicas)
Each partition is replicated replication.factor times across brokers. One replica is the leader; followers fetch from the leader like consumers (pull-based replication). A follower is in-sync if it is caught up within configurable lag bounds (replica.lag.time.max.ms, fetch sessions, etc.—exact thresholds vary by version and tuning).
The ISR is the set of replicas eligible to become leader. If followers fall behind or die, they drop from the ISR. On leader failure, the controller elects a new leader from the ISR (never a hopelessly stale replica) so committed data visibility matches your acks settings.
Unclean leader election (allowing non-ISR replicas) trades availability for data loss on failure—usually disabled for money-path topics.
Link this mental model to replication and consistency models: Kafka’s client-visible guarantees sit on top of replica commit rules and consumer offset management.
Consumer groups and partition assignment
Consumers join a group; the cluster assigns each partition to at most one consumer in the group at a time (for standard consumers). Adding consumers scales reads until consumers == partitions (then new consumers sit idle). Removing consumers triggers rebalance; during rebalance, processing pauses unless you use cooperative protocols and careful tuning.
Offsets track progress per partition. The consumer commits offsets (to __consumer_offsets or external store). If you commit after side effects, you may lose messages on crash; if you commit before side effects, you may duplicate on crash. That is the classic at-least-once vs at-most-once tension—pair with idempotency.
Static membership and cooperative sticky assignors reduce rebalance churn in large microservice fleets.
Delivery semantics (what you can honestly say)
- At-most-once: commit offsets early; possible message loss; rare for business logic.
- At-least-once (typical): process then commit; duplicates possible; require idempotent consumers or stores.
- Exactly-once in the “end-to-end” product sense is hard: Kafka offers idempotent producer and transactions (write to Kafka + read-process-write with transaction boundaries) for exactly-once within Kafka in defined setups. Cross-system EOS needs outbox, idempotent APIs, or deduplication—see idempotency.
Producer acks:
acks=0fire-and-forget.acks=1leader ack (common; follower loss window if leader dies before replication).acks=all(or-1) wait for ISR; combine withmin.insync.replicasto avoid silent single-replica commit.
Log retention and compaction
Retention by time or size (retention.ms, retention.bytes) is the default “keep the tail” model for event streams and telemetry.
Log compaction keeps the latest record per key (per partition) and discards older versions for the same key—useful for changelog topics (database CDC, config, feature flags). Compaction is not a primary analytics store; it is a keyed state projection of the log. Tombstones (null values) with delete.retention.ms allow key removal per policy.
For stream vs queue trade-offs in system design, see message queue vs stream and pub/sub.
Failure and operations talking points
- Brokers die: leaders move; clients refresh metadata; prefer multi-AZ placement.
- Zombie leader concerns are mitigated by controller + fencing; still, design consumers to tolerate duplicates.
- Rebalance storms happen if consumers are flaky or session timeouts are too tight.
Interview phrase
“I pick partition count and keying for per-entity order, set acks=all with min.insync.replicas for durability, and treat consumer offset commits as the at-least-once boundary—my app dedupes or idempotency keys cover the rest.”
Related reading
Wire protocol, batches, and throughput levers
Clients speak a binary protocol (historically over TCP) with batched produce requests and fetch responses. Throughput comes from batching (linger.ms, batch.size), compression (gzip, lz4, snappy, zstd), and pipelining—trade latency vs efficiency.
linger.ms intentionally waits to accumulate batches; raise it for throughput-heavy workloads, lower it for tail-sensitive RPC replacement topics.
Transactions and idempotent producer (how far “EOS” goes)
An idempotent producer dedupes retries using producer IDs and sequence numbers—fixes duplicates caused by network ambiguity without consumers changing.
Transactions tie produce/consume offsets across partitions when you use the transactional APIs—powerful for Kafka-to-Kafka pipelines but operationally heavier (transaction coordinators, timeouts).
Neither replaces application dedupe against external databases unless you design writes carefully.
Controller and metadata
The controller manages partition leadership and administrative actions (modern architectures moved toward KRaft metadata quorum versus ZooKeeper—operational details differ, but the interview point is: metadata is highly available, not magic). Clients cache cluster metadata and refresh on errors or periodic intervals.
Client design checklist (LLD-relevant)
- Key choice for partition stickiness; null keys land round-robin (no per-key order).
- Schema evolution (Schema Registry) for contracts; think compatibility modes.
- Consumer idempotency and outbox for side effects to SQL/NoSQL.
- Monitoring: consumer lag, ISR shrink events, under-replicated partitions, request latencies on brokers.
For caching related read models built from Kafka, see caching.
Security, ACLs, and multi-tenant boundaries
Production clusters enable SASL authentication (SCRAM, OAuth bearer flows in managed offerings) and ACLs on topics, consumer groups, and transactional IDs. Treat topic prefixes as tenant boundaries when you sell SaaS—complement network policies with ACL hygiene.
Encryption in transit (TLS) is standard; at-rest encryption depends on cloud KMS integrations or disk-level guarantees.
Exactly-once connectors and external systems
Kafka Connect sink connectors face the same boundary problem as apps: at-least-once delivery into databases requires upsert keys or dedupe tables. Source connectors must track offsets durably.
When sketching LLD for an outbox, position Kafka after the database commit so downstream consumers never observe unpublished events—this mirrors patterns in idempotency.
Monitoring signals that separate junior from senior answers
- Under-replicated partitions persisting: replication cannot catch up (network, IO, GC pauses on brokers).
- Offline replicas / ISR shrink: durability risk; investigate follower fetch lag.
- Consumer lag rising: processing slower than arrival—scale consumers, optimize tasks, or increase partitions after proving CPU saturation is partition-bound.
- RequestHandler idle vs busy: pinpoints hot brokers.
Tie lag to product SLOs: a growing lag is a data freshness problem, not only an ops metric.
Quotas, throughput ceilings, and broker hotspots
Even with ample cluster aggregate throughput, a single overloaded broker (leader-heavy partitions, disk imbalance) can bottleneck produces/fetches. Rack awareness and reassignment tools rebalance leadership; tiered storage (where available) offloads cold segments.
Message headers and schema evolution
Headers carry trace context, tenant IDs, or content-type hints for non-breaking evolution. Pair Schema Registry compatibility rules with consumer double-reading strategies during upgrades—same discipline as API versioning.
Last updated on
Spotted something unclear or wrong on this page?