Kafka Internals & Production Operations
Kafka is a distributed, replicated, append-only commit log, and almost every claim it makes about durability and ordering reduces to "a partition is a log, replicated to a few brokers, with one leader." This note builds from that primitive up through replication and ISR, exactly-once across topics, KRaft, and the production failures that page on-call at 3am: under-replicated partitions, unclean leader election, rebalancing storms, and consumer-lag avalanches.
Prerequisites: Logs, replication, consumer offsets, and basic distributed-systems failure modes.
After this: Reason about Kafka ordering, durability, consumer groups, rebalancing, and exactly-once boundaries.
Suggested first pass: Read sections 1–5, answer each section in your own words, then use the remaining failure modes and exercises as the advanced pass.
Technically reviewed 21 June 2026 · Primary reference: Apache Kafka design documentation
1. A topic is split into partitions. Each partition is an ordered, append-only log replicated to a few brokers, one of which is the leader.
2. Producers append to the leader; the leader replicates to followers; a write is "committed" once the in-sync replicas (ISR) have it. Consumers read committed records and track their own offset.
3. Ordering is per-partition only; parallelism is one consumer per partition per group. Most Kafka questions reduce to those two facts.
Kafka's storage model is a replicated, durable, append-only commit log. A consumer group can provide queue-like work sharing, but records are not deleted when one consumer reads them; they age out by retention or compaction. This explains replay and multiple independent consumer groups without pretending Kafka has only one usage model.
Everything difficult then hangs off two invariants. (1) The partition is the unit of ordering, replication, and parallelism: order holds within a partition, never across; throughput scales by adding partitions; a group cannot have more working consumers than partitions. (2) The leader plus ISR is the unit of durability: a record is safe once the in-sync replicas hold it, and every durability or availability knob (acks, min.insync.replicas, unclean election) is tuning what "in-sync" and "committed" mean. Those two invariants frame any Kafka design or incident.