Devops Interview Prep
Kafka – Cheat-sheet
What Kafka Is
One-Line Summary
“Kafka stores events as immutable logs across partitions, replicates them for fault tolerance, and lets consumer groups read independently using offsets.”
Mental Model
Kafka is not a queue.
It is a distributed, durable, ordered log where consumers move, not messages.
Simple terms:
A giant, durable, append-only log that lets different services send, store, and read events at massive scale and replay them whenever they want.
Technical terms:
A distributed commit log with partitioned topics, replicated storage, consumer groups, offset-based consumption, and high-throughput sequential disk writes. Kafka acts as both a message broker and a durable event store.
Glossary
Term | Simple Explanation | Technical Explanation |
|---|---|---|
Topic | A named stream of messages | A partitioned log storing records with keys, values, timestamps |
Partition | A slice of a topic | An ordered, immutable sequence of records with index-based offsets |
Offset | A message’s position in a partition | A monotonically increasing pointer allowing random-access reads |
Broker | A Kafka server that stores partitions | Handles replication, leader elections, fetch/produce requests |
Producer | Sends messages into Kafka | Pushes batches to the leader partition via partitioner |
Consumer | Reads messages from Kafka | Pulls data and commits offsets to track progress |
Consumer Group | A set of consumers working together | Kafka distributes partitions across group members |
ISR | “In-Sync Replicas” that are up-to-date | Followers fully caught up with the leader; needed for safe failover |
Retention | How long Kafka keeps messages | Time-based or size-based segment cleanup policies |
Rebalance | Redistribution of partitions | Triggered by membership changes or topic metadata updates |
Kraft | Kafka without Zookeeper | Internal Raft-based metadata quorum |
Kafka Architecture Explained in Words
Mental Diagram
Think of Kafka as:
- Many servers holding long notebooks (brokers)
- Each notebook is split into sections (partitions)
- Writers only append at the end
- Readers remember the page number they reached
Step by Step
- Producers send events to Kafka
- Events go to a topic
- The topic is split into partitions
- Each partition lives on one leader broker
- Other brokers keep replica copies
- Consumers read partitions and track offsets themselves
- Kafka deletes old data based on retention rules
Key Architectural Insight
Kafka moves compute, not data. Messages stay in place; consumers move forward.
Core Strengths
- Scales horizontally through partitions
- At-least-once delivery built-in
- High throughput due to sequential disk I/O
- Consumer groups for load-balanced processing
- Durable storage with retention policies
- Stream processing ecosystem (Kafka Streams, ksqlDB, Connect)
- Decoupled producers and consumers
- Replayable history via offsets
- Fault tolerant through replication
Common Pitfalls – With Explanations
1. Misconfigured partitions
- Bad practice: Using too few or too many partitions.
- Why (simple): Too few = not enough parallelism. Too many = Kafka wastes resources and slows down.
- Why (technical): Partition count determines consumer concurrency, controller load, filesystem overhead, and replication traffic.
2. Wrong retention strategy
- Bad practice: Expecting messages to disappear after being consumed.
- Why (simple): Kafka keeps messages based on time or size, not on read state.
- Why (technical): Consumer offsets are stored separately; retention policies work independently of consumption.
3. Consumer offset mistakes
- Bad practice: Auto-committing immediately or too frequently.
- Why (simple): You may lose messages or process them twice.
- Why (technical): Offsets represent read position. Premature commit = potential loss on crash. Late commit = duplicates.
4. Hot partitions
- Bad practice: Using a partition key that sends most events to a single partition.
- Why (simple): One consumer ends up doing all the work while others sit idle.
- Why (technical): Producers hash the key → specific partition; skew breaks ordering guarantees and saturates a single broker.
5. Rebalancing storms
- Bad practice: Constant consumer group changes or slow consumers.
- Why (simple): Kafka pauses everything each time a rebalance happens.
- Why (technical): Group Coordinator triggers partition reassignment; consumers pause consumption until rebalance completes.
6. ISR (In-Sync Replica) issues
- Bad practice: Overloaded brokers cause replicas to fall behind.
- Why (simple): Fewer replicas mean less safety.
- Why (technical): ISR shrinkage reduces redundancy; leader failover becomes risky.
7. Running Kafka on slow disks or network
- Bad practice: Putting Kafka on cheap HDDs, NFS, or slow cloud disks.
- Why (simple): Kafka becomes unusable under load.
- Why (technical): Kafka relies on sequential disk writes and page cache; slow I/O collapses throughput and increases latency.
8. Treating Kafka like a job queue
- Bad practice: Expecting job-level ACKs, job deletion, or per-message priority.
- Why (simple): Kafka is a log, not a task queue. Jobs don’t disappear after processing.
- Why (technical): Kafka consumers track offsets, and messages remain in partitions per retention settings; no FIFO across partitions.
4. When to Use Kafka (Good Fits)
High-throughput event streaming
- Why (simple): Kafka handles insane write rates without choking.
- Why (technical): Sequential disk writes + batching + zero-copy send = huge throughput.
CDC - Change Data Capture
- Why (simple): Kafka reliably moves DB changes to other systems.
- Why (technical): Tools like Debezium capture binlog/WAL changes → Kafka topics → downstream processors.
Analytics pipelines
- Why (simple): You can stream events into analytics systems in real time.
- Why (technical): Kafka integrates with Spark, Flink, BigQuery, ClickHouse, etc. through Kafka Connect and connectors.
Microservice decoupling
- Why (simple): Services don’t need to call each other directly; everything listens to the stream.
- Why (technical): Loose coupling through pub/sub semantics + durable event logs.
Event replay and audit trails
- Why (simple): You can reprocess history anytime.
- Why (technical): Kafka stores immutable logs with offset-based navigation; consumers can rewind offsets at will.
Stream processing
- Why (simple): You can transform or aggregate data as it flows.
- Why (technical): Kafka Streams provides windowing, joins, aggregations, and state stores.
When Not to Use Kafka (Bad Fits)
Low-latency request/response messaging
- Why (simple): Kafka is fast, but not “RPC fast.” You’ll get milliseconds, not microseconds.
- Why (technical): Kafka batches messages and doesn’t provide synchronous RPC semantics.
Message priority systems
- Why (simple): Kafka doesn't let you easily prioritize certain messages over others.
- Why (technical): Partitions are strictly ordered logs; no priority queues per partition.
Exactly-once job execution
- Why (simple): Kafka can guarantee delivery semantics, not that your business logic runs once.
- Why (technical): Exactly-once processing requires idempotency + careful offset/transaction coordination.
Very small workloads
- Why (simple): Kafka is heavy; running it for tiny message volume is overkill.
- Why (technical): Zookeeper/Kraft controllers, brokers, retention, replication… too much operational overhead for low throughput.
Dynamic routing, filtering, or complex message semantics
- Why (simple): Kafka doesn’t natively route messages based on content.
- Why (technical): No content-based routing; you must implement it with consumers or use something like RabbitMQ/NATS.
Mini Examples – With Reasoning
Bad: Keying partitions by user ID when a few users produce most events.
- Why (simple): Creates “hot partitions.”
- Why (technical): Hash-based partitioner produces skew → throughput imbalance.
Bad: Assuming Kafka deletes messages after consumption.
- Why (simple): You’ll be surprised when storage fills up.
- Why (technical): Kafka retains based on time/size, not offsets.
Good: Using hash tags or composite keys when ordering per entity matters but you need distribution.
- Why (simple): You keep order without bottlenecking the cluster.
- Why (technical): Partition key = entity ID → ordering; large cardinality → even distribution.
Good: Using Kafka for event replay when debugging production issues.
- Why (simple): You can “time-travel” through system events.
- Why (technical): Offsets can be reset per consumer group to reprocess historical data.
End-to-End Example
Simple Story Example
- A payment service emits “payment_completed” events
- Kafka stores them in order
- Analytics and billing services read them independently
- If analytics crashes, it continues from where it left off
- Nothing is lost, nothing is blocked
Technical Walkthrough
- Producer sends records to topic
payments - Records are hashed by
payment_id→ specific partition - Partition leader writes to disk and replicates to ISR followers
- Consumers in different consumer groups poll the partition
- Each group commits its own offsets
- Retention deletes old segments after policy threshold
Failure Scenarios
1. Broker Dies
Simple terms:
Kafka promotes a backup broker automatically and keeps going.
Technical terms: If a broker fails:
- Controller detects failure
- ISR follower is promoted to leader
- Clients update metadata and continue
- No data loss if replicas were in ISR
✅ Works if replication factor ≥ 2
2. Consumer Dies
Simple terms:
Another consumer takes over its work.
Technical terms:
- Consumer group rebalance triggers
- Partitions reassigned to healthy consumers
- Consumption resumes from last committed offset
✅ Messages are reprocessed at worst
3. ISR Shrinks (Very Important)
Simple terms:
Backups fall behind. Kafka becomes fragile.
Technical terms:
- Followers lag leader too far
- Removed from ISR
min.insync.replicasmay prevent writes- Broker failure now risks data loss
⚠️ Production red flag
4. Producer Crashes
Simple terms:
Some messages may be sent twice.
Technical terms:
Retries + acks may cause duplicates unless idempotence is enabled.
✅ Use
enable.idempotence=true5. Controller Failure (KRaft)
Simple terms:
Another controller takes over.
Technical terms:
- Raft quorum elects new controller
- Metadata remains consistent
- Cluster keeps operating