Devops Interview Prep

Kafka – Cheat-sheet

Source: Kafka – Cheat-sheet.html

What Kafka Is

One-Line Summary

“Kafka stores events as immutable logs across partitions, replicates them for fault tolerance, and lets consumer groups read independently using offsets.”

Mental Model

Kafka is not a queue.

It is a distributed, durable, ordered log where consumers move, not messages.

Simple terms:

A giant, durable, append-only log that lets different services send, store, and read events at massive scale and replay them whenever they want.

Technical terms:

A distributed commit log with partitioned topics, replicated storage, consumer groups, offset-based consumption, and high-throughput sequential disk writes. Kafka acts as both a message broker and a durable event store.

Glossary

Term	Simple Explanation	Technical Explanation
Topic	A named stream of messages	A partitioned log storing records with keys, values, timestamps
Partition	A slice of a topic	An ordered, immutable sequence of records with index-based offsets
Offset	A message’s position in a partition	A monotonically increasing pointer allowing random-access reads
Broker	A Kafka server that stores partitions	Handles replication, leader elections, fetch/produce requests
Producer	Sends messages into Kafka	Pushes batches to the leader partition via partitioner
Consumer	Reads messages from Kafka	Pulls data and commits offsets to track progress
Consumer Group	A set of consumers working together	Kafka distributes partitions across group members
ISR	“In-Sync Replicas” that are up-to-date	Followers fully caught up with the leader; needed for safe failover
Retention	How long Kafka keeps messages	Time-based or size-based segment cleanup policies
Rebalance	Redistribution of partitions	Triggered by membership changes or topic metadata updates
Kraft	Kafka without Zookeeper	Internal Raft-based metadata quorum

Kafka Architecture Explained in Words

Mental Diagram

Think of Kafka as:

Many servers holding long notebooks (brokers)
Each notebook is split into sections (partitions)
Writers only append at the end
Readers remember the page number they reached

Step by Step

Producers send events to Kafka
Events go to a topic
The topic is split into partitions
Each partition lives on one leader broker
Other brokers keep replica copies
Consumers read partitions and track offsets themselves
Kafka deletes old data based on retention rules

Key Architectural Insight

Kafka moves compute, not data. Messages stay in place; consumers move forward.

Core Strengths

Scales horizontally through partitions
At-least-once delivery built-in
High throughput due to sequential disk I/O
Consumer groups for load-balanced processing
Durable storage with retention policies
Stream processing ecosystem (Kafka Streams, ksqlDB, Connect)
Decoupled producers and consumers
Replayable history via offsets
Fault tolerant through replication

Common Pitfalls – With Explanations

1. Misconfigured partitions

Bad practice: Using too few or too many partitions.
Why (simple): Too few = not enough parallelism. Too many = Kafka wastes resources and slows down.
Why (technical): Partition count determines consumer concurrency, controller load, filesystem overhead, and replication traffic.

2. Wrong retention strategy

Bad practice: Expecting messages to disappear after being consumed.
Why (simple): Kafka keeps messages based on time or size, not on read state.
Why (technical): Consumer offsets are stored separately; retention policies work independently of consumption.

3. Consumer offset mistakes

Bad practice: Auto-committing immediately or too frequently.
Why (simple): You may lose messages or process them twice.
Why (technical): Offsets represent read position. Premature commit = potential loss on crash. Late commit = duplicates.

4. Hot partitions

Bad practice: Using a partition key that sends most events to a single partition.
Why (simple): One consumer ends up doing all the work while others sit idle.
Why (technical): Producers hash the key → specific partition; skew breaks ordering guarantees and saturates a single broker.

5. Rebalancing storms

Bad practice: Constant consumer group changes or slow consumers.
Why (simple): Kafka pauses everything each time a rebalance happens.
Why (technical): Group Coordinator triggers partition reassignment; consumers pause consumption until rebalance completes.

6. ISR (In-Sync Replica) issues

Bad practice: Overloaded brokers cause replicas to fall behind.
Why (simple): Fewer replicas mean less safety.
Why (technical): ISR shrinkage reduces redundancy; leader failover becomes risky.

7. Running Kafka on slow disks or network

Bad practice: Putting Kafka on cheap HDDs, NFS, or slow cloud disks.
Why (simple): Kafka becomes unusable under load.
Why (technical): Kafka relies on sequential disk writes and page cache; slow I/O collapses throughput and increases latency.

8. Treating Kafka like a job queue

Bad practice: Expecting job-level ACKs, job deletion, or per-message priority.
Why (simple): Kafka is a log, not a task queue. Jobs don’t disappear after processing.
Why (technical): Kafka consumers track offsets, and messages remain in partitions per retention settings; no FIFO across partitions.

4. When to Use Kafka (Good Fits)

High-throughput event streaming

Why (simple): Kafka handles insane write rates without choking.
Why (technical): Sequential disk writes + batching + zero-copy send = huge throughput.

CDC - Change Data Capture

Why (simple): Kafka reliably moves DB changes to other systems.
Why (technical): Tools like Debezium capture binlog/WAL changes → Kafka topics → downstream processors.

Analytics pipelines

Why (simple): You can stream events into analytics systems in real time.
Why (technical): Kafka integrates with Spark, Flink, BigQuery, ClickHouse, etc. through Kafka Connect and connectors.

Microservice decoupling

Why (simple): Services don’t need to call each other directly; everything listens to the stream.
Why (technical): Loose coupling through pub/sub semantics + durable event logs.

Event replay and audit trails

Why (simple): You can reprocess history anytime.
Why (technical): Kafka stores immutable logs with offset-based navigation; consumers can rewind offsets at will.

Stream processing

Why (simple): You can transform or aggregate data as it flows.
Why (technical): Kafka Streams provides windowing, joins, aggregations, and state stores.

When Not to Use Kafka (Bad Fits)

Low-latency request/response messaging

Why (simple): Kafka is fast, but not “RPC fast.” You’ll get milliseconds, not microseconds.
Why (technical): Kafka batches messages and doesn’t provide synchronous RPC semantics.

Message priority systems

Why (simple): Kafka doesn't let you easily prioritize certain messages over others.
Why (technical): Partitions are strictly ordered logs; no priority queues per partition.

Exactly-once job execution

Why (simple): Kafka can guarantee delivery semantics, not that your business logic runs once.
Why (technical): Exactly-once processing requires idempotency + careful offset/transaction coordination.

Very small workloads

Why (simple): Kafka is heavy; running it for tiny message volume is overkill.
Why (technical): Zookeeper/Kraft controllers, brokers, retention, replication… too much operational overhead for low throughput.

Dynamic routing, filtering, or complex message semantics

Why (simple): Kafka doesn’t natively route messages based on content.
Why (technical): No content-based routing; you must implement it with consumers or use something like RabbitMQ/NATS.

Mini Examples – With Reasoning

Bad: Keying partitions by user ID when a few users produce most events.

Why (simple): Creates “hot partitions.”
Why (technical): Hash-based partitioner produces skew → throughput imbalance.

Bad: Assuming Kafka deletes messages after consumption.

Why (simple): You’ll be surprised when storage fills up.
Why (technical): Kafka retains based on time/size, not offsets.

Good: Using hash tags or composite keys when ordering per entity matters but you need distribution.

Why (simple): You keep order without bottlenecking the cluster.
Why (technical): Partition key = entity ID → ordering; large cardinality → even distribution.

Good: Using Kafka for event replay when debugging production issues.

Why (simple): You can “time-travel” through system events.
Why (technical): Offsets can be reset per consumer group to reprocess historical data.

End-to-End Example

Simple Story Example

A payment service emits “payment_completed” events
Kafka stores them in order
Analytics and billing services read them independently
If analytics crashes, it continues from where it left off
Nothing is lost, nothing is blocked

Technical Walkthrough

Producer sends records to topic payments
Records are hashed by payment_id → specific partition
Partition leader writes to disk and replicates to ISR followers
Consumers in different consumer groups poll the partition
Each group commits its own offsets
Retention deletes old segments after policy threshold

Failure Scenarios

1. Broker Dies

Simple terms:

Kafka promotes a backup broker automatically and keeps going.

Technical terms: If a broker fails:

Controller detects failure
ISR follower is promoted to leader
Clients update metadata and continue
No data loss if replicas were in ISR

✅ Works if replication factor ≥ 2

2. Consumer Dies

Simple terms:

Another consumer takes over its work.

Technical terms:

Consumer group rebalance triggers
Partitions reassigned to healthy consumers
Consumption resumes from last committed offset

✅ Messages are reprocessed at worst

3. ISR Shrinks (Very Important)

Simple terms:

Backups fall behind. Kafka becomes fragile.

Technical terms:

Followers lag leader too far
Removed from ISR
min.insync.replicas may prevent writes
Broker failure now risks data loss

⚠️ Production red flag

4. Producer Crashes

Simple terms:

Some messages may be sent twice.

Technical terms:

Retries + acks may cause duplicates unless idempotence is enabled.

✅ Use enable.idempotence=true

5. Controller Failure (KRaft)

Simple terms:

Another controller takes over.

Technical terms:

Raft quorum elects new controller
Metadata remains consistent
Cluster keeps operating