Apache Kafka Interview Questions for Data Engineers: 15 Questions That Actually Get Asked (2026)
Kafka is in every data engineering job description, but most candidates only know 'producers and consumers.' Master these 15 questions covering partitioning strategy, exactly-once semantics, and Kafka Connect patterns.
Key Takeaways
- ✓Why Kafka Questions Trip Up Even Experienced Engineers
- ✓Q1: How do you decide the number of partitions for a Kafka topic?
- ✓Q2: Explain exactly-once semantics in Kafka
- ✓Q3: What happens during a consumer group rebalance?
Why Kafka Questions Trip Up Even Experienced Engineers
Apache Kafka appears in 70%+ of data engineering job descriptions, yet most candidates can only explain the basics: producers publish messages, consumers read them, topics have partitions. That's a 5-minute tutorial, not interview-ready knowledge.
What separates candidates who get offers: understanding partition strategy trade-offs, exactly-once semantics (and when it actually works), consumer group rebalancing, and Kafka Connect vs custom consumers.
This guide covers 15 real questions from interviews at companies like Netflix, Uber, LinkedIn, and Walmart — with weak answers that get rejected and strong answers that get offers.
Q1: How do you decide the number of partitions for a Kafka topic?
Weak answer: "More partitions = more parallelism. I'd use a high number like 100."
Why it fails: Shows no understanding of the trade-offs. More partitions means more open file handles, longer leader election on broker failure, and higher end-to-end latency.
Strong answer: Partition count is a function of target throughput and consumer parallelism:
- Start with throughput math: If each partition can handle 10MB/s and you need 100MB/s, you need at least 10 partitions.
- Match consumer count: Maximum parallelism = number of partitions. If you have 20 consumers, you need at least 20 partitions.
- Leave room to grow: Increasing partition count is easy (just add more), but decreasing is impossible without recreating the topic.
My production guidelines:
- Start with 2x your expected consumer count
- Cap at 50 partitions per topic unless throughput demands more
- Monitor
consumer lag— if it's consistently growing, add partitions + consumers - For key-based partitioning (e.g., by user_id), changing partition count redistributes keys and breaks ordering guarantees. Plan the initial count carefully.
Q2: Explain exactly-once semantics in Kafka
Weak answer: "Kafka supports exactly-once delivery using transactions."
Strong answer: Exactly-once in Kafka has three different scopes — most candidates only know one:
- Idempotent producer (
enable.idempotence=true): Prevents duplicates within a single producer session. The broker deduplicates using a sequence number per partition. Limitation: Only works within one producer instance. If the producer restarts, duplicates are possible unless you also use transactions.
- Transactional producer: Wraps multiple writes (across topics/partitions) in an atomic transaction. Either ALL messages are committed or NONE. Used by Kafka Streams for exactly-once stream processing.
- Consumer exactly-once: The hardest part. Kafka itself only guarantees at-least-once delivery to consumers. For exactly-once END-TO-END:
- Option A: Transactional producer + read_committed consumers (Kafka Streams does this)
- Option B: Idempotent writes on the consumer side (e.g., MERGE INTO with dedup key)
Production reality: Most teams use at-least-once + idempotent consumers because transactional exactly-once adds latency and complexity. Exceptions: financial systems where duplicate processing causes real money problems.
Q3: What happens during a consumer group rebalance?
Weak answer: "Partitions get redistributed among consumers in the group."
Strong answer: A rebalance is triggered when a consumer joins, leaves, or fails a heartbeat. During rebalance:
- Stop-the-world (eager rebalance): ALL consumers in the group stop processing. The group coordinator revokes all partition assignments and reassigns them. This causes processing gaps — no messages are consumed during rebalance.
- Cooperative rebalance (modern approach): Only the affected partitions are revoked and reassigned. Other consumers keep processing. Enabled with
partition.assignment.strategy=cooperative-sticky.
Why this matters in production:
- A deployment that rolls 10 consumers sequentially triggers 10 rebalances
- Each eager rebalance pauses ALL consumers for seconds
- Fix: Use cooperative-sticky assignment +
session.timeout.ms=45000+heartbeat.interval.ms=15000 - Better fix: Use static group membership (
group.instance.id) — consumers that restart withinsession.timeout.msget their old partitions back without triggering a rebalance
Red flag answer: Saying "just increase the number of consumers" without mentioning rebalance overhead.
Q4: Kafka Connect vs custom consumers — when do you use each?
Weak answer: "Kafka Connect is easier, custom consumers give more control."
Strong answer: The decision depends on your sink system and transformation needs:
Use Kafka Connect when:
- Sinking to a standard system (S3, HDFS, JDBC, Elasticsearch, BigQuery)
- Transformations are simple (field renaming, type conversion, filtering)
- You want exactly-once sink delivery (Connect handles offset management)
- You don't want to maintain consumer code
Use custom consumers when:
- Complex business logic per message (enrichment, API calls, conditional routing)
- Custom error handling (dead-letter queues with specific retry policies)
- Non-standard sink systems with no existing connector
- You need fine-grained control over batching and parallelism
Production pattern I use most often:
Kafka → Connect (S3 Sink) → Bronze layer
Bronze → Spark Structured Streaming → Silver layerKafka Connect handles the reliable landing. Spark handles the complex transformations. This avoids writing any custom consumer code.
Gotcha: Kafka Connect connectors vary wildly in quality. Always test the connector's failure recovery before production. The Confluent-maintained connectors are reliable; community connectors often aren't.
Q5: How would you handle Kafka message ordering across partitions?
Weak answer: "Kafka guarantees ordering within a partition. Use a single partition for global ordering."
Why it fails: A single partition destroys throughput and isn't practical.
Strong answer: Ordering requirements come in three levels:
- Per-key ordering (most common): Produce with a key. All messages with the same key go to the same partition → ordering guaranteed within that key.
producer.send('events', key=user_id, value=event)- Global ordering: Rare requirement. Solutions:
- Single partition (low throughput, not recommended)
- Sequence numbers in messages + consumer-side reordering buffer
- Use a different system (Redis Streams, Kinesis) that supports global ordering natively
- Causal ordering: "Event A must be processed before Event B." Use the same key for causally related events, or embed a causal dependency chain in the message schema.
Production gotcha: Key-based partitioning breaks if you change partition count. The hash(key) → partition mapping changes, so the same user_id may land on a different partition. Solution: Use a custom partitioner with consistent hashing, or never change partition count for ordering-sensitive topics.
Advanced: Schema Registry, Compaction, and Backpressure
Q6: Why use a Schema Registry with Kafka?
Without a schema registry, any producer can send any format. One bad producer breaks all consumers. The Schema Registry enforces compatibility rules:
- Backward compatible: New schema can read old data (safe for consumers to upgrade first)
- Forward compatible: Old schema can read new data (safe for producers to upgrade first)
- Full compatible: Both directions (safest, most restrictive)
Q7: What is log compaction and when do you use it?
Normal retention deletes old segments by time/size. Compaction keeps only the latest value per key, deleting older versions. Use cases: changelog topics (database CDC), configuration topics, KTable backing topics. Not suitable for: event streams where you need full history.
Q8: How do you handle backpressure in Kafka consumers?
- Pause/resume:
consumer.pause(partitions)stops fetching from specific partitions while processing catches up - Reduce `max.poll.records`: Process fewer messages per poll cycle
- Scale horizontally: Add consumers (up to partition count)
- Consumer lag alerting: Monitor
kafka_consumer_group_lag— alert when lag exceeds SLA threshold
Test Your Kafka Knowledge with DataEngPrep
Reading about Kafka is easy. Articulating these concepts clearly under interview pressure is hard.
Use DataEngPrep's Answer Analyzer to practice:
- Type your answer to any Kafka question
- Get scored on completeness, accuracy, and communication
- See the improved FAANG-level version
Over 1,800 real interview questions covering Kafka, Spark, SQL, Airflow, and more.
Written by the DataEngPrep Team
Our editorial team consists of experienced data engineers who have worked at top tech companies and gone through hundreds of real interviews. Every article is reviewed for technical accuracy and practical relevance to help you prepare effectively.
Learn more about our team →Related Articles
Practice These Questions
Think you can answer these questions? Find out in 30 seconds
Paste your answer and get instant AI feedback — see exactly where your answer is weak and how a FAANG-level candidate would respond.