Question 1

What architecture are you following in your current project, and why?

Accepted Answer

**Section 1 — The Context (The 'Why')**
The primary challenge in choosing and justifying an architecture is alignment between technical complexity, operational cost, and business latency requirements. A naive monolithic pipeline fails when schema evolution hits, when a single component becomes a bottleneck, or when teams scale—causing coordination overhead and deployment risk....

Question 2

CDC During Migration - explain approaches for real-time Change Data Capture

Accepted Answer

CDC captures inserts, updates, and deletes from a source and applies them to a target in near real-time, enabling minimal-downtime migrations. **Approaches**: Log-based CDC (Debezium, AWS DMS)—reads WAL/redo logs; lowest latency, no schema change. Trigger-based—triggers on source; adds load and schema coupling. Timestamp/version columns—incremental only; misses deletes and out-of-order updates. Dual-write with reconciliation—applications write to both; eventual consistency and complexity....

Question 3

Briefly explain the architecture of Kafka.

Accepted Answer

**Section 1 — The Context (The 'Why')**
Kafka must handle millions of events per second while guaranteeing durability, ordering within partitions, and consumer group coordination. Failures include broker loss, consumer rebalance storms, and retention vs. storage cost trade-offs....

Question 4

Describe the data pipeline architecture you've worked with.

Accepted Answer

**Section 1 — The Context (The 'Why')**
Data pipeline design must reconcile batch latency (hours) with streaming complexity (exactly-once, backpressure). A naive approach either over-engineers (Kafka for daily batch) or under-engineers (no idempotency, no lineage)....

Question 5

Explain the trade-offs between batch and real-time data processing. Provide examples of when each is appropriate.

Accepted Answer

**Why the distinction matters**: Systems are optimized for different access patterns. Batch assumes bounded, complete datasets; streaming assumes unbounded, infinite data. The choice cascades into storage layout, compute model, operational complexity, and cost structure. **Batch**: Designed for high-throughput, bounded processing. Lower cost per record due to amortized compute (e.g., spot instances). Easier to reason about correctness (all-or-nothing). Trade-off: End-to-end latency is O(hours)....

Question 6

Architect a solution to handle notifications for millions of users with varying preferences.

Accepted Answer

**Section 1 — The Context (The 'Why')**
This system faces scale and failure challenges at production. A naive approach breaks under load, loses data, or violates compliance. The primary challenge varies by domain: notifications need preference respect; banking needs ACID; pipelines need idempotency....

Question 7

Build a banking system architecture from scratch, highlighting critical workflows, scalability, and data management strategies.

Accepted Answer

**Section 1 — The Context (The 'Why')**
This system faces scale and failure challenges at production. A naive approach breaks under load, loses data, or violates compliance. The primary challenge varies by domain: notifications need preference respect; banking needs ACID; pipelines need idempotency....

Question 8

Business Role of Data Pipeline

Accepted Answer

**Section 1 — The Context (The 'Why')**
This system faces scale and failure challenges at production. A naive approach breaks under load, loses data, or violates compliance. The primary challenge varies by domain: notifications need preference respect; banking needs ACID; pipelines need idempotency....

Question 9

Can Schema Evolution lead to data inconsistencies? If so, how do you manage them?

Accepted Answer

**Section 1 — The Context (The 'Why')**
This system faces scale and failure challenges at production. A naive approach breaks under load, loses data, or violates compliance. The primary challenge varies by domain: notifications need preference respect; banking needs ACID; pipelines need idempotency....

Question 10

Can you explain the trade-offs you made during the design process?

Accepted Answer

**Section 1 — The Context (The 'Why')**
The primary challenge for this design in General/Other is balancing scale, cost, and reliability. At scale, naive approaches fail: single points of failure cause cascades, schema evolution breaks consumers, and over-provisioning explodes cost. Failure modes include silent data loss from non-idempotent writes, cascading job failures from tight coupling, and operational burden from manual intervention....

Top System Design Interview Questions for Data Engineers

Difficulty Breakdown

Key Topics Covered

How to Use This Guide

Companies asking these questions

All 50 Questions

More Interview Prep Guides

Unlock All Expert Answers

Top System Design Interview Questions for Data Engineers

Difficulty Breakdown

Key Topics Covered

How to Use This Guide

Companies asking these questions

All 50 Questions

More Interview Prep Guides

Unlock All Expert Answers