**Why the distinction matters**: Systems are optimized for different access patterns. Batch assumes bounded, complete datasets; streaming assumes unbounded, infinite data. The choice cascades into storage layout, compute model, operational complexity, and cost structure....
Red Flag: Saying 'we use Spark streaming' without mentioning backpressure, checkpointing, or exactly-once semantics. Pro-Move: Discuss a hybrid architecture where you used streaming for real-time SLAs but batch for source-of-truth reconciliation to handle late-arriving data and ensure idempotency.
This hard-level System Design/Architecture question appears frequently in data engineering interviews at companies like Expedia, Swiggy. While less common, it tests deeper understanding that distinguishes strong candidates. Mastering the underlying concepts (join) will help you answer variations of this question confidently.
This is a senior-level question that tests architectural thinking. Lead with the high-level design, then drill into specifics. Discuss trade-offs explicitly - there is rarely one correct answer. Show awareness of scale, fault tolerance, and operational complexity.
Why the distinction matters: Systems are optimized for different access patterns. Batch assumes bounded, complete datasets; streaming assumes unbounded, infinite data. The choice cascades into storage layout, compute model, operational complexity, and cost structure. Batch: Designed for high-throughput, bounded processing. Lower cost per record due to amortized compute (e.g., spot instances). Easier to reason about correctness (all-or-nothing). Trade-off: End-to-end latency is O(hours). Use for: nightly reconciliation, ML training, complex multi-table joins, regulatory reporting where freshness isn't critical. Streaming: Designed for low-latency, unbounded processing. Higher cost (always-on consumers, stateful infra). Requires explicit semantics (at-least-once vs exactly-once), checkpointing, backpressure, and schema evolution. Use for: fraud detection (<100ms), inventory updates, recommendation systems, alerting. Architectural trade-off: Hybrid (Lambda/Kappa) is common—streaming for real-time views, batch for historical backfill and complex aggregations. Cost implication: A streaming pipeline running 24/7 can cost 3–5x a daily batch job for the same volume; justify with business value of latency.
Want feedback on your answer?
Paste your answer to this question and our AI Coach scores it, finds gaps, and shows you the FAANG-level version.
Paste your answer and get instant AI feedback with a FAANG-level improved version.
Analyze My Answer — FreeAccording to DataEngPrep.tech, this is one of the most frequently asked System Design/Architecture interview questions, reported at 2 companies. DataEngPrep.tech maintains a curated database of 1,863+ real data engineering interview questions across 7 categories, verified by industry professionals.