**Why the distinction matters**: Systems are optimized for different access patterns. Batch assumes bounded, complete datasets; streaming assumes unbounded, infinite data. The choice cascades into storage layout, compute model, operational complexity, and cost structure....
Red Flag: Saying 'we use Spark streaming' without mentioning backpressure, checkpointing, or exactly-once semantics. Pro-Move: Discuss a hybrid architecture where you used streaming for real-time SLAs but batch for source-of-truth reconciliation to handle late-arriving data and ensure idempotency.
This hard-level System Design/Architecture question appears frequently in data engineering interviews at companies like Expedia, Swiggy. While less common, it tests deeper understanding that distinguishes strong candidates. Mastering the underlying concepts (join) will help you answer variations of this question confidently.
This is a senior-level question that tests architectural thinking. Lead with the high-level design, then drill into specifics. Discuss trade-offs explicitly - there is rarely one correct answer. Show awareness of scale, fault tolerance, and operational complexity.
Why the distinction matters: Systems are optimized for different access patterns. Batch assumes bounded, complete datasets; streaming assumes unbounded, infinite data. The choice cascades into storage layout, compute model, operational complexity, and cost structure. Batch: Designed for high-throughput, bounded processing. Lower cost per record due to amortized compute (e.g., spot instances). Easier to reason about correctness (all-or-nothing). Trade-off: End-to-end latency is O(hours). Use for: nightly reconciliation, ML training, complex multi-table joins, regulatory reporting where freshness isn't critical. Streaming: Designed for low-latency, unbounded processing. Higher cost (always-on consumers, stateful infra). Requires explicit semantics (at-least-once vs exactly-once), checkpointing, backpressure, and schema evolution. Use for: fraud detection (<100ms), inventory updates, recommendation systems, alerting. Architectural trade-off: Hybrid (Lambda/Kappa) is common—streaming for real-time views, batch for historical backfill and complex aggregations. Cost implication: A streaming pipeline running 24/7 can cost 3–5x a daily batch job for the same volume; justify with business value of latency.
This answer is partially locked
Unlock the full expert answer with code examples and trade-offs
Practice real interviews with AI feedback, track progress, and get interview-ready faster.
Pro starts at $19/mo - cancel anytime
Trusted by 10,000+ aspiring data engineers
According to DataEngPrep.tech, this is one of the most frequently asked System Design/Architecture interview questions, reported at 2 companies. DataEngPrep.tech maintains a curated database of 1,863+ real data engineering interview questions across 7 categories, verified by industry professionals.