Explain wide vs. narrow transformations and how they drive shuffle cost, failure domains, and pipeline design. When would you intentionally add a wide transformation, and how do you minimize its impact?
Spark/Big Datahard
2
Design a Delta table layout for mixed workload: point lookups by user_id, range scans by date, and full partition scans. Compare partitioning vs. Z-ordering—when to use each, and the rewrite cost trade-off.
Spark/Big Datahard
3
Architecturally, how do Job–Stage–Task boundaries in Spark's execution model impact cluster sizing, shuffle cost, and when would you deliberately collapse or split stages?
Spark/Big Datahard
4
Design a fault-tolerant Spark Streaming checkpoint strategy: what to persist, recovery semantics, and cost/scalability trade-offs with checkpoint frequency.
Spark/Big Datahard
5
Explain the Medallion Architecture (Bronze, Silver, Gold layers).
Spark/Big Datahard
6
Explain the benefits of using DataFrames over RDDs.
Spark/Big Datahard
7
Explain the difference between batch and streaming data processing in Data Fusion.
Spark/Big Datahard
8
Given a streaming dataset from Kafka, how would you ingest the data in real-time using Spark?
Spark/Big Datahard
+20 More Questions with Expert Answers
Get the complete 1,800+ question library with detailed, expert-level answers covering SQL, Spark, System Design, Python, Cloud, and Behavioral topics.