Spark & Big Data questions from BCG data engineering interviews.
These spark & big data questions are sourced from BCG data engineering interviews. Each includes an expert-level answer. This set leans toward senior-level depth (8 of 10 are tagged hard). Recurring themes are partition, spark, and optimization — these patterns appear most often in real interviews and reward the deepest preparation. Many of these questions also surface at Citi and Dunnhumby, so the preparation transfers across companies. Average answer is around 1 minute of reading — plan roughly 1 hour to work through the full set thoughtfully.
This collection contains 10 curated questions: 0 easy, 2 medium, and 8 hard. The distribution skews toward harder problems, reflecting the depth expected in senior-level interviews.
The most frequently tested areas in this set are partition (10), spark (6), optimization (5), join (4), python (2), and airflow (1). Focusing on these topics will give you the highest return on your preparation time.
Medium-difficulty questions form the bulk of real interviews — spend the most time here and practice explaining your reasoning out loud. Hard questions often appear in senior and staff-level rounds; attempt them after you're comfortable with the basics. For each question, try answering before revealing the solution. Use our AI Mock Interview to simulate real interview conditions and get instant feedback on your responses.
What is the difference between repartition and coalesce in Apache Spark?
What strategies can you use to handle skewed data in Spark?
Design a Delta table layout for mixed workload: point lookups by user_id, range scans by date, and full partition scans. Compare partitioning vs. Z-ordering—when to use each, and the rewrite cost trade-off.
Compare Hadoop and Spark. Which one would you choose for a real-time application, and why?
Explain how HDFS (Hadoop Distributed File System) stores data across nodes.
Explain how to schedule an automated task using Apache Airflow.
How do Spark transformations differ from actions? Provide examples of each.
How would you optimize Spark jobs for better performance?
What role does Kafka play in real-time data streaming pipelines?
What strategies would you use to reduce latency in a streaming data pipeline?
Get full access to 1,800+ expert answers, AI mock interviews, and personalized progress tracking.