Spark & Big Data questions from BCG data engineering interviews.
These spark & big data questions are sourced from BCG data engineering interviews. Each includes an expert-level answer.
What is the difference between repartition and coalesce in Apache Spark?
What strategies can you use to handle skewed data in Spark?
Design a Delta table layout for mixed workload: point lookups by user_id, range scans by date, and full partition scans. Compare partitioning vs. Z-ordering—when to use each, and the rewrite cost trade-off.
Compare Hadoop and Spark. Which one would you choose for a real-time application, and why?
Explain how HDFS (Hadoop Distributed File System) stores data across nodes.
Explain how to schedule an automated task using Apache Airflow.
How do Spark transformations differ from actions? Provide examples of each.
How would you optimize Spark jobs for better performance?
What role does Kafka play in real-time data streaming pipelines?
What strategies would you use to reduce latency in a streaming data pipeline?
Download the complete interview prep bundle with expert answers. Study offline, on your commute, anywhere.