Spark & Big Data questions from PWC data engineering interviews.
These spark & big data questions are sourced from PWC data engineering interviews. Each includes an expert-level answer.
Design a cost-aware resource strategy for a Databricks workload with spiky and batch jobs. Explain Dynamic Resource Allocation, when to disable it, and how min/max executors and spot instances affect cost and SLAs.
Explain how Adaptive Query Execution changes the economics of Spark tuning. What problems does it solve at runtime, and when might you still need manual intervention (e.g., salting, broadcast hints)?
Explain Delta Time Travel and the purpose of the vacuum command.
Explain the architecture of Spark, including the roles of driver, executors, DAGs, and SparkContext.
How do Delta Tables handle large-scale data updates efficiently?
How do caching strategies impact memory management in Databricks?
How do you configure retention periods for Delta tables?
How do you decide the number of partitions for repartitioning data in Spark?
How do you handle bad data in Databricks?
How do you identify skewed partitions in a dataset?
How do you resolve merge conflicts in Databricks notebooks?
How do you use Spark UI to debug stages, tasks, and performance issues?
How does Optimize command improve query latency in Delta tables?
How does the driver program handle task scheduling?
How is Git version control implemented in Databricks?
How would you identify and resolve a shuffle spill in Spark UI?
What are the limitations of the REORG command with respect to large datasets?
What are the performance trade-offs of using salting to mitigate data skewness?
What causes Out of Memory (OOM) issues in Databricks, and how do you resolve them?
What causes data skewness in Spark, and how can it be resolved?
What configuration parameters are critical for enabling AQE effectively?
What happens if the vacuum command is not run periodically?
What happens when an executor fails during a task execution?
What insights can you gather from the DAG visualization in Spark UI?
What is the usage of Optimize and REORG commands in Databricks?
What limitations do you face when using Delta Tables in a multi-cloud environment?
Download the complete interview prep bundle with expert answers. Study offline, on your commute, anywhere.