Real interview questions asked at PWC. Practice the most frequently asked questions and land your next role.
PWC data engineering interviews test your ability across multiple domains. These questions are sourced from real PWC interview experiences and sorted by frequency. Practice the ones that matter most.
Design a cost-aware resource strategy for a Databricks workload with spiky and batch jobs. Explain Dynamic Resource Allocation, when to disable it, and how min/max executors and spot instances affect cost and SLAs.
Explain how Adaptive Query Execution changes the economics of Spark tuning. What problems does it solve at runtime, and when might you still need manual intervention (e.g., salting, broadcast hints)?
What challenges do you face when managing multiple notebooks in Git?
What are the differences between Azure Key Vault-backed and Databricks-backed Secret Scopes?
What is Secret Scope, and how is it used in Databricks?
How do you handle expired secrets in a production environment?
How does resource allocation adjust when a job experiences a sudden load increase?
What are the best practices for logging and monitoring bad data?
What are the implications of enabling schema auto-detection?
What are the potential downsides of enabling dynamic resource allocation?
What role does the executor heap size play in preventing OOM errors?
How do quarantine tables ensure data quality in downstream pipelines?
How does AQE optimize join operations dynamically?
How does improper partitioning affect Spark job performance?
What metrics would you analyze to determine if your partitioning strategy is effective?
Explain Delta Time Travel and the purpose of the vacuum command.
Explain the architecture of Spark, including the roles of driver, executors, DAGs, and SparkContext.
How do Delta Tables handle large-scale data updates efficiently?
How do caching strategies impact memory management in Databricks?
How do you configure retention periods for Delta tables?
How do you decide the number of partitions for repartitioning data in Spark?
How do you handle bad data in Databricks?
How do you identify skewed partitions in a dataset?
How do you resolve merge conflicts in Databricks notebooks?
How do you use Spark UI to debug stages, tasks, and performance issues?
How does Optimize command improve query latency in Delta tables?
How does the driver program handle task scheduling?
How is Git version control implemented in Databricks?
How would you identify and resolve a shuffle spill in Spark UI?
What are the limitations of the REORG command with respect to large datasets?
What are the performance trade-offs of using salting to mitigate data skewness?
What causes Out of Memory (OOM) issues in Databricks, and how do you resolve them?
What causes data skewness in Spark, and how can it be resolved?
What configuration parameters are critical for enabling AQE effectively?
What happens if the vacuum command is not run periodically?
What happens when an executor fails during a task execution?
What insights can you gather from the DAG visualization in Spark UI?
What is the usage of Optimize and REORG commands in Databricks?
What limitations do you face when using Delta Tables in a multi-cloud environment?
Can Schema Evolution lead to data inconsistencies? If so, how do you manage them?
Differentiate between Schema Enforcement and Schema Evolution.
Download the complete interview prep bundle with expert answers. Study offline, on your commute, anywhere.