Question 1

Design a cost-aware resource strategy for a Databricks workload with spiky and batch jobs. Explain Dynamic Resource Allocation, when to disable it, and how min/max executors and spot instances affect cost and SLAs.

Accepted Answer

**Section 1 — The Context (The 'Why')**
Databricks workload cost explodes when clusters run idle, jobs are over-provisioned, or spot preemption causes thrashing. The challenge is aligning DPU allocation to actual parallelism while maintaining SLA....

Question 2

Explain how Adaptive Query Execution changes the economics of Spark tuning. What problems does it solve at runtime, and when might you still need manual intervention (e.g., salting, broadcast hints)?

Accepted Answer

AQE runs at shuffle boundaries: recalculates partition counts, join strategies, and skew using runtime statistics. Features: (1) Coalesce shuffle partitions—merge small partitions post-shuffle, fewer tasks. (2) Switch sort-merge to broadcast when stats show small side. (3) Skew join—split oversized partitions. Why it matters economically: reduces manual tuning—fewer hours on config, fewer job failures from bad stats....

Question 3

Explain Delta Time Travel and the purpose of the vacuum command.

Accepted Answer

**Why it matters**: At scale, design choices directly impact reliability, latency, and cost. Wrong decisions compound across jobs and teams.

Delta Time Travel enables querying historical table states using version numbers or timestamps. Each Delta transaction appends a new version to the transaction log. Query with: SELECT * FROM table_name VERSION AS OF 3 or TIMESTAMP AS OF '2024-01-01'....

Question 4

Explain the architecture of Spark, including the roles of driver, executors, DAGs, and SparkContext.

Accepted Answer

**Section 1 — The Context (The 'Why')**
SparkContext entry point. collect() anti-pattern pulls all to driver—OOM. broadcast() for small tables. A naive developer uses collect() on billion-row DataFrame.

**Section 2 — The Diagram**
```
[SparkContext] --> [Driver]
   DAG Build | Schedule
      |
      v
[Executors]
   Tasks | RDD Cache
```

**Section 3 — Component Logic**
**SparkContext** in driver. **Driver** builds DAG. collect() pulls all to driver—use take() or write to S3....

Question 5

How do Delta Tables handle large-scale data updates efficiently?

Accepted Answer

**Why it matters**: At scale, design choices directly impact reliability, latency, and cost. Wrong decisions compound across jobs and teams.

Delta large-scale updates: MERGE for upserts—matches on key, updates/inserts. Partition pruning and file skipping via metrics. OPTIMIZE plus ZORDER for read performance. Example: MERGE INTO target USING source ON target.id=source.id WHEN MATCHED UPDATE SET * WHEN NOT MATCHED INSERT *....

Question 6

How do caching strategies impact memory management in Databricks?

Accepted Answer

**Why it matters**: At scale, design choices directly impact reliability, latency, and cost. Wrong decisions compound across jobs and teams.

Caching in Databricks: .cache() stores in executor memory. Impact: reduces recompute but consumes memory; can cause OOM or eviction. Delta Cache uses SSD....

Question 7

How do you configure retention periods for Delta tables?

Accepted Answer

**Why it matters**: At scale, design choices directly impact reliability, latency, and cost. Wrong decisions compound across jobs and teams.

Delta retention: spark.conf.set('spark.databricks.delta.retentionDurationCheck.enabled', 'false') to allow vacuum under 7 days (risky). ALTER TABLE SET TBLPROPERTIES (delta.logRetentionDuration='7 days', delta.deletedFileRetentionDuration='7 days')....

Question 8

How do you decide the number of partitions for repartitioning data in Spark?

Accepted Answer

**Why it matters**: At scale, design choices directly impact reliability, latency, and cost. Wrong decisions compound across jobs and teams.

Partition count: target 100–200MB per partition; total_partitions = data_size_MB / 150; or 2–4 times total cores. Rules: more for shuffle-heavy; fewer for small data. spark.sql.shuffle.partitions (default 200)....

Question 9

How do you handle bad data in Databricks?

Accepted Answer

**Situation**: Faced competing demands—multiple pipelines, stakeholders, deadlines. **Task**: Deliver impact while maintaining quality and preventing burnout. **Action**: (1) Prioritized by business impact and SLA risk. (2) Used ROI (value/time); WIP limits; timeboxing. (3) Communicated trade-offs—'Adding X pushes Y by N days.' (4) Maintained backlog with tech-debt capacity. **Result**: Shipped on time; zero incidents; stakeholder alignment on deferrals....

Question 10

How do you identify skewed partitions in a dataset?

Accepted Answer

**Why it matters**: At scale, design choices directly impact reliability, latency, and cost. Wrong decisions compound across jobs and teams.

Identify skewed partitions: (1) Spark UI—task duration variance; (2) df.groupBy(partition_col).count().orderBy(desc('count')); (3) Check shuffle read imbalance....

PWC Spark & Big Data Interview Questions

Difficulty Breakdown

Key Topics Covered

How to Use This Guide

Companies asking these questions

All 26 Questions

More Interview Prep Guides

Practice with AI — Not Just Reading

PWC Spark & Big Data Interview Questions

Difficulty Breakdown

Key Topics Covered

How to Use This Guide

Companies asking these questions

All 26 Questions

More Interview Prep Guides

Practice with AI — Not Just Reading