Question 1

Design a cost-aware resource strategy for a Databricks workload with spiky and batch jobs. Explain Dynamic Resource Allocation, when to disable it, and how min/max executors and spot instances affect cost and SLAs.

Accepted Answer

**Section 1 — The Context (The 'Why')**
Databricks workload cost explodes when clusters run idle, jobs are over-provisioned, or spot preemption causes thrashing. The challenge is aligning DPU allocation to actual parallelism while maintaining SLA....

Question 2

Explain how Adaptive Query Execution changes the economics of Spark tuning. What problems does it solve at runtime, and when might you still need manual intervention (e.g., salting, broadcast hints)?

Accepted Answer

AQE runs at shuffle boundaries: recalculates partition counts, join strategies, and skew using runtime statistics. Features: (1) Coalesce shuffle partitions—merge small partitions post-shuffle, fewer tasks. (2) Switch sort-merge to broadcast when stats show small side. (3) Skew join—split oversized partitions. Why it matters economically: reduces manual tuning—fewer hours on config, fewer job failures from bad stats....

Question 3

What challenges do you face when managing multiple notebooks in Git?

Accepted Answer

Situation: Notebooks in Git. Task: Explain challenges and mitigations. Action: Challenges: JSON merge conflicts, large outputs, execution order, reproducibility. Mitigation: nbstripout, small focused notebooks, Papermill for parameterized runs, conventions. Some use Databricks for integration....

Question 4

What are the differences between Azure Key Vault-backed and Databricks-backed Secret Scopes?

Accepted Answer

**Databricks-backed**: Secrets stored in Databricks control plane. Simpler setup; no Azure dependency. Use for dev, non-sensitive. **Key Vault-backed**: Secrets stored in Azure Key Vault. Enterprise-grade—rotation, audit, RBAC. Use for production. **Trade-offs**: Key Vault provides central key management, compliance (HSM, audit logs), and rotation workflows. Databricks-backed is simpler but doesn't integrate with enterprise PKI....

Question 5

What is Secret Scope, and how is it used in Databricks?

Accepted Answer

**Secret Scope**: Namespace for secrets in Databricks. Access via dbutils.secrets.get(scope="scope", key="key"). Masks values in logs and UI. **Use**: DB passwords, API keys, tokens. Used in notebooks, jobs, init scripts. **Types**: Databricks-backed (stored in Databricks) and Key Vault/Secrets Manager-backed (enterprise). **Best practice**: Least privilege—ACLs per scope. Rotate regularly. Use Key Vault-backed in production....

Question 6

How do you handle expired secrets in a production environment?

Accepted Answer

**Prevention**: Rotate before expiry; automated (Secrets Manager, Vault). Alert 30 days before. **Detection**: Monitor expiry. **Update**: Propagate to consumers; gradual rollout. **Incident**: If expired causes outage—emergency rotation, deploy, verify. Never hardcode; use secret managers; test rotation in staging....

Question 7

How does resource allocation adjust when a job experiences a sudden load increase?

Accepted Answer

**Mechanisms**: (1) Auto-scale—K8s HPA, Spark dynamic allocation; (2) Queue—backlog in Kafka/SQS; (3) Burst—cloud scale; lag possible; (4) Shedding—drop low-priority. Example: Spark adds executors when backlog grows; removes when idle....

Question 8

What are the best practices for logging and monitoring bad data?

Accepted Answer

(1) Structured logging—JSON with record_id, rule_id, severity, sample. (2) Centralized—ELK, Splunk, CloudWatch. (3) Metrics—failed count per rule; alert on spikes. (4) DLQ—store bad records for reprocessing. (5) Dashboards—DQ score, trends. (6) Sampling—log sample, not all. WHY: Root cause without full rerun. SCALABILITY: Sampling and aggregation; avoid log explosion....

Question 9

What are the implications of enabling schema auto-detection?

Accepted Answer

Pros: No manual schema; flexible. Cons: Inferred from sample—can be wrong; type errors; schema drift undetected; scan overhead. Best: Exploration or variable sources; production = explicit schema (Avro, JSON Schema) + validation. BigQuery: auto-detect can misinfer; provide schema for critical....

Question 10

What are the potential downsides of enabling dynamic resource allocation?

Accepted Answer

Dynamic allocation (Spark) downsides: (1) Latency—executor acquisition delay; short jobs suffer. (2) Uneven—stragglers if allocation slow. (3) Cost unpredictability—bursty in shared clusters. (4) External shuffle—may need extra service. (5) Small jobs—overhead not justified. Best: Dynamic for long, variable; static for predictable batch....

PWC Data Engineer Interview Questions

Difficulty Breakdown

Key Topics Covered

How to Use This Guide

Companies asking these questions

All 41 Questions

More Interview Prep Guides

Practice with AI — Not Just Reading