Question 1

Design a cost-aware resource strategy for a Databricks workload with spiky and batch jobs. Explain Dynamic Resource Allocation, when to disable it, and how min/max executors and spot instances affect cost and SLAs.

Accepted Answer

**Section 1 — The Context (The 'Why')**
Databricks workload cost explodes when clusters run idle, jobs are over-provisioned, or spot preemption causes thrashing. The challenge is aligning DPU allocation to actual parallelism while maintaining SLA....

Question 2

Explain how Adaptive Query Execution changes the economics of Spark tuning. What problems does it solve at runtime, and when might you still need manual intervention (e.g., salting, broadcast hints)?

Accepted Answer

AQE runs at shuffle boundaries: recalculates partition counts, join strategies, and skew using runtime statistics. Features: (1) Coalesce shuffle partitions—merge small partitions post-shuffle, fewer tasks. (2) Switch sort-merge to broadcast when stats show small side. (3) Skew join—split oversized partitions. Why it matters economically: reduces manual tuning—fewer hours on config, fewer job failures from bad stats....

Question 3

What challenges do you face when managing multiple notebooks in Git?

Accepted Answer

Situation: Notebooks in Git. Task: Explain challenges and mitigations. Action: Challenges: JSON merge conflicts, large outputs, execution order, reproducibility. Mitigation: nbstripout, small focused notebooks, Papermill for parameterized runs, conventions. Some use Databricks for integration....

Question 4

What are the differences between Azure Key Vault-backed and Databricks-backed Secret Scopes?

Accepted Answer

**Databricks-backed**: Secrets stored in Databricks control plane. Simpler setup; no Azure dependency. Use for dev, non-sensitive. **Key Vault-backed**: Secrets stored in Azure Key Vault. Enterprise-grade—rotation, audit, RBAC. Use for production. **Trade-offs**: Key Vault provides central key management, compliance (HSM, audit logs), and rotation workflows. Databricks-backed is simpler but doesn't integrate with enterprise PKI....

Question 5

What is Secret Scope, and how is it used in Databricks?

Accepted Answer

**Secret Scope**: Namespace for secrets in Databricks. Access via dbutils.secrets.get(scope="scope", key="key"). Masks values in logs and UI. **Use**: DB passwords, API keys, tokens. Used in notebooks, jobs, init scripts. **Types**: Databricks-backed (stored in Databricks) and Key Vault/Secrets Manager-backed (enterprise). **Best practice**: Least privilege—ACLs per scope. Rotate regularly. Use Key Vault-backed in production....

Question 6

How do you handle expired secrets in a production environment?

Accepted Answer

**Prevention**: Rotate before expiry; automated (Secrets Manager, Vault). Alert 30 days before. **Detection**: Monitor expiry. **Update**: Propagate to consumers; gradual rollout. **Incident**: If expired causes outage—emergency rotation, deploy, verify. Never hardcode; use secret managers; test rotation in staging....

Question 7

How does resource allocation adjust when a job experiences a sudden load increase?

Accepted Answer

**Mechanisms**: (1) Auto-scale—K8s HPA, Spark dynamic allocation; (2) Queue—backlog in Kafka/SQS; (3) Burst—cloud scale; lag possible; (4) Shedding—drop low-priority. Example: Spark adds executors when backlog grows; removes when idle....

Question 8

What are the best practices for logging and monitoring bad data?

Accepted Answer

(1) Structured logging—JSON with record_id, rule_id, severity, sample. (2) Centralized—ELK, Splunk, CloudWatch. (3) Metrics—failed count per rule; alert on spikes. (4) DLQ—store bad records for reprocessing. (5) Dashboards—DQ score, trends. (6) Sampling—log sample, not all. WHY: Root cause without full rerun. SCALABILITY: Sampling and aggregation; avoid log explosion....

PWC Data Engineer Interview Questions

Reading isn't practice. Get AI feedback on your answers.

Other Companies

PWC Data Engineer Interview Questions

Reading isn't practice. Get AI feedback on your answers.

Other Companies