Question 1

Design a fault-tolerant Spark Streaming checkpoint strategy: what to persist, recovery semantics, and cost/scalability trade-offs with checkpoint frequency.

Accepted Answer

**Section 1 — The Context (The 'Why')**
Spark Streaming fault tolerance requires checkpointing state and offsets. Checkpoint corruption loses replay; too-frequent checkpoints add overhead.

**Section 2 — The Diagram**
```
[Source] --> [Stream] --> [Sink]
  Checkpoint:S3  State:RocksDB
```

**Section 3 — Component Logic**
**Checkpoint** stores offsets and metadata to S3/HDFS. On restart, driver replays from last offset. **State store** (RocksDB) backs aggregation state....

Question 2

How do you handle version conflicts for libraries?

Accepted Answer

Situation: Our dbt project and a shared Spark job both depended on different PyArrow versions; one upgrade broke the other. We had 3 similar conflicts in a quarter. Task: Stabilize dependencies and prevent future conflicts. Action: I enforced lock files (poetry.lock, pip-tools) and hashed requirements.txt for reproducible builds. I created an internal package index with vetted, tested versions. I introduced a dependency review process—any upgrade required CI to pass for all consumers....

Question 3

How is Azure Key Vault used to manage encryption keys in Databricks?

Accepted Answer

Architectural flow: Key Vault stores secrets and keys. Databricks accesses via Azure AD (Managed Identity or SP). Use: DB passwords, API keys; dbutils.secrets.get(scope, key). Secret scope linked to Key Vault. CMK for workspace encryption....

Question 4

What are the differences between %run and dbutils.notebook.run?

Accepted Answer

**%run**: Executes notebook inline in the same Spark session. Variables and imports are shared. Child notebook can modify parent's state. Use for quick composition, shared setup (imports, config). **dbutils.notebook.run()**: Executes in separate context. No variable sharing. Returns last expression. Supports timeout and arguments. Use for job workflows, parallel runs, parameterized execution. **When to use which**: %run for development, ad-hoc composition....

Question 5

Can you describe the role of user groups in setting up these policies?

Accepted Answer

**Groups in RBAC**: Assign permissions to groups; add users to groups. Simplifies management—add user to analyst group vs. granting per-object. **Databricks**: Create group; grant on schema/table. Sync from IdP (Okta, Azure AD)....

Question 6

How do these transformations impact memory usage?

Accepted Answer

**Impact**: Shuffle (groupBy, join)—data redistribution; spill if OOM. collect()/toPandas()—pull to driver; avoid on large. cache()—uses memory/disk. Wide transformations—many output partitions. **Mitigation**: More partitions; broadcast small joins; avoid collect; streaming for large. Monitor Spark UI....

Question 7

How do you ensure version control when migrating notebooks?

Accepted Answer

**Approach**: (1) Export to .py or .ipynb; (2) Git—commit notebooks; use nbdime for diff; (3) Review—notebooks are JSON; (4) CI—papermill, nbval for tests; (5) No secrets—templates, inject at runtime. Pair notebooks with .py modules. Databricks Repos for sync....

Question 8

How do you handle passing parameters between notebooks?

Accepted Answer

**Databricks**: Widgets (dbutils.widgets.text); %run ./notebook $param=value. **Papermill**: papermill input.ipynb output.ipynb -p param value. **Shared state**: DBFS file or table. **Orchestrator**: Airflow passes params to tasks....

Question 9

How do you identify resource bottlenecks in cluster logs?

Accepted Answer

**Sources**: Spark UI (tasks, stages, shuffle); executor logs (memory, GC, CPU); driver logs; iostat, top. **Signals**: Skew (one task 10x slower); memory (spill, OOM); network (shuffle); stragglers. Tools: Spark UI, Prometheus....

Question 10

How does cluster size impact parallelism limits?

Accepted Answer

**Relationship**: More workers = more concurrent tasks. Cores per worker = tasks per worker. Total parallelism ≈ min(data_partitions, total_cores). Diminishing returns—overhead dominates. Right-size; auto-scale; monitor utilization....

TCS Data Engineer Interview Questions

Difficulty Breakdown

Key Topics Covered

How to Use This Guide

Companies asking these questions

All 44 Questions

More Interview Prep Guides

Unlock All Expert Answers