Question 1

Explain the differences between Repartition and Coalesce. When would you use each?

Accepted Answer

**Repartition(n)**: Full shuffle; creates exactly n partitions. Can increase or decrease. **Coalesce(n)**: Merges partitions without full shuffle; only decreases. **Why it matters**: Shuffle is expensive—network and disk I/O. Coalesce avoids shuffle when reducing partitions by merging within existing partitions. **When Repartition**: Increasing partitions, fixing skew (repartition by key), or before a join to align partition counts....

Question 2

Explain the differences between a Data Lake and a Data Warehouse.

Accepted Answer

**Data Lake**: Low-cost object storage (S3, ADLS) for raw, semi-structured, unstructured data. Schema-on-read; used for exploratory analytics, ML, archival. **Data Warehouse**: Structured, curated storage optimized for SQL; schema-on-write; used for BI and reporting. **Why both exist**: Lakes offer flexibility and cost at scale; warehouses offer query performance and concurrency....

Question 3

Explain the types of triggers in ADF, including schedule, tumbling window, and event-based triggers.

Accepted Answer

Schedule: Fixed cadence (cron, every N mins). Predictable batch windows; simple ops. Tumbling window: Fixed non-overlapping intervals; fires once per window. Ideal for idempotent, exactly-once semantics—no overlap means no double-processing. Event-based: Fires on blob created, queue message, etc. Enables near real-time pipelines....

Question 4

Write a SQL query to find top 3 earners in each department.

Accepted Answer

**Architectural Logic:** Use ROW_NUMBER or DENSE_RANK with PARTITION BY department. ROW_NUMBER = exactly 3 rows per department; DENSE_RANK = ties share a rank (may return >3). **Query:** WITH ranked AS (SELECT employee_id, name, department, salary, ROW_NUMBER() OVER (PARTITION BY department ORDER BY salary DESC) AS rn FROM employees) SELECT employee_id, name, department, salary, rn FROM ranked WHERE rn <= 3....

Question 5

Explain how Adaptive Query Execution changes the economics of Spark tuning. What problems does it solve at runtime, and when might you still need manual intervention (e.g., salting, broadcast hints)?

Accepted Answer

AQE runs at shuffle boundaries: recalculates partition counts, join strategies, and skew using runtime statistics. Features: (1) Coalesce shuffle partitions—merge small partitions post-shuffle, fewer tasks. (2) Switch sort-merge to broadcast when stats show small side. (3) Skew join—split oversized partitions. Why it matters economically: reduces manual tuning—fewer hours on config, fewer job failures from bad stats....

Question 6

Explain wide vs. narrow transformations and how they drive shuffle cost, failure domains, and pipeline design. When would you intentionally add a wide transformation, and how do you minimize its impact?

Accepted Answer

**Section 1 — The Context (The 'Why')**
Wide transformations force full data shuffles across the cluster; narrow transformations stay partition-local. The cost of shuffle dominates Spark job runtime at scale.

**Section 2 — The Diagram**
```
[Narrow: map, filter] --> [RDD] --> [Wide: join, groupBy] --> [Shuffle]
```

**Section 3 — Component Logic**
**Narrow transformations** (map, filter) do not require data movement....

Question 7

Architecturally, how do Job–Stage–Task boundaries in Spark's execution model impact cluster sizing, shuffle cost, and when would you deliberately collapse or split stages?

Accepted Answer

**Architecture**: Job = one action; Stage = boundary at shuffle; Task = unit per partition. Stages enable pipelining of narrow transformations (filter, map) across partitions without network I/O; shuffles force stage boundaries and dominate cost.

**Why it matters for sizing**: Cluster parallelism is bounded by min(#tasks, #cores). Over-partitioning increases tasks and overhead (scheduler, task launch); under-partitioning underutilizes clusters....

Question 8

What are the key components of the Spark execution model (Job, Stage, Task)?

Accepted Answer

**Job**: Triggered by one action (count, write, collect); one job per action. **Stage**: Boundary at shuffle; narrow transformations (map, filter) pipeline into one stage; wide (join, groupBy, distinct) trigger stage boundaries. **Task**: One task per partition per stage; unit of work on an executor. Flow: Action → Job → DAG Scheduler (plans stages) → Task Scheduler (schedules tasks to executors)....

FedEx Dataworks Data Engineer Interview Questions

Reading isn't practice. Get AI feedback on your answers.

Other Companies

FedEx Dataworks Data Engineer Interview Questions

Reading isn't practice. Get AI feedback on your answers.

Other Companies