Q: Add a column to the Employees table that shows the name of the employee with the next higher employee_id.

**LEAD:** NEXT value in partition. LEAD(name, 1) OVER (ORDER BY employee_id). LAG for previous. Default for last row: NULL. SELECT employee_id, name, LEAD(name) OVER (ORDER BY employee_id) AS next_employee_name FROM Employees;

Question 1

Architecturally, how do Job–Stage–Task boundaries in Spark's execution model impact cluster sizing, shuffle cost, and when would you deliberately collapse or split stages?

Accepted Answer

**Architecture**: Job = one action; Stage = boundary at shuffle; Task = unit per partition. Stages enable pipelining of narrow transformations (filter, map) across partitions without network I/O; shuffles force stage boundaries and dominate cost.

**Why it matters for sizing**: Cluster parallelism is bounded by min(#tasks, #cores). Over-partitioning increases tasks and overhead (scheduler, task launch); under-partitioning underutilizes clusters....

Question 2

What are the key components of the Spark execution model (Job, Stage, Task)?

Accepted Answer

**Job**: Triggered by one action (count, write, collect); one job per action. **Stage**: Boundary at shuffle; narrow transformations (map, filter) pipeline into one stage; wide (join, groupBy, distinct) trigger stage boundaries. **Task**: One task per partition per stage; unit of work on an executor. Flow: Action → Job → DAG Scheduler (plans stages) → Task Scheduler (schedules tasks to executors)....

Question 3

How many cities does each department operate in? List the top 3 departments in terms of the most number of cities. In case of a tie, order by dept_id.

Accepted Answer

**SQL:** SELECT dept_id, COUNT(DISTINCT city) city_count FROM ... GROUP BY dept_id ORDER BY city_count DESC, dept_id LIMIT 3. **Ties:** RANK() or DENSE_RANK(); filter r<=3. **Why:** Ranking semantics. **Production:** Clarify tie behavior.

Question 4

List every combination of dept_name, employee_name, and city such that the employee belongs to the department and the same city in which the department is located.

Accepted Answer

**Why Join + Same-Location Filter:** Business rule: employee must be in dept AND colocated. The join condition e.city = d.city enforces referential locality—critical for org hierarchy and location-based access control.

**Query Design:** JOIN on dept_id plus e.city = d.city. Alternative: JOIN on (dept_id, city) as composite key if indexed—can be faster....

Question 5

Add a column to the Employees table that shows the name of the employee with the next higher employee_id.

Accepted Answer

**LEAD:** NEXT value in partition. LEAD(name, 1) OVER (ORDER BY employee_id). LAG for previous. Default for last row: NULL.

SELECT employee_id, name,
       LEAD(name) OVER (ORDER BY employee_id) AS next_employee_name
FROM Employees;

Question 6

Find the third-highest salary for each department.

Accepted Answer

SELECT * FROM (SELECT employee_id, department, salary, DENSE_RANK() OVER (PARTITION BY department ORDER BY salary DESC) AS rk FROM employees) t WHERE rk = 3. Same as 2nd-highest; rk = 3. **Parameterize**: Variable N for Nth....

Question 7

Write a PySpark job to find the top 3 employees of each department, where Age < 30 and Salary > department average salary.

Accepted Answer

from pyspark.sql import Window
from pyspark.sql.functions import avg, col, row_number
df = spark.table("employees").filter("age < 30")
dept_avg = df.groupBy("dept_id").agg(avg("salary").alias("dept_avg_sal"))
df2 = df.join(dept_avg, "dept_id").filter(col("salary") > col("dept_avg_sal"))
window = Window.partitionBy("dept_id").orderBy(col("salary").desc())
df2.withColumn("rn", row_number().over(window)).filter("rn <= 3").select("dept_id", "employee_id", "salary", "age").show() **Why**: Filter earl...

Question 8

Write a PySpark script to read a CSV file, filter rows where the age column is less than 18, and write the result to a new CSV file.

Accepted Answer

**Why It Matters (Architectural Logic)**: CSV is ubiquitous but inefficient at scale. Filter early; prefer Parquet for production. Handle nulls and type coercion.

Read CSV with schema inference or explicit schema: `df = spark.read.option("header", "true").csv("s3://bucket/input.csv")`. Filter: `minors_df = df.filter(F.col("age") < 18)`. If age may be string: `df.filter(F.col("age").cast("int") < 18)`....

Freight Tiger Data Engineer Interview Questions

Difficulty Breakdown

Key Topics Covered

How to Use This Guide

Companies asking these questions

All 8 Questions

More Interview Prep Guides

Unlock All Expert Answers