Q: Add a column to the Employees table that shows the name of the employee with the next higher employee_id.

**LEAD:** NEXT value in partition. LEAD(name, 1) OVER (ORDER BY employee_id). LAG for previous. Default for last row: NULL. SELECT employee_id, name, LEAD(name) OVER (ORDER BY employee_id) AS next_employee_name FROM Employees;

Question 1

Architecturally, how do Job–Stage–Task boundaries in Spark's execution model impact cluster sizing, shuffle cost, and when would you deliberately collapse or split stages?

Accepted Answer

**Architecture**: Job = one action; Stage = boundary at shuffle; Task = unit per partition. Stages enable pipelining of narrow transformations (filter, map) across partitions without network I/O; shuffles force stage boundaries and dominate cost.

**Why it matters for sizing**: Cluster parallelism is bounded by min(#tasks, #cores). Over-partitioning increases tasks and overhead (scheduler, task launch); under-partitioning underutilizes clusters....

Question 2

What are the key components of the Spark execution model (Job, Stage, Task)?

Accepted Answer

**Job**: Triggered by one action (count, write, collect); one job per action. **Stage**: Boundary at shuffle; narrow transformations (map, filter) pipeline into one stage; wide (join, groupBy, distinct) trigger stage boundaries. **Task**: One task per partition per stage; unit of work on an executor. Flow: Action → Job → DAG Scheduler (plans stages) → Task Scheduler (schedules tasks to executors)....

Question 3

How many cities does each department operate in? List the top 3 departments in terms of the most number of cities. In case of a tie, order by dept_id.

Accepted Answer

**SQL:** SELECT dept_id, COUNT(DISTINCT city) city_count FROM ... GROUP BY dept_id ORDER BY city_count DESC, dept_id LIMIT 3. **Ties:** RANK() or DENSE_RANK(); filter r<=3. **Why:** Ranking semantics. **Production:** Clarify tie behavior.

Question 4

List every combination of dept_name, employee_name, and city such that the employee belongs to the department and the same city in which the department is located.

Accepted Answer

**Why Join + Same-Location Filter:** Business rule: employee must be in dept AND colocated. The join condition e.city = d.city enforces referential locality—critical for org hierarchy and location-based access control.

**Query Design:** JOIN on dept_id plus e.city = d.city. Alternative: JOIN on (dept_id, city) as composite key if indexed—can be faster....

Question 5

Add a column to the Employees table that shows the name of the employee with the next higher employee_id.

Accepted Answer

**LEAD:** NEXT value in partition. LEAD(name, 1) OVER (ORDER BY employee_id). LAG for previous. Default for last row: NULL.

SELECT employee_id, name,
       LEAD(name) OVER (ORDER BY employee_id) AS next_employee_name
FROM Employees;

Question 6

Find the third-highest salary for each department.

Accepted Answer

SELECT * FROM (SELECT employee_id, department, salary, DENSE_RANK() OVER (PARTITION BY department ORDER BY salary DESC) AS rk FROM employees) t WHERE rk = 3. Same as 2nd-highest; rk = 3. **Parameterize**: Variable N for Nth....

Question 7

Write a PySpark job to find the top 3 employees of each department, where Age < 30 and Salary > department average salary.

Accepted Answer

from pyspark.sql import Window
from pyspark.sql.functions import avg, col, row_number
df = spark.table("employees").filter("age < 30")
dept_avg = df.groupBy("dept_id").agg(avg("salary").alias("dept_avg_sal"))
df2 = df.join(dept_avg, "dept_id").filter(col("salary") > col("dept_avg_sal"))
window = Window.partitionBy("dept_id").orderBy(col("salary").desc())
df2.withColumn("rn", row_number().over(window)).filter("rn <= 3").select("dept_id", "employee_id", "salary", "age").show() **Why**: Filter earl...

Question 8

Write a PySpark script to read a CSV file, filter rows where the age column is less than 18, and write the result to a new CSV file.

Accepted Answer

**Why It Matters (Architectural Logic)**: CSV is ubiquitous but inefficient at scale. Filter early; prefer Parquet for production. Handle nulls and type coercion.

Read CSV with schema inference or explicit schema: `df = spark.read.option("header", "true").csv("s3://bucket/input.csv")`. Filter: `minors_df = df.filter(F.col("age") < 18)`. If age may be string: `df.filter(F.col("age").cast("int") < 18)`....

Freight Tiger Data Engineer Interview Questions

Reading isn't practice. Get AI feedback on your answers.

Other Companies

Freight Tiger Data Engineer Interview Questions

Reading isn't practice. Get AI feedback on your answers.

Other Companies