Question 1

Write a query to find the top three highest-paid employees in each department using window functions.

Accepted Answer

**Architectural Logic:** Same pattern as ID 116—window + filter. DENSE_RANK for 'include ties'; ROW_NUMBER for exactly 3. **Query:** WITH ranked AS (SELECT employee_id, name, department, salary, DENSE_RANK() OVER (PARTITION BY department ORDER BY salary DESC) AS rn FROM employees) SELECT employee_id, name, department, salary FROM ranked WHERE rn <= 3 ORDER BY department, salary DESC....

Question 2

How do you see your career evolving in the next 3-5 years?

Accepted Answer

Situation: I've been Senior DE for 3 years. Task: Articulate realistic path. Action: I aim for Staff/Principal—owning org-wide data strategy. I'm developing cross-functional influence, distributed systems depth, mentorship (sponsored 2 engineers), and external presence....

Question 3

If your team disagrees on the approach to solving a problem, how do you manage the situation?

Accepted Answer

Situation: Team split on data modeling approach. Task: Facilitate fact-based decision. Action: I had each perspective documented; ran decision meeting with timebox. Suggested spike to validate. Used RACI for accountability. Documented decision and rationale. As lead, I made final call when deadlocked, explaining reasoning....

Question 4

Explain the architectural trade-offs when optimizing a query on 100M+ rows: indexing vs. partitioning vs. materialized views. When does each approach become cost-prohibitive or operationally burdensome, and how do you quantify impact?

Accepted Answer

Architectural Logic: Indexing—speeds reads, slows writes, costs storage; partitioning—prunes at scan, requires partition design; materialized views—precompute, cost refresh. Why each: Index for point/range lookup on filter columns; partition for time/entity-based pruning; MV for repeated aggregations. Scalability: Indexes on wide tables → write amplification; over-partitioning → small-file problem; MVs → refresh time and storage....

Question 5

How would you handle nulls in a SQL join? Provide examples using COALESCE.

Accepted Answer

Nulls in SQL joins behave uniquely: NULL = NULL evaluates to NULL (not TRUE), so join keys with nulls typically don't match. Use COALESCE to normalize join keys. Example: SELECT * FROM orders o JOIN customers c ON COALESCE(o.customer_id, -1) = COALESCE(c.id, -1). For left joins where right-side columns may be null, use COALESCE in SELECT: COALESCE(c.name, 'Unknown Customer'). To include null-matching rows explicitly, use: ON (o.key = c.key OR (o.key IS NULL AND c.key IS NULL))....

Question 6

What are your expectations for the role beyond the salary?

Accepted Answer

**Situation**: Context of the challenge. **Task**: Your responsibility. **Action**: Specific steps, tools, collaboration. **Result**: Quantified outcome. Beyond salary, I expect: (1) Impact—working on problems that matter at scale. (2) Growth—learning from strong peers, mentorship, stretch projects. (3) Ownership—end-to-end responsibility for systems. (4) Work-life balance—sustainable pace, flexibility. (5) Technology—modern stack, investment in tooling....

Question 7

What is the most common performance bottleneck in Spark jobs, and how would you resolve it?

Accepted Answer

**#1 Bottleneck: Shuffle**. Network I/O for join, groupBy. Resolve: (1) Broadcast small tables. (2) Column pruning. (3) Partition by key. (4) Fix skew (salting). (5) Tune shuffle partitions. (6) AQE coalesce.

**Other Bottlenecks**: GC (reduce executor memory, G1), data skew, too few executors, source slow.

**Why Shuffle Dominates**: Moves data across network; serialization; often 70–90% of time.

**Scalability Trade-offs**: Each fix has limits. Combine....

Question 8

Write PySpark code to filter records based on specific conditions and add a calculated column.

Accepted Answer

**Code**:

```python
from pyspark.sql.functions import when, col
spark = SparkSession.builder.getOrCreate()
df = spark.read.parquet("/path")
result = df.filter(col("region") == "US").withColumn("category", when(col("amount") > 100, "high").otherwise("low"))
```

**Why**: Filter before calculated column reduces data. when/otherwise = built-in, optimized.

**Scalability Trade-offs**: Filter first. Built-in over UDF.

**Cost Implications**: Filter early = less data....

Question 9

Write a PySpark script to filter out invalid records from a dataset and calculate the average for a specific column, ensuring the schema is strictly defined at runtime.

Accepted Answer

**Why It Matters (Architectural Logic)**: Strict schemas reject malformed data at read time—fail fast vs. silent corruption. FAILFAST mode prevents partial loads.

Use a strictly defined schema via StructType to reject malformed records at read time. Define schema: `schema = StructType([StructField("id", LongType()), StructField("value", DoubleType())])`. Read with schema: `df = spark.read.schema(schema).csv(path)`....

Bristol Myers Squibb Data Engineer Interview Questions

Difficulty Breakdown

Key Topics Covered

How to Use This Guide

Companies asking these questions

All 9 Questions

More Interview Prep Guides

Unlock All Expert Answers