Question 1

Write an SQL query to find the second-highest salary from an employee table.

Accepted Answer

**Using subquery with MAX**:
```sql
SELECT MAX(salary) AS second_highest
FROM employee
WHERE salary < (SELECT MAX(salary) FROM employee);
```

**Using LIMIT/OFFSET** (MySQL, PostgreSQL):
```sql
SELECT DISTINCT salary
FROM employee
ORDER BY salary DESC
LIMIT 1 OFFSET 1;
```

**Using DENSE_RANK** (ANSI SQL, most robust):
```sql
SELECT salary
FROM (
  SELECT salary, DENSE_RANK() OVER (ORDER BY salary DESC) AS rk
  FROM employee
) t
WHERE rk = 2;
```

**Architectural Logic & Trade-offs**:
- **Subque...

Question 2

What is the difference between cache() and persist() in Spark? When would you use each?

Accepted Answer

**cache()**: Equivalent to `persist(MEMORY_AND_DISK)`. Stores partitions in memory; spills to disk if memory is insufficient.

**persist(storage_level)**: Explicit control over storage: MEMORY_ONLY, MEMORY_AND_DISK, MEMORY_ONLY_SER, MEMORY_AND_DISK_SER, DISK_ONLY.

**Architectural Logic (Why It Matters)**: Caching trades memory/disk for recomputation cost....

Question 3

What is the difference between groupByKey and reduceByKey in Spark?

Accepted Answer

**groupByKey()**: Shuffles all (key, value) pairs to group values per key. Transfers O(total_values) over the network. No local aggregation—you combine values afterward. High memory and network cost.

**reduceByKey(func)**: Performs local reduce (e.g., sum) on each partition before shuffle. Shuffles only O(unique_keys) aggregated values. Combines locally first, then across partitions.

**Architectural Logic (Why reduceByKey Wins)**: Shuffle is the bottleneck....

Question 4

Discuss differences between ROW_NUMBER(), RANK(), and DENSE_RANK(), and provide examples from your projects.

Accepted Answer

**ROW_NUMBER()**: Unique sequential numbers (1, 2, 3...); no ties—deterministic only with ORDER BY uniqueness. **RANK()**: Same rank for ties; skips (1, 2, 2, 4). **DENSE_RANK()**: Same rank for ties; no gaps (1, 2, 2, 3). **Project examples**: ROW_NUMBER() to deduplicate events by (user_id, event_time) keeping first—critical when upstream sends duplicates. DENSE_RANK() for 'top 10 products per category' reports—avoids gaps when filtering....

Question 5

Briefly introduce yourself and walk us through your journey as a Data Engineer so far.

Accepted Answer

**Situation**: I joined as a software engineer and saw data as a bottleneck—pipelines broke, nobody trusted the numbers. **Task**: Transition into data engineering and build reliable, scalable systems. **Action**: I moved from ETL dev to owning cloud data platforms—designed data lakes on AWS/GCP, optimized Spark jobs (reduced costs 40% via partition pruning and skew fixes), implemented Kafka/Flink streaming, and led migrations to Delta Lake....

Question 6

Can you explain the difference between OLTP and OLAP?

Accepted Answer

**OLTP**: Optimized for many small transactions (inserts, updates, deletes). Row-oriented, normalized, high concurrency. Examples: MySQL, PostgreSQL. **OLAP**: Optimized for complex analytical queries and aggregations on large datasets. Column-oriented, denormalized (star/snowflake). Examples: Snowflake, BigQuery, Redshift. **Why the split**: Different access patterns; mixing them degrades both. OLTP needs low latency and ACID; OLAP needs scan throughput....

Question 7

Describe a time when you had to optimize a slow SQL query. What steps did you take?

Accepted Answer

**Situation**: A critical exec report was timing out at 30+ minutes; SLA was 5 minutes. **Task**: Diagnose and fix without changing business logic. **Action**: I ran EXPLAIN (ANALYZE) and found: (1) missing index on join key causing full table scan, (2) cross join that could be inner join with a filter, (3) filter in HAVING that could move to WHERE to reduce rows early, (4) unnecessary ORDER BY on a subquery....

Question 8

Explain the concept of ACID properties in the context of databases.

Accepted Answer

**ACID**: Atomicity (all or nothing), Consistency (valid state transitions), Isolation (concurrent transactions don't interfere), Durability (committed data persists). **Why it matters**: Without ACID, financial and operational data become inconsistent; retries and failures create duplicates or lost updates. **Scalability trade-off**: Strict isolation (e.g., Serializable) limits throughput; most OLTP systems use Read Committed or Repeatable Read....

Question 9

Explain the difference between INNER JOIN, LEFT JOIN, RIGHT JOIN, and FULL JOIN.

Accepted Answer

**INNER JOIN**: Only rows with matches in both tables. **LEFT JOIN**: All from left; matches from right; NULLs where no match. **RIGHT JOIN**: All from right; matches from left. **FULL JOIN**: All from both; NULLs where no match. **Why it matters**: Join choice affects result cardinality and semantics. Wrong join = wrong numbers. **Scalability**: Hash joins are common; broadcast for small dimension. FULL OUTER can be expensive—large shuffle....

Question 10

How do you handle NULL values in SQL? Mention functions like COALESCE and NULLIF.

Accepted Answer

**Approaches**: IS NULL / IS NOT NULL for filtering. **COALESCE(val1, val2, ...)**: First non-NULL value; useful for defaults. **NULLIF(val1, val2)**: Returns NULL if equal; e.g., NULLIF(divisor, 0) to avoid divide-by-zero. **Why it matters**: NULL propagates in expressions; aggregate functions ignore NULL (except COUNT(*)). JOIN on NULL yields no match (NULL ≠ NULL). **Scalability**: COALESCE in SELECT is cheap; in WHERE or JOIN it can prevent index use....

Yash Technologies Data Engineer Interview Questions

Difficulty Breakdown

Key Topics Covered

How to Use This Guide

Companies asking these questions

All 23 Questions

More Interview Prep Guides

Practice with AI — Not Just Reading