Question 1

Write an SQL query to find the second-highest salary from an employee table.

Accepted Answer

**Using subquery with MAX**:
```sql
SELECT MAX(salary) AS second_highest
FROM employee
WHERE salary < (SELECT MAX(salary) FROM employee);
```

**Using LIMIT/OFFSET** (MySQL, PostgreSQL):
```sql
SELECT DISTINCT salary
FROM employee
ORDER BY salary DESC
LIMIT 1 OFFSET 1;
```

**Using DENSE_RANK** (ANSI SQL, most robust):
```sql
SELECT salary
FROM (
  SELECT salary, DENSE_RANK() OVER (ORDER BY salary DESC) AS rk
  FROM employee
) t
WHERE rk = 2;
```

**Architectural Logic & Trade-offs**:
- **Subque...

Question 2

Demonstrate the difference between DENSE_RANK() and RANK()

Accepted Answer

**RANK()**: Same rank for ties; skips subsequent ranks (e.g., 1, 2, 2, 4, 5). **DENSE_RANK()**: Same rank for ties; no gaps (e.g., 1, 2, 2, 3, 4). **Why it matters**: RANK preserves "position" semantics (e.g., 4th place); DENSE_RANK gives consecutive integers useful for filtering (e.g., TOP 10). **Example**: `SELECT name, salary, RANK() OVER (ORDER BY salary DESC) AS rk, DENSE_RANK() OVER (ORDER BY salary DESC) AS dense_rk FROM employee`....

Question 3

Discuss differences between ROW_NUMBER(), RANK(), and DENSE_RANK(), and provide examples from your projects.

Accepted Answer

**ROW_NUMBER()**: Unique sequential numbers (1, 2, 3...); no ties—deterministic only with ORDER BY uniqueness. **RANK()**: Same rank for ties; skips (1, 2, 2, 4). **DENSE_RANK()**: Same rank for ties; no gaps (1, 2, 2, 3). **Project examples**: ROW_NUMBER() to deduplicate events by (user_id, event_time) keeping first—critical when upstream sends duplicates. DENSE_RANK() for 'top 10 products per category' reports—avoids gaps when filtering....

Question 4

Explain the differences between Repartition and Coalesce. When would you use each?

Accepted Answer

**Repartition(n)**: Full shuffle; creates exactly n partitions. Can increase or decrease. **Coalesce(n)**: Merges partitions without full shuffle; only decreases. **Why it matters**: Shuffle is expensive—network and disk I/O. Coalesce avoids shuffle when reducing partitions by merging within existing partitions. **When Repartition**: Increasing partitions, fixing skew (repartition by key), or before a join to align partition counts....

Question 5

What strategies can you use to handle skewed data in Spark?

Accepted Answer

**1. Salting**: Add random suffix to skewed keys to spread load; requires two-phase aggregation. **2. Two-phase aggregation**: Aggregate with salted key, then aggregate again without salt. **3. Broadcast**: For small dimension tables, broadcast to avoid shuffle. **4. Custom partitioning**: Pre-partition by known skewed keys. **5. Increase partitions**: Spreads work but doesn't fix root cause. **6. AQE Skew Join (Spark 3.0+)**: Automatically splits skewed partitions....

Question 6

How do you remove duplicate rows in BigQuery?

Accepted Answer

Approach: Use ROW_NUMBER() OVER (PARTITION BY dedup_keys ORDER BY tie_breaker) to define which row to keep; filter rn = 1. Preferred pattern: CREATE OR REPLACE TABLE ... AS SELECT * EXCEPT(rn) FROM (SELECT *, ROW_NUMBER() OVER (...) AS rn ...) WHERE rn = 1. Why CREATE OR REPLACE over DELETE: BigQuery is columnar; DELETE is a rewrite under the hood. For large tables, CREATE OR REPLACE is a single scan+write vs DELETE's read-modify-write....

Question 7

Joins and window functions - INNER, LEFT, RIGHT, FULL OUTER, ROW_NUMBER(), RANK(), DENSE_RANK()

Accepted Answer

Joins: INNER = intersection only; LEFT = all left + matching right (NULL fill); RIGHT = mirror of LEFT; FULL OUTER = union of both. Why it matters: Join choice affects result cardinality and NULL handling—wrong join = wrong business logic (e.g., LEFT to preserve all customers even without orders). Window functions: ROW_NUMBER() = unique rank 1,2,3; RANK() = ties same rank, gaps after; DENSE_RANK() = ties same rank, no gaps....

Question 8

Convert complex SQL (CTEs, window functions, subqueries) to production-grade PySpark. Discuss when to use spark.sql() vs. DataFrame API, and the implications for testability, partitioning, and execution predictability.

Accepted Answer

Two approaches: spark.sql() for direct translation and DataFrame API for programmatic logic. SQL approach: createOrReplaceTempView, run ANSI-like SQL—fast parity, but string-based, harder to unit test, and execution plan less explicit. DataFrame API: composable, testable (pass mock DataFrames), explicit transformations....

Question 9

Difference between ROW_NUMBER(), RANK(), and DENSE_RANK() with examples.

Accepted Answer

**Architectural Logic**: All three assign ordinals over partitions but differ in tie-handling and downstream semantics. ROW_NUMBER(): Deterministic unique ordering—required for deduplication (pick one row per key). RANK(): Ties get same rank, next rank skips—classic leaderboard (e.g., Olympic medals). DENSE_RANK(): Ties get same rank, no skip—"top N per category" when N must be exact count. **Why**: ROW_NUMBER needs ORDER BY with a tie-breaker for determinism. RANK reflects competitive gaps....

Question 10

Explain SQL Window Functions with examples.

Accepted Answer

**Architectural Logic**: Window functions compute over a "frame" of rows related to the current row without collapsing rows. Syntax: func() OVER (PARTITION BY ... ORDER BY ... [frame]). Categories: Ranking (ROW_NUMBER, RANK, DENSE_RANK), Aggregate (SUM, AVG over partitions), Value (LAG, LEAD, FIRST_VALUE). **Why**: Enable row-level analytics (running totals, moving averages, prior/next comparisons) without self-joins. Self-joins duplicate data and are slower....

SQL Window Functions Interview Questions

Difficulty Breakdown

Key Topics Covered

How to Use This Guide

Companies asking these questions

All 50 Questions

More Interview Prep Guides

Practice with AI — Not Just Reading