Question 1

What strategies can you use to handle skewed data in Spark?

Accepted Answer

**1. Salting**: Add random suffix to skewed keys to spread load; requires two-phase aggregation. **2. Two-phase aggregation**: Aggregate with salted key, then aggregate again without salt. **3. Broadcast**: For small dimension tables, broadcast to avoid shuffle. **4. Custom partitioning**: Pre-partition by known skewed keys. **5. Increase partitions**: Spreads work but doesn't fix root cause. **6. AQE Skew Join (Spark 3.0+)**: Automatically splits skewed partitions....

Question 2

Write a Python function to check if a string is a palindrome.

Accepted Answer

**Architectural logic**: A palindrome reads the same forwards and backwards. We need to normalize (case, non-alphanumeric) and compare. **Approach 1 (string ops)**: `cleaned = "".join(c.lower() for c in s if c.isalnum()); return cleaned == cleaned[::-1]`—O(n) time, O(n) space. **Approach 2 (two-pointer)**: Compare from both ends; O(n) time, O(1) space if not normalizing....

Question 3

How does Spark's Catalyst Optimizer work? Explain its stages.

Accepted Answer

Catalyst: Rule-based + cost-based optimizer for Spark SQL/DataFrame. Stages: 1) Analysis—resolve tables, columns, types via Catalog. 2) Logical optimization—predicate pushdown, projection pruning, constant folding, join reorder. 3) Physical planning—generate plans, cost model picks best (e.g., broadcast vs sort-merge). 4) Code generation—Tungsten generates Java bytecode for tight loops....

Question 4

Walk through the three AQE features in Spark 3.x (coalesce, join switch, skew join)—how they operate at shuffle boundaries, which configs enable them, and what happens when AQE cannot help.

Accepted Answer

AQE in Spark 3.x performs runtime reoptimization at shuffle boundaries. Three features: (1) Coalesce shuffle partitions (spark.sql.adaptive.coalescePartitions.enabled): post-shuffle, merge undersized partitions into fewer tasks—reduces scheduler overhead and small-task waste. (2) Join strategy switch: if runtime stats show one side small, convert sort-merge to broadcast—eliminates shuffle for that side....

Question 5

What is Adaptive Query Execution (AQE) in Spark 3.x, and how does it improve performance?

Accepted Answer

AQE re-optimizes at runtime using actual statistics at stage boundaries, addressing the planning-time blind spot (e.g., wrong size estimates, skew). **Three features**: (1) **Coalesce shuffle partitions**—merges small partitions after shuffle to reduce task overhead; avoids 10K tiny tasks. (2) **Switch join strategy**—if one side is smaller than expected, converts sort-merge to broadcast; avoids unnecessary shuffle....

Question 6

Identify who is a manager and who is not.

Accepted Answer

Manager = employee who is parent_id for at least one other. SQL: SELECT e.id, e.name, CASE WHEN EXISTS (SELECT 1 FROM employees e2 WHERE e2.manager_id = e.id) THEN 'Manager' ELSE 'Non-Manager' END AS role FROM employees e. WHY: EXISTS is efficient—short-circuits; avoids duplicate scans. Handle NULL manager_id (CEO). SCALABILITY: Index on manager_id; for large orgs, consider materialized view or closure table for hierarchy queries....

Question 7

Check if a number is prime.

Accepted Answer

**Logic:** Prime if no divisor in [2, √n]. **Code:** `for i in range(2, int(n**0.5)+1): if n%i==0: return False`. O(√n). **Optimize:** Check 2, then odds. **Large n:** Miller-Rabin probabilistic. **Why:** Validation, sampling....

Question 8

Implement a function to find the maximum sum subarray (Kadane's algorithm).

Accepted Answer

**Kadane:** res=curr=arr[0]; for x in arr[1:]: curr=max(x, curr+x); res=max(res, curr). O(n) time, O(1) space. **All negative:** Returns max element. **Why:** Classic DP. **Production:** Handle empty.

HashedIn Data Engineer Interview Questions

Reading isn't practice. Get AI feedback on your answers.

Other Companies

HashedIn Data Engineer Interview Questions

Reading isn't practice. Get AI feedback on your answers.

Other Companies