Question 1

Tell me about yourself and your experience.

Accepted Answer

**Situation**: I joined the data org when our pipelines were monolithic, causing 4+ hour delays and frequent outages affecting downstream dashboards and ML models.

**Task**: I was tasked with redesigning the data platform to support real-time decisioning while improving reliability and cost efficiency.

**Action**: I led a cross-functional team of 5 engineers to architect a medallion (Bronze/Silver/Gold) architecture on Delta Lake....

Question 2

Implement a Spark job to find the top 10 most frequent words in a large text file.

Accepted Answer

Core logic: read text → split → explode → filter empty → groupBy → count → orderBy desc → limit 10. Code: from pyspark.sql import functions as F; df = spark.read.text("path/to/file.txt"); words = df.select(F.explode(F.split(F.col("value"), "\s+")).alias("word")); top10 = words.filter(F.length(F.col("word")) > 0).groupBy("word").count().orderBy(F.desc("count")).limit(10). **Why \s+**: Handles multiple spaces/tabs; more robust than single space....

Question 3

Combine records by name with concatenated course values

Accepted Answer

**SQL (PostgreSQL)**: SELECT name, STRING_AGG(course, ', ' ORDER BY course) AS courses FROM student_courses GROUP BY name. **MySQL**: GROUP_CONCAT. **Spark**: concat_ws(', ', collect_list('course')) with groupBy. Use collect_set for dedupe....

Question 4

Reverse operation for splitting values back to original format

Accepted Answer

Join back: SQL STRING_AGG(value, ',') or array_to_string(array_agg(col), ','). Python ','.join(list). Spark: groupBy.agg(collect_list).alias('vals').withColumn('restored', concat_ws(',','vals')). Preserve order: array_agg with ORDER BY. Handle nulls....

Question 5

Sort and merge arrays

Accepted Answer

**Why Merge Sorted Arrays:** Mergesort's merge step, merge K sorted streams (logs from multiple sources), and combining partitioned data.

**Two Arrays:** Two pointers, O(m+n). K arrays: (1) Merge pairwise—O(N log K). (2) Heap of (value, stream_id)—O(N log K). Python: heapq.merge(a, b, c)—lazy, memory-efficient for large streams.

**Scalability:** heapq.merge doesn't load all into memory—streams. For Spark: merge sorted part files via sortMerge join....

Question 6

Count records for INNER JOIN and LEFT JOIN

Accepted Answer

**Architectural Logic**: Join type determines output cardinality and null semantics. **INNER JOIN**: Only matching rows from both tables; count ≤ min(left, right) when 1:1 or N:1. **LEFT JOIN**: All left rows; right columns NULL where no match; count = left count (if left is 'one' side). **Cardinality Rules**: INNER can reduce rows (excludes orphans); LEFT preserves left. Use COUNT(*) for total rows; COUNT(right.key) for non-null matches (LEFT JOIN gives matched count)....

Question 7

Create partitioned table

Accepted Answer

**Architectural Logic**: Partitioning enables partition pruning—critical for cost and performance at scale. **BigQuery**: `CREATE TABLE dataset.sales (...) PARTITION BY sale_date` or `PARTITION BY DATE(created_at)`. **Snowflake**: `PARTITION BY (sale_date)` or expression. **Hive/Spark**: `PARTITIONED BY (sale_date DATE)`. **Why Partition**: Predicate on partition column skips non-matching partitions; 90% cost reduction on date-filtered queries....

Question 8

Find average salary for each manager – Assume a table with manager_id and employee_salary

Accepted Answer

SELECT manager_id, AVG(employee_salary) AS avg_salary FROM employees GROUP BY manager_id. **Why simple**: Direct aggregation. **Consider**: HAVING AVG(employee_salary) > X to filter; exclude NULL manager_id if not applicable....

Question 9

Find non-common records in two tables (SQL EXCEPT or NOT IN)

Accepted Answer

EXCEPT: SELECT col FROM t1 EXCEPT SELECT col FROM t2 (A minus B). NOT IN: SELECT * FROM t1 WHERE col NOT IN (SELECT col FROM t2)—**NULL trap**: if t2 has NULL, NOT IN returns nothing. **Safe**: NOT EXISTS (SELECT 1 FROM t2 WHERE t2.col = t1.col). **EXCEPT** eliminates duplicates; NOT EXISTS doesn't....

Question 10

Print only the newest record for each name – Use SQL Window functions (ROW_NUMBER, RANK, etc.)

Accepted Answer

Newest record per name: SELECT * FROM (SELECT *, ROW_NUMBER() OVER (PARTITION BY name ORDER BY created_at DESC) rn FROM t) t2 WHERE rn=1. Or RANK if ties should all be kept. DENSE_RANK for consecutive ranks. Use FIRST_VALUE/LAST_VALUE for single column: SELECT name, FIRST_VALUE(col) OVER (PARTITION BY name ORDER BY ts DESC) FROM t; then GROUP BY or DISTINCT. **Why it matters**: Design choices compound at scale—wrong approach can cause 100× overhead....

Pubmatic Data Engineer Interview Questions

Difficulty Breakdown

Key Topics Covered

How to Use This Guide

Companies asking these questions

All 13 Questions

More Interview Prep Guides

Unlock All Expert Answers