Question 1

Demonstrate the difference between DENSE_RANK() and RANK()

Accepted Answer

**RANK()**: Same rank for ties; skips subsequent ranks (e.g., 1, 2, 2, 4, 5). **DENSE_RANK()**: Same rank for ties; no gaps (e.g., 1, 2, 2, 3, 4). **Why it matters**: RANK preserves "position" semantics (e.g., 4th place); DENSE_RANK gives consecutive integers useful for filtering (e.g., TOP 10). **Example**: `SELECT name, salary, RANK() OVER (ORDER BY salary DESC) AS rk, DENSE_RANK() OVER (ORDER BY salary DESC) AS dense_rk FROM employee`....

Question 2

Explain the differences between Data Warehouse, Data Lake, and Delta Lake

Accepted Answer

**Data Warehouse**: Structured, schema-on-write; optimized for SQL analytics (Snowflake, BigQuery). High compute cost, fast queries. **Data Lake**: Raw/semi-structured object storage (S3, ADLS); schema-on-read; low cost, flexible. **Delta Lake**: Open-source storage layer on a data lake adding ACID transactions, schema enforcement, time travel, upserts. **Why the distinction**: Warehouses scale compute and storage together; lakes decouple them....

Question 3

Joins and window functions - INNER, LEFT, RIGHT, FULL OUTER, ROW_NUMBER(), RANK(), DENSE_RANK()

Accepted Answer

Joins: INNER = intersection only; LEFT = all left + matching right (NULL fill); RIGHT = mirror of LEFT; FULL OUTER = union of both. Why it matters: Join choice affects result cardinality and NULL handling—wrong join = wrong business logic (e.g., LEFT to preserve all customers even without orders). Window functions: ROW_NUMBER() = unique rank 1,2,3; RANK() = ties same rank, gaps after; DENSE_RANK() = ties same rank, no gaps....

Question 4

If you already have an offer, why are you exploring other roles?

Accepted Answer

Situation: I had an offer from Company A. Task: Explain continued exploration professionally. Action: I said I was excited about the offer but wanted to ensure the best long-term fit. I emphasized genuine interest in this role (mission, team, technical challenges). I was transparent about timeline....

Question 5

Introduce yourself, highlighting key projects and tech stacks

Accepted Answer

Situation: Opening of technical interview. Task: Concise, impact-focused intro. Action: 'I'm a Senior DE with X years—built real-time (Kafka/Flink) and batch (Spark/Delta) pipelines. Key projects: migration that cut costs 40%, platform that reduced failures 80%. Stack: Python, SQL, Spark, Airflow, dbt, AWS....

Question 6

Why did you leave your previous job?

Accepted Answer

Situation: Departure. Task: Concise, professional. Action: Growth, challenges, direction, culture, compensation. Positive experience; new challenge. Redirect to current opportunity. Honest without oversharing....

Question 7

Are you willing to relocate to Bangalore?

Accepted Answer

**STAR**: Situation: Role may require Bangalore. Task: Evaluate. **Action**: I am open based on role fit, team, and growth. I consider team presence, project impact, and trajectory. I have worked in distributed teams. I ask about timeline and relocation support. **Result**: For the right opportunity, I am flexible....

Question 8

Count occurrences of a specific word in a file

Accepted Answer

**Linux**: grep -c 'word' file.txt or grep -o 'word' file | wc -l. **Python**: open().read().split().count('word') or Counter. **Spark**: textFile.flatMap(split).filter(lambda w: w=='word').count(). **Best practice**: Case-insensitive with lower(); word boundaries with \b; streaming for large files.

Question 9

Discuss Logical Plan vs Physical Plan

Accepted Answer

**Logical Plan**: Abstract 'what'—filter, join, aggregate. Independent of execution. Optimizer applies rules (predicate pushdown, constant folding).

**Physical Plan**: Concrete 'how'—which join algo (sort-merge, broadcast hash), partitioning, resource allocation. Cost-based choices.

**Why Both**: Separation enables optimization without locking to implementation....

Question 10

Discuss the nature and volume of data you manage daily

Accepted Answer

**Data Types**: Structured (transactions, dimensions), semi-structured (logs, events). **Volume**: Hundreds of GB to multi-TB. **Sources**: Batch (Airflow), streaming (Kafka). **Stack**: Spark/Databricks, Snowflake/BigQuery, S3/GCS.

**Responsibilities**: ETL, schema evolution, data quality, serving for BI/ML....

KPMG Data Engineer Interview Questions

Difficulty Breakdown

Key Topics Covered

How to Use This Guide

Companies asking these questions

All 21 Questions

More Interview Prep Guides

Practice with AI — Not Just Reading