Question 1

Tell me about yourself and your experience.

Accepted Answer

**Situation**: I joined the data org when our pipelines were monolithic, causing 4+ hour delays and frequent outages affecting downstream dashboards and ML models.

**Task**: I was tasked with redesigning the data platform to support real-time decisioning while improving reliability and cost efficiency.

**Action**: I led a cross-functional team of 5 engineers to architect a medallion (Bronze/Silver/Gold) architecture on Delta Lake....

Question 2

What architecture are you following in your current project, and why?

Accepted Answer

**Section 1 — The Context (The 'Why')**
The primary challenge in choosing and justifying an architecture is alignment between technical complexity, operational cost, and business latency requirements. A naive monolithic pipeline fails when schema evolution hits, when a single component becomes a bottleneck, or when teams scale—causing coordination overhead and deployment risk....

Question 3

What is the difference between narrow and wide transformations in Apache Spark? Explain with examples.

Accepted Answer

**Narrow transformations**: Each input partition maps to at most one output partition. No shuffle. Examples: map, filter, flatMap, mapPartitions.

**Wide transformations**: Require data from multiple input partitions to produce one output partition. Trigger shuffle. Examples: groupByKey, reduceByKey, join, distinct, repartition.

**Architectural Logic (Why This Matters)**: Spark pipelines narrow transformations and executes them in a single stage....

Question 4

CDC During Migration - explain approaches for real-time Change Data Capture

Accepted Answer

CDC captures inserts, updates, and deletes from a source and applies them to a target in near real-time, enabling minimal-downtime migrations. **Approaches**: Log-based CDC (Debezium, AWS DMS)—reads WAL/redo logs; lowest latency, no schema change. Trigger-based—triggers on source; adds load and schema coupling. Timestamp/version columns—incremental only; misses deletes and out-of-order updates. Dual-write with reconciliation—applications write to both; eventual consistency and complexity....

Question 5

Tell me about your family background

Accepted Answer

**Situation**: Growing up, my family emphasized education and hard work. My parents ran a small business, so I saw firsthand how data and decisions affect outcomes.

**Task**: I learned to connect personal discipline with professional reliability—hitting deadlines, owning failures, and iterating.

**Action**: I channeled that into engineering: building pipelines with clear SLAs, documenting runbooks, and mentoring juniors....

Question 6

What strategies can you use to handle skewed data in Spark?

Accepted Answer

**1. Salting**: Add random suffix to skewed keys to spread load; requires two-phase aggregation. **2. Two-phase aggregation**: Aggregate with salted key, then aggregate again without salt. **3. Broadcast**: For small dimension tables, broadcast to avoid shuffle. **4. Custom partitioning**: Pre-partition by known skewed keys. **5. Increase partitions**: Spreads work but doesn't fix root cause. **6. AQE Skew Join (Spark 3.0+)**: Automatically splits skewed partitions....

Question 7

Briefly introduce yourself and walk us through your journey as a Data Engineer so far.

Accepted Answer

**Situation**: I joined as a software engineer and saw data as a bottleneck—pipelines broke, nobody trusted the numbers. **Task**: Transition into data engineering and build reliable, scalable systems. **Action**: I moved from ETL dev to owning cloud data platforms—designed data lakes on AWS/GCP, optimized Spark jobs (reduced costs 40% via partition pruning and skew fixes), implemented Kafka/Flink streaming, and led migrations to Delta Lake....

Question 8

Describe the difference between Spark RDDs, DataFrames, and Datasets.

Accepted Answer

RDD: Low-level, immutable, JVM-object based. No Catalyst optimization; full control but manual. DataFrame: Row-based, schema-driven; Catalyst + Tungsten optimized. Untyped at compile time. Dataset: Typed extension of DataFrame (Scala/Java); Catalyst + type safety. Why the evolution: RDD predates optimization; DataFrame brought 10–100x speedups via predicate pushdown, columnar execution, code gen. Dataset adds type safety without losing optimization. When to use: DataFrame for 95% of workloads....

Question 9

Explain the difference between args and kwargs in Python.

Accepted Answer

*args: Variable positional args, received as tuple. **kwargs: Variable keyword args, received as dict. Why: Flexible signatures—wrappers, decorators, config-driven functions. Example: def run_pipeline(*sources, **config): ... allows run_pipeline('s3://a','s3://b', parallelism=4). Order: (regular, *args, keyword-only, **kwargs). Pitfall: Passing through—func(*args, **kwargs)—preserves interface. In data pipelines: **kwargs for optional connector params (region, timeout, retries)....

Question 10

Explain the difference between Spark's map() and flatMap() transformations.

Accepted Answer

map(): 1-to-1; each input element yields one output. flatMap(): 1-to-many or 1-to-0; function returns iterable, results flattened. Example: map splits string → one string; flatMap splits → multiple words. Why it matters: map preserves partition count; flatMap can change it (e.g., explode JSON array → more rows). Use flatMap for: tokenization, exploding arrays, filtering (return empty list). Performance: flatMap that expands heavily can cause skew; monitor partition sizes....

ETL Interview Questions for Data Engineers

Difficulty Breakdown

Key Topics Covered

How to Use This Guide

Companies asking these questions

All 50 Questions

More Interview Prep Guides

Practice with AI — Not Just Reading