Question 1

What architecture are you following in your current project, and why?

Accepted Answer

**Section 1 — The Context (The 'Why')**
The primary challenge in choosing and justifying an architecture is alignment between technical complexity, operational cost, and business latency requirements. A naive monolithic pipeline fails when schema evolution hits, when a single component becomes a bottleneck, or when teams scale—causing coordination overhead and deployment risk....

Question 2

CDC During Migration - explain approaches for real-time Change Data Capture

Accepted Answer

CDC captures inserts, updates, and deletes from a source and applies them to a target in near real-time, enabling minimal-downtime migrations. **Approaches**: Log-based CDC (Debezium, AWS DMS)—reads WAL/redo logs; lowest latency, no schema change. Trigger-based—triggers on source; adds load and schema coupling. Timestamp/version columns—incremental only; misses deletes and out-of-order updates. Dual-write with reconciliation—applications write to both; eventual consistency and complexity....

Question 3

Explain the differences between a Data Lake and a Data Warehouse.

Accepted Answer

**Data Lake**: Low-cost object storage (S3, ADLS) for raw, semi-structured, unstructured data. Schema-on-read; used for exploratory analytics, ML, archival. **Data Warehouse**: Structured, curated storage optimized for SQL; schema-on-write; used for BI and reporting. **Why both exist**: Lakes offer flexibility and cost at scale; warehouses offer query performance and concurrency....

Question 4

Explain the differences between Data Warehouse, Data Lake, and Delta Lake

Accepted Answer

**Data Warehouse**: Structured, schema-on-write; optimized for SQL analytics (Snowflake, BigQuery). High compute cost, fast queries. **Data Lake**: Raw/semi-structured object storage (S3, ADLS); schema-on-read; low cost, flexible. **Delta Lake**: Open-source storage layer on a data lake adding ACID transactions, schema enforcement, time travel, upserts. **Why the distinction**: Warehouses scale compute and storage together; lakes decouple them....

Question 5

What strategies can you use to handle skewed data in Spark?

Accepted Answer

**1. Salting**: Add random suffix to skewed keys to spread load; requires two-phase aggregation. **2. Two-phase aggregation**: Aggregate with salted key, then aggregate again without salt. **3. Broadcast**: For small dimension tables, broadcast to avoid shuffle. **4. Custom partitioning**: Pre-partition by known skewed keys. **5. Increase partitions**: Spreads work but doesn't fix root cause. **6. AQE Skew Join (Spark 3.0+)**: Automatically splits skewed partitions....

Question 6

Describe the difference between Spark RDDs, DataFrames, and Datasets.

Accepted Answer

RDD: Low-level, immutable, JVM-object based. No Catalyst optimization; full control but manual. DataFrame: Row-based, schema-driven; Catalyst + Tungsten optimized. Untyped at compile time. Dataset: Typed extension of DataFrame (Scala/Java); Catalyst + type safety. Why the evolution: RDD predates optimization; DataFrame brought 10–100x speedups via predicate pushdown, columnar execution, code gen. Dataset adds type safety without losing optimization. When to use: DataFrame for 95% of workloads....

Question 7

Explain Fact and Dimension Tables with examples.

Accepted Answer

Architecture: Star schema centralizes measurable events in fact tables; dimensions provide semantic context. Why this design: Facts are append-heavy and grow unbounded; dimensions are smaller and change slowly. Separating them optimizes for different access patterns. Fact grain defines the entire schema—get it wrong and joins become wrong. Example: sales_fact (quantity, revenue, date_key, product_key, customer_key) at grain one row per transaction....

Question 8

Explain the difference between INNER JOIN, LEFT JOIN, RIGHT JOIN, and FULL JOIN.

Accepted Answer

**INNER JOIN**: Only rows with matches in both tables. **LEFT JOIN**: All from left; matches from right; NULLs where no match. **RIGHT JOIN**: All from right; matches from left. **FULL JOIN**: All from both; NULLs where no match. **Why it matters**: Join choice affects result cardinality and semantics. Wrong join = wrong numbers. **Scalability**: Hash joins are common; broadcast for small dimension. FULL OUTER can be expensive—large shuffle....

Question 9

What is the difference between a list and a tuple in Python?

Accepted Answer

List: Mutable, []; tuple: immutable, (). Why it matters: Mutability drives use—lists for collections that change; tuples for fixed data, dict keys (hashable), multiple return values. Performance: Tuples are slightly faster (less overhead, fixed size). Hashability: Tuples can be dict keys/set members; lists cannot. In data pipelines: Tuples for schema-like rows (column names); lists for buffers, accumulators....

Question 10

When would you choose a Snowflake schema over a Star schema?

Accepted Answer

Star: One fact, denormalized dimensions—simple, fewer joins, fast. Snowflake: Normalized dimensions (e.g., dim_product → dim_category → dim_category_group)—more joins, less redundancy. Why Snowflake: When dimension tables are large and shared—a single dim_category serves many products; denormalizing would duplicate millions of rows. Avoids inconsistent attributes across copies (e.g., category name updated in one place). Storage: Snowflake saves space when hierarchy is deep and wide....

Data Modeling Interview Questions for Data Engineers

Difficulty Breakdown

Key Topics Covered

How to Use This Guide

Companies asking these questions

All 50 Questions

More Interview Prep Guides

Practice with AI — Not Just Reading