Question 1

When would you choose a Snowflake schema over a Star schema?

Accepted Answer

Star: One fact, denormalized dimensions—simple, fewer joins, fast. Snowflake: Normalized dimensions (e.g., dim_product → dim_category → dim_category_group)—more joins, less redundancy. Why Snowflake: When dimension tables are large and shared—a single dim_category serves many products; denormalizing would duplicate millions of rows. Avoids inconsistent attributes across copies (e.g., category name updated in one place). Storage: Snowflake saves space when hierarchy is deep and wide....

Question 2

How do you ensure data quality and validation in a fast-moving team?

Accepted Answer

Situation: At [Company], our data team shipped 15+ pipelines weekly; quality incidents were causing downstream analytics and ML models to fail silently. Task: I was tasked with implementing a scalable quality framework without slowing velocity. Action: I designed a tiered validation strategy: (1) Schema validation at ingestion—Great Expectations run as pre-commit hooks and in CI. (2) Critical-field blocking—null or out-of-range on key columns fails the pipeline....

Question 3

Tell me about a time when a Spark job failed in production. How did you fix it?

Accepted Answer

Situation: Spark job OOM due to key skew—one partition had 80% of data. Task: Diagnose and fix. Action: Used Spark UI; identified skewed stage. Fixed with salting, increased executor memory, broadcast joins for dimensions. Deployed; added skew monitoring and data quality check....

Question 4

What storage format would you choose for analytics-heavy workloads and why?

Accepted Answer

Situation: Analytics workload. Task: Format selection. Action: Why: columnar (compression, predicate pushdown). Parquet/ORC. Lakehouse: Delta/Iceberg for ACID, time travel. Partition by date; Z-order for filters. Avoid CSV/JSON for large scans....

Question 5

What happens if the NameNode goes down?

Accepted Answer

NameNode: HDFS metadata (directory, block locations). Single point. If down: HDFS unavailable. Mitigation: HA with standby + ZooKeeper failover; shared edits. Backup: namespace to NFS; Secondary NameNode checkpoints. WHY: Always HA in prod; test failover; runbooks....

Question 6

What's the time and space complexity of both solutions?

Accepted Answer

**Example—two sum:** Brute O(n²) time, O(1) space. Optimized O(n) time, O(n) space (hash map). **Merge intervals:** Sort O(n log n) time, O(1) or O(n) space. **Always state both.** **Production:** Choose from input size; optimize when profiling shows bottleneck. Document in comments....

Question 7

Given a list of intervals, merge the overlaps. How do you optimize it?

Accepted Answer

**Algorithm:** Sort by start. If overlaps last, extend; else append. O(n log n). **Code:** sort; res=[intervals[0]]; for s,e: if s<=res[-1][1]: res[-1][1]=max(...); else: res.append. **Why:** Sort enables linear merge....

Question 8

How would you test these functions with edge cases?

Accepted Answer

**Edge cases:** Empty; single element; boundaries; duplicates; negative; null/NaN; type errors; large input. **Tools:** pytest parametrize; Hypothesis. **Example:** @pytest.mark.parametrize('input,expected', [([],0), ([1],1)]). **Why:** Prevent regressions....

Question 9

Solve the Dutch National Flag problem in one pass. How would you handle it?

Accepted Answer

**Why Three-Way Partition:** Sort 0s, 1s, 2s in one pass—O(n), O(1). Foundation for 3-way quicksort (Dijkstra's). Used in routing (low/med/high priority), bucketing.

**Invariant:** [0..low)=0, [low..mid)=1, [high..n)=2. mid sweeps; swap 0 with low, 2 with high; 1 stays.

**Extensions:** K-way partition needs different approach. With duplicates, stability is lost—acceptable for sort....

Question 10

How do partitions improve query performance in fact tables?

Accepted Answer

**Partition pruning**: Filter on partition column (e.g., dt) skips entire partition files—query reads only matching segments. **Scalability**: 1B rows partitioned by month = 12 segments; unpartitioned = full scan. **Columnar + partition**: Both reduce I/O; partition first (coarse), then columnar (fine). **Trade-off**: Low-cardinality partition (e.g., boolean) = few files, little pruning....

Microsoft Data Engineer Interview Questions

Difficulty Breakdown

Key Topics Covered

How to Use This Guide

Companies asking these questions

All 21 Questions

More Interview Prep Guides

Practice with AI — Not Just Reading