Question 1

What is the difference between repartition and coalesce in Apache Spark?

Accepted Answer

**Repartition(n)**: Performs a full shuffle to redistribute data across exactly `n` partitions. Can increase or decrease partition count. Uses hash partitioning by default—all rows are exchanged across the network.

**Coalesce(n)**: Merges existing partitions into fewer partitions without a full shuffle. Only decreases partition count....

Question 2

What is the difference between narrow and wide transformations in Apache Spark? Explain with examples.

Accepted Answer

**Narrow transformations**: Each input partition maps to at most one output partition. No shuffle. Examples: map, filter, flatMap, mapPartitions.

**Wide transformations**: Require data from multiple input partitions to produce one output partition. Trigger shuffle. Examples: groupByKey, reduceByKey, join, distinct, repartition.

**Architectural Logic (Why This Matters)**: Spark pipelines narrow transformations and executes them in a single stage....

Question 3

Explain the difference between Spark's map() and flatMap() transformations.

Accepted Answer

map(): 1-to-1; each input element yields one output. flatMap(): 1-to-many or 1-to-0; function returns iterable, results flattened. Example: map splits string → one string; flatMap splits → multiple words. Why it matters: map preserves partition count; flatMap can change it (e.g., explode JSON array → more rows). Use flatMap for: tokenization, exploding arrays, filtering (return empty list). Performance: flatMap that expands heavily can cause skew; monitor partition sizes....

Question 4

How does Spark's Catalyst Optimizer work? Explain its stages.

Accepted Answer

Catalyst: Rule-based + cost-based optimizer for Spark SQL/DataFrame. Stages: 1) Analysis—resolve tables, columns, types via Catalog. 2) Logical optimization—predicate pushdown, projection pruning, constant folding, join reorder. 3) Physical planning—generate plans, cost model picks best (e.g., broadcast vs sort-merge). 4) Code generation—Tungsten generates Java bytecode for tight loops....

Question 5

What is the difference between Managed and External tables in Hive/Spark?

Accepted Answer

Managed: Spark/Hive owns metadata and data. DROP TABLE deletes both. External: Metadata only; data lives in specified location. DROP TABLE drops metadata; data remains. Why External: Shared data across tools (Athena, Glue, Spark); production datasets where accidental DROP would be catastrophic; data lifecycle independent of table. Why Managed: Ephemeral tables, temp outputs; simpler—no orphan paths. Cost: Managed DROP can trigger expensive recursive deletes on object store....

Question 6

Explain the concept of Broadcast Join in Spark. When should it be used?

Accepted Answer

Mechanism: Small table sent to all executors; join happens locally, no shuffle. Triggered by broadcast() hint or spark.sql.autoBroadcastJoinThreshold (default 10MB). Why: Shuffle of large table is expensive; broadcasting small table avoids it. When: One side fits in executor memory (~broadcast threshold). Trade-off: Too large = driver/executor OOM; too small threshold = unnecessary shuffles. Cost: Broadcast data replicated per executor; acceptable for MB-scale....

Question 7

Explain the difference between shallow copy and deep copy in Python.

Accepted Answer

Shallow (copy.copy()): New top-level object; nested objects are references. Nested mutations affect original. Deep (copy.deepcopy()): Recursive copy; fully independent. Why it matters: Shallow is O(n) for top level only; deep is O(n) for entire structure—can be slow for large nested dicts. Use shallow when: No nested mutables or shared refs OK. Use deep when: Need full isolation (e.g., config that will be modified)....

Question 8

Write a Python function to find the first non-repeating character in a string.

Accepted Answer

Approach: Two-pass—count chars, then find first with count 1. Code: def first_non_repeating(s): counts = {}; [counts.update({c: counts.get(c, 0) + 1}) for c in s]; return next((c for c in s if counts[c] == 1), None). Or: from collections import Counter; counts = Counter(s); return next((c for c in s if counts[c] == 1), None). Complexity: O(n) time, O(k) space. Why: Single pass can't know if char is unique until full scan....

Question 9

Have you worked on Data Warehousing projects?

Accepted Answer

**Architectural context**: A data warehouse is the semantic layer between raw data and business decisions. Design choices—star vs snowflake, SCD strategy, partitioning—directly impact query latency, storage cost, and maintenance burden. **Key responsibilities**: (1) **Schema design**: Star for BI simplicity, snowflake for normalized flexibility. SCD Type 2 for slowly changing dimensions (audit trail, point-in-time correctness)....

Question 10

What is the difference between OLTP and OLAP?

Accepted Answer

**Why the distinction exists**: They serve different access patterns. OLTP = many small, random writes and point reads. OLAP = few, large sequential scans and aggregations. Optimizing for one degrades the other. **OLTP**: Row-oriented storage (fast single-row access). Normalized schema (3NF) to avoid update anomalies. Indexes for lookup (B-tree). ACID for consistency. High concurrency via locking/mvcc. Examples: PostgreSQL, MySQL, Oracle....

Dunnhumby Data Engineer Interview Questions

Difficulty Breakdown

Key Topics Covered

How to Use This Guide

Companies asking these questions

All 48 Questions

More Interview Prep Guides

Practice with AI — Not Just Reading