Question 1

What is the difference between repartition and coalesce in Apache Spark?

Accepted Answer

**Repartition(n)**: Performs a full shuffle to redistribute data across exactly `n` partitions. Can increase or decrease partition count. Uses hash partitioning by default—all rows are exchanged across the network.

**Coalesce(n)**: Merges existing partitions into fewer partitions without a full shuffle. Only decreases partition count....

Question 2

What is the difference between narrow and wide transformations in Apache Spark? Explain with examples.

Accepted Answer

**Narrow transformations**: Each input partition maps to at most one output partition. No shuffle. Examples: map, filter, flatMap, mapPartitions.

**Wide transformations**: Require data from multiple input partitions to produce one output partition. Trigger shuffle. Examples: groupByKey, reduceByKey, join, distinct, repartition.

**Architectural Logic (Why This Matters)**: Spark pipelines narrow transformations and executes them in a single stage....

Question 3

Explain the difference between Spark's map() and flatMap() transformations.

Accepted Answer

map(): 1-to-1; each input element yields one output. flatMap(): 1-to-many or 1-to-0; function returns iterable, results flattened. Example: map splits string → one string; flatMap splits → multiple words. Why it matters: map preserves partition count; flatMap can change it (e.g., explode JSON array → more rows). Use flatMap for: tokenization, exploding arrays, filtering (return empty list). Performance: flatMap that expands heavily can cause skew; monitor partition sizes....

Question 4

How does Spark's Catalyst Optimizer work? Explain its stages.

Accepted Answer

Catalyst: Rule-based + cost-based optimizer for Spark SQL/DataFrame. Stages: 1) Analysis—resolve tables, columns, types via Catalog. 2) Logical optimization—predicate pushdown, projection pruning, constant folding, join reorder. 3) Physical planning—generate plans, cost model picks best (e.g., broadcast vs sort-merge). 4) Code generation—Tungsten generates Java bytecode for tight loops....

Question 5

What is the difference between Managed and External tables in Hive/Spark?

Accepted Answer

Managed: Spark/Hive owns metadata and data. DROP TABLE deletes both. External: Metadata only; data lives in specified location. DROP TABLE drops metadata; data remains. Why External: Shared data across tools (Athena, Glue, Spark); production datasets where accidental DROP would be catastrophic; data lifecycle independent of table. Why Managed: Ephemeral tables, temp outputs; simpler—no orphan paths. Cost: Managed DROP can trigger expensive recursive deletes on object store....

Question 6

Explain the concept of Broadcast Join in Spark. When should it be used?

Accepted Answer

Mechanism: Small table sent to all executors; join happens locally, no shuffle. Triggered by broadcast() hint or spark.sql.autoBroadcastJoinThreshold (default 10MB). Why: Shuffle of large table is expensive; broadcasting small table avoids it. When: One side fits in executor memory (~broadcast threshold). Trade-off: Too large = driver/executor OOM; too small threshold = unnecessary shuffles. Cost: Broadcast data replicated per executor; acceptable for MB-scale....

Question 7

Explain the difference between shallow copy and deep copy in Python.

Accepted Answer

Shallow (copy.copy()): New top-level object; nested objects are references. Nested mutations affect original. Deep (copy.deepcopy()): Recursive copy; fully independent. Why it matters: Shallow is O(n) for top level only; deep is O(n) for entire structure—can be slow for large nested dicts. Use shallow when: No nested mutables or shared refs OK. Use deep when: Need full isolation (e.g., config that will be modified)....

Question 8

Write a Python function to find the first non-repeating character in a string.

Accepted Answer

Approach: Two-pass—count chars, then find first with count 1. Code: def first_non_repeating(s): counts = {}; [counts.update({c: counts.get(c, 0) + 1}) for c in s]; return next((c for c in s if counts[c] == 1), None). Or: from collections import Counter; counts = Counter(s); return next((c for c in s if counts[c] == 1), None). Complexity: O(n) time, O(k) space. Why: Single pass can't know if char is unique until full scan....

Dunnhumby Data Engineer Interview Questions

Reading isn't practice. Get AI feedback on your answers.

Other Companies

Dunnhumby Data Engineer Interview Questions

Reading isn't practice. Get AI feedback on your answers.

Other Companies