Question 1

What is the difference between repartition and coalesce in Apache Spark?

Accepted Answer

**Repartition(n)**: Performs a full shuffle to redistribute data across exactly `n` partitions. Can increase or decrease partition count. Uses hash partitioning by default—all rows are exchanged across the network.

**Coalesce(n)**: Merges existing partitions into fewer partitions without a full shuffle. Only decreases partition count....

Question 2

What is the difference between narrow and wide transformations in Apache Spark? Explain with examples.

Accepted Answer

**Narrow transformations**: Each input partition maps to at most one output partition. No shuffle. Examples: map, filter, flatMap, mapPartitions.

**Wide transformations**: Require data from multiple input partitions to produce one output partition. Trigger shuffle. Examples: groupByKey, reduceByKey, join, distinct, repartition.

**Architectural Logic (Why This Matters)**: Spark pipelines narrow transformations and executes them in a single stage....

Question 3

What are your salary expectations for this role?

Accepted Answer

**Situation**: I was negotiating with a FAANG-tier company after multiple rounds. **Task**: Communicate compensation expectations without anchoring low or pricing myself out. **Action**: I researched Levels.fyi, Blind, and Glassdoor for the role, level, and geo. I framed my response: "Based on market data for Staff/Principal Data Engineering in [location], total comp typically ranges $X–$Y....

Question 4

Describe the difference between Spark RDDs, DataFrames, and Datasets.

Accepted Answer

RDD: Low-level, immutable, JVM-object based. No Catalyst optimization; full control but manual. DataFrame: Row-based, schema-driven; Catalyst + Tungsten optimized. Untyped at compile time. Dataset: Typed extension of DataFrame (Scala/Java); Catalyst + type safety. Why the evolution: RDD predates optimization; DataFrame brought 10–100x speedups via predicate pushdown, columnar execution, code gen. Dataset adds type safety without losing optimization. When to use: DataFrame for 95% of workloads....

Question 5

Explain the difference between Spark's map() and flatMap() transformations.

Accepted Answer

map(): 1-to-1; each input element yields one output. flatMap(): 1-to-many or 1-to-0; function returns iterable, results flattened. Example: map splits string → one string; flatMap splits → multiple words. Why it matters: map preserves partition count; flatMap can change it (e.g., explode JSON array → more rows). Use flatMap for: tokenization, exploding arrays, filtering (return empty list). Performance: flatMap that expands heavily can cause skew; monitor partition sizes....

Question 6

How does Spark's Catalyst Optimizer work? Explain its stages.

Accepted Answer

Catalyst: Rule-based + cost-based optimizer for Spark SQL/DataFrame. Stages: 1) Analysis—resolve tables, columns, types via Catalog. 2) Logical optimization—predicate pushdown, projection pruning, constant folding, join reorder. 3) Physical planning—generate plans, cost model picks best (e.g., broadcast vs sort-merge). 4) Code generation—Tungsten generates Java bytecode for tight loops....

Question 7

What is the difference between Managed and External tables in Hive/Spark?

Accepted Answer

Managed: Spark/Hive owns metadata and data. DROP TABLE deletes both. External: Metadata only; data lives in specified location. DROP TABLE drops metadata; data remains. Why External: Shared data across tools (Athena, Glue, Spark); production datasets where accidental DROP would be catastrophic; data lifecycle independent of table. Why Managed: Ephemeral tables, temp outputs; simpler—no orphan paths. Cost: Managed DROP can trigger expensive recursive deletes on object store....

Question 8

Explain the concept of Broadcast Join in Spark. When should it be used?

Accepted Answer

Mechanism: Small table sent to all executors; join happens locally, no shuffle. Triggered by broadcast() hint or spark.sql.autoBroadcastJoinThreshold (default 10MB). Why: Shuffle of large table is expensive; broadcasting small table avoids it. When: One side fits in executor memory (~broadcast threshold). Trade-off: Too large = driver/executor OOM; too small threshold = unnecessary shuffles. Cost: Broadcast data replicated per executor; acceptable for MB-scale....

Question 9

How do you optimize Spark jobs for better performance? Mention at least 5 techniques.

Accepted Answer

1) Broadcast joins for small tables—avoid shuffle. 2) Predicate pushdown—filter at source (Parquet/ORC) to reduce scan. 3) Partition tuning—spark.sql.shuffle.partitions ~2–4× cores; match partition columns to filter/join keys. 4) Cache only when reused; unpersist when done to free memory. 5) Prefer Spark SQL over UDFs—Catalyst optimization. 6) Skew handling—salted keys, AQE skew join. 7) Kryo serialization for RDD; avoid Java default. 8) Coalesce before write to avoid small files....

Question 10

What is the difference between a list and a tuple in Python?

Accepted Answer

List: Mutable, []; tuple: immutable, (). Why it matters: Mutability drives use—lists for collections that change; tuples for fixed data, dict keys (hashable), multiple return values. Performance: Tuples are slightly faster (less overhead, fixed size). Hashability: Tuples can be dict keys/set members; lists cannot. In data pipelines: Tuples for schema-like rows (column names); lists for buffers, accumulators....

Fragma Data Systems Data Engineer Interview Questions

Difficulty Breakdown

Key Topics Covered

How to Use This Guide

Companies asking these questions

All 60 Questions

More Interview Prep Guides

Unlock All Expert Answers