Question 1

What is the difference between repartition and coalesce in Apache Spark?

Accepted Answer

**Repartition(n)**: Performs a full shuffle to redistribute data across exactly `n` partitions. Can increase or decrease partition count. Uses hash partitioning by default—all rows are exchanged across the network.

**Coalesce(n)**: Merges existing partitions into fewer partitions without a full shuffle. Only decreases partition count....

Question 2

What is the difference between narrow and wide transformations in Apache Spark? Explain with examples.

Accepted Answer

**Narrow transformations**: Each input partition maps to at most one output partition. No shuffle. Examples: map, filter, flatMap, mapPartitions.

**Wide transformations**: Require data from multiple input partitions to produce one output partition. Trigger shuffle. Examples: groupByKey, reduceByKey, join, distinct, repartition.

**Architectural Logic (Why This Matters)**: Spark pipelines narrow transformations and executes them in a single stage....

Question 3

Describe the difference between Spark RDDs, DataFrames, and Datasets.

Accepted Answer

RDD: Low-level, immutable, JVM-object based. No Catalyst optimization; full control but manual. DataFrame: Row-based, schema-driven; Catalyst + Tungsten optimized. Untyped at compile time. Dataset: Typed extension of DataFrame (Scala/Java); Catalyst + type safety. Why the evolution: RDD predates optimization; DataFrame brought 10–100x speedups via predicate pushdown, columnar execution, code gen. Dataset adds type safety without losing optimization. When to use: DataFrame for 95% of workloads....

Question 4

Explain the difference between Spark's map() and flatMap() transformations.

Accepted Answer

map(): 1-to-1; each input element yields one output. flatMap(): 1-to-many or 1-to-0; function returns iterable, results flattened. Example: map splits string → one string; flatMap splits → multiple words. Why it matters: map preserves partition count; flatMap can change it (e.g., explode JSON array → more rows). Use flatMap for: tokenization, exploding arrays, filtering (return empty list). Performance: flatMap that expands heavily can cause skew; monitor partition sizes....

Question 5

How does Spark's Catalyst Optimizer work? Explain its stages.

Accepted Answer

Catalyst: Rule-based + cost-based optimizer for Spark SQL/DataFrame. Stages: 1) Analysis—resolve tables, columns, types via Catalog. 2) Logical optimization—predicate pushdown, projection pruning, constant folding, join reorder. 3) Physical planning—generate plans, cost model picks best (e.g., broadcast vs sort-merge). 4) Code generation—Tungsten generates Java bytecode for tight loops....

Question 6

What is the difference between Managed and External tables in Hive/Spark?

Accepted Answer

Managed: Spark/Hive owns metadata and data. DROP TABLE deletes both. External: Metadata only; data lives in specified location. DROP TABLE drops metadata; data remains. Why External: Shared data across tools (Athena, Glue, Spark); production datasets where accidental DROP would be catastrophic; data lifecycle independent of table. Why Managed: Ephemeral tables, temp outputs; simpler—no orphan paths. Cost: Managed DROP can trigger expensive recursive deletes on object store....

Question 7

Explain the concept of Broadcast Join in Spark. When should it be used?

Accepted Answer

Mechanism: Small table sent to all executors; join happens locally, no shuffle. Triggered by broadcast() hint or spark.sql.autoBroadcastJoinThreshold (default 10MB). Why: Shuffle of large table is expensive; broadcasting small table avoids it. When: One side fits in executor memory (~broadcast threshold). Trade-off: Too large = driver/executor OOM; too small threshold = unnecessary shuffles. Cost: Broadcast data replicated per executor; acceptable for MB-scale....

Question 8

How do you optimize Spark jobs for better performance? Mention at least 5 techniques.

Accepted Answer

1) Broadcast joins for small tables—avoid shuffle. 2) Predicate pushdown—filter at source (Parquet/ORC) to reduce scan. 3) Partition tuning—spark.sql.shuffle.partitions ~2–4× cores; match partition columns to filter/join keys. 4) Cache only when reused; unpersist when done to free memory. 5) Prefer Spark SQL over UDFs—Catalyst optimization. 6) Skew handling—salted keys, AQE skew join. 7) Kryo serialization for RDD; avoid Java default. 8) Coalesce before write to avoid small files....

Question 9

Design an anti-skew strategy for a join on a high-cardinality key with a long-tail distribution (e.g., a few keys hold 80% of rows). Cover salting, split-skew, AQE, and cost/operational trade-offs.

Accepted Answer

**Section 1 — The Context (The 'Why')**
Joining on a high-cardinality key (e.g., user_id across 100M+ rows) causes severe data skew when a few keys dominate. One partition gets 80% of rows; others are nearly empty....

Question 10

Explain the benefits of using DataFrames over RDDs.

Accepted Answer

DataFrames vs RDDs is a design trade-off between optimization surface and control. **Why DataFrames win for most workloads**: Catalyst applies predicate pushdown, projection pruning, and join reordering—optimizations impossible on opaque RDDs. Tungsten uses columnar in-memory layout and whole-stage codegen, yielding 5–10x speedups on analytical workloads....

Fragma Data Systems Spark & Big Data Interview Questions

Difficulty Breakdown

Key Topics Covered

How to Use This Guide

Companies asking these questions

All 29 Questions

More Interview Prep Guides

Practice with AI — Not Just Reading