Question 1

What is the difference between cache() and persist() in Spark? When would you use each?

Accepted Answer

**cache()**: Equivalent to `persist(MEMORY_AND_DISK)`. Stores partitions in memory; spills to disk if memory is insufficient.

**persist(storage_level)**: Explicit control over storage: MEMORY_ONLY, MEMORY_AND_DISK, MEMORY_ONLY_SER, MEMORY_AND_DISK_SER, DISK_ONLY.

**Architectural Logic (Why It Matters)**: Caching trades memory/disk for recomputation cost....

Question 2

What is the difference between groupByKey and reduceByKey in Spark?

Accepted Answer

**groupByKey()**: Shuffles all (key, value) pairs to group values per key. Transfers O(total_values) over the network. No local aggregation—you combine values afterward. High memory and network cost.

**reduceByKey(func)**: Performs local reduce (e.g., sum) on each partition before shuffle. Shuffles only O(unique_keys) aggregated values. Combines locally first, then across partitions.

**Architectural Logic (Why reduceByKey Wins)**: Shuffle is the bottleneck....

Question 3

Describe the difference between Spark RDDs, DataFrames, and Datasets.

Accepted Answer

RDD: Low-level, immutable, JVM-object based. No Catalyst optimization; full control but manual. DataFrame: Row-based, schema-driven; Catalyst + Tungsten optimized. Untyped at compile time. Dataset: Typed extension of DataFrame (Scala/Java); Catalyst + type safety. Why the evolution: RDD predates optimization; DataFrame brought 10–100x speedups via predicate pushdown, columnar execution, code gen. Dataset adds type safety without losing optimization. When to use: DataFrame for 95% of workloads....

Question 4

Explain strategies for managing schema changes in PySpark over time.

Accepted Answer

Schema evolution in PySpark is architecturally driven by two competing forces: storage economics (rewriting entire datasets is costly) and query correctness (downstream consumers break when schemas shift). **Why it matters**: At petabyte scale, a full rewrite for a new column can cost thousands in compute and hours of downtime. **Strategies with trade-offs**: (1) mergeSchema—additive only, zero rewrite cost, but schema drift accumulates; use for append-heavy pipelines....

Question 5

How do you handle data skewness in Spark?

Accepted Answer

Skew occurs when a few keys hold disproportionate data, causing hotspot tasks and stragglers. **Why it matters**: One task taking 10x longer blocks the entire stage; cluster utilization drops. **Strategies with trade-offs**: (1) **Salting**: Add random suffix to skewed keys; distributes load but requires two-phase aggregation (first with salt, then collapse). Cost: 2x shuffle....

Question 6

What is the difference between Spark RDDs, DataFrames, and Datasets?

Accepted Answer

**RDD**: Low-level, immutable, partitioned collection of objects; no schema; no Catalyst; Python UDF forces serialization row-by-row. **DataFrame**: Rows with named columns; Catalyst + Tungsten; untyped (Row). **Dataset (Scala/Java)**: Typed DataFrame; compile-time type safety; same optimization as DataFrame. **Architectural trade-off**: RDD gives full control (custom partitioner, arbitrary types) but no optimizer help; DataFrame/Dataset trade control for 5–10x speedup on analytical workloads....

Question 7

What is the difference between repartition and coalesce in Spark?

Accepted Answer

**Repartition(n)**: Full shuffle; creates exactly n partitions; data is rebalanced. Can increase or decrease partitions. **Coalesce(n)**: Merges partitions without full shuffle; only decreases; partitions are combined (no data movement across nodes for already-co-located partitions). **Why it matters**: Repartition is expensive (network I/O); coalesce is cheap when reducing....

Question 8

How do you manage schema changes in PySpark when processing data over time?

Accepted Answer

**Why it matters**: At scale, design choices directly impact reliability, latency, and cost. Wrong decisions compound across jobs and teams.

Schema changes in PySpark: (1) mergeSchema for Parquet/Delta; (2) Explicit schema with allowMissingColumns; (3) Add new columns with default; (4) Use Delta schema evolution. Example: spark.read.option('mergeSchema','true').parquet(path)....

Accenture Spark & Big Data Interview Questions

Difficulty Breakdown

Key Topics Covered

How to Use This Guide

Companies asking these questions

All 8 Questions

More Interview Prep Guides

Practice with AI — Not Just Reading