Question 1

What is the difference between cache() and persist() in Spark? When would you use each?

Accepted Answer

**cache()**: Equivalent to `persist(MEMORY_AND_DISK)`. Stores partitions in memory; spills to disk if memory is insufficient.

**persist(storage_level)**: Explicit control over storage: MEMORY_ONLY, MEMORY_AND_DISK, MEMORY_ONLY_SER, MEMORY_AND_DISK_SER, DISK_ONLY.

**Architectural Logic (Why It Matters)**: Caching trades memory/disk for recomputation cost....

Question 2

What is the difference between groupByKey and reduceByKey in Spark?

Accepted Answer

**groupByKey()**: Shuffles all (key, value) pairs to group values per key. Transfers O(total_values) over the network. No local aggregation—you combine values afterward. High memory and network cost.

**reduceByKey(func)**: Performs local reduce (e.g., sum) on each partition before shuffle. Shuffles only O(unique_keys) aggregated values. Combines locally first, then across partitions.

**Architectural Logic (Why reduceByKey Wins)**: Shuffle is the bottleneck....

Question 3

What is the difference between narrow and wide transformations in Apache Spark? Explain with examples.

Accepted Answer

**Narrow transformations**: Each input partition maps to at most one output partition. No shuffle. Examples: map, filter, flatMap, mapPartitions.

**Wide transformations**: Require data from multiple input partitions to produce one output partition. Trigger shuffle. Examples: groupByKey, reduceByKey, join, distinct, repartition.

**Architectural Logic (Why This Matters)**: Spark pipelines narrow transformations and executes them in a single stage....

Question 4

Can you explain the architecture of Apache Spark and its components?

Accepted Answer

**Section 1 — The Context (The 'Why')**
Apache Spark's distributed execution model faces the core challenge of coordinating hundreds of executors while avoiding driver bottlenecks and shuffle storms. At scale, the driver's single-threaded scheduling and result aggregation become failure points....

Question 5

When would you architecturally choose Dataset[T] over DataFrame in a Scala Spark pipeline, and what are the scalability and portability trade-offs? Include type-safety benefits vs. operational constraints.

Accepted Answer

DataFrame is an untyped collection of Row objects with schema at runtime; Dataset[T] is typed with compile-time safety. In Scala, DataFrame = Dataset[Row]. Architectural why: Dataset enables domain modeling (e.g., Dataset[Order])—catch errors at compile time, better IDE support, and Catalyst can optimize typed encoders. Scalability: both use Tungsten and Catalyst; Dataset adds encoder overhead but marginal for most workloads. Portability trade-off: PySpark has only DataFrame—no typed Dataset....

Question 6

Can you explain dynamic resource allocation in Spark? How does it help optimize job performance?

Accepted Answer

**Why dynamic allocation matters**: Static executors = overpay when idle, underperform when backlogged. **Mechanism**: `spark.dynamicAllocation.enabled=true`. Executors scale up when task backlog grows; scale down when idle for `executorIdleTimeout`. Min/max bounds control range. **Scalability trade-offs**: Scale-up has latency—initial tasks may queue. Scale-down frees resources for other jobs. Works with YARN, K8s; requires external shuffle service for shuffle data retention....

Question 7

Explain the DAG in Spark and how it plays a role in execution.

Accepted Answer

**Why it matters**: At scale, design choices directly impact reliability, latency, and cost. Wrong decisions compound across jobs and teams.

Spark DAG (Directed Acyclic Graph): logical and physical plan of transformations. Each RDD/DataFrame is a node; edges are lineage. Spark uses DAG to (1) optimize—pipeline narrow transformations, predicate pushdown; (2) recover—recompute lost partitions from lineage; (3) schedule—determine stages and tasks. No cycles allowed....

Question 8

Have you worked with UDFs in Spark? When do you use them, and how do they differ from built-in functions?

Accepted Answer

**Why it matters**: At scale, design choices directly impact reliability, latency, and cost. Wrong decisions compound across jobs and teams.

UDFs extend Spark with custom logic. Use when: no built-in for logic (e.g., custom parse). Differ: UDFs are black-box to Catalyst—no pushdown, slower (Python UDF equals row-by-row via Py4J). Prefer: built-in, pandas_udf (vectorized). Example: @udf(StringType()) def my_udf(x): return x.upper()....

Question 9

How do you handle schema evolution in Spark, especially when reading data from sources like Parquet or Avro?

Accepted Answer

**Situation**: Faced competing demands—multiple pipelines, stakeholders, deadlines. **Task**: Deliver impact while maintaining quality and preventing burnout. **Action**: (1) Prioritized by business impact and SLA risk. (2) Used ROI (value/time); WIP limits; timeboxing. (3) Communicated trade-offs—'Adding X pushes Y by N days.' (4) Maintained backlog with tech-debt capacity. **Result**: Shipped on time; zero incidents; stakeholder alignment on deferrals....

Question 10

How do you handle very large datasets in Spark to ensure scalability and efficiency?

Accepted Answer

**Situation**: Faced competing demands—multiple pipelines, stakeholders, deadlines. **Task**: Deliver impact while maintaining quality and preventing burnout. **Action**: (1) Prioritized by business impact and SLA risk. (2) Used ROI (value/time); WIP limits; timeboxing. (3) Communicated trade-offs—'Adding X pushes Y by N days.' (4) Maintained backlog with tech-debt capacity. **Result**: Shipped on time; zero incidents; stakeholder alignment on deferrals....

Coforge Spark & Big Data Interview Questions

Difficulty Breakdown

Key Topics Covered

How to Use This Guide

Companies asking these questions

All 18 Questions

More Interview Prep Guides

Practice with AI — Not Just Reading