Spark Performance Tuning: 15 Interview Questions That Separate Senior Engineers from Juniors (2026)
Senior Spark interviews at Amazon, Databricks, and Meta focus on performance tuning, not API syntax. Master these 15 questions to prove you've run Spark at scale.
Key Takeaways
- βWhy Performance Tuning Is the Senior Engineer Bar
- β1. Repartition vs Coalesce β When Each Destroys Performance
- β2. Data Skew β The #1 Production Spark Problem
- β3. Broadcast Joins β Size Limits and Hidden Costs
Why Performance Tuning Is the Senior Engineer Bar
Every engineer can write df.groupBy().count(). What separates a senior data engineer from a mid-level one is what happens when that query takes 4 hours instead of 20 minutes on a 500-node cluster.
At Amazon, the data engineering loop includes at least one "optimize this Spark job" scenario. Databricks interviews expect you to read a Spark UI screenshot and diagnose bottlenecks. Meta's data engineering interviews include production debugging rounds where Spark tuning is front and center.
This guide covers the 15 questions that test whether you've actually operated Spark in production β not just completed a tutorial. Each answer includes the *why* behind the optimization, because interviewers at senior levels care about reasoning, not just the solution.
1. Repartition vs Coalesce β When Each Destroys Performance
Question: You have a Spark DataFrame with 2,000 partitions and need to write it to S3 as 100 Parquet files. Should you use repartition(100) or coalesce(100)?
Answer:
coalesce(100) β because you're *reducing* partitions. Coalesce avoids a full shuffle by merging adjacent partitions on the same executor. Repartition triggers a full shuffle (hash-based redistribution across the network), which is expensive when you're only trying to reduce file count.
When repartition is better: When you need to *increase* partitions or need *evenly distributed* data. Coalesce can create skewed partitions because it just combines existing ones. If partition 1 has 10GB and partition 2 has 100MB, coalesce will keep that skew.
Production pattern:
# Write with target file size, not partition count
df.repartition(col("date")) # Even distribution by date
.sortWithinPartitions(col("id")) # Sorted for downstream reads
.write.parquet("s3://bucket/table/")Follow-up trap: "What about coalesce(1)?" β Never in production. Single partition means single executor, zero parallelism, and likely OOM for any non-trivial dataset. If you need a single output file, write to a temp location then use hadoop fs -getmerge.
2. Data Skew β The #1 Production Spark Problem
Question: Your Spark job's shuffle stage shows 199 tasks completing in 30 seconds but 1 task taking 45 minutes. What's happening and how do you fix it?
Answer:
This is data skew β one partition has disproportionately more data than others. The single slow task is processing the heavy partition while all other executors sit idle.
5 strategies, ranked by production impact:
- Salting β Add a random suffix to the skewed key, perform the join/aggregation, then remove it:
from pyspark.sql.functions import concat, lit, floor, rand
SALT_BUCKETS = 10
skewed_df = skewed_df.withColumn(
"salted_key", concat(col("join_key"), lit("_"), floor(rand() * SALT_BUCKETS))
)
other_df_exploded = other_df.crossJoin(
spark.range(SALT_BUCKETS).withColumnRenamed("id", "salt")
).withColumn("salted_key", concat(col("join_key"), lit("_"), col("salt")))
result = skewed_df.join(other_df_exploded, "salted_key")- Broadcast join β If the non-skewed side fits in memory (< 8GB), broadcast it.
- Adaptive Query Execution (AQE) β
spark.sql.adaptive.enabled = true+spark.sql.adaptive.skewJoin.enabled = true(Spark 3.0+). AQE detects skew at runtime and splits oversized partitions automatically. - Isolate + union β Filter out the hot key, process it separately (maybe with broadcast), then UNION ALL the results.
- Pre-aggregate β If you're joining then aggregating, aggregate first to reduce the skewed side before the join.
3. Broadcast Joins β Size Limits and Hidden Costs
Question: Explain broadcast joins in Spark. When should you NOT use them?
Answer:
A broadcast join sends the smaller DataFrame to every executor, avoiding a shuffle on the larger side. Spark automatically broadcasts tables under spark.sql.autoBroadcastJoinThreshold (default: 10MB).
from pyspark.sql.functions import broadcast
result = large_df.join(broadcast(small_df), "key")When NOT to broadcast:
- Table exceeds driver memory (causes OOM on the driver, not executors)
- Many concurrent jobs broadcasting simultaneously (network saturation)
- The "small" table is actually 5GB β executors need to deserialize it, eating into executor memory
Senior-level insight: The 10MB default is almost always too conservative for production. Most teams set it to 100MBβ500MB. But the real limit isn't the threshold β it's spark.driver.maxResultSize (default 1GB), because the driver serializes the broadcast table first.
Follow-up: "What about broadcast hash join vs sort-merge join?" β Broadcast hash is O(1) lookup per row. Sort-merge join requires both sides sorted, which means a full shuffle, but it works for tables of any size and streams data (lower memory footprint).
4. Spark UI Deep Dive β Reading the DAG
Question: Walk me through how you debug a slow Spark job using the Spark UI.
Answer (structured for interviews):
- Jobs tab β Find the slow job. Check if it's a single stage or multiple stages that are slow.
- Stages tab β For the slow stage:
- Look at task duration distribution. If 95th percentile task time >> median, you have skew.
- Check shuffle read/write sizes. High shuffle write = expensive wide transformation.
- Look at GC time. If GC > 10% of task time, increase executor memory or reduce object creation.
- Executors tab β Check for:
- Executors with disk spill. If shuffle spill > 0, increase
spark.sql.shuffle.partitionsor executor memory. - Uneven task distribution. If one executor ran 500 tasks and others ran 50, check data locality.
- SQL tab β The physical plan shows you exactly which join strategy Spark chose (BroadcastHashJoin, SortMergeJoin, etc.) and where exchanges (shuffles) happen.
The pattern senior engineers follow: Slow job β identify the slow stage β check task metrics (skew? GC? spill?) β trace back to the transformation in code β fix the root cause (partition strategy, join type, or data layout).
5. Shuffle Optimization β The Biggest Performance Lever
Question: What is a shuffle in Spark and how do you minimize shuffle overhead?
Answer:
A shuffle is a data exchange between executors β Spark serializes data, writes it to local disk, transfers it over the network, and deserializes it on the receiving side. Shuffles are expensive because they involve disk I/O, network I/O, and serialization.
Operations that trigger shuffles: groupBy, join (non-broadcast), repartition, distinct, orderBy, window functions with PARTITION BY.
Minimization strategies:
- Reduce data before shuffling β Filter and project early. A
WHEREbefore aJOINcan reduce shuffle data by 90%. - Use broadcast joins β Eliminate shuffle entirely for small tables.
- Bucket tables β Pre-shuffle at write time. If table A and table B are both bucketed by
user_idinto 200 buckets, joins onuser_idrequire zero shuffle. - Tune partition count β
spark.sql.shuffle.partitionsdefault is 200, which is too low for big data and too high for small data. Rule of thumb: target 128MBβ256MB per partition. - Enable AQE β Adaptive Query Execution coalesces small shuffle partitions automatically.
- Avoid unnecessary sorts β
orderBytriggers a full shuffle + sort. UsesortWithinPartitionsif global order isn't needed.
6. Memory Management β OOM Debugging
Question: Your Spark job fails with java.lang.OutOfMemoryError: GC overhead limit exceeded. Walk through your debugging process.
Answer:
- Check which memory area overflowed:
- Executor memory (
spark.executor.memory) β increase if tasks process large partitions - Driver memory (
spark.driver.memory) β increase ifcollect(),broadcast(), ortoPandas()is used - Overhead memory (
spark.executor.memoryOverhead) β increase for heavy UDF usage or off-heap operations
- Common root causes:
- Skewed partition (one partition has 50GB of data)
collect()ortoPandas()on a large DataFrame (brings all data to driver)- Expensive UDFs creating many short-lived objects (GC pressure)
- Too few partitions (e.g., 10 partitions for 100GB = 10GB per task)
- Production fix checklist:
# 1. Increase partitions to reduce per-task memory
df = df.repartition(2000)
# 2. Use Kryo serialization (2-10x faster, smaller)
spark.conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
# 3. Set memory fractions
spark.conf.set("spark.memory.fraction", "0.8") # default 0.6
spark.conf.set("spark.memory.storageFraction", "0.3") # default 0.5
# 4. Increase off-heap if using Arrow/UDFs
spark.conf.set("spark.executor.memoryOverhead", "2g")7. Catalyst Optimizer β How Spark Plans Queries
Question: How does Spark's Catalyst Optimizer work? Walk through the stages.
Answer:
4 phases:
- Analysis β Resolves column names, table references, and data types against the catalog. Unresolved logical plan β resolved logical plan.
- Logical Optimization β Applies rule-based optimizations:
- Predicate pushdown β Push filters as close to the data source as possible
- Column pruning β Only read columns that are actually used
- Constant folding β Evaluate constant expressions at plan time
- Join reordering β Put smaller tables on the build side of hash joins
- Physical Planning β Converts the logical plan to one or more physical plans. Chooses join strategies (broadcast vs sort-merge vs shuffle-hash) based on table statistics.
- Code Generation (Tungsten) β Generates optimized Java bytecode for the physical plan. Whole-stage codegen fuses multiple operators into a single Java function, reducing virtual function calls.
Senior-level follow-up: "How do you influence the optimizer?" β Use spark.sql.autoBroadcastJoinThreshold, run ANALYZE TABLE to update statistics, use hint() for join strategy (/*+ BROADCAST(t) */), and check df.explain(true) to verify the plan.
8. Dynamic Partition Pruning (DPP)
Question: What is Dynamic Partition Pruning and how does it optimize joins?
Answer:
DPP is a Spark 3.0+ optimization for star-schema queries. When you join a large fact table (partitioned by date) with a small dimension table that filters to specific dates, Spark can push the dimension filter into the fact table scan β pruning partitions at read time.
Without DPP:
- Scan ALL partitions of fact table
- Shuffle both sides
- Join, then filter
With DPP:
- Scan dimension table (small)
- Extract distinct
datevalues from the filter - Only scan matching partitions of fact table
- Join (much less data)
-- DPP kicks in automatically here
SELECT f.*
FROM fact_sales f
JOIN dim_dates d ON f.sale_date = d.date_key
WHERE d.quarter = 'Q1-2026';Production impact: We've seen DPP reduce scan size from 2TB to 40GB on partitioned Hive tables β a 50x improvement. Ensure spark.sql.optimizer.dynamicPartitionPruning.enabled = true (default in Spark 3.x).
9β12: Rapid-Fire Senior Questions
9. What is the difference between `cache()` and `persist()`?
cache() = persist(StorageLevel.MEMORY_AND_DISK). persist() lets you choose: MEMORY_ONLY, MEMORY_AND_DISK, DISK_ONLY, OFF_HEAP, or serialized variants. Use MEMORY_AND_DISK_SER for large DataFrames to trade CPU for memory savings.
10. Explain narrow vs wide transformations.
Narrow: each input partition maps to at most one output partition (map, filter, union). No shuffle. Wide: input partitions contribute to multiple output partitions (groupBy, join, repartition). Requires shuffle. Pipeline boundary = wide transformation.
11. How do you handle small files problem?
Small files kill read performance (too many file opens, excessive metadata overhead in HDFS/S3). Solutions: coalesce before write, enable spark.sql.files.maxRecordsPerFile, use Delta Lake OPTIMIZE / ZORDER, or run compaction jobs (Hive ALTER TABLE CONCATENATE).
12. What's the difference between client mode and cluster mode?
Client mode: driver runs on the submitting machine (good for interactive work, spark-shell). Cluster mode: driver runs on a worker node (good for production β if the submitting machine dies, the job continues). Always use cluster mode in production.
13. Delta Lake and Lakehouse Optimization
Question: You're running Spark on Delta Lake. Your read queries are slow despite OPTIMIZE. What else would you check?
Answer:
- Z-ORDER β If queries filter on columns that aren't the partition key,
OPTIMIZE table ZORDER BY (col1, col2)co-locates related data. Without Z-ORDER, Spark must scan all files in a partition even with predicate pushdown.
- Partition strategy β Over-partitioning (e.g., by
user_idwith 10M users) creates millions of tiny directories. Under-partitioning (no partitions) means full table scans. Sweet spot: partition by date or a low-cardinality column, Z-ORDER by high-cardinality filter columns.
- File size β Delta Lake default target file size is 1GB. If your files average 10MB after compaction, increase
spark.databricks.delta.optimize.maxFileSize.
- Statistics β Delta Lake keeps per-file min/max stats for the first 32 columns. Ensure your frequently filtered columns are in the first 32. Check with
DESCRIBE DETAIL table.
- Vacuum β Old file versions (from time-travel) slow S3 list operations.
VACUUM table RETAIN 168 HOURScleans up stale files.
14. Spark Structured Streaming Tuning
Question: Your Spark Structured Streaming job's processing time per micro-batch is increasing over time. Diagnose and fix.
Answer:
Likely causes:
- State store growth β If using
groupBy().agg()with watermarking, the state store grows with distinct keys. Check Spark UI β Structured Streaming tab β "Num Rows Total" in state. Fix: tighten watermark (withWatermark("event_time", "10 minutes")) to drop old state faster.
- Backpressure β Kafka ingestion rate exceeds processing rate. Set
maxOffsetsPerTriggerto cap per-batch size.
- Checkpoint bloat β Old checkpoint files accumulate. The state store compacts periodically, but if your trigger interval is very short (100ms), checkpoint overhead dominates.
- Sink bottleneck β Writing to a slow sink (e.g., JDBC to a single Postgres) creates backpressure. Batch writes, increase connection pool, or use an intermediate Kafka topic.
df.writeStream \
.option("maxOffsetsPerTrigger", 100000) \
.option("checkpointLocation", "s3://checkpoints/job-v2") \
.trigger(processingTime="30 seconds") \
.start()15. End-to-End Pipeline Optimization Scenario
Question: You inherit a Spark pipeline that reads 2TB of Parquet from S3, joins with a 500MB dimension table, aggregates by region and product, and writes to Snowflake. It currently takes 6 hours. Bring it under 1 hour.
Senior answer framework:
- Broadcast the dimension table β 500MB fits in memory. Eliminates one shuffle.
- Predicate pushdown β If only yesterday's data is needed, filter on partition column *before* reading (
spark.read.parquet("s3://...").where(col("date") == "2026-04-26")).
- Column pruning β Only
select()needed columns before the join. Reading all 200 columns when you need 15 wastes I/O and memory.
- Tune partitions β 2TB / 256MB target = ~8,000 partitions. Set
spark.sql.shuffle.partitions = 8000.
- Use Kryo serialization β 2-5x faster than Java serialization.
- Write optimization β Batch the Snowflake write. Use the Snowflake Spark connector with
sfWarehouseset to a large warehouse, and useCOPY INTOstaging approach instead of row-by-row JDBC.
- Enable AQE β Let Spark coalesce small post-shuffle partitions and handle any skew automatically.
- Infrastructure β If on EMR, use Graviton instances (30% cheaper, 20% faster for Spark). If on Databricks, enable Photon.
Expected result: Steps 1-3 alone typically reduce runtime by 60-70%. Combined with proper partitioning and AQE, the job should complete in 30-45 minutes.
What to Do Next
These 15 questions represent what senior engineers actually face at Amazon, Databricks, Meta, and Google. The key is not memorizing answers but understanding the *why* behind each optimization.
Practice these interactively:
- AI Mock Interview β Spark Focus β Get real-time feedback on your Spark answers
- 1,800+ Data Engineering Questions β Browse all Spark questions by difficulty
- Spark Interview Questions (Beginner to Advanced) β Start here if you need to build fundamentals first
Remember: at the senior level, every answer should include a production anecdote. "I used broadcast joins to reduce a 4-hour pipeline to 20 minutes at [Company]" is 10x more convincing than a textbook definition.
Written by the DataEngPrep Team
Our editorial team consists of experienced data engineers who have worked at top tech companies and gone through hundreds of real interviews. Every article is reviewed for technical accuracy and practical relevance to help you prepare effectively.
Learn more about our team βRelated Articles
Practice These Questions
Ace Your Interview with AI Coaching
1,800+ expert answers, AI mock interviews, and personalized feedback to get you hired.