Spark/Big Data·20 min read·April 27, 2026

Spark Performance Tuning: 15 Interview Questions That Separate Senior Engineers from Juniors (2026)

Senior Spark interviews at Amazon, Databricks, and Meta focus on performance tuning, not API syntax. Master these 15 questions to prove you've run Spark at scale.

Key Takeaways

✓Why Performance Tuning Is the Senior Engineer Bar
✓1. Repartition vs Coalesce — When Each Destroys Performance
✓2. Data Skew — The #1 Production Spark Problem
✓3. Broadcast Joins — Size Limits and Hidden Costs

Why Performance Tuning Is the Senior Engineer Bar

Every engineer can write df.groupBy().count(). What separates a senior data engineer from a mid-level one is what happens when that query takes 4 hours instead of 20 minutes on a 500-node cluster.

At Amazon, the data engineering loop includes at least one "optimize this Spark job" scenario. Databricks interviews expect you to read a Spark UI screenshot and diagnose bottlenecks. Meta's data engineering interviews include production debugging rounds where Spark tuning is front and center.

This guide covers the 15 questions that test whether you've actually operated Spark in production — not just completed a tutorial. Each answer includes the *why* behind the optimization, because interviewers at senior levels care about reasoning, not just the solution.

1. Repartition vs Coalesce — When Each Destroys Performance

Question: You have a Spark DataFrame with 2,000 partitions and need to write it to S3 as 100 Parquet files. Should you use repartition(100) or coalesce(100)?

Answer:

coalesce(100) — because you're *reducing* partitions. Coalesce avoids a full shuffle by merging adjacent partitions on the same executor. Repartition triggers a full shuffle (hash-based redistribution across the network), which is expensive when you're only trying to reduce file count.

When repartition is better: When you need to *increase* partitions or need *evenly distributed* data. Coalesce can create skewed partitions because it just combines existing ones. If partition 1 has 10GB and partition 2 has 100MB, coalesce will keep that skew.

Production pattern:

python

# Write with target file size, not partition count
df.repartition(col("date"))  # Even distribution by date
  .sortWithinPartitions(col("id"))  # Sorted for downstream reads
  .write.parquet("s3://bucket/table/")

Follow-up trap: "What about coalesce(1)?" — Never in production. Single partition means single executor, zero parallelism, and likely OOM for any non-trivial dataset. If you need a single output file, write to a temp location then use hadoop fs -getmerge.

2. Data Skew — The #1 Production Spark Problem

Question: Your Spark job's shuffle stage shows 199 tasks completing in 30 seconds but 1 task taking 45 minutes. What's happening and how do you fix it?

Answer:

This is data skew — one partition has disproportionately more data than others. The single slow task is processing the heavy partition while all other executors sit idle.

5 strategies, ranked by production impact:

Salting — Add a random suffix to the skewed key, perform the join/aggregation, then remove it:

python

from pyspark.sql.functions import concat, lit, floor, rand

SALT_BUCKETS = 10
skewed_df = skewed_df.withColumn(
    "salted_key", concat(col("join_key"), lit("_"), floor(rand() * SALT_BUCKETS))
)
other_df_exploded = other_df.crossJoin(
    spark.range(SALT_BUCKETS).withColumnRenamed("id", "salt")
).withColumn("salted_key", concat(col("join_key"), lit("_"), col("salt")))

result = skewed_df.join(other_df_exploded, "salted_key")

Broadcast join — If the non-skewed side fits in memory (< 8GB), broadcast it.
Adaptive Query Execution (AQE) — spark.sql.adaptive.enabled = true + spark.sql.adaptive.skewJoin.enabled = true (Spark 3.0+). AQE detects skew at runtime and splits oversized partitions automatically.
Isolate + union — Filter out the hot key, process it separately (maybe with broadcast), then UNION ALL the results.
Pre-aggregate — If you're joining then aggregating, aggregate first to reduce the skewed side before the join.

3. Broadcast Joins — Size Limits and Hidden Costs

Question: Explain broadcast joins in Spark. When should you NOT use them?

Answer:

A broadcast join sends the smaller DataFrame to every executor, avoiding a shuffle on the larger side. Spark automatically broadcasts tables under spark.sql.autoBroadcastJoinThreshold (default: 10MB).

python

from pyspark.sql.functions import broadcast
result = large_df.join(broadcast(small_df), "key")

When NOT to broadcast:

Table exceeds driver memory (causes OOM on the driver, not executors)
Many concurrent jobs broadcasting simultaneously (network saturation)
The "small" table is actually 5GB — executors need to deserialize it, eating into executor memory

Senior-level insight: The 10MB default is almost always too conservative for production. Most teams set it to 100MB–500MB. But the real limit isn't the threshold — it's spark.driver.maxResultSize (default 1GB), because the driver serializes the broadcast table first.

Follow-up: "What about broadcast hash join vs sort-merge join?" — Broadcast hash is O(1) lookup per row. Sort-merge join requires both sides sorted, which means a full shuffle, but it works for tables of any size and streams data (lower memory footprint).

4. Spark UI Deep Dive — Reading the DAG

Question: Walk me through how you debug a slow Spark job using the Spark UI.

Answer (structured for interviews):

Jobs tab — Find the slow job. Check if it's a single stage or multiple stages that are slow.

Stages tab — For the slow stage:
Look at task duration distribution. If 95th percentile task time >> median, you have skew.
Check shuffle read/write sizes. High shuffle write = expensive wide transformation.
Look at GC time. If GC > 10% of task time, increase executor memory or reduce object creation.

Executors tab — Check for:
Executors with disk spill. If shuffle spill > 0, increase spark.sql.shuffle.partitions or executor memory.
Uneven task distribution. If one executor ran 500 tasks and others ran 50, check data locality.

SQL tab — The physical plan shows you exactly which join strategy Spark chose (BroadcastHashJoin, SortMergeJoin, etc.) and where exchanges (shuffles) happen.

The pattern senior engineers follow: Slow job → identify the slow stage → check task metrics (skew? GC? spill?) → trace back to the transformation in code → fix the root cause (partition strategy, join type, or data layout).

5. Shuffle Optimization — The Biggest Performance Lever

Question: What is a shuffle in Spark and how do you minimize shuffle overhead?

Answer:

A shuffle is a data exchange between executors — Spark serializes data, writes it to local disk, transfers it over the network, and deserializes it on the receiving side. Shuffles are expensive because they involve disk I/O, network I/O, and serialization.

Operations that trigger shuffles: groupBy, join (non-broadcast), repartition, distinct, orderBy, window functions with PARTITION BY.

Minimization strategies:

Reduce data before shuffling — Filter and project early. A WHERE before a JOIN can reduce shuffle data by 90%.
Use broadcast joins — Eliminate shuffle entirely for small tables.
Bucket tables — Pre-shuffle at write time. If table A and table B are both bucketed by user_id into 200 buckets, joins on user_id require zero shuffle.
Tune partition count — spark.sql.shuffle.partitions default is 200, which is too low for big data and too high for small data. Rule of thumb: target 128MB–256MB per partition.
Enable AQE — Adaptive Query Execution coalesces small shuffle partitions automatically.
Avoid unnecessary sorts — orderBy triggers a full shuffle + sort. Use sortWithinPartitions if global order isn't needed.

6. Memory Management — OOM Debugging

Question: Your Spark job fails with java.lang.OutOfMemoryError: GC overhead limit exceeded. Walk through your debugging process.

Answer:

Check which memory area overflowed:
Executor memory (spark.executor.memory) — increase if tasks process large partitions
Driver memory (spark.driver.memory) — increase if collect(), broadcast(), or toPandas() is used
Overhead memory (spark.executor.memoryOverhead) — increase for heavy UDF usage or off-heap operations

Common root causes:
Skewed partition (one partition has 50GB of data)
collect() or toPandas() on a large DataFrame (brings all data to driver)
Expensive UDFs creating many short-lived objects (GC pressure)
Too few partitions (e.g., 10 partitions for 100GB = 10GB per task)

Production fix checklist:

python

# 1. Increase partitions to reduce per-task memory
df = df.repartition(2000)

# 2. Use Kryo serialization (2-10x faster, smaller)
spark.conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")

# 3. Set memory fractions
spark.conf.set("spark.memory.fraction", "0.8")  # default 0.6
spark.conf.set("spark.memory.storageFraction", "0.3")  # default 0.5

# 4. Increase off-heap if using Arrow/UDFs
spark.conf.set("spark.executor.memoryOverhead", "2g")

7. Catalyst Optimizer — How Spark Plans Queries

Question: How does Spark's Catalyst Optimizer work? Walk through the stages.

Answer:

4 phases:

Analysis — Resolves column names, table references, and data types against the catalog. Unresolved logical plan → resolved logical plan.

Logical Optimization — Applies rule-based optimizations:
Predicate pushdown — Push filters as close to the data source as possible
Column pruning — Only read columns that are actually used
Constant folding — Evaluate constant expressions at plan time
Join reordering — Put smaller tables on the build side of hash joins

Physical Planning — Converts the logical plan to one or more physical plans. Chooses join strategies (broadcast vs sort-merge vs shuffle-hash) based on table statistics.

Code Generation (Tungsten) — Generates optimized Java bytecode for the physical plan. Whole-stage codegen fuses multiple operators into a single Java function, reducing virtual function calls.

Senior-level follow-up: "How do you influence the optimizer?" — Use spark.sql.autoBroadcastJoinThreshold, run ANALYZE TABLE to update statistics, use hint() for join strategy (/*+ BROADCAST(t) */), and check df.explain(true) to verify the plan.

8. Dynamic Partition Pruning (DPP)

Question: What is Dynamic Partition Pruning and how does it optimize joins?

Answer:

DPP is a Spark 3.0+ optimization for star-schema queries. When you join a large fact table (partitioned by date) with a small dimension table that filters to specific dates, Spark can push the dimension filter into the fact table scan — pruning partitions at read time.

Without DPP:

Scan ALL partitions of fact table
Shuffle both sides
Join, then filter

With DPP:

Scan dimension table (small)
Extract distinct date values from the filter
Only scan matching partitions of fact table
Join (much less data)

sql

-- DPP kicks in automatically here
SELECT f.*
FROM fact_sales f
JOIN dim_dates d ON f.sale_date = d.date_key
WHERE d.quarter = 'Q1-2026';

Production impact: We've seen DPP reduce scan size from 2TB to 40GB on partitioned Hive tables — a 50x improvement. Ensure spark.sql.optimizer.dynamicPartitionPruning.enabled = true (default in Spark 3.x).

9–12: Rapid-Fire Senior Questions

9. What is the difference between `cache()` and `persist()`?

cache() = persist(StorageLevel.MEMORY_AND_DISK). persist() lets you choose: MEMORY_ONLY, MEMORY_AND_DISK, DISK_ONLY, OFF_HEAP, or serialized variants. Use MEMORY_AND_DISK_SER for large DataFrames to trade CPU for memory savings.

10. Explain narrow vs wide transformations.

Narrow: each input partition maps to at most one output partition (map, filter, union). No shuffle. Wide: input partitions contribute to multiple output partitions (groupBy, join, repartition). Requires shuffle. Pipeline boundary = wide transformation.

11. How do you handle small files problem?

Small files kill read performance (too many file opens, excessive metadata overhead in HDFS/S3). Solutions: coalesce before write, enable spark.sql.files.maxRecordsPerFile, use Delta Lake OPTIMIZE / ZORDER, or run compaction jobs (Hive ALTER TABLE CONCATENATE).

12. What's the difference between client mode and cluster mode?

Client mode: driver runs on the submitting machine (good for interactive work, spark-shell). Cluster mode: driver runs on a worker node (good for production — if the submitting machine dies, the job continues). Always use cluster mode in production.

13. Delta Lake and Lakehouse Optimization

Question: You're running Spark on Delta Lake. Your read queries are slow despite OPTIMIZE. What else would you check?

Answer:

Z-ORDER — If queries filter on columns that aren't the partition key, OPTIMIZE table ZORDER BY (col1, col2) co-locates related data. Without Z-ORDER, Spark must scan all files in a partition even with predicate pushdown.

Partition strategy — Over-partitioning (e.g., by user_id with 10M users) creates millions of tiny directories. Under-partitioning (no partitions) means full table scans. Sweet spot: partition by date or a low-cardinality column, Z-ORDER by high-cardinality filter columns.

File size — Delta Lake default target file size is 1GB. If your files average 10MB after compaction, increase spark.databricks.delta.optimize.maxFileSize.

Statistics — Delta Lake keeps per-file min/max stats for the first 32 columns. Ensure your frequently filtered columns are in the first 32. Check with DESCRIBE DETAIL table.

Vacuum — Old file versions (from time-travel) slow S3 list operations. VACUUM table RETAIN 168 HOURS cleans up stale files.

14. Spark Structured Streaming Tuning

Question: Your Spark Structured Streaming job's processing time per micro-batch is increasing over time. Diagnose and fix.

Answer:

Likely causes:

State store growth — If using groupBy().agg() with watermarking, the state store grows with distinct keys. Check Spark UI → Structured Streaming tab → "Num Rows Total" in state. Fix: tighten watermark (withWatermark("event_time", "10 minutes")) to drop old state faster.

Backpressure — Kafka ingestion rate exceeds processing rate. Set maxOffsetsPerTrigger to cap per-batch size.

Checkpoint bloat — Old checkpoint files accumulate. The state store compacts periodically, but if your trigger interval is very short (100ms), checkpoint overhead dominates.

Sink bottleneck — Writing to a slow sink (e.g., JDBC to a single Postgres) creates backpressure. Batch writes, increase connection pool, or use an intermediate Kafka topic.

python

df.writeStream \
  .option("maxOffsetsPerTrigger", 100000) \
  .option("checkpointLocation", "s3://checkpoints/job-v2") \
  .trigger(processingTime="30 seconds") \
  .start()

15. End-to-End Pipeline Optimization Scenario

Question: You inherit a Spark pipeline that reads 2TB of Parquet from S3, joins with a 500MB dimension table, aggregates by region and product, and writes to Snowflake. It currently takes 6 hours. Bring it under 1 hour.

Senior answer framework:

Broadcast the dimension table — 500MB fits in memory. Eliminates one shuffle.

Predicate pushdown — If only yesterday's data is needed, filter on partition column *before* reading (spark.read.parquet("s3://...").where(col("date") == "2026-04-26")).

Column pruning — Only select() needed columns before the join. Reading all 200 columns when you need 15 wastes I/O and memory.

Tune partitions — 2TB / 256MB target = ~8,000 partitions. Set spark.sql.shuffle.partitions = 8000.

Use Kryo serialization — 2-5x faster than Java serialization.

Write optimization — Batch the Snowflake write. Use the Snowflake Spark connector with sfWarehouse set to a large warehouse, and use COPY INTO staging approach instead of row-by-row JDBC.

Enable AQE — Let Spark coalesce small post-shuffle partitions and handle any skew automatically.

Infrastructure — If on EMR, use Graviton instances (30% cheaper, 20% faster for Spark). If on Databricks, enable Photon.

Expected result: Steps 1-3 alone typically reduce runtime by 60-70%. Combined with proper partitioning and AQE, the job should complete in 30-45 minutes.

What to Do Next

These 15 questions represent what senior engineers actually face at Amazon, Databricks, Meta, and Google. The key is not memorizing answers but understanding the *why* behind each optimization.

Practice these interactively:

AI Mock Interview — Spark Focus — Get real-time feedback on your Spark answers
1,800+ Data Engineering Questions — Browse all Spark questions by difficulty
Spark Interview Questions (Beginner to Advanced) — Start here if you need to build fundamentals first

Remember: at the senior level, every answer should include a production anecdote. "I used broadcast joins to reduce a 4-hour pipeline to 20 minutes at [Company]" is 10x more convincing than a textbook definition.

Written by the DataEngPrep Team

Our editorial team consists of experienced data engineers who have worked at top tech companies and gone through hundreds of real interviews. Every article is reviewed for technical accuracy and practical relevance to help you prepare effectively.

Learn more about our team →

⚡

Fragma Data Systems Data Engineer Interview Questions & Answers (2026)

Spark/Big Data · 13 min read

→⚡

Meesho Data Engineer Interview Questions & Answers (2026)

Spark/Big Data · 11 min read

→⚡

Dunnhumby Data Engineer Interview Questions & Answers (2026)

Spark/Big Data · 9 min read

→

Practice These Questions

mediumWhat is the difference between repartition and coalesce in Apache Spark?→mediumWhat strategies can you use to handle skewed data in Spark?→mediumExplain the concept of Broadcast Join in Spark. When should it be used?→hardHow does Spark's Catalyst Optimizer work? Explain its stages.→mediumWhat is the difference between cache() and persist() in Spark? When would you use each?→mediumWhat is the difference between narrow and wide transformations in Apache Spark? Explain with examples.→hardHow do you optimize Spark jobs for better performance? Mention at least 5 techniques.→hardWhat is the small-file problem in Spark, and how do you solve it?→hardCan you explain the architecture of Apache Spark and its components?→hardDescribe the difference between Spark RDDs, DataFrames, and Datasets.→hardHow do you handle late-arriving data in Spark Structured Streaming?→hardHow do you use Spark UI to debug stages, tasks, and performance issues?→

Ace Your Interview with AI Coaching

1,800+ expert answers, AI mock interviews, and personalized feedback to get you hired.

Unlock Full Access Try AI Coach Free

← Back to all articles

Spark/Big Data·20 min read·April 27, 2026

Spark Performance Tuning: 15 Interview Questions That Separate Senior Engineers from Juniors (2026)

Senior Spark interviews at Amazon, Databricks, and Meta focus on performance tuning, not API syntax. Master these 15 questions to prove you've run Spark at scale.

Key Takeaways

✓Why Performance Tuning Is the Senior Engineer Bar
✓1. Repartition vs Coalesce — When Each Destroys Performance
✓2. Data Skew — The #1 Production Spark Problem
✓3. Broadcast Joins — Size Limits and Hidden Costs

Why Performance Tuning Is the Senior Engineer Bar

Every engineer can write df.groupBy().count(). What separates a senior data engineer from a mid-level one is what happens when that query takes 4 hours instead of 20 minutes on a 500-node cluster.

1. Repartition vs Coalesce — When Each Destroys Performance

Question: You have a Spark DataFrame with 2,000 partitions and need to write it to S3 as 100 Parquet files. Should you use repartition(100) or coalesce(100)?

Answer:

Production pattern:

python

# Write with target file size, not partition count
df.repartition(col("date"))  # Even distribution by date
  .sortWithinPartitions(col("id"))  # Sorted for downstream reads
  .write.parquet("s3://bucket/table/")

2. Data Skew — The #1 Production Spark Problem

Question: Your Spark job's shuffle stage shows 199 tasks completing in 30 seconds but 1 task taking 45 minutes. What's happening and how do you fix it?

Answer:

This is data skew — one partition has disproportionately more data than others. The single slow task is processing the heavy partition while all other executors sit idle.

5 strategies, ranked by production impact:

Salting — Add a random suffix to the skewed key, perform the join/aggregation, then remove it:

python

from pyspark.sql.functions import concat, lit, floor, rand

SALT_BUCKETS = 10
skewed_df = skewed_df.withColumn(
    "salted_key", concat(col("join_key"), lit("_"), floor(rand() * SALT_BUCKETS))
)
other_df_exploded = other_df.crossJoin(
    spark.range(SALT_BUCKETS).withColumnRenamed("id", "salt")
).withColumn("salted_key", concat(col("join_key"), lit("_"), col("salt")))

result = skewed_df.join(other_df_exploded, "salted_key")

Broadcast join — If the non-skewed side fits in memory (< 8GB), broadcast it.
Adaptive Query Execution (AQE) — spark.sql.adaptive.enabled = true + spark.sql.adaptive.skewJoin.enabled = true (Spark 3.0+). AQE detects skew at runtime and splits oversized partitions automatically.
Isolate + union — Filter out the hot key, process it separately (maybe with broadcast), then UNION ALL the results.
Pre-aggregate — If you're joining then aggregating, aggregate first to reduce the skewed side before the join.

3. Broadcast Joins — Size Limits and Hidden Costs

Question: Explain broadcast joins in Spark. When should you NOT use them?

Answer:

python

from pyspark.sql.functions import broadcast
result = large_df.join(broadcast(small_df), "key")

When NOT to broadcast:

Table exceeds driver memory (causes OOM on the driver, not executors)
Many concurrent jobs broadcasting simultaneously (network saturation)
The "small" table is actually 5GB — executors need to deserialize it, eating into executor memory

4. Spark UI Deep Dive — Reading the DAG

Question: Walk me through how you debug a slow Spark job using the Spark UI.

Answer (structured for interviews):

Jobs tab — Find the slow job. Check if it's a single stage or multiple stages that are slow.

Stages tab — For the slow stage:
Look at task duration distribution. If 95th percentile task time >> median, you have skew.
Check shuffle read/write sizes. High shuffle write = expensive wide transformation.
Look at GC time. If GC > 10% of task time, increase executor memory or reduce object creation.

Executors tab — Check for:
Executors with disk spill. If shuffle spill > 0, increase spark.sql.shuffle.partitions or executor memory.
Uneven task distribution. If one executor ran 500 tasks and others ran 50, check data locality.

SQL tab — The physical plan shows you exactly which join strategy Spark chose (BroadcastHashJoin, SortMergeJoin, etc.) and where exchanges (shuffles) happen.

5. Shuffle Optimization — The Biggest Performance Lever

Question: What is a shuffle in Spark and how do you minimize shuffle overhead?

Answer:

Operations that trigger shuffles: groupBy, join (non-broadcast), repartition, distinct, orderBy, window functions with PARTITION BY.

Minimization strategies:

Reduce data before shuffling — Filter and project early. A WHERE before a JOIN can reduce shuffle data by 90%.
Use broadcast joins — Eliminate shuffle entirely for small tables.
Bucket tables — Pre-shuffle at write time. If table A and table B are both bucketed by user_id into 200 buckets, joins on user_id require zero shuffle.
Tune partition count — spark.sql.shuffle.partitions default is 200, which is too low for big data and too high for small data. Rule of thumb: target 128MB–256MB per partition.
Enable AQE — Adaptive Query Execution coalesces small shuffle partitions automatically.
Avoid unnecessary sorts — orderBy triggers a full shuffle + sort. Use sortWithinPartitions if global order isn't needed.

6. Memory Management — OOM Debugging

Question: Your Spark job fails with java.lang.OutOfMemoryError: GC overhead limit exceeded. Walk through your debugging process.

Answer:

Check which memory area overflowed:
Executor memory (spark.executor.memory) — increase if tasks process large partitions
Driver memory (spark.driver.memory) — increase if collect(), broadcast(), or toPandas() is used
Overhead memory (spark.executor.memoryOverhead) — increase for heavy UDF usage or off-heap operations

Common root causes:
Skewed partition (one partition has 50GB of data)
collect() or toPandas() on a large DataFrame (brings all data to driver)
Expensive UDFs creating many short-lived objects (GC pressure)
Too few partitions (e.g., 10 partitions for 100GB = 10GB per task)

Production fix checklist:

python

# 1. Increase partitions to reduce per-task memory
df = df.repartition(2000)

# 2. Use Kryo serialization (2-10x faster, smaller)
spark.conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")

# 3. Set memory fractions
spark.conf.set("spark.memory.fraction", "0.8")  # default 0.6
spark.conf.set("spark.memory.storageFraction", "0.3")  # default 0.5

# 4. Increase off-heap if using Arrow/UDFs
spark.conf.set("spark.executor.memoryOverhead", "2g")

7. Catalyst Optimizer — How Spark Plans Queries

Question: How does Spark's Catalyst Optimizer work? Walk through the stages.

Answer:

4 phases:

Analysis — Resolves column names, table references, and data types against the catalog. Unresolved logical plan → resolved logical plan.

Logical Optimization — Applies rule-based optimizations:
Predicate pushdown — Push filters as close to the data source as possible
Column pruning — Only read columns that are actually used
Constant folding — Evaluate constant expressions at plan time
Join reordering — Put smaller tables on the build side of hash joins

Physical Planning — Converts the logical plan to one or more physical plans. Chooses join strategies (broadcast vs sort-merge vs shuffle-hash) based on table statistics.

Code Generation (Tungsten) — Generates optimized Java bytecode for the physical plan. Whole-stage codegen fuses multiple operators into a single Java function, reducing virtual function calls.

8. Dynamic Partition Pruning (DPP)

Question: What is Dynamic Partition Pruning and how does it optimize joins?

Answer:

Without DPP:

Scan ALL partitions of fact table
Shuffle both sides
Join, then filter

With DPP:

Scan dimension table (small)
Extract distinct date values from the filter
Only scan matching partitions of fact table
Join (much less data)

sql

-- DPP kicks in automatically here
SELECT f.*
FROM fact_sales f
JOIN dim_dates d ON f.sale_date = d.date_key
WHERE d.quarter = 'Q1-2026';

9–12: Rapid-Fire Senior Questions

9. What is the difference between `cache()` and `persist()`?

10. Explain narrow vs wide transformations.

11. How do you handle small files problem?

12. What's the difference between client mode and cluster mode?

13. Delta Lake and Lakehouse Optimization

Question: You're running Spark on Delta Lake. Your read queries are slow despite OPTIMIZE. What else would you check?

Answer:

Z-ORDER — If queries filter on columns that aren't the partition key, OPTIMIZE table ZORDER BY (col1, col2) co-locates related data. Without Z-ORDER, Spark must scan all files in a partition even with predicate pushdown.

Partition strategy — Over-partitioning (e.g., by user_id with 10M users) creates millions of tiny directories. Under-partitioning (no partitions) means full table scans. Sweet spot: partition by date or a low-cardinality column, Z-ORDER by high-cardinality filter columns.

File size — Delta Lake default target file size is 1GB. If your files average 10MB after compaction, increase spark.databricks.delta.optimize.maxFileSize.

Statistics — Delta Lake keeps per-file min/max stats for the first 32 columns. Ensure your frequently filtered columns are in the first 32. Check with DESCRIBE DETAIL table.

Vacuum — Old file versions (from time-travel) slow S3 list operations. VACUUM table RETAIN 168 HOURS cleans up stale files.

14. Spark Structured Streaming Tuning

Question: Your Spark Structured Streaming job's processing time per micro-batch is increasing over time. Diagnose and fix.

Answer:

Likely causes:

State store growth — If using groupBy().agg() with watermarking, the state store grows with distinct keys. Check Spark UI → Structured Streaming tab → "Num Rows Total" in state. Fix: tighten watermark (withWatermark("event_time", "10 minutes")) to drop old state faster.

Backpressure — Kafka ingestion rate exceeds processing rate. Set maxOffsetsPerTrigger to cap per-batch size.

Checkpoint bloat — Old checkpoint files accumulate. The state store compacts periodically, but if your trigger interval is very short (100ms), checkpoint overhead dominates.

Sink bottleneck — Writing to a slow sink (e.g., JDBC to a single Postgres) creates backpressure. Batch writes, increase connection pool, or use an intermediate Kafka topic.

python

df.writeStream \
  .option("maxOffsetsPerTrigger", 100000) \
  .option("checkpointLocation", "s3://checkpoints/job-v2") \
  .trigger(processingTime="30 seconds") \
  .start()

15. End-to-End Pipeline Optimization Scenario

Senior answer framework:

Broadcast the dimension table — 500MB fits in memory. Eliminates one shuffle.

Predicate pushdown — If only yesterday's data is needed, filter on partition column *before* reading (spark.read.parquet("s3://...").where(col("date") == "2026-04-26")).

Column pruning — Only select() needed columns before the join. Reading all 200 columns when you need 15 wastes I/O and memory.

Tune partitions — 2TB / 256MB target = ~8,000 partitions. Set spark.sql.shuffle.partitions = 8000.

Use Kryo serialization — 2-5x faster than Java serialization.

Write optimization — Batch the Snowflake write. Use the Snowflake Spark connector with sfWarehouse set to a large warehouse, and use COPY INTO staging approach instead of row-by-row JDBC.

Enable AQE — Let Spark coalesce small post-shuffle partitions and handle any skew automatically.

Infrastructure — If on EMR, use Graviton instances (30% cheaper, 20% faster for Spark). If on Databricks, enable Photon.

Expected result: Steps 1-3 alone typically reduce runtime by 60-70%. Combined with proper partitioning and AQE, the job should complete in 30-45 minutes.

What to Do Next

These 15 questions represent what senior engineers actually face at Amazon, Databricks, Meta, and Google. The key is not memorizing answers but understanding the *why* behind each optimization.

Practice these interactively: