Most candidates say 'coalesce is always better.' That's wrong. See the FAANG-level answer that demonstrates real Spark expertise.
What is the difference between repartition and coalesce in Apache Spark?
repartition() does a full shuffle to create the specified number of partitions. coalesce() reduces partitions without a full shuffle by merging existing partitions. coalesce is more efficient because it avoids the shuffle. You should use coalesce when reducing partitions and repartition when increasing them.
Both control partition count, but the trade-offs matter in production:
coalesce(n) — merges partitions locally without a full shuffle:
python# After filtering 1B rows down to 1M, too many empty partitions df_filtered = df.filter(col('active') == True) df_filtered.coalesce(10).write.parquet('/output')
Fast, but creates uneven partitions if upstream data is skewed. Never use coalesce to INCREASE partitions — it silently does nothing.
repartition(n) — full shuffle, guarantees even distribution:
python# Before writing to storage — prevents small-files problem df.repartition(200).write.parquet('/warehouse/events') # Before a skewed join — spread hot keys evenly df.repartition(100, col('user_id')).join(dim_users, 'user_id')
Production gotcha: After coalesce(1), a single executor holds all data — OOM risk on large datasets. For single-file output, prefer repartition(1) + monitoring or write to temp then merge.
Junior engineers memorize 'coalesce = no shuffle, repartition = shuffle.' Senior engineers explain WHEN each matters: small-files, skew, write optimization, and the hidden dangers of coalesce on skewed data.
Paste your answer and get instant AI-powered feedback with a FAANG-level improved version.
Analyze My Answer — Free3 free analyses per day. No sign-up required.