What mistakes do candidates make on this question?

The claim 'coalesce is always more efficient' is misleading — coalesce can cause severe data skew No mention of WHEN repartition is actually better (fixing skew, before writes, before joins) Missing the small-files problem — the #1 real-world reason teams use repartition before writes No code examples showing actual usage patterns Doesn't discuss partitioning strategies (hash vs range partitioning with repartitionByRange)

mediumSpark/Big DataAnswer Breakdown

Spark repartition() vs coalesce() — Why Most Candidates Get This Wrong

Most candidates say 'coalesce is always better.' That's wrong. See the FAANG-level answer that demonstrates real Spark expertise.

Original Interview Question

What is the difference between repartition and coalesce in Apache Spark?

View Full Question

✗

The Weak Answer (What Most Candidates Say)

repartition() does a full shuffle to create the specified number of partitions. coalesce() reduces partitions without a full shuffle by merging existing partitions. coalesce is more efficient because it avoids the shuffle. You should use coalesce when reducing partitions and repartition when increasing them.

⚠

Why This Answer Fails

1.The claim 'coalesce is always more efficient' is misleading — coalesce can cause severe data skew
2.No mention of WHEN repartition is actually better (fixing skew, before writes, before joins)
3.Missing the small-files problem — the #1 real-world reason teams use repartition before writes
4.No code examples showing actual usage patterns
5.Doesn't discuss partitioning strategies (hash vs range partitioning with repartitionByRange)

✓

The FAANG-Level Answer

Both control partition count, but the trade-offs matter in production:

coalesce(n) — merges partitions locally without a full shuffle:

python
# After filtering 1B rows down to 1M, too many empty partitions
df_filtered = df.filter(col('active') == True)
df_filtered.coalesce(10).write.parquet('/output')

Fast, but creates uneven partitions if upstream data is skewed. Never use coalesce to INCREASE partitions — it silently does nothing.

repartition(n) — full shuffle, guarantees even distribution:

python
# Before writing to storage — prevents small-files problem
df.repartition(200).write.parquet('/warehouse/events')

# Before a skewed join — spread hot keys evenly
df.repartition(100, col('user_id')).join(dim_users, 'user_id')

When I choose which:

coalesce: After filter/sampling reduced row count, and skew isn't a concern
repartition: Before writes to control file count, before skewed joins, when increasing parallelism
repartitionByRange: Before sorted writes (e.g., date-partitioned Parquet) for optimal file layout

Production gotcha: After coalesce(1), a single executor holds all data — OOM risk on large datasets. For single-file output, prefer repartition(1) + monitoring or write to temp then merge.

Key Takeaway

Junior engineers memorize 'coalesce = no shuffle, repartition = shuffle.' Senior engineers explain WHEN each matters: small-files, skew, write optimization, and the hidden dangers of coalesce on skewed data.

Want to know if YOUR answer is weak or strong?

Paste your answer and get instant AI-powered feedback with a FAANG-level improved version.

Analyze My Answer — Free

3 free analyses per day. No sign-up required.

Spark repartition() vs coalesce() — Why Most Candidates Get This Wrong

Most candidates say 'coalesce is always better.' That's wrong. See the FAANG-level answer that demonstrates real Spark expertise.

Original Interview Question

What is the difference between repartition and coalesce in Apache Spark?

View Full Question

✗

The Weak Answer (What Most Candidates Say)

⚠

Why This Answer Fails

1.The claim 'coalesce is always more efficient' is misleading — coalesce can cause severe data skew
2.No mention of WHEN repartition is actually better (fixing skew, before writes, before joins)
3.Missing the small-files problem — the #1 real-world reason teams use repartition before writes
4.No code examples showing actual usage patterns
5.Doesn't discuss partitioning strategies (hash vs range partitioning with repartitionByRange)

✓

The FAANG-Level Answer

Both control partition count, but the trade-offs matter in production:

coalesce(n) — merges partitions locally without a full shuffle:

python
# After filtering 1B rows down to 1M, too many empty partitions
df_filtered = df.filter(col('active') == True)
df_filtered.coalesce(10).write.parquet('/output')

Fast, but creates uneven partitions if upstream data is skewed. Never use coalesce to INCREASE partitions — it silently does nothing.

repartition(n) — full shuffle, guarantees even distribution:

python
# Before writing to storage — prevents small-files problem
df.repartition(200).write.parquet('/warehouse/events')

# Before a skewed join — spread hot keys evenly
df.repartition(100, col('user_id')).join(dim_users, 'user_id')

When I choose which:

coalesce: After filter/sampling reduced row count, and skew isn't a concern
repartition: Before writes to control file count, before skewed joins, when increasing parallelism
repartitionByRange: Before sorted writes (e.g., date-partitioned Parquet) for optimal file layout

Production gotcha: After coalesce(1), a single executor holds all data — OOM risk on large datasets. For single-file output, prefer repartition(1) + monitoring or write to temp then merge.

Key Takeaway

Want to know if YOUR answer is weak or strong?

Paste your answer and get instant AI-powered feedback with a FAANG-level improved version.

Analyze My Answer — Free

3 free analyses per day. No sign-up required.

Spark repartition() vs coalesce() — Why Most Candidates Get This Wrong

The Weak Answer (What Most Candidates Say)

Why This Answer Fails

The FAANG-Level Answer

When I choose which:

Want to know if YOUR answer is weak or strong?

Related Interview Questions

Spark repartition() vs coalesce() — Why Most Candidates Get This Wrong

The Weak Answer (What Most Candidates Say)

Why This Answer Fails

The FAANG-Level Answer

When I choose which:

Want to know if YOUR answer is weak or strong?

Related Interview Questions