80% of candidates give the textbook answer about shuffle. See the production-level response with real performance numbers and code that gets offers.
What is the difference between groupByKey and reduceByKey in Spark?
groupByKey shuffles all key-value pairs across the network and groups them by key. reduceByKey first performs a local reduce (map-side combine) before shuffling, which reduces the amount of data transferred. reduceByKey is more efficient and should be preferred over groupByKey.
The core difference is map-side combining:
python# groupByKey: shuffles ALL values, then groups rdd.groupByKey() # Transfers O(total_values) across network # reduceByKey: combines locally FIRST, then shuffles rdd.reduceByKey(lambda a, b: a + b) # Transfers O(unique_keys) across network
Real performance impact: On a 10M row dataset with 100K unique keys, reduceByKey can reduce shuffle data by 10-100x. I've seen production jobs drop from 45 minutes to 3 minutes by switching.
python# DataFrame API handles map-side combining automatically df.groupBy('key').agg(sum('value')) # Catalyst optimizer adds combiners
- Complex aggregations that can't be expressed as reduce (e.g., collecting distinct values)
- Small cardinality keys (< 10K unique keys with small values)
- When you need ALL values per key (not just an aggregate)
- aggregateByKey: Separate combine logic for within-partition vs across-partition
- combineByKey: Most flexible — custom createCombiner, mergeValue, mergeCombiners
My production rule: Use DataFrame API for 95% of aggregations. If forced to use RDDs, reduceByKey for simple reduces, aggregateByKey for complex ones, groupByKey only when you genuinely need the full value list.
The textbook answer about shuffle is table stakes. The senior answer explains that DataFrames handle this automatically, quantifies the real performance impact, and knows when groupByKey is actually acceptable.
Paste your answer and get instant AI-powered feedback with a FAANG-level improved version.
Analyze My Answer — Free3 free analyses per day. No sign-up required.