Most candidates confuse when to partition vs bucket. See the answer with real performance numbers and production best practices.
Partitioning divides data into directories based on column values like date or country. Bucketing divides data into a fixed number of files using a hash function on a column. Partitioning is good for filtering and bucketing is good for joins. You can use both together.
Both are physical data organization strategies, but they solve different problems:
Partitioning — directory-level segregation:
sql-- Hive/Spark SQL CREATE TABLE events ( user_id STRING, event STRING, ts TIMESTAMP ) PARTITIONED BY (dt STRING); -- Queries with WHERE dt = '2026-01-01' read ONLY that directory -- Called 'partition pruning' — skips 99%+ of data
Bucketing — hash-based file distribution:
sqlCREATE TABLE orders ( order_id BIGINT, user_id STRING, amount DECIMAL ) CLUSTERED BY (user_id) INTO 256 BUCKETS;
When two tables are bucketed on the same column with the same bucket count, Spark can join them WITHOUT a shuffle — each bucket joins with its counterpart directly.
python# Both tables bucketed by user_id into 256 buckets # Join becomes a local per-bucket merge — no network shuffle orders.join(users, 'user_id') # Bucket join detected automatically
PARTITIONED BY (dt) CLUSTERED BY (user_id) INTO 128 BUCKETS — partition prunes the date, bucket-sort speeds the joinFile size target: 128MB-1GB per file. Anything smaller → compaction job. Anything larger → increase bucket count.
The key insight: partitioning optimizes FILTER queries (WHERE date = X), bucketing optimizes JOIN queries (ON a.user_id = b.user_id). The senior answer includes the small-files anti-pattern and the bucket-join optimization.
Paste your answer and get instant AI-powered feedback with a FAANG-level improved version.
Analyze My Answer — Free3 free analyses per day. No sign-up required.