What is the small-file problem in Spark, and how do you solve it?

Reviewed by Aditya Kumar · Last reviewed 2026-03-24

The small file problem in Spark refers to the significant performance degradation and resource inefficiencies caused by processing and storing data in a large number of very small files (kilobytes to…

Why This Question Matters

This hard-level Spark/Big Data question appears frequently in data engineering interviews at companies like Daniel Wellington, Incedo, Swiggy. While less common, it tests deeper understanding that distinguishes strong candidates. Mastering the underlying concepts (partition, spark) will help you answer variations of this question confidently.

How to Approach This

This is a senior-level question that tests architectural thinking. Lead with the high-level design, then drill into specifics. Discuss trade-offs explicitly - there is rarely one correct answer. Show awareness of scale, fault tolerance, and operational complexity. The expert answer includes a code example that demonstrates the implementation pattern.

The small-file problem in Spark refers to the significant performance degradation and resource inefficiencies caused by processing and storing data in a large number of very small files (kilobytes to a few megabytes each). This leads to substantial overhead in metadata management, task scheduling, and I/O operations, particularly in distributed file systems like HDFS or object storage like S3.

Why it's a Problem

Each small file typically corresponds to a Spark partition, which maps to a task. Thousands of small files thus generate thousands of tasks, overwhelming the Spark driver with scheduling overhead, increasing executor startup/teardown costs, and causing I/O thrashing due to numerous non-sequential reads. For object storage like S3, listing directories with hundreds of thousands of files can take minutes, as LIST operations are rate-limited and incur costs. HDFS NameNodes also struggle with memory for excessive metadata. Common root causes include high parallelism (e.g., spark.sql.shuffle.partitions set too high), over-partitioning data using df.write.partitionBy() on high-cardinality columns, and frequent micro-batch writes in streaming applications.

Solutions and Trade-offs

The primary goal is to consolidate small files into larger, more optimal sizes, typically between 128MB and 1GB, aligning with HDFS block sizes or efficient S3 multipart upload segments.

Consolidate Data Before Write: Use df.coalesce(num_partitions) or df.repartition(num_partitions) before writing. coalesce avoids a full shuffle if reducing partitions.

    df.repartition(200).write.parquet("s3://my-bucket/path")

Table Formats: Modern data lake table formats like Delta Lake, Apache Iceberg, or Apache Hudi inherently address this. They provide OPTIMIZE commands (e.g., OPTIMIZE table_name ZORDER BY column_name) for compaction and auto-compaction features that merge small files in the background, managing metadata efficiently through a transaction log.

Adjust Spark Configurations: Tune spark.sql.shuffle.partitions to control output file count. For reading, spark.sql.files.maxPartitionBytes groups small files into larger input partitions.

Batch Streaming Writes: Accumulate data over longer intervals (e.g., 5-10 minutes instead of 1 minute) before writing to reduce small file creation frequency.

While larger files improve read performance and reduce overhead, excessively large files can hinder parallelism (fewer tasks) and make granular updates or deletions less efficient. Optimal file size balances these factors.

In the interview, also mention the direct impact on cloud storage costs (S3 LIST/GET requests) and how managed data warehousing solutions like Snowflake abstract this away through their micro-partitioning and clustering mechanisms.

Why it's a Problem

Solutions and Trade-offs

The primary goal is to consolidate small files into larger, more optimal sizes, typically between 128MB and 1GB, aligning with HDFS block sizes or efficient S3 multipart upload segments.

Consolidate Data Before Write: Use df.coalesce(num_partitions) or df.repartition(num_partitions) before writing. coalesce avoids a full shuffle if reducing partitions.

    df.repartition(200).write.parquet("s3://my-bucket/path")

Adjust Spark Configurations: Tune spark.sql.shuffle.partitions to control output file count. For reading, spark.sql.files.maxPartitionBytes groups small files into larger input partitions.

Batch Streaming Writes: Accumulate data over longer intervals (e.g., 5-10 minutes instead of 1 minute) before writing to reduce small file creation frequency.

What is the small-file problem in Spark, and how do you solve it?

Why This Question Matters

How to Approach This

Why it's a Problem

Solutions and Trade-offs

Swiggy Data Engineer Interview Questions & Answers (2026)

Spark Performance Tuning: 15 Interview Questions That Separate Senior Engineers from Juniors (2026)

Related Spark/Big Data Questions

Level up your prep

What is the small-file problem in Spark, and how do you solve it?

Why This Question Matters

How to Approach This

Why it's a Problem

Solutions and Trade-offs

Swiggy Data Engineer Interview Questions & Answers (2026)

Spark Performance Tuning: 15 Interview Questions That Separate Senior Engineers from Juniors (2026)

Related Spark/Big Data Questions

Level up your prep