Small File, Large Impact — Addressing the Small File Issue in Spark

4 min readJul 14, 2023

Introduction:

As Data Engineers we have used or are using Apache Spark in our day-to-day work. The way it handles and processes large-scale data still amazes me. It has proven to be powerful and revolutionized the way we handle big data.
One critical challenge that often gets overlooked is the small file problem. While Spark is renowned for its ability to handle large datasets efficiently, the presence of numerous small files can drastically impact the performance and scalability of your Spark applications.

In this article, we’ll delve into the reasons behind the small file problem and explore effective strategies to overcome it, enabling smoother and more efficient data processing in Spark.

What Causes the Small File Problem?

Every file that Spark processes requires some metadata to be created and stored. This metadata includes information like the file location, schema, partitions, etc.
When you have a large dataset comprising hundreds of thousands of small files, the storage and maintenance of metadata for each small file adds significant overhead to the overall job execution.

Some key sources of overhead are:

File metadata storage in memory — Storing metadata for a large number of small files takes up a lot of memory on the driver.

Increased memory pressure on executors — When working with a dataset spread across small files, each executor task processes fewer records per file. This increases memory pressure.

High task scheduling overhead — With small files, Spark ends up creating many tasks to process the dataset. The overhead of scheduling and coordinating these tasks reduces performance.

Sub-optimal disk I/O — Reading from a large number of small files leads to random disk I/O, which is very inefficient.

Reduced Parallelism — Spark’s parallel processing capabilities are based on dividing data into partitions and executing tasks on each partition concurrently. With an abundance of small files, the number of partitions increases exponentially, leading to reduced parallelism and slower overall execution times.

Serialization costs — Serializing and deserializing data from many small files has CPU costs.

So, in summary, having many small files results in high metadata overhead, increased memory pressure, inefficient disk I/O, CPU costs, Reduced parallelism and Serialization costs — all of which can significantly slow down the Spark jobs.

Strategies to Overcome the Small File Problem

Here are some effective strategies to avoid or mitigate the small file problem in Spark:

Combine small files into larger files — Using a preprocessing step to pack small files into larger files. Consolidating files optimizes metadata management, I/O operations, and task scheduling, resulting in improved performance.

Use file formats with small metadata — File formats like Parquet and ORC require less metadata storage compared to formats like JSON. Converting small files to Parquet can help as Spark works very well with Parquet.

Increase parallelism — Using repartition and coalesce to increase the number of partitions can help parallelize work across more executor tasks to improve disk I/O.

File Compression — Compressing files reduces storage requirements, optimizes I/O operations, and improves data transfer across the network. Popular compression formats like Snappy, Gzip, or Parquet’s built-in compression can significantly reduce the file sizes while maintaining data integrity.

Partitioning — If consolidating files is not feasible due to specific requirements, partitioning data based on meaningful attributes, such as date, customer ID, or location can be considered. Partitioning enables Spark to prune unnecessary data while executing queries, reducing the data scanned and enhancing query performance.

Archiving and Cleanup — Regularly archive or remove obsolete and infrequently accessed small files. Archiving files to long-term storage solutions like HDFS snapshots or cloud storage can help preserve historical data while reducing clutter. Cleaning up unnecessary small files can improve overall system performance and reduce maintenance overhead.

Data Ingestion Optimization — When ingesting data into Spark, consider using larger batch sizes or buffering techniques to reduce the number of small files generated. Additionally, explore options for real-time ingestion frameworks like Apache Kafka to aggregate data before storing it in Spark.

Summary:

The strategies discussed above can be carried out to get the optimal/effective performance depending upon the scenarios faced. I’ll write an article soon about the compaction process that I carried out in my org to handle small files.

Remember, in the world of Big Data, it’s not just the size that matters, but also the structure and organization of your data that can make all the difference.