The Battle of the Compressors: Optimizing Spark Workloads with ZStd, Snappy and More for Parquet
Hello!
Hope you’re having a wonderful time working with challenging issues around Data and Data Engineering.
In this article let’s look at the different compression algorithms Apache Spark offers when writing data in parquet and which one to leverage for your particular use case.
- Snappy — A fast and efficient compression algorithm developed by Google. It prioritizes speed over compression ratio. Snappy is used by default in Spark.
- LZ4 — A very fast compression algorithm that focuses on decompression speed. LZ4 provides a good balance between speed and compression ratio.
- ZStandard (ZStd) — A modern compression algorithm that provides a good compression ratio while still being pretty fast. ZStd offers compression ratios comparable to ZLib with faster compression/decompression speeds.
- Deflate/ZLib — A common compression standard that provides good compression ratios but is slower than the above algorithms. ZLib aims for a balance between compression ratio and speed.
- Brotli — A more modern compression algorithm that can provide 20–26% better compression ratios than ZLib, but at lower speeds. Brotli compression can produce smaller output sizes but takes longer to compress/decompress.
Let’s talk about some of the above algorithms in detail,
Snappy:
- Developed by Google for use in the MapReduce framework and Google Cloud Platform. Released in 2011.
- Snappy is designed as a very fast compressor/decompressor that prioritizes speed over compression ratio.
- Snappy aims to be significantly faster than LZMA/ZLib while still providing reasonable compression.
- Snappy uses a compression algorithm based on concepts from LZ77 but simplified and optimized for speed.
- No compression dictionary, instead compresses in small chunks using partial matching with overlapping copies.
- Typical compression ratio around 2x on large data workloads. Lower ratios than ZLib which gets 3–4x.
- Decompression is extremely fast, over an order of magnitude faster than ZLib in some tests.
- Single-threaded compression but highly parallelizable decompression.
- Supported natively in Spark SQL since Spark 1.6. Default codec for shuffle operations.
- Useful when speed is critical and data tends to be repetitive, like log files, JSON/XML documents, etc.
- Not the highest compression, but very fast performance. Much faster than ZLib while reasonable compression.
In summary, Snappy emphasizes speed over everything else. It has very low CPU overhead for compression and extremely fast decompression. Makes it a good choice when Spark needs to work very fast, even if some compression ratio is sacrificed.
ZStandard (ZStd):
- Developed by Facebook in 2016 as a modern replacement for ZLib/Deflate.
- ZSTD provides compression ratios comparable to ZLib, but much faster compression and decompression speeds. Often 2–5x faster.
- Zstd leverages multiple compression techniques like finite state entropy coding, sequences memorization, and dictionary compression.
- Zstd efficiently scales across multiple threads and cores. Enables multi-threaded compression.
- Zstd offers compression levels from 1 (fastest) up to 22 (best compression). Level 3 provides a good balance.
- Supported natively in Spark since version 2.3.
Brotli:
- Developed by Google in 2015, originally for web content compression.
- Can provide 20–26% better compression than ZLib/Deflate, but at slower speeds.
- Uses a combination of LZ77, Huffman coding, and 2nd order context modeling to achieve higher compression.
- Designed for compressing text-based formats like JSON, XML, HTML. Performs especially well on larger texts.
- Slower compression and decompression than ZStd, but produces smaller output size.
- Supported in Spark via third-party libraries like compress-brotli. Not natively integrated yet.
- Effective for cold data that is compressed once and read many times later.
In summary, ZStd hits a sweet spot of good compression and speed, while Brotli optimizes for compression ratio at the cost of speed. Choosing between them depends on your priorities and dataset characteristics.
Example code for reference on how to use the different compression techniques discussed,
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('compression').getOrCreate()
# Create a sample dataframe
df = spark.range(0, 1000).toDF("number")
# Write dataframe using Snappy (default)
df.write.parquet("data/snappy")
# Write using LZ4
df.write.option("compression", "lz4").parquet("data/lz4")
# Write using ZStandard
df.write.option("compression", "zstd").parquet("data/zstd")
# Write using Deflate/ZLib
df.write.option("compression", "zlib").parquet("data/zlib")
# Write using Brotli (requires import)
from compress_brotli import compress
df.write.option("compression", compress).parquet("data/brotli")
If you have any queries regarding this topic or other topics around Data Engineering, feel free to hit me up with a message on LinkedIn.
LinkedIn: Sirajudeen A
If you liked the article hit me with a follow here! Thanks!!