SirajEnabling Python print statements in Docker detached container logsDocker is used to run our application in a containerized fashion. Using one docker image we can have multiple instances of our same…1 min read·Jan 22, 2024----
SirajThe Battle of the Compressors: Optimizing Spark Workloads with ZStd, Snappy and More for ParquetHello! Hope you’re having a wonderful time working with challenging issues around Data and Data Engineering.3 min read·Oct 22, 2023--1--1
SirajSmall File, Large Impact — Addressing the Small File Issue in SparkIntroduction:4 min read·Jul 14, 2023--2--2
SirajDistributed Computing 103: Advanced Techniques and Best PracticesIn Distributed Computing 101, we explored the basics and in 102 we explored the strategies for scaling and optimizing your distributed…5 min read·Apr 3, 2023----
SirajDistributed Computing 102: Scaling and Optimizing Your Data ProcessingIn Distributed Computing 101, we explored the fundamentals of parallel processing and the technical terminologies that are essential to…4 min read·Mar 29, 2023----
SirajDistributed Computing 101: An Introduction to the World of Parallel ProcessingIn today’s data-driven world, organizations are collecting and processing more data than ever before. As data volumes continue to grow…3 min read·Mar 28, 2023----
SirajMastering JSON Handling in Apache Spark: A Guide to MapType, ArrayType, and Custom SchemasOne of the many strengths of Spark is its ability to process complex data formats, such as JSON. JSON, or JavaScript Object Notation, is a…4 min read·Jan 31, 2023----
SirajGarbage Collection in Spark: Why it Matters and How to Optimize it for Optimal PerformanceGarbage collection is an essential component of any Java-based application, and this is especially true for Apache Spark, which heavily…3 min read·Jan 12, 2023--2--2
SirajApache Spark Performance — Tuning Techniques for Optimal Data ProcessingApache Spark is a powerful tool for big data processing, but as with any complex system, it can be challenging to get the best performance…3 min read·Jan 11, 2023----
SirajReading excel files with Pyspark in AWS Glue and EMREver tried to read excel files on Glue(Pyspark) and gave up because that wasn’t working out? Well look no more this article will show you…3 min read·Jul 1, 2021--1--1