Enabling Python print statements in Docker detached container logsDocker is used to run our application in a containerized fashion. Using one docker image we can have multiple instances of our same…Jan 22Jan 22
The Battle of the Compressors: Optimizing Spark Workloads with ZStd, Snappy and More for ParquetHello! Hope you’re having a wonderful time working with challenging issues around Data and Data Engineering.Oct 22, 20232Oct 22, 20232
Small File, Large Impact — Addressing the Small File Issue in SparkIntroduction:Jul 14, 20232Jul 14, 20232
Distributed Computing 103: Advanced Techniques and Best PracticesIn Distributed Computing 101, we explored the basics and in 102 we explored the strategies for scaling and optimizing your distributed…Apr 3, 2023Apr 3, 2023
Distributed Computing 102: Scaling and Optimizing Your Data ProcessingIn Distributed Computing 101, we explored the fundamentals of parallel processing and the technical terminologies that are essential to…Mar 29, 2023Mar 29, 2023
Distributed Computing 101: An Introduction to the World of Parallel ProcessingIn today’s data-driven world, organizations are collecting and processing more data than ever before. As data volumes continue to grow…Mar 28, 2023Mar 28, 2023
Mastering JSON Handling in Apache Spark: A Guide to MapType, ArrayType, and Custom SchemasOne of the many strengths of Spark is its ability to process complex data formats, such as JSON. JSON, or JavaScript Object Notation, is a…Jan 31, 2023Jan 31, 2023
Garbage Collection in Spark: Why it Matters and How to Optimize it for Optimal PerformanceGarbage collection is an essential component of any Java-based application, and this is especially true for Apache Spark, which heavily…Jan 12, 20232Jan 12, 20232
Apache Spark Performance — Tuning Techniques for Optimal Data ProcessingApache Spark is a powerful tool for big data processing, but as with any complex system, it can be challenging to get the best performance…Jan 11, 2023Jan 11, 2023
Reading excel files with Pyspark in AWS Glue and EMREver tried to read excel files on Glue(Pyspark) and gave up because that wasn’t working out? Well look no more this article will show you…Jul 1, 20211Jul 1, 20211