Reading excel files with Pyspark in AWS Glue and EMR

Siraj
3 min readJul 1, 2021

--

Ever tried to read excel files on Glue(Pyspark) and gave up because that wasn’t working out? Well look no more this article will show you the steps involved to work on excel files with Pyspark in Glue and EMR.

Pyspark doesn’t have a library supporting excel files. There is a library created by crealytics for scala which we’ll be using to work with in Pyspark. Working with excel in Databricks is easy with a single click by just going to the library section and entering the package name and boom it’s downloaded and installed and ready to use. This isn’t the case with AWS Glue.

A small detour for people working on Glue for the first time, AWS Glue works differently because the libraries that we want to work with should be shipped to an S3 bucket and then the path of these libraries should be mentioned in the python library path text box while creating a Glue job. Then these libraries can be imported in the code. Also Glue doesn’t support libraries compiled with language ‘C’. Glue currently provides only Spark version 2.4.3 and the default scala version that comes with it is 2.11. These are not the latest versions available outside.

So coming back to the point of working with excel files in Glue, one must manually download(Can be downloaded from MVN repository) all the files with the same version mentioned in the below pic. If the version is different it might not work.

spark-excel dependencies

Ship all these libraries to an S3 bucket and mention the path in the glue job’s python library path text box. Make sure your Glue job has necessary IAM policies to access this bucket.

Now we‘ll jump into the code. After initializing the SparkSession we can read the excel file as shown below.

sample excel file read using pyspark

The options available to read are listed below,

spark.read
.format("com.crealytics.spark.excel")
.option("dataAddress", "'My Sheet'!B3:C35") // Optional, default: "A1"
.option("header", "true") // Required
.option("treatEmptyValuesAsNulls", "false") // Optional, default: true
.option("setErrorCellsToFallbackValues", "true") // Optional, default: false, where errors will be converted to null. If true, any ERROR cell values (e.g. #N/A) will be converted to the zero values of the column's data type.
.option("usePlainNumberFormat", "false") // Optional, default: false, If true, format the cells without rounding and scientific notations
.option("inferSchema", "false") // Optional, default: false
.option("addColorColumns", "true") // Optional, default: false
.option("timestampFormat", "MM-dd-yyyy HH:mm:ss") // Optional, default: yyyy-mm-dd hh:mm:ss[.fffffffff]
.option("maxRowsInMemory", 20) // Optional, default None. If set, uses a streaming reader which can help with big files
.option("excerptSize", 10) // Optional, default: 10. If set and if schema inferred, number of rows to infer schema from
.option("workbookPassword", "pass") // Optional, default None. Requires unlimited strength JCE for older JVMs
.schema(myCustomSchema) // Optional, default: Either inferred schema, or all columns are Strings
.load("Worktime.xlsx")

Similarly the options available for write are,

df.write
.format("com.crealytics.spark.excel")
.option("dataAddress", "'My Sheet'!B3:C35")
.option("header", "true")
.option("dateFormat", "yy-mmm-d") // Optional, default: yy-m-d h:mm
.option("timestampFormat", "mm-dd-yyyy hh:mm:ss") // Optional, default: yyyy-mm-dd hh:mm:ss.000
.mode("append") // Optional, default: overwrite.
.save("Worktime2.xlsx")

In EMR we can read and write in the code the same way as mentioned above. The only thing to add would be an additional parameter in our spark-submit command which is for Scala 2.12 --packages com.crealytics:spark-excel_2.12:0.13.1 or Scala 2.11 --packages com.crealytics:spark-excel_2.11:0.13.1

Example command: spark-submit --packages com.crealytics:spark-excel_2.11:0.13.1 --sample_script.py

I hope this helped!!

--

--