Processing XML with AWS Glue and Databricks Spark-XML
A fast introduction to Glue and some tricks for XML processing , which has never been easy
Playing with unstructured data can be sometimes cumbersome and might include mammoth tasks to have control over the data if you have strict rules on the quality and structure of the data.
In this article I will be sharing my experience of processing XML files with Glue transforms versus Databricks Spark-xml library.
AWS Glue is “the” ETL service provided by AWS. It has three main components, which are Data Catalogue, Crawler and ETL Jobs. As Crawler helps you to extract information(schema and statistics) of your data,Data Catalogue is used for centralised metadata management. With ETL Jobs, you can process the data stored on AWS data stores with either Glue proposed scripts or your custom scripts with additional libraries and jars.
XML… Firstly, you can use Glue crawler for exploration of data schema. As xml data is mostly multilevel nested, the crawled metadata table would have complex data types such as structs, array of structs,…And you won’t be able to query the xml with Athena since it is not supported.So it is necessary to convert xml into a flat format.To flatten the xml either you can choose an easy way to use Glue’s magic(!), a simple trick convert it to csv or you can use Glue transforms to flatten the data, which i will elaborate on shortly.
You need to be careful about the flattening, which might cause null values even the data is available in original structure.
I will give an example for alternative approaches, and it is up to you which to choose according to your use case.
- Crawl XML
- Convert to CSV with Glue Job
- Using Glue PySpark Transforms to flatten the data
- An Alternative : Use Databricks Spark-xml
Dataset : http://opensource.adobe.com/Spry/data/donuts.xml
Code&Snippets : https://github.com/elifinspace/GlueETL/tree/article-2
0. Upload dataset to S3:
Download the file from the given link and go to S3 service on AWS console.
Click on the bucket name and click on Upload:(this is the easiest way to do this, you can also setup AWS CLI to interact with aws services from your local machine, which would require a bit more work incl. installing aws cli/configurations etc.)
Click on Add files and choose the file you would like to upload, just click Upload.
- Crawl XML Metadata
First of all , if you know the tag in the xml data to choose as base level for the schema exploration, you can create a custom classifier in Glue . Without the custom classifier, Glue will infer the schema from the top level.
In the example xml dataset above, I will choose “items” as my classifier and create the classifier as easily as follows:
Go to Glue UI and click on Classifiers tab under Data Catalog section.
I create the crawler with the classifier :
Add Another Data Store : No
You can use your IAM role with the relevant read/write permissions on the S3 bucket or you can create a new one :
Frequency: Run On Demand
Review and Click Finish.
Now we are ready to run the crawler: Select the crawler and click on Run Crawler ,once the Status is ‘Ready’ , visit Database section and see the tables in database.
Go to Tables and filter your DB:
Click on table name and the output schema is as follows:
Now we have an idea of the schema, but we have complex data types and need to flatten the data.
2. Convert to CSV :
It will be simple and we will use the script provided by Glue:
Go to Jobs section in ETL menu and Add Job:
Now the magic step:(If we selected Parquet as format, we would do the flattening ourselves, as parquet can have complex types but the mapping is revealed easily for csv.)
I leave everything as default,review,save and continue with edit script.
Glue proposed script:
I have added some lines to the proposed script to generate a single CSV output, otherwise the output will be multiple small csv files based on partitions.
Save and Click on Run Job, this will bring a configuration review, so you can set the DPU to 2(the least it can be) and timeout as follows:
Let’s run and see the output.You can monitor the status in Glue UI as follows:
Once the Run Status is Succeeded , go to your target S3 location:
Click on the file name and go to the Select From tab as below:
If you scroll down, you can preview and query small files easily by clicking Show File Preview/Run SQL(Athena in the background):
3. Glue PySpark Transforms for Unnesting
There are two pyspark transforms provided by Glue :
- Relationalize : Unnests the nested columns, pivots array columns, generates joinkeys for relational operations(joins, etc.), produces list of frames
- UnnestFrame : Unnests the frame, generates joinkeys for array type columns , produces a single frame with all fields incl. joinkey columns.
We will use Glue DevEndpoint to visualize these transformations :
Glue DevEndpoint is the connection point to data stores for you to debug your scripts , do exploratory analysis on data using Glue Context with a Sagemaker or Zeppelin Notebook .
Moreover you can also access this endpoint from Cloud9 ,which is the cloud-based IDE environment to write, run, and debug your codes.You just need to generate SSH key on Cloud9 instance and add the public ssh key while creating the endpoint. To connect to the endpoint you will use the “SSH to Python REPL” command in endpoint details(click on endpoint name in Glue UI),replace private key parameter with the location of yours on your Cloud9 instance.
- Create a Glue DevEndpoint and a Sagemaker Notebook:
I will use this endpoint also for Databricks spark-xml example, so download the jar file to your PC from https://mvnrepository.com/artifact/com.databricks/spark-xml_2.11/0.4.1, upload the jar to S3 and set “Dependent jars path” accordingly:
You can leave every other configuration as default and click Finish .It takes approx. 6 mins for the endpoint to be Ready.
Once the endpoint is ready, we are ready to create a notebook to connect to it.
Choose your endpoint and click create Sagemaker Notebook from Actions drop down list.It will take a couple of minutes for the notebook to be ready, once created.
Open the notebook and create a new Pyspark notebook:
You can copy and paste the boilerplate from the csv job we created previously , change glueContext line as below and comment out the job related libraries and snippets:
You can find more about format options in https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-format.html
- Relationalize:
I used the frame created by from options for the following steps:(the outputs will be the same even if you use the catalog option, the catalog does not persist a static schema for the data.)
It will be clearer if you look at the root table. For e.g. the fillings has only an integer value in this field in root table, this value matches the id column in the root_fillings_filling frame above.
An important thing is that we see that “batters.batter” field propagated into multiple columns.For the item 2 “batters.batter” column is identified as struct , however for item 3 this field is an array!. So here comes the difficulty of working with Glue.
- Unnest Frame:
Let’s see how this transform will give us a different output :
And unnest could spread out the upper level structs but is not effective on flattening the array of structs. So since we can not apply udfs on dynamic frames we need to convert the dynamic frame into Spark dataframe and apply explode on columns to spread array type columns into multiple rows.I will leave this part for your own investigation.
And Finally… Databricks spark-xml :
It may not be the best solution but this package is very useful in terms of control and accuracy. A good feature is that un-parseable records are also detected and a _corrupt_record column is added with relevant information.
So you don’t need to consider whether there is an struct or array column, you can write a generic function for exploding array columns by making use of the extracted schema.
Just to mention , I used Databricks’ Spark-XML in Glue environment, however you can use it as a standalone python script, since it is independent of Glue.
We saw that even though Glue provides one line transforms for dealing with semi/unstructured data, if we have complex data types, we need to work with samples and see what fits our purpose.
Hope you enjoyed it !