The world’s leading publication for data science, AI, and ML professionals.

Big Data Transformations with Complex and Nested Data Types

Apache Spark Programming Tips & Tricks

Photo by Laura Ockel on Unsplash
Photo by Laura Ockel on Unsplash

Introduction

Apache Spark is a distributed computing Big Data analytics framework designed to transform, engineer, and process massive amounts of data (think terabytes and petabytes) across a cluster of machines. Often working with diverse datasets, you will come across complex data types and formats that require expensive compute and transformations (think IoT devices). Extremely complicated and specialized, under the hood, Apache Spark is a master of its craft when it comes to scaling big data engineering efforts. In this blog using the native Scala API I will walk you through examples of 1.) how to flatten and normalize semi-structured JSON data with nested schema (array and struct), 2.) how to pivot your data, and 3.) how to save the data to storage as parquet schema for downstream analytics. To note, the same exercises can be achieved using the Python API and Spark SQL.

Step 1: Normalize Semi-Structured Nested Schema

1a.) Let’s view our beautiful multi-line JSON schema (dummy data from my favorite video game).

1b.) Next, to improve performance I will map and construct our schema as a new StructType() to avoid triggering an unnecessary Spark job when reading the JSON data.

1c.) Now we can print our schema and examine the data, which you can see is a crazy bundle of joy because of the data types involved. There are 12 total rows with 5 columns in this dataset however we are seeing 2 rows with 2 columns in the native schema.

1d.) It is time to normalize this dataset using built-in Spark DataFrame API functions including explode , which turn elements in array data types to individual rows and dots *, which unpack subfields in struct data types. Since the subclass and super columns have a 1 to 1 element pair mapping the slick arrays_zip function will also be used with dot selection to avoid creating extra row combinations during the exploding.

1e.) It is time to check out the transformed dataset and its schema. As expected, it returns 12 rows with 5 columns.

Step 2: Transform and Re-Shape Data

2a.) This next exercise will take our flattened dataset and apply pivot functions, which trigger a wide transformation where distinct values for a specific column are transposed into individual columns. Pivots can be performed with or without aggregation. Without aggregation is often a required schema for data science use cases using many columns a.k.a features as input to learning algorithms. An efficient performance tip is to specify your unique values in the pivot function input so Spark does not have to trigger an additional job.

Step 3: Write to Parquet Format

3a.) The final exercise will simply write out our data to storage. Parquet format will be used because it is a splittable file format, highly compressed for space efficiency, and optimized for columnar storage hence making it fabulous for downstream big data analytics.

Conclusion

These exercises just scratch the surface of what Apache Spark is capable of for big Data Engineering and advanced analytics use cases. Thank you for reading this blog.


Related Articles