The world’s leading publication for data science, AI, and ML professionals.

Type safety in data parsers using PySpark

Ensuring type safety while parsing data using Apache PySpark

Photo by Thom Milkovic on Unsplash
Photo by Thom Milkovic on Unsplash

One of the primary tasks of a Data Engineer is to ingest data from multiple sources. These sources could be API endpoints, streaming services, cron jobs uploading files to the cloud, etc. The data ingested from these sources is dumped into the datalake which is subsequently used for parsing and in ETLs for downstream business logic.

But the question that lies here is: Can we rely on source systems, especially third parties, when it comes to the data types of fields they produce?

Instead of relying on these sources, it is best to offload the process of type checks and type conversions at the consumption level i.e to let the data consumer define the target data types as per their need.

This article conveys an approach to ensure Type Safety in parsers with its pros and cons.


The Approach

Taking the example of data in JavaScript Object Notation (JSON) format, fields in the JSON can be of multiple types. Sometimes, we might not even be aware of which fields have which data type and it becomes extremely difficult to keep track and ensure that the types of the respective fields are maintained correctly.

Hence, it is essential to enforce uniformity in all the fields that the JSON schema consists of. The safest way to do this is by enforcing StringType across all the leaf nodes in the schema tree. A leaf node is a field that could be of string or bigint or timestamp etc. types but not of struct-type or array-type.

Let’s have a look at how this can be done:

Consider that you have loaded JSON data into a dataframe with the JSON column appearing with the name json.

First, the schema is inferred from the json column. The enforce_stringtype function then performs a depth-first search (DFS) on the schema metadata retrieved from the schema inferred by Spark. It checks if a particular field in the schema metadata is a leaf or not. If it is a leaf field, it converts all the leaf fields (fields that are not of struct or array type) to string and returns else, it traverses inside the struct or array field and converts all leaf fields to string type.

The schema metadata is then converted back to StructType to be used to enforce the changed schema on the json column.

The required transformations can then be applied to the fields that can be obtained from the type_safe_json column for conversion to the target data type as specified by the consumer of the data.


PROs

There are several advantages that this approach provides.

  1. Converting to StringType does not result in any data loss and also does not drop rows during the post-ingestion and pre-consumption process.
  2. Any data type change made to fields in the JSON by a source, for instance, from integer to double, is handled by this approach.
  3. Transformations done post enforcing StringType can ensure that a particular field has a set data type.

CONs

Even though this approach provides several advantages, it does have some disadvantages.

  1. Transformations applied on fields after enforcing StringType is an added computation. Additionally, those fields could already have the target data type inherently, provided the source system is reliable. Since they are being converted to string and back to the target data type means a redundant computation.
  2. Setting a target data type for fields to apply the transformation on is manual work. It can get tedious in cases where there are a huge number of fields on which transformations are to be applied.
  3. Downcasting will lead to data loss (like double to int), this has to be addressed by the consumer.

To sum up, trusting source systems to retain the same data type of each field is questionable. This method can be used to ensure safety from any sort of data type conflicts arising from source systems with its obvious disadvantages mentioned above. Transferring the process of specifying target data types to the consumers makes the parsers independent of any data type checks. This also conforms with deciding on schema when data is being consumed.


Related Articles