Gallia: A Library for Data Transformation

A schema-aware Scala library for practical data transformation: ETL, feature engineering, HTTP responses, etc

Anthony Cros
Towards Data Science

--

Photo by Joshua Sortino on Unsplash

Gallia is a Scala library for generic data transformation with a focus on practicality, readability, and scalability (if needed).

It is a personal project which I started after years of frustration with existing tools. It is meant to help data engineers get the job done without necessarily forgoing scalability if it is required. It is also intended to fill a gap between libraries like pandas (for those who value a powerful type system like Scala’s) and Spark SQL (for those who find SQL hard to fathom past a certain query complexity). More generally, it was created to offer a one-stop shop paradigm for most or all data transformations needs within one’s application.

Its execution happens in two phases, each traversing a dedicated execution Directed Acyclic Graph (DAG):

  1. A initial meta phase which ignores the data entirely and ensures that transformation steps are consistent (schema-wise).
  2. A subsequent data phase where the data is actually processed.
Gallia logo
Image via LogoGround under license to Anthony Cros.

Trivial example

Here’s a very basic example of usage:

This will successfully print: {“name”:”TONY”,”age”:40}

However, because Gallia is schema-aware, the following will fail before any data is actually processed:

Because incrementing a string is (typically) nonsensical, the initial “meta” phase will return an error complaining that a string cannot be incremented.

Notes:

  • JSON is used for all examples due to its ubiquity as an object notation, however Gallia is not JSON-specific for its data serialization or in-memory representation (see JSON flaws)
  • The schema is actually inferred here, which is more succinct but typically not ideal (see providing schemas instead)

More complex example

Let’s walk through a more complex example of usage. Your boss, Bill, provides you with the following spreadsheets (graciously dumped as TSV):

employees
projects
issues

He would like you to create a report for each employee at the company (Initech), and projects “W2K” and “W3K”, based on the following template:

Code and data

You can achieve the above result with Gallia and the following code:

Runnable code for this example can be found on github. The repository also contains the input and output data, along with a dump of all intermediate top-level schema/data pairs, which should help clarify any ambiguity. If we consider line 31 above for instance (generate is_manager field), we can find all intermediates files like so:

Note that this is not a standard Gallia feature, just a convenience provided for the sake of the article. Note also that only the first data entity is provided for each step.

EDIT: I also added a manual counterpart, which can be compared easily with Gallia processing. The manual counterpart exclusively uses the standard library.

Walkthrough

The transformation steps above are meant to be as self-explanatory as possible. For instance .remove(“Employee ID”) means exactly that: the “Employee ID” field will no longer exist past this point.

Instead we will detail below a few of the more involved operations.

Nesting operation

The first nesting operation is straightforward:

.nest(“first”, “middle”, “last”).under(“name”)

It basically turns an entity:

{“first”: “Peter”, “middle”: “Ron”, “last”: “Gibbons”, …}

into

{“name”: { “first”: “Peter”, “middle”: “Ron”, “last”: “Gibbons”}, … }

However the second nesting operation is (intentionally) more complex:

.renest{ _.filterKeys(_.startsWith(“Address”)) }.usingSeparator(“ “)

It takes advantage of two mechanisms:

  • Selecting a subset of keys via a predicate, namely startsWith(“Address”) here
  • Re-nesting using a common separator, namely the space character here

The target selection zeroes in on the only four fields with an “Address” prefix. Meanwhile the renesting mechanism uses the provided separator to reconstruct the implied nesting:

Therefore we go from:

{
...
"Address number": 9,
"Address street": "Channel Street",
"Address city" : "Houston",
"Address zip" : 77001,
...
}

to:

{
...
"Address": {
"number": 9,
"street": "Channel Street",
"city" : "Houston",
"zip" : 77001
},
...
}

Note that this is very typical of processing data which has been “flattened” to fit a rectangular format.

More details on re-nesting with Gallia can be found here

Bring operation

issues.bring(projects, target = "Name", via = "Project ID" <~> "ID")

This statement can be read in plain English as "bring field Name from projects into issues via their matching fields". Here, we have to explicitly name the matching fields because their keys differ ("Project ID" vs "ID"), otherwise they could be guessed¹.

"bring" is a special type of left join. It is used for convenience when one simply wants to augment data with a few fields from another source. This is in contrast with:

  • A Gallia join, which would correspond to the namesake SQL operation, and therefore imply a potential denormalization².
  • A Gallia co-group, which would correspond to the namesake Spark operation: no denormalization but the grouped sides end up being nested under _left and _right fields respectively.

Pivot operation (nested)

.transformObjects(“issues”).using {
_ .countBy(“Status”) // defaults to “_count”
.pivot(_count).column(“Status”)
.asNewKeys(“OPEN”, "IN_PROGRESS", “RESOLVED”) }

The first thing to note here is that we are applying a transformation to nested elements, as opposed to data elements at the root of each entity. The underscore, which has a special meaning in Scala, therefore corresponds to each entity nested under the issues field (from the preceding group-by Employee ID operation, see intermediate meta and data).

From here, for all such “issue” entities, we do a count by status (OPEN, IN_PROGRESS, or RESOLVED), followed by a pivot on that count. That is if these are the issues for a given assignee:

then the counting operation results in:

and the pivot results in:

It should be noted that the .asNewKeys(“OPEN”, “IN_PROGRESS”, “RESOLVED”) part is required due to the — very important — fact that Gallia maintains a schema throughout transformations. In the present situation, the values for Status can only be known after looking at the entire set of data³, and must therefore be provided ahead of time. More details on this matter can also be found in the schema and (DAG) heads sections of the documentation.

Future articles

In future articles I’d like to discuss ways to achieve the same result using alternative technologies such as SQL, Spark sql, pandas, etc (relates to this idea)

I’d also like to discuss how the above transformations can scale using Spark RDDs, support for which is built-in Gallia (see documentation). The above example is intentionally trivial, but if the data contained billions of rows, the code would not be able to handle it as it is.

Conclusion

For a real life and significantly more complex example, see this repository. Also see the full list of examples.

Gallia’s main strengths can be summed up like so:

  • The most common/useful data operations are provided, or at least scheduled.
  • Readable DSL that domain experts should be able to at least partially comprehend.
  • Scaling is not an afterthought and Spark RDDs can be leveraged when required.
  • Meta-awareness, meaning inconsistent transformations are rejected whenever possible (for instance, cannot use a field that’s been removed already).
  • Can process individual entities, not just collections thereof; that is, there’s no need to create “dummy” collections of one entity in order to operate on that entity.
  • Can process nested entities of any multiplicity in a natural way.
  • Macros are available for a smooth integration with case class hierarchies.
  • Provides flexible target selection — i.e. which field(s) to act on — which ranges from explicit reference to actual queries, including when nesting is involved.
  • The execution DAG is sufficiently abstracted that its optimization is a well-separated concern (e.g. predicate pushdowns, pruning, …); note however, that few such optimizations are in place at the moment.

Gallia is still a fairly new project and as such I’m looking forward to any feedback on it! The main repository can be found on github, and the parent organization contains all related repositories.

Gallia logo
Image via LogoGround under license to Anthony Cros.

Footnotes

[1]: This would not be the case if relations between the classes were provided (for instance via implicits), see two related tasks:

Note also that bring/join/co-group operations on more than one key is not available yet: see task multiple keys (t210304115501). The workaround is to generate a temporary field.

[2]: bring and left join are functionally equivalent if there are never multiple matches on the right-side.

[3]: This may be circumvented if the field is explicitly set to be an enum, which is not the case in our example because we used schema inferring. Gallia may also offer “opaque objects” in the future: see corresponding task (t210202172304)

--

--

Independent software engineer/architect with a focus on big data processing, domain modelling, software architecture, and bioinformatics.