
Spark 3.1.1 release comes with a lot of new things!
Check out, for example, all the Kubernetes-related tasks done for this release.
As you can probably imagine from the title of this post we are not going to talk about Kubernetes but Nested Fields.
If you have been working with Spark long enough you would definitely have had some nightmares using deeply nested fields.
Let’s start from the beginning.
What are nested fields?
Nested fields are fields that contain other fields or objects.
For example:

How do I manage nested fields in Spark?
In Spark, it has been always kind of problematic to handle them.
Imagine that the person with id=1 has changed his address from "James St" to "St Peter" and now you need to update it.
Since in Spark you can’t modify nested fields directly, you need to build the whole struct again so you don’t lose the other fields.
It would be something like:
df
.withColumn("personal_info",
struct(
lit("St Peter"),
col("personal_info.phone_number"),
col("personal_info.age")))
hm ok, but How do I manage nested fields in Spark now?
Well, in this new 3.1.1 release, a new method called withField was introduced.
Let’s say Hi 😁
From Spark 3.1.1 we will be able to change nested fields like this:
df
.withColumn("personal_info",
'personal_info.withField("address", lit("St Peter")))
Much better right? We don’t need to reconstruct the whole object, we can just reference the field we want to change and that’s it!
Let’s see an example more complex.
Our new input JSON is the following(for simplicity I avoided some brackets)

I’ll print the Spark schema to give more clarity about the input Data.
root
|-- items: struct (nullable = true)
| |-- item: struct (nullable = true)
| | |-- batters: struct (nullable = true)
| | | |-- batter: struct (nullable = true)
| | | | |-- id: string (nullable = true)
| | | | |-- topping: struct (nullable = true)
| | | | | |-- id: string (nullable = true)
| | | | | |-- type: string (nullable = true)
| | | | |-- type: string (nullable = true)
| | |-- id: string (nullable = true)
| | |-- name: string (nullable = true)
| | |-- ppu: double (nullable = true)
| | |-- type: string (nullable = true)
Now we need to change the id from the topping field, which is deeply nested, we would need to navigate:
items -> item -> batters -> batter -> topping -> id
to reach the value.
Prior to Spark 3.1.1 this would be a real pain, but now it would be as simple as this:
df
.withColumn("items",
col("items")
.withField("item.batters.batter.topping.id", lit(12)))
And if we print the result we can see that the new id is no longer 5001 but 12:
+----------------------------------------------------------+
|items |
+----------------------------------------------------------+
|{{{{1001, {12, None}, Regular}}, 0001, Cake, 0.55, donut}}|
+----------------------------------------------------------+
And that is it! Cool right?
…and what about dropping nested fields?
Good news! We can drop them too. Let’s drop the same field we just changed, to do it, we need to call dropFields:
df
.withColumn("items",
col("items")
.dropFields("item.batters.batter.topping.id"))
Let’s print the output again:
+------------------------------------------------------+
|items |
+------------------------------------------------------+
|{{{{1001, {None}, Regular}}, 0001, Cake, 0.55, donut}}|
+------------------------------------------------------+
And success! The id is gone.
At this point, you might be asking…
Can I modify a column and drop a column at the same time?
The answer is yes!
df
.withColumn("items",
col("items")
.withField("item.batters.batter.topping.id", lit(12))
.dropFields("item.batters.batter.topping.type"))
If we print the result and the schema, we can see that the id is 12 and the topping.type is gone:
+----------------------------------------------------+
|items |
+----------------------------------------------------+
|{{{{1001, {12}, Regular}}, 0001, Cake, 0.55, donut}}|
+----------------------------------------------------+
root
|-- items: struct (nullable = true)
| |-- item: struct (nullable = true)
| | |-- batters: struct (nullable = true)
| | | |-- batter: struct (nullable = true)
| | | | |-- id: string (nullable = true)
| | | | |-- topping: struct (nullable = true)
| | | | | |-- id: integer (nullable = false)
| | | | |-- type: string (nullable = true)
| | |-- id: string (nullable = true)
| | |-- name: string (nullable = true)
| | |-- ppu: double (nullable = true)
| | |-- type: string (nullable = true)
Conclusions
Spark is evolving fast and things that weren’t possible before might be possible now, it is always good to be up to date, not just with Spark but with every technology.
Definitely, this will be a game-changer for all the people that work with deeply nested fields.