The world’s leading publication for data science, AI, and ML professionals.

Delta Lake – Type widening

What is type widening and why does it matter?

Photo by Luca Florio on Unsplash
Photo by Luca Florio on Unsplash

Delta Lake is on its way to releasing a new major version, and, of course, there are plenty of features that everyone in the community widely expects. One of them is called Type Widening and this post will be dedicated to explaining what it is and why it is useful.


The only constant in life is change – Heraclitus

Heraclitus’s quote not only applies to the world we live in but also when talking about Data, the information that describes it. We’re in an era of fast-paced business environments where adaptation is key to keeping up with all the new requirements that emerge from our ever-changing world. It is quite common that those result in changes in the underlying structure of our data so we need to be able to accommodate such changes to stay factual about what it is describing.

In Delta Lake, we describe our data using Delta Tables and there are several ways to accommodate those changes seamlessly using the current schema evolution feature. We can: add a new column that better describes our entities; a column that needs to be renamed to a more relevant name; reorder columns if the column order is not optimized; drop a column if we don’t need it anymore; or change a column’s type due to, for instance, lack of scale (numbers don’t fit inside an integer and need to be stored in a long). While all of the above examples are already possible in Delta Lake, changing column types requires rewriting the whole table*****, which is not ideal, especially for tables in the terabyte scale. This is where Type Widening enters the game.

*****there are workarounds using column mapping but they are not very user-friendly.

Type Widening

Type widening allows changing a given column type to a wider type (one that can hold the same or more information).

There are two kinds of Type Widening, automatic, which, as the name suggests, is automatically applied, and explicit type widening, which will require an ALTER TABLE command. All the automatic type changes are supported by explicit type changes.

Automatic

  • Byte → Short → Int → Long
  • Float → Double
  • Decimal scale and precision increase (6,2) → (10,4)
  • Date → TimestampNTZ (Timestamp without timezone)
  • varchar(x) → varchar(y) – (x < y)
  • varchar(x) → string

Explicit only

  • int → double
  • byte/short/int/long → decimal
  • char(x) → char(y)/varchar(y) – (x < y)
  • char(x) → string

These changes are supported on nested fields as well as inside arrays or maps.

How to use it?

As stated before, we can either do automatic or explicit type widening. To be able to use this feature we need to enable it like:

ALTER TABLE t SET TBLPROPERTIES ('Delta.enableTypeWidening' = true)

For automatic, after enabling the feature for the table we will be writing into, we need to ensure that we have schema evolution enabled for the session so that types are widened automatically when the source contains a wider type than the respective target.

Schema evolution can be enabled like:

write(Stream).option("mergeSchema", "true")

or

SET Spark.databricks.delta.schema.autoMerge.enabled = true

For explicit Type Widening we explicitly run the ALTER TABLE commands with the requested (valid) changes:

ALTER TABLE t CHANGE COLUMN col TYPE type

What happens on type changes?

When a type is widened, the table’s schema is changed through a metadata action. Even though the already existing files have been written with a different schema, they remain untouched, which, for the same result, we would have had to rewrite them all without Type Widening enabled.

Table with type changing from integer to long - Image by author
Table with type changing from integer to long – Image by author

To maintain a history of all the changes in the schema, additional information is stored under the relevant fields’ metadata, which will be part of the writer’s protocol. Below is an example of such changes in a map key:

{
    "name" : "mapColumn",
    "type" : {
      "type": "map",
      "keyType": "integer",
      "valueType": "long",
      "valueContainsNull": true
    },
    "nullable" : true,
    "metadata" : { 
      "delta.typeChanges": [
        {
          "tableVersion": 2,
          "fromType": "short",
          "toType": "integer",
          "fieldPath": "value"
        },
        {
          "tableVersion": 5,
          "fromType": "integer",
          "toType": "long",
          "fieldPath": "value"
        }
      ]
    }
  }

Four pieces of information will be stored under the metadata field in delta.typeChanges:

  • tableVersion – The version of the table when the change was applied
  • fromType – The type of the column before the change
  • toType – The type of the column after the change
  • fieldPath (optional) – key, value, and element are used to indicate the type of a map’s key, value, and array type, respectively. For nested arrays and maps, it takes the full path from the parent as a prefix

This information will be used by readers so that they can upcast the narrower types when reading older files that still contain the fromType type.

With time, UPDATE/MERGE operations will run and trigger a natural rewrite of all the files that are still using the old schema.

Natural rewrite - Image by author
Natural rewrite – Image by author

At this time, the writer will remove the metadata information present in delta.typeChanges if all the files have the same schema as the latest Metadata action with type changes. That way, readers can read the table without any special handling on files containing previous types.

How to disable Type Widening?

To disable this feature entirely all we need to do is set it to false:

ALTER TABLE t SET TBLPROPERTIES ('delta.enableTypeWidening' = false)

or

ALTER TABLE t DROP FEATURE typeWidening

After disabling Type Widening, Delta needs to be sure that all the non-compatible readers can read the table without any issues.

If no changes were made to the type in the available history, the feature is dropped with success. However, suppose the current table version contains type changes. In that case, it will trigger a rewrite of the selected files (RowCommitVersion ≥ latest Metadata action with type changes) and all the Type Widening metadata will be removed. After the job executes, an exception will be thrown to indicate users to re-run the DROP FEATURE command after the retention period expires. For tables that only contain Type Widening residues in historical versions (only available through time travel), only an exception is thrown indicating to rerun the command after the retention period expires, similar to the previous scenario.

Conclusion

In this post, we have learned what Type Widening is and its importance by allowing type changes without undergoing full table rewrites. This feature is still under development at the time of writing (see: https://github.com/delta-io/delta/issues/2622) and might be prone to changes but the general idea is that type changes should be more user-friendly and not require complex operations. In the meantime, let’s wait for Delta’s new major release and all the awesome features it will include!


If you wish to read my latest post on a highly requested Delta Lake feature called Liquid Clustering make sure to read it on:

Delta Lake – Partitioning, Z-Order and Liquid Clustering


Related Articles