The world’s leading publication for data science, AI, and ML professionals.

Which tool should you use for database migrations?

sqitch, flyway, and liquibase all have merits

Choose a paradigm based on your needs.

All my Data teams have had a relational database at the core of our operations. When you have a relational database, you will inevitably have to change it! This post will help you decide between three useful frameworks for versioned database migrations – sqitch, flyway, and liquibase. There is an associated repository in GitHub that walks through each implementation of the same tables in a dockerized PostgreSQL database.

tl;dr

A quick decision tree to help you decide (Image by author)
A quick decision tree to help you decide (Image by author)

Why these three?

I focused this guide on sqitch, flyway, and liquibase because they all share two important characteristics. First, they all work with virtually any relational Database – this cuts out Microsoft’s excellent DACPACs. Second, they all allow users to write migrations in plain SQL – this cuts out ORM implementations for migrations like Alembic and Orator.

Does your data come from a CRUD application?

you have options

The defining feature of many OLTP workloads is that the backend receives and stores data from the application layer. All three of these migration tools excel at the basic functionality of keeping a database and its application in sync. Migrations are atomically applied to a database and the data remains at rest in the same tables.

Does your data come from cloud storage?

consider flyway

In contrast, many OLAP workloads are easily reproduced by loading all the staging files again from cloud storage like S3 or Azure Blob Storage. Flyway shines here as a Sql executor to reliably reproduce the state of a given database. In particular, it is easy to separate the DML from the DDL in different types of plain SQL files. This allows you to run the same transformations again, but load in the latest data.

Flyway excels at repeating the same task with new data in 'callbacks' (Image by author)
Flyway excels at repeating the same task with new data in ‘callbacks’ (Image by author)

Are there many ordered DML transformations?

consider flyway

If your data manipulation language moves data between multiple tables before being ready to use downstream, then flyway has a unique advantage over sqitch because you can easily infer the order of execution from the file names. Flyway is broken down into ‘versioned migrations’ (V prefix, most commonly your DDL) and ‘callbacks’ (afterMigrate prefix, most commonly your DML). Most of my teams have used XXXX numerical scripting as a naming convention. In the simplified example below, it is easy to see that the gross table is loaded AFTER the price table.

This kind of interpretability is super helpful for developers who are responsible for batch analytical workloads. [Warning: biased by my focus in data science]. Virtually all statistical models I have deployed depend on pipelines that iteratively cleanse and aggregate data until they are ready for the model. If you are not constrained by operating in real-time, this translates to a lot of ‘please execute this SQL in order’ at some interval. Flyway is a sure-fire way to get the same result every time.

The order of execution is clear from the file system (Image by author)
The order of execution is clear from the file system (Image by author)

In contrast, the same ordering logic is hidden with sqitch in a sqitch.plan file. This is made exponentially more difficult when teams use cross-reference ticket numbers as their file names. Now you have to look up in three places to figure out what (file system), when (sqitch.plan), and why (JIRA et al.) anything was done.

The order of execution is apparent in the sqitch.plan, but not the file system (Image by author)
The order of execution is apparent in the sqitch.plan, but not the file system (Image by author)

Do you need to revert migrations?

consider sqitch

Sqitch has more overhead in development in that you have to write logic to make a change, logic to test if it worked, and logic to revert the change if your test fails. This is more work at the outset, but it comes with the excellent feature that you can roll back migrations to any given point in history. Unfortunately, rolling back migrations is a paid feature in flyway.

trust but verify with sqitch (Image by author)
trust but verify with sqitch (Image by author)

Do you use environment variables to parameterize your deployments?

avoid liquibase

I would be surprised if you weren’t doing this already. Keeping database secrets out of version control is security 101. Unfortunately, using environment variables is a paid feature in liquibase. Both flyway and sqitch support parameterization out of the box. Of course, you can dive down and modify their docker entrypoint, but WHY is this necessary?!

Please keep your secrets out of version control (Image by author)
Please keep your secrets out of version control (Image by author)

Still not sure?

Fork this repository to try building out some of your core table structures. All the painful things like configuration files and volume mappings are taken care of, and you can simply swap in your migrations into any of one of the three tools.


Related Articles