Notes from Industry

3 Steps for a Successful Data Migration

Mark Grover
Towards Data Science
8 min readOct 7, 2021

--

Photo by Patrick Fore on Unsplash

Migrations are one of the most common initiatives that data teams own. But as everyone knows, they are complex beasts. They always take longer than they need to, many different teams are involved, and they’re super visible — not just to users, but execs, too. Everyone is anxious when a migration is underway. And, there’s always one ongoing or about to start.

Over my career, I have been a data engineer myself and have worked with hundreds of data engineers. One thing I’ve learned which correlates with the quality of data engineering teams is how well they do migrations.

I’ve found that how a data engineering team handles migrations can be a good proxy for the quality of a data engineering team. I’ll dive into the 3 crucial steps that I have seen great data engineering teams take to ensure a successful migration:

  1. Determine the complexity and order of migration
  2. Migrate and test level-1 tables
  3. Provide tools for downstream owners and kick-off ripple migration

Each of these steps could be a blog post of its own, but for the sake of coherence I have tried to tackle them in one post.

Migrations are usually based on something that’s happening organizationally or externally to the company that’s pushing them to migrate. Migrations are exhaustive because success is tied to killing something rather than simply just moving something. In most migrations, getting 90% done isn’t good enough, even getting 100% done is not good enough, you really have to kill something old in order for a migration to be successful.

Here are some common reasons why data teams migrate, along with what success looks like in each of these cases:

  1. Wholesale migration from one data warehouse to another, usually to reduce costs, consolidate, increase scalability or reliability. Success: no activity on the previous warehouse.
  2. Migrate from one source of data to another, due to a decision made “upstream”. A common example is a company moving from Hubspot to Salesforce as their CRM and all the downstream sales analytics pipelines, dashboards and metrics need to be updated due to the change. Another common example is “decomp” or decomposition of upstream services from a monolith to a number of microservices. These microservices end up having their own databases, which leads to new decomposed sources of data that data warehouse ETL needs to be rebuilt on, and of course. Success: zero queries on old data.

It turns out that it’s much harder to kill something than to stand up something new. And, this is precisely why migrations are so hard and complex.

These are the 3 steps that I have seen great data engineering teams take when it comes to migrations.

Step 1 — Determine the complexity and order of migration

Complexity of migration

The first step to any migration is understanding how complex it is. Is it a 2 week, 2 month, or 2 year migration? The closest proxy to complexity of a migration is determined by two factors for the data source in question:

  1. How many levels of data assets do you have?
  2. How many data assets overall are built on the data source?

A data asset can be a table, dashboard, etc.

To look at it another way, if you had a lineage graph like the one below, these factors are:

  1. What’s the depth of the graph?
  2. How many nodes are there in the graph?
Image by Author: Lineage Graph

Order of migration

Great data engineering teams are thoughtful and intentional about the order in which they choose to migrate. In a lineage graph like above, you still have to choose which “branch” to migrate first.

When determining the order of which branch to choose first, here are the three most common categories that great teams think about:

  1. Cost — we’re going to migrate the most costliest data first.
  2. Usage Impact — we’re going to migrate the most used data sets last in order to reduce user impact.
  3. Risk — we’re going to migrate the least riskiest data first. For example, let’s save core financial metrics to be migrated last since they get reported externally.

Migrations are easy, as long as you don’t have to worry about dependencies.

This brings me to the hardest part of migration. Once you have picked which branch to migrate, you have to determine how the existing data is being used.

Level-1 usage means that there are only leaf level consumers of the data like dashboards, ad hoc queries on the tables. Level-2 usage means that there are other tables built on the table that have dashboards and ad hoc queries being run on them.

Image by Author: Levels of Data Usage

An automated data catalog that captures lineage of your data can help you gain a good understanding of your levels of usage.

The data catalog that surfaces this lineage graph should be able to automatically capture these 3 kinds of relationships at a minimum:

  1. Table level lineage
  2. Dashboards/reports built on the tables
  3. Names of users and number of ad hoc queries run by them, over the last 30 days

Nice to have (not must have):

  • Column level lineage — this is hard to capture at a high level of granularity and accuracy and not needed for many migrations.
  • More information about what process/ETL was used to generate the downstream tables.

There are other use-cases for column level lineage, like meeting regulatory needs or deprecating a column. However, for the larger migrations as described at the beginning of this post, I find that table level lineage is sufficient.

Step 2 — Migrate & test level-1 tables

In most organizations, you are going to have more than one level of usage. The first step is always to migrate these level-1 tables first. In some organizations, no real ETL/ELT is being done. In such cases, this step is a no-op.

There are two common cases here:

  1. The upstream change is “absorbed,” or
  2. The upstream change is “leaked.”

Testing

If the new tables are supposed to be equivalent to the old tables, it’s important to do data testing. A good data quality product will be able to profile the old data and the new data, highlighting the differences in row count, distribution, min, max, average, etc. You would often want to run the two jobs in parallel (at least 2 weeks) to ensure that the profiles are within some upper bound during that time period.

Step 3 — Provide tools for downstream owners and kick-off ripple migration

Migrations have a ripple effect. Great data engineering teams have a clear understanding of this and are able to build feedback loops in both directions to orchestrate the migration.

Once you have migrated level-1 tables, it’s time to notify downstream consumers of the data. There are 3 kinds of possible downstream consumers:

  1. Dashboards built on the tables
  2. Ad hoc queriers of the data
  3. Derived downstream tables

For each category of downstream consumers, you are passing the responsibility of downstream migration to them. This means that:

  • Downstream dashboard owners will need to migrate their dashboard to use the new source of data.
  • Ad hoc queriers of the data will need to query the new sources of data instead of the old ones.
  • Downstream table owners will need to migrate their ETL job to read from the new sources of data.

The last case for downstream table owners is the most interesting because either their table will absorb the change and no one further downstream of them needs to be notified. Or, their ETL job or table won’t be able to absorb the change and they will have triggered their own ripple effect, asking the consumers of their data to migrate.

Great data engineering teams provide tools for downstream consumers to know what their data is being used for, so they can perform their part of the migration.

Downstream tables and dashboards

Usually, data engineering teams will file a JIRA, which is integrated with their data catalog. This files tickets for downstream table or dashboard owners.

Dashboard owners update their dashboards, if needed. They often end up deprecating them since they no longer need them.

Downstream table owners have to do the same thing that the level-1 owners did. They either absorb or leak the change. If they absorb the change, they close the ticket. If they leak the change, they create tickets for their downstream consumers. They can use the same catalog and JIRA integration to create tickets for their downstream users.

Ad hoc queriers

While the JIRA ticket works well for downstream dashboard and table owners, it doesn’t work well for those who ad hoc query the data. Even if they were notified via a Slack message or an email message, a new data user can and will easily miss that detail.

It’s important to document the status of the migration so that those who mistakenly use the wrong data are politely redirected to the new one.

A good way to do that is to have a banner in your data catalog tool like so:

Image by Author: Banner on a table announcing a migrated table

For example, if a new user stumbles upon an old table and its outdated documentation in their data catalog, they’ll be notified via banner that a new version of the table exists and redirected to use that one instead. Similar to how Amazon notifies users via banner about a new version of a product when it becomes available.

Taming the beast

Migrations don’t have to be the complex beasts that they always seem to be. There are numerous ways to ensure a smooth migration, they just involve a little extra love and care. The four steps above are the most common ways that I’ve seen data engineering teams tame the beast that is a migration, and at the heart of it lies the power of lineage and how it can transform your team’s approach. Because at the core of it, having the capability to quickly understand upstream and downstream dependencies in your data is one of the most crucial aspects of a successful migration.

To read more posts like this and stay in touch, follow me on Twitter or receive a monthly newsletter by subscribing here.

--

--