Why Does It Hurt so Much: Basic Strategies for Data Migrations

Recognizing and overcoming the challenges of moving your data around

Steven Van den Berghe
Towards Data Science

--

Photo by Franki Chamaki on Unsplash

Congratulations, you just bought a new IT system. Unless it will be used to support an entirely new business process, you are now faced with the task of getting your data from a legacy system into your newly acquired tool. Data migrations are fraught with challenges, pitfalls and dangers. At the same time, they are one of the most common data projects we deal with. However, migrating the data is often an afterthought, or something vastly underestimated. It is therefore no surprise that these initiatives often go wrong. In this post, I discuss some of the stumbling blocks and strategies to mitigate the risks.

Basic strategies: you won’t clean it afterwards, probably

Your mom doesn’t live here anymore (Photo by Lunail on Freepik)

While copying data from one system to the other is technically usually not that hard, the core issue of data migrations is how to perform them without deteriorating the target system’s data quality. There are at least 4 basic issues you encounter:

  • Data loss: by not mapping certain data elements to the target, those data elements will not be in the target system. They are effectively lost.
  • Data duplication: some data elements may already exist in the target system, leading to duplicates if matching is not effectively done.
  • Validity problems: data in the source system may be in different formats than in the target system. Without transformations, you may either load poorly formed data into the target system. In some cases, you may be unable to load the data at all, for example if the target system has validation rules.
  • Consistency problems: this is related to the duplication issue I mentioned earlier. What happens if you have different values for what appears to be the same data element in the source and target systems? Which value is accurate? Are they both? Do the fields really have the same semantics?

There are three ways to deal with legacy data in the context of a migration: first, you can throw it away which, from the perspective of not disturbing the data quality in the target system is probably the best. What you don’t have also can’t be wrong or ill-formed.

This approach comes with a hefty price: the business will in effect lose any way to interact or even consult their legacy data. While there are approaches that allow for this strategy while complying to legal requirements (you can, for example, store the data on a drive in its original form), this does not help the business if they need this data in their day-to-day operations, which is almost always the case.

The second approach is cleansing the data at the source or during migration. While this is a superior approach, it does take time, money and resources. Moreover, legacy systems often have poor support to allow for such cleansing operations.

The third approach is migrating the data as-is, and leveraging the new application to cleanse the data in the target. Often, this is where we end up, minus the actual cleansing. The data migration project is over, resources are reallocated and if we’re having this discussion, chances of there being a clear governance around data quality are slim to none indeed.

This will result in two things: first, the experience in the new system will be marred by the influx of poor quality data. All the demos you have been shown of the slick new application have been done by the vendor using specially-crafted, high quality data. Remember that fast lookup of contacts in the system you were shown? Well, now the system is sluggish out of the box and the problem of there being 7 contacts with the same name in the legacy system is still there. Yet, you were promised all of this would be solved by implementing the new tool.

Second, if you migrate towards a shared platform, not only do you have the same problems you used to have in the legacy system, you also inserted your problems into the lives of your coworkers who use the same system. Maybe they have whittled down their duplicates to an acceptable level, only to come into the office to discover someone has destroyed all their efforts in one fell swoop.

I have been involved in projects where we successfully chose a hybrid approach, throwing away some data (do you really need twenty years of invoice data?), cleansing some in transit or at the source and cleansing afterwards in the target. The only way to make that last step work is to have an approved project, with a budget and locked resources to complete the cleansing exercise within an agreed-upon timespan. If you don’t have all of these, simply assume that the mess you migrate will be there forever.

No-migration means you start from an empty database

One of my favorite data migration strategies (Photo by Joaquin Corbalan at Freepik)

I want to spend a little time on the concept of not doing the data migration at all. I’ll be honest with you: just like the best code is code not written, the best data migration is the one not performed. However, no-migration can have vastly different meanings depending on who uses the words.

Let me go first: when I say no-migration, I mean that you’ll be starting from an empty database. Your history: gone. Your accounts: gone. Your contacts: bye-bye. As I mentioned before, we can store that stuff somewhere for compliance reasons, but for all practical purposes, you will not be able to use anything operationally. If you’re moving to a shared platform, you may be able to reuse some of the data there. If you want to use externally purchased data, we can make that work too.

But when the business says no migration, they may mean something else entirely: maybe they don’t want to spend any money or time doing the migration properly, so they’d prefer just shoving everything they have in the legacy system in the new application as it is. Which is, in effect, also a data migration, just a poorly-done one.

I’ve seen this happen in multiple organizations and it always leads to tragedy. At some point, when the consequences of not migrating become apparent, there is a tendency to move the goalposts little by little: maybe we should just try to migrate some of the master data, right? Account data, can’t be that hard.

After an initial round of testing, the business would really like to be able to see what they have done for a client in the past, so maybe we should try to load some of the history as well. And so on. Guess what? You are now in the middle of a data migration, only without doing any of the analysis and without approved budgets or resources.

Load fast, fail fast

Photo by Guillaume Jaillet on Unsplash

Speaking of analysis, what’s the right balance between analyzing and planning beforehand and doing the actual migration?

In my experience, problems usually start surfacing when you’re really trying to load data into the target system. I’m not saying you shouldn’t spend any time on analysis, on the contrary. But, even if you find people who know both source and target systems and understand the business context really well, it is exceedingly hard to fully analyze the situation before you’re extracting, transforming and loading data.

Thinking in terms of products is helpful: what do you need to actually do the migration? Are you looking at sets of CSVs? If so, which fields do they have? Where will they map to in the target application? Does everything you have in the legacy system have a place in the data model of the new system? Do we have keys to map relations and if not, can we create these? These questions are similar if you are thinking about using API calls to load the data.

You may say that these things should be addressed during the analysis and you would be right. What I’m saying is that I have rarely encountered a situation where all of these things were properly resolved before loading data into the target.

Usually, we don’t have to load all objects at the same time, so testing one object at a time may bring to light problems you didn’t plan for.

Of course, at some point you’ll want to try the sequence of multiple objects, especially if they are relations between different objects. In terms of testing, if at all feasible, I’m a big fan of running the entire migration at least once during the test phase. Obviously, if you are looking at a one-week migration, this may prove difficult. On the other hand, the longer the migration takes, the bigger the chance, you’ll stumble upon unforeseen issues. Some of these issues may not be teased out when using smaller samples.

Maybe there are objects that can be loaded before the actual migration, but then some method of pushing updates from the legacy system needs to be foreseen, since the legacy system will still be getting updates on certain data elements. While there are arguments for pre-migrating specific data elements, do keep in mind that this adds an extra layer of complexity to the migration and poses a significant data quality risk.

Your overall data maturity has a direct impact on the costs of data migrations

As a conclusion, I would like to stress that there are no free lunches and there are definitely no free data migrations. That said, many of the issues you encounter doing them are a direct consequence of a low overall maturity in data-related matters. In organizations with a high data maturity, questions about data migrations are asked before buying or implementing new systems. Similarly, when you create new types of data in the organization, you should be thinking about the unavoidable data migration down the road. That means thinking about granularity (more granular data can always be aggregated when needed, while the inverse is not true), generic data models and shared taxonomies.

When such issues are addressed throughout the data lifecycle, rather than at the moment when you are time-pressed to move your data around, costs tend to be significantly lower.

Also, having a higher data maturity implies that your business users will be able to reason about their data in more structured way, also leading to clearer requirements and more realistic expectations, both of which are key success factors for a successful data migration.

--

--