Doing a Cloud Migration? When Should You Add a Data Catalog and Governance?

Do it early. Keep it agile.

Jon Loyens
Towards Data Science

--

Image courtesy of pixabay — CC License

It’s not quite the chicken and egg question, but it is a dilemma that many enterprise data leaders face as they lean into adopting a modern data stack — which generally starts with adopting a cloud data warehouse: when is the right time to introduce a data catalog?

It’s tempting to try to take this one step at a time: I’ll migrate all my data to my data warehouse first, then I’ll worry about discovery and governance. But I’m here to argue that you should do both at the same time.

I’ll lay out my argument and then describe a process for adopting agile data governance methodologies that will actually help you do that migration faster, coordinating the effort via a data catalog and governance platform. Both internally and throughout our customer base, we’ve seen how doing this can increase your migration’s chances of success and accelerate the adoption of a data-driven culture.

Get It Right by Applying Agile Methods to Data Stack Development

Adopting agile data governance via a data catalog while migrating to a Modern Data Stack lets you reap the benefits and get ROI from your new data platform instantaneously. It also helps you avoid the pitfalls of typical waterfall development methods that plague data and analytics and slow innovation. Adopting agile principles in your data governance and management process will get your organization ROI on modern tools faster than ever.

The methods described below have proven successful both internally at data.world — where we migrated to Snowflake from AWS Athena using our own data catalog — and with our customers:

1. Build an Analytics Backlog

Create a list of metrics that the business needs or wants. It’s usually best to phrase the metrics as questions like, “What is the daily average session length of visitors to our website,” or “What is our average order value for a certain time period?” This is the equivalent of user stories in software development. By starting with high-value questions, you’ll see patterns emerge that can help with your next step. This is also where considering the creation of metrics or a semantic layer can be truly helpful. Ultimately though, these metrics will need to be broken down into the data sources required for their calculation, and this is where thinking about step two becomes valuable in this process (so that you can create reusable data products beyond simple metrics).

2. Decide on an Architectural Style

Like a well-designed software application, the data in your data warehouse or data lake should conform to an architectural style. You can select the style based on the kinds of questions in your analytics backlog and the shape and types of data you predominantly have available in your enterprise (star-schema, snowflake-schema, data vaults, and many other denormalized formats). Think about layers of data models from raw data to clean data to transformed analytic models. You can compare this layering to layering software from raw API to business logic to UX.

The architectural style you choose will have a big impact on how analysts and data scientists access and use the data. Applying this style consistently will make your data platform much more usable for all data consumers. At data.world, we use a star schema layout and ELT (extract, load, transform) architectural pattern. The star-schema layout of fact and dimension tables works particularly well for tracking the activity of our membership base but also pivoting our analytics on time period or by customer org.

3. Select a Toolchain

Once you have an architectural style and a backlog of analytics stories, it’s time to choose some tools. How well these tools work together is critical to maintaining agility in a world with ever-expanding data science and analytics use cases. Different data platforms support different architectural styles. The linchpins of the toolchain are your data platform/query layer, your ETL/data-integration tooling, and your data catalog.

Data quality, profiling, lineage, and other tools can be integrated as your use matures. Having a data catalog with an open and flexible metadata model is critical to adding new tools over time. It also gives you the basis to expand your BI, ML/AI, and data science toolbox to support data consumers over time. At data.world, we’ve adopted JIRA to manage our analytics backlog, Snowflake for our data platform, dbt for transforms, and a variety of analytics tools. All of this is coordinated via a data.world data catalog.

4. Gather Your Team

Now it’s time to bring together the data consumers and producers who will be working on the initial analytics stories. Good agile processes incorporate a diversity of stakeholders at every touchpoint. This keeps feedback loops tight and might be the single most important thing that drives adoption. Consider who you’ll tap to coordinate your data sprints as well; anoint someone to play the role of data product manager or owner at this point. Data engineers, stewards, and product managers can’t retreat into a cave for months only to emerge and expect analysts and data sciences to start using the results. By not involving all stakeholders in the development of data products and getting feedback and providing value in real time, the chances of failure go up exponentially. The ability to capture ROI from our data products also goes up since the opportunities to gain advantage from data based insights is often fleeting. Waiting months to use a “perfect” data warehouse will mean the opportunity to capture value has passed.

5. Pick Your First Analytics Stories

In classic agile/scrum fashion, now is the time to group, prioritize, and select the first set of stories to tackle. All the stakeholders should be involved in this process. Grouping can be done using traditional techniques like card sorting and affinity exercises. Sizing, business impact, and the team available can also play a role in which stories get done first.

Make sure to keep the analysis concrete, not hypothetical. Pick stories that are closely tied to jobs that the data consumers need to get done so that clear, measurable value is delivered in the end. Additionally, time box these deliverables and set a date to measure the results. This will help you reign in the temptation to boil the ocean on your first iteration.

6. Gather Your Sources in a Data Catalog

It’s time for data producers — typically DBAs or data engineers — to gather up raw data sources to answer the questions posed in the first set of analytics stories. As producers curate sources by story in your data catalog, consumers can evaluate and ask questions about those sources. The initial questions and findings are critical to capture in real time and can’t disappear into the ether of chat or email. A great data catalog makes this curation, profiling, and question process fluid, and eases the overall workflow. This is the step where it becomes clear why you should build a catalog and warehouse at the same time.

7. Build and Document Your Data Products

As data sources get refined into the architectural style you’ve selected, data consumers should be working with the data in real time and evaluating how good the models are at answering the metrics questions posed. Data stewards build data dictionaries and business glossaries right next to the data being used. Since you’ve curated the sources by analytics story, the appropriate data assets are now discoverable by purpose. By making the data catalog the fulcrum around which the collaboration happens for your new data platform, all this knowledge capture happens in real time. This minimizes the chore of having to go back and scrape data dictionaries from Google Sheets or write boring documentation. By incorporating your data catalog AS YOU BUILD THE ASSETS, you’re ensuring their reuse and minimizing your knowledge debt.

8. Peer Review the Analysis

At the end of your first data sprint, it’s time to peer review the work. The process is far more efficient with an enterprise data catalog in place. Your data catalog acts as a consumer and SME friendly environment to ask questions and understand results and prevents the kind of data brawls that happen when people show up to decision meetings with different results and definitions. Everyone can see who’s contributed to the work and other questions that have been asked. Work can be quickly and efficiently validated and extended. Your data work is all in one place: the data catalog.

9. Publish the Results

Congratulations! You’ve got your first set of data models in your shiny new awesome data management platform. Everything is well-curated by analytic stories, peer reviewed, and documented in your cloud data catalog. You’ve done something good for the business and made it reusable at the same time. Best of all, your team did it without having a massive post-hoc documentation effort because the work was done in the data catalog from the beginning. You’ve also taken the first steps in adopting a data mesh architecture and treating data assets as a product.

10. Refine and Expand

By working in an agile way with your data platform and data catalog at the same time, your assets will be well documented and organized by the time they’re published. With the next sprint coming up, you can now expand or refine the assets that are already published. A jumping-off point where assets are well documented and organized around use cases makes the next sprints easier and easier. You can then expand the program to include more lines of business or working groups. This expansion drives adoption, data literacy, and the data-driven culture we all aspire to.

If you’ve already started down the path of migrating to a new cloud data warehouse or data lake, you can still adopt agile data governance practices and chip away at any knowledge debt you have — it’s never too late! Adopting a data catalog that allows you to work iteratively on reducing your knowledge debt will be the key to not feeling like you have an ocean to boil. If you’re interested in learning more or if you’re already working in this way, I’d love to hear from you!

--

--

Co-founder and Chief Product Officer at data.world, ex-HomeAway-BV-Trilogy, Python and JS nut, Austinite, Canadian, Midgetman, Tennis Player, Geek