The world’s leading publication for data science, AI, and ML professionals.

Follow These Best Practices for High-Quality Data Ingestion

How to choose the right tool and integrate it into your data pipeline

Photo by Jon Tyson on Unsplash
Photo by Jon Tyson on Unsplash

Data ingestion may just be the most important step in the ETL/ELT process. After all, you can’t do any transformation without data! Data ingestion pipelines connect your tools and databases to your data warehouse, the hub of your entire data stack.

The processes you set up to ingest data into your warehouse set the standards for all other processes within your data team. Your transformations and analyses that follow are only as good as the quality of data you ingest. This is why it is imperative that you follow best practices from the start, documenting them along the way.

When you document best practices from the beginning, you are forced to follow a set structure. Without this structure, things can start to get sloppy. You create a new data connector that doesn’t follow the naming convention you wanted, but you already paid for the data to be ingested and it is too late to change it. You sign a contract with an ingestion company who doesn’t offer integration with the connector most important for your business. These are all nightmares of the modern data stack that you will avoid.

So, what are the best practices to create a data ingestion pipeline? In this article I’m going to share the best practices that I’ve learned while building out a data stack from scratch. These practices include choosing an ingestion tool, documenting sources, orchestration, testing, and monitoring. By sharing what I’ve done in the past, you can learn from both my successes and my failures before embarking on the journey yourself.

‍Compare data ingestion tools

When I first embarked on deciding on the tools we would use for our data stack, my manager at the time had some awesome systems in place. I was a new Analytics engineer, never having worked full-time for a startup. I was just learning the ropes! Luckily he had a lot of experience in deciding on the right tools and putting frameworks together. Had he not put together a fancy Excel spreadsheet to help compare data ingestion tools we were deciding between, who knows where our ingestion would be today.

So, what exactly was this framework for deciding on an ingestion tool? It was basically a scorecard with the three different categories I mentioned in the title- must-have’s, nice-to-have’s, and dealbreakers. Then, under each category were certain qualities/features which we scored 1–3 depending on the tool’s capabilities.

There are a lot of important things to consider when you design your Data Ingestion pipeline, here are just a few that we included:

  • Data connectors available (Shopify, Azure Blob, Facebook, NetSuite, etc.)
  • Capabilities of your team (time to set-up, time to maintain, skillset, etc.)
  • Budget (monthly costs to ingest expected volume of data)
  • Support (Slack communities, dedicated support agents, etc.)

These are all critical to deciding on the right ingestion tool for your team. You may not find one tool that satisfies all of your needs but you do want to find the one that satisfies most of your needs. If you can’t find what you are looking for in one tool, strategize on how you can use a few different ones and still achieve the same results. For example, maybe you have an analytics engineer who can set up and maintain connectors for most of your sources with Airbyte. However, there are a few that don’t have integrations so you need to have your engineering team build something out.

Document your data ingestion pipeline sources

I can never stress the importance of documentation enough, especially if you work on a small data team. Documentation is your lifeline when something goes wrong and the person who owns it is sick, on vacation, or left the company. Everything in your modern data stack should be documented to the point where anyone else on the team can take it over in a time of need. The same goes for data ingestion.

Make sure you keep a document of the various ingestion tools you are using and the connectors you have set up within that tool. It can be easy to lose track of where your raw data is coming from if you don’t do this. In addition, note any special requirements or changes you had to make for that connector to work. For example, for one of our connectors I had to enable change data captureon the database in order to properly capture the data. Write these things down so it’s not a guessing game for your team.

Keep a copy of all raw data in your warehouse at all times

I wrote about this extensively in Snowflake Data Warehouse Architecture: How to Organiza Databases, Schemas and Tables, so be sure to check that out if you haven’t. One of the most basic best practices of data ingestion is the protection of your raw data. Your raw data should always be in a separate database in your data warehouse. No transformation tools or people should have write access to this database. It should be strictly read-only.

This is the database that will ingest all of your raw data from different sources. This database will act as a "backup" of all of your data in case something goes wrong in your data modeling. Let’s say you accidently run a data model that deletes random rows in your table and mixes column values. You can no longer trust that data. You don’t want to delete it because then it is gone forever, but you can’t use it otherwise. You can delete all of the data saved in your data models and re-run the models by referencing your raw data sources.

Raw data to the rescue!

‍Run syncs and models synchronously

Depending on how often you run your data models, your raw data should always be synced to your warehouse right before your models are run. Whether you are using dbt for your transformations like I am, or something else, one transformation should occur right after the other. This ensures that your dbt models that depend on one another aren’t running in parallel, giving inaccurate results in different analyses. By running syncs and dbt models synchronously, you also allow for more accurate validation of your data.

Luckily, data ingestion tools like Airbyte have been partnering with powerful deployment tools like Airflow, Prefect and Dagster. Personally, I use Prefect to deploy my data models. They make synchronicity easy by offering "Airbyte Connection Tasks" which allow you to sync your data connectors directly within your Data Pipeline. By syncing data directly within your pipeline, you can create dependencies downstream between models and these syncs. This means you will always have the most accurate and fresh data in your data models.

Image by author
Image by author

You simply specify where your Airbyte server is hosted, the post, api version, and connection id. It’s as simple as that! Then, you use these tasks within your Prefect flows and set them as upstream dependencies for your dbt model runs. You can read this tutorial to create a data ingestion and tranformation pipeline with Airbyte, Prefect, and dbt.

‍Create alerting at the data source

It is imperative that you create data alerts and testing at the source. Many people think testing and alerting should be done on the final data model, but really it is quite the opposite. When you focus on the downstream rather than the upstream, it becomes a 10,000 piece puzzle that you will never solve. You’ll have to do LOTS of digging just to find the problem was never actually the data model.

Trust me, I know because I’ve been there. When we had all of our data models written within a data visualization platform, it made it impossible to debug intermediate models. I had to spend weeks investigating a problem that turned out to be created at the data source itself. Your first alerts you ever build should be at the data source level. If there is an issue with your data source you need to know right away before it impacts any downstream data models.

Not to mention, it is much easier fixing issues with your data source than it is poking around your data models, testing every line of code to see if it’s performing as expected. Errors at your data source can be as simple as fixing a human error in a spreadsheet, updating a credential, or retriggering a cron job. They are much more straight-forward than when there are large amounts of code involved.

My favorite way to set up testing and alerts at the source is using dbt tests and re_data. With dbt tests you can check for null values, primary keys, and expected values. You can use these simple tests to ensure your data looks as expected. re_data is a dbt package that allows you to monitor metrics like freshness, row_count, and descriptive statistics of your data. You can even set up Slack alerts to tell you when a metric is outside it’s typical z-score.

Slack alerts in themselves are powerful. We are all on Slack every day to communicate with our company and team. Getting data alerts directly to a #data-alerts channel makes it so you don’t need to parse through your emails or manually visit a dashboard every morning. If something is wrong, your tools will tell you via Slack. Make sure you always set them up!

Conclusion

Following data ingestion best practices from the beginning stages of your data stack will set your team up for future success. Documentation of these best practices will make requirements clear from the very start, leaving little room for things to go wrong. When high standards are set for one piece of the puzzle, it is only bound to trickle down to every other piece.

To reiterate, remember the following when deciding on an ingestion tool and creating a data ingestion pipeline it into your stack:

  1. Write down all of the most important connectors for your business, the skills and time your team has to offer, and your budget when deciding on the right ingestion tool.
  2. Document all of your data sources and how they are being ingested into your data warehouse, including any special setup.
  3. Always keep a database with your raw, untouched data.
  4. Run data syncs and models synchronously so there are no gaps in data.
  5. Create alerts at the data sources rather than downstream data models.

While this may be adding more work to your plate, following these best practices now will only save you trouble down the line. You are proactively solving for when something in your data ingestion pipeline breaks, finding a solution now before it negatively impacts the business. It is always better to think about these things from the start rather than when reacting to things going wrong.

Subscribe to my analytics engineering newsletter to stay up-to-date on best practices, new tools, and all things Modern Data Stack.


Related Articles