5 Things to do When Evaluating ELT/ETL Tools

A list to make evaluating ELT/ETL tools a bit less daunting

Eva Revear
Towards Data Science

--

Photo by Volodymyr Hryshchenko on Unsplash

We’ve all been there: you’ve attended (many!) meetings with sales reps from all of the SaaS data integration tooling companies and are granted 14 day access to try their wares. Now you have to decide what sorts of things to test in order to figure out definitively if the tool is the right commitment for you and the team.

I wanted to throw together some notes on key evaluation questions, as well as a few ways to check functionality, as I’m confident that this is a process that I will encounter again and again, and I like to have a template for these types of things.

These are primarily collected with cloud based integration platforms such as, but not limited to Fivetran, Airbyte, and Rivery in mind, but could apply to other cases as well!

If you have a favorite way to test out new data tools, add them to the comments!

1. Create a rubric

You can find a million articles on evaluation criteria for data integration tooling (I really like this one!), but ultimately it comes down to your data platform and the problems within it that you are trying to solve.

Gather the team together and determine what these things are. There are, of course obvious features like required source and destination connectors that can be deal breakers, but maybe you’re also looking for a metadata solution that provides lineage, or trying to increase monitoring, or needing to scale something that was built in house and is no longer holding its own.

When you lay all of that out it also makes it easier to divide up the work of making these evaluations across team members to run in parallel.

2. Start a simple pipeline running right away

Pick something pretty simple and get it up and running on day one. This will help create an overall picture of logging, metadata, latency, CDC, and all the other things that come with a pipeline.

If you are lucky you might even run into a platform error over the course of the 14 days and see how that is handled from the tooling company’s side. If you are dealing with an open source option, it can also help you understand if you are equipped to manage such issues in house.

Key questions

  • Does the documentation and UI guide you through setting up permissions and keys, scheduling, schema setup, etc in a way that’s intuitive or do you have to reach out to the technical rep for help?
  • If platform errors do occur, are they obvious via logs or is it hard to tell if you or the platform are the problem?
  • How quickly are customers notified, and issues resolved when the platform goes down?

3. Create a few end to end transforms

Some tools come with built in DBT integrations, some allow for fully custom Python based transformations. Translating a few transforms, maybe even a somewhat complex one, end to end from your existing solution can give you a good idea of how heavy a lift it will be to move everything over, if it is possible at all.

Key Questions

  • Can you land the data in the same format that it is landing in now, or will it change in ways that majorly impact upstream dependencies?
  • Are there types of transformations that you do prior to landing data that can’t be done in the tool (joining in supplemental data sources, parsing messy multi-multi level JSON, etc) that will now have to be done in the database post landing?

4. Throw a non-native data source at it

Try to process something from a non natively supported source or format (dummy up some fixed width files, or maybe pick an in house tool that exports data out in an unconventional way), or at least talk through how you could, with your technical sales representative. Even if, right now, that’s not an issue, if something does come up, it is worthwhile to at least understand what the options are for putting that functionality into place.

Key Questions

  • When a non supported source comes up will you have enough flexibility from the tool to build a solution within its framework?
  • When you start adding custom functionality to the framework does the same logging, error handling, state management, etc apply?

5. Force an error

Somewhere along one of the test pipelines that you’ve built, throw in a badly formatted file, add bad code into a transform, change the schema, or wreak havoc in some other creative way to see what happens.

3rd party tools like these can be black boxes in some aspects, and nothing is more frustrating when a pipeline goes down, than incomprehensible error messages.

Key questions

  • Do error messages and logs make it clear what went wrong and where?
  • What happens to the data that was in the pipeline once you put a fix in place? Does anything get lost, or loaded more times than it should have?
  • Are there options to redirect bad data and allow the rest of the pipeline to keep going?

A couple of bonuses

Have a non-technical user ingest a Google sheet

Needing to integrate data from a manually uploaded spreadsheet is a somewhat more common use case than DE’s often like to think that it is. A tool should make this easy for the producing business team to do without the DE’s getting involved at all.

Read through the Reddit threads on the tool

I have found Reddit to be very useful when looking at tooling options. Folks are typically very reasonable in their assessment of positive and negative experiences with a tool, and open to answering questions. At the end of the day even a thorough trial phase will miss things, and this can be an easy way to see if you have some blind spots.

--

--