The world’s leading publication for data science, AI, and ML professionals.

Collaboration Between Data Scientists and Data Engineers Made Simple

How to create synergy between your data scientists and data engineers

DayTwo holds the largest Microbiome database in the world, with over 85K unique genomic sequences. Analyzing high volumes of data, which can reach up to hundreds of TB, requires a reliable and solid engineering infrastructure, one that enables continuous exploration and analysis, done by multiple data scientists simultaneously.

Data Scientist Requirements Checklist

At the heart of every research project we conduct, there is the Data Analysis part in which our data scientists are required to analyze and explore massive amounts of data in the search for valuable business insights. Such explorations should be carefully designed since failing to understand the infrastructure limitations might lead to disappointment and waste of time.

For this reason, before each experiment, we go over a short checklist to unfold some dependencies and obstacles we might encounter:

  1. What is the actual size of our input and output data
  2. Based on the sizes above, how much memory is needed?
  3. Should we run it on jupyter notebooks or VS-Code?
  4. Should we invest in parallelism?

Notably, not all the experiments require analysis of large amounts of data. But for the sake of this post, we will assume this is the case.

Data Engineers to The Rescue

The way we see it, data engineers are an essential part of our work as data scientists. Working on a research project which doesn’t have the support of our data engineers can look drastically different – and not in a good way. It starts with a solid infrastructure to collect and query the data and continues to feature selection and model training, any aspect of the project can be simplified and better structured when the proper tools and scripts are available.

Photo by Todd Quackenbush on Unsplash
Photo by Todd Quackenbush on Unsplash

Going back to our use case of analyzing tens of TB of data, we leverage our Data Engineering team capabilities to our favor by defining a shared channel which we will use to send and receive data.

To put it more simply, we define two core components : (1) The computations we should calculate. (2) The relevant data subset which should be used.

This way, every complicated and undocumented analysis task becomes structured and easy to track, which allows our researchers to be focused on finding insights and clears them from most of the engineering hassles data scientists usually encounter.

image by author
image by author

Epilogue

Data engineers have long become an essential part of every tech company. At DayTwo, besides their trivial role of "organizing the company databases", we leverage their profound knowledge in building tools and infrastructures that can simplify our Data Science team workflows. Having our data engineers integrated into our data science flows, enables us to expand our research and data analysis capabilities, besides the obvious benefit of exposing more people to one of the hottest topics currently exist – MLOps.

If you’d like to hear more about the latest and greatest we’ve been working on, don’t hesitate to contact me.

Yaron


Related Articles