Data: Where Engineering and Science Meet

TDS Editors
Towards Data Science
3 min readSep 29, 2022

--

Does the divide between data engineering and data science still hold?

In some places, certainly: the engineers responsible for moving, storing, and maintaining raw data and the specialists who process, visualize, and analyze it stay in their distinct lanes. In many contexts and organizations, however, things have gotten a lot hazier. One person (or a single team) might be tasked with a range of duties that fall along the spectrum between traditional data engineering and data science work.

This week, we’ve chosen a selection of excellent posts with a strong data-engineering angle. Some tackle workflows you might be very familiar with, while others address processes you’ve yet to explore. We think you’ll find them useful either way: it’s never a bad idea to add depth to our toolkit, or to get a more concrete sense of the work our colleagues are doing. Here we go!

  • Building cost-effective pipelines. We should all be mindful of the resources that our data- and compute-heavy work consumes, says Xiaoxu Gao, “no matter if you are an engineer who manages the resources or a manager who receives the bills.” This article guides us towards making smart decisions while keeping cloud-related expenses under control.
  • How to avoid the worst data-migration pitfalls. Moving to a new place is rarely fun in the physical world, and—alas!—it can be a pain in the digital realm, too. Hanzala Qureshi provides a helpful resource for planning and executing a smooth data migration, with a clear focus on maintaining data quality and “testing, testing, and more testing.”
  • A streamlined database workflow? Yes, please. If you’re a data scientist who isn’t quite SQL-fluent yet, that shouldn’t be a barrier to collaborating with your data-engineering friends on important database projects. Kenneth Leung walks us through the magic of PyMySQL, which makes it possible to access and query MySQL databases with Python.
  • Effective pipelines matter in ML, too. It makes sense that machine learning practitioners devote so much so much energy to the predictive power of their models; it’s right at the core of what they do. As Zolzaya Luvsandorj stresses, however, it’s just as important to ensure your model can receive raw data, preprocess it, and produce outputs without a hitch.
Photo by Svetlana B on Unsplash
  • Master the art of orchestrating containerized applications. Kubernetes is a key building block in many data infrastructures, but learning how to work with it can seem daunting. Percy Bolmér is here to help, with a thorough tutorial that dives deep into the code, but keeps things manageable and accessible.
  • Get familiar with the ins and outs of data storage. Cloud storage is a technology almost everyone uses, but few people care to learn about its inner workings. If you’d like to get a better grasp of its function in the context of data-science workflows, Matt Sosna’s latest contribution covers the essentials of AWS in great detail.
  • Designing a robust bridge between GitHub and Docker. Khuyen Tran’s recent tutorial makes it clear that a little bit of data engineering know-how can simplify processes on a very local, personal level. It leverages open-source tools to store and execute your code in different locations, all while automating major recurring chunks of the process.

Wait, there’s more! (Isn’t there always?) If you’re looking for engaging reads on other topics, here are a few options we recommend:

Thank you for supporting the work we publish! If you’d like to make the biggest impact, consider becoming a Medium member.

Until the next Variable,

TDS Editors

--

--

Building a vibrant data science and machine learning community. Share your insights and projects with our global audience: bit.ly/write-for-tds