The Open-Source Spirit of Data Science

Published in

Towards Data Science

3 min readNov 10, 2022

It took just over a decade (give or take) for data science and machine learning to grow from ancillary functions within businesses into major lucrative industries in their own right. Still, many of the tools that data professionals use every day have retained a grassroots, community-driven approach, and the ecosystem as a whole has preserved a fondness for freely shared open-source software.

This week, let’s explore the intersection of data science and open-source culture. We’ve selected a handful of recent articles that celebrate this relationship and center projects and products that eschew tech’s tendency to focus on walled gardens and bottom lines. Let’s dive in.

The whys, hows, and whats of contributing to OS projects. As a prolific contributor, Maarten Grootendorst has a unique perspective on the benefits data scientists can reap by joining open-source projects. His experience has also helped him develop a pragmatic approach to the challenges of less-structured (and sometimes all-out chaotic) development workflows.
A practical assessment of an emerging programming language. Python and R fans have been debating the languages’ respective strengths for a very long time, but in recent years Julia, which is backed by a strong community of developers, has emerged as one of data scientists’ most popular alternatives to the Big Two. Natassha Selvaraj walks us through the factors that make Julia a compelling contender, and offers a beginner-friendly introduction in case you’d like to get started with the basics.

What’s better than one open-source tool? Two of them working in harmony. Khuyen Tran is an expert when it comes to building open source-focused workflows for data science and ML practitioners. A recent hands-on tutorial is a case in point: it brings together the power of GitHub Actions and DVC (Data Version Control) to streamline experimentation in your data pipelines.
How tech giants came to recognize the power (and business logic) of open source. As a Google software engineer working on the Keras library, Haifeng Jin has a unique perspective on the decision-making process that leads FAANG-sized companies to invest in non-proprietary technology. In his debut TDS post, he reflects on how—and why—TensorFlow has stayed an open-source product despite the potential to commercialize it.

What else is new and worth learning about this week? Well, we were hoping you’d ask.

If you’re a data or ML pro who also happens to love good writing and working with authors, we’re currently accepting applications for three volunteer Editorial Associate positions.
We recently celebrated our love of all things geospatial data-related with a selection of excellent articles on the topic.
Adrienne Kline’s popular Statistics Bootcamp series is back—don’t miss the latest installment, which covers confidence intervals in detail.
For a concise introduction to the basics of anomaly detection, head right over to Abiodun Olaoye’s new explainer.
If you’ve found yourself complaining (once again) about Python’s slowness, you’ll appreciate Casey Cheng’s thorough guide to speeding up loops.
Interested in learning more about reinforcement learning? Hennie de Harder presents a practical application in a new and accessible post on solving multi-armed bandit problems.

Thank you, as always, for your support. If you’d like to make the biggest impact, consider becoming a Medium member.

Until the next Variable,

TDS Editors

The Open-Source Spirit of Data Science

Written by TDS Editors