The world’s leading publication for data science, AI, and ML professionals.

Why Software Development Skills are Essential for Data Science

Data Scientists Should Learn From Software Engineers

Opinion

Photo by Peter Gombos on Unsplash
Photo by Peter Gombos on Unsplash

Overview of Your Journey

  1. Traditional Roles for Data Scientists
  2. A Changing World
  3. Four Basic Software Development Skills
  4. How Software Development Skills can Guide Decisions
  5. Wrapping Up

Traditional Roles for Data Scientists

Data scientists were dubbed the Sexiest Job Of The Century by the Harvard Business Review, Forbes, and others almost a decade ago. No wonder, as data scientists were given high wages and interesting problems to solve. It quickly became a popular job for both college graduates and self-taught learners to aspire to.

The data scientist of the 2010s had an incredibly broad scope and ill-defined responsibilities. A data scientist was simply someone who could generate insight from data.

Two data scientists at different companies could have widely different responsibilities:

  • At company A a data scientist could work with technologies like Microsoft Excel, PowerBi, and SQL databases. The data scientist at company A gives presentations where they share the insights they have discovered.
  • At company B a data scientist could work in Jupyter notebooks to develop machine learning models. The data scientist at company B uses high-level statistics and communicates her insight via reports.

Businesses gradually understood that having predictive models in production is really valuable. Yet, something was terribly wrong. Many companies learned the hard way that getting reliable models in production is not easy. Especially if you hired your whole data science team because they are good at Microsoft Excel and high-level statistics.

It became clear that without software development skills, data science teams fail to deploy models to production.

Does that mean that companies should hire software developers to do the jobs of data scientists? No! The skills that data scientists have for understanding and making sense of data are a culmination of years of experience. This should not be underestimated 🔥

Note: In addition to a lack of software development skills, there are other problems that can make it hard to deploy models to production. Having siloed data sources or bad communication practices are other common reasons.


A Changing World

The solution to the production problem became to hire more people to compensate for the lacking skills. We now have many new roles like Machine Learning Engineer or MLOps that aim to partly fill this gap. Along with cloud-based solutions to deployment, this has certainly helped 😃

An emphasis is placed on cross-functional teams. A cross-functional data team has persons of varying backgrounds. This makes it more realistic to deploy models to production.

The team may include a person with a DevOps background and a person with a testing background. For how to develop cross-functional teams, you can start by taking a look at this blog post.

However, the deployment process is also affected by the software development skills of data scientists. Often machine learning models are developed with jumbled code in Jupyter notebooks. They are left to traditional software developers/testers/DevOps personnel to "fix up" and deploy. This approach creates a waterfall structure that makes the whole process slow down.

The following should be taken into consideration:

Cross-functional teams become truly useful when the members of the team actively work to improve in each others areas.

What does this mean for data scientists in cross-functional teams? Data scientists should strive to pick up some basic skills in the field of software development. This will also lead them to be able to branch out in areas such as MLOps, which is in incredibly high demand.

This can initially seem very time-consuming! I am not saying that a data scientist should become a full-fledged software developer. I am not advocating that data scientists should necessarily build websites or manage Kubernetes clusters (although kudos to you if you want to do that!).

The goal should be to grasp and incorporate the basics of these fields to more rapidly deploy models to production and generate value. Even better, companies should allocate (paid 💸 ) time for data scientists to learn and practice the four software development skills in the next section.


Four Basic Software Engineering Skills

Data scientists should at the beginning of their software development journey focus on four key skills:

1 – You Should Learn How to Write High-Quality Code 🐍

If you are working in Python, then you should consistently follow a style guide (typically PEP8). You should write modular code (in form of functions and classes) that is reusable and testable. Variable names should be carefully picked.

The aim is to write code that is as close as possible to being self-documenting. Picking up a few design principles can help make your code more reusable and extendable. Here my advice is to dream big, but start small. For each week, find a new part of your code that you can improve. You will be surprised at how fast you can improve your skills.

2 -You Should Learn How to Utilize the Command Line 💻

Learning to use the Command Line (typically bash for data science) is greatly beneficial. Many command-line tools (like csvkit and curl) are super useful for a data scientist.

But more than a specific tool, just the mindset of getting comfortable with living in the command line is really useful. When mastered, the command line becomes your best friend rather than your enemy. My advice is to either start with a general introductory course on the command line (there are plenty of free ones on e.g. YouTube) or to start with csvkit. For csvkit, I have made a free video series that you can check out to get started 😸

3 -You Should Learn How to Use Version Control 💾

A version control software (typically Git) will group the changes over time to your codebase into chunks you can revert to. This makes the coding process reliable and nothing is lost needlessly. When done right, many contributors can work on the same project without worrying about breaking each other’s code.

In the modern world, using version control should be a given in any serious Data Science team. In that you, you can gradually introduce new changes or revert back to previous versions if nesessary.

4 – You Should Learn How to Write Tests ⏰

Writing tests ensures the reliability of your code. If modifications to the code occur (and let’s be real, it probably will be modified in some way), then you can do that safely by ensuring that all tests pass. Writing tests for your code will force you to think about edge cases, and thus become more aware of your code.

Knowing how to set up a minimal CI/CD environment for evaluating tests automatically is also super useful. But if you are new to testing, then I would suggest starting with writing some basic unit tests with the pytest library in Python.

Other Topics?

The four topics above should not be seen as an extensive list. Other topics, such as containerization with e.g. Docker and REST-API development with e.g. FastAPI are also important topics. However, the four topics above are the most fundamental Software Development skills for a data scientist to pick up.

To exemplify, knowing how to write high-quality code will help you once you decide to pick up FastAPI for API development. In the same way, being comfortable with the command line makes technologies like Docker a lot less intimidating.


How Software Development Skills can Guide Decisions

How would software development skills help you to make important decisions in a data science team? Here are three examples that are taken from the real world:

  • A colleague mentions a cool new product that offers Jupyter Notebooks on Steriods. Say hello to Jupyter Notebooks with extra widgets and GUI menus for developing machine learning models. Amazing! However, you realise that the new notebook files are impossible to test and Version Control. The reason you realize this is because you know the basics of these topics. Should your team adopt this new tool? Probably Not. A lack of version control and testing is a deal-breaker for most serious tasks.
  • Sometimes data scientists interview new team members in technical and non-technical interviews. Say you are hiring a software developer, tester, or MLOps practitioner for your team. Having no experience with software development will make evaluating the candidate much harder. With no software development skills, you might even end up giving the candidate a technical interview that is irrelevant.
  • Many machine learning tools are starting to assume software development skills from their users. Take a look at MLFlow, a popular tool for tracking artefacts of machine learning models (as well as other things like packaging models). Even the MLFlow’s QuickStart contains git commands and terminal commands like curl!

Wrapping Up

Photo by Spencer Bergen on Unsplash
Photo by Spencer Bergen on Unsplash

I hope I have convinced you that software engineering practices are super useful for a data scientist to know!

Like my writing? Check out some of my other posts for more Python content:

If you are interested in data science, programming, or anything in between, then feel free to add me on LinkedIn and say hi ✋


Related Articles