Opinion

Importance of Software Design in Data Science

Why you should not neglect software design and development principles in data science and machine learning

Arnuld On Data
Towards Data Science
4 min readNov 9, 2020

--

Photo by ev on Unsplash

UPDATE, Feb 26, 2021: I will keep on adding links, at the bottom of the post, to prove my point that this blog-post might look like an opinion piece but it is not. That software design is a fact when it comes to data science in production.

Background

Everyone talks about the need of R, Python, Statistics, data analysis, and machine learning as the primary skills to seek employment as a data scientist. I have taken many introductory and advanced data science and machine learning courses. Yes, you need to know how to code but what no one talks about is “software design principles”.

Software Design and Development

We are going to write code in any data science or machine learning project. A production environment requires a lot of code. One must not think “joining a few for loops, creating some variables and writing a list comprehension” is the end of coding in data science. Most of the courses in data science are just the same, they focus on basic Python and end with list comprehensions. We need to know general software design principles and also understanding of different features of the programming language we are using. You could be creating a tool, you could be analyzing something and single source-code file could be anything between 100–500 lines of code (or more). You can not just patch those many lines together with basic variables and loops. Actually you can but that won’t be good for all: for you, your team and your organization (it is going to be a nightmare instead). This is where designs like object-oriented programming come in. I read every day and the only person who has written about it so far is Rebecca Vickery:

This is why we need to understand generators and context managers too. As data science and machine learning are getting automated and focus is shifting to the cloud, we might not even need to know a programming language (another 15 years later). But for now, I think one needs to know basic software design principles. You can start with a few resources:

Software in Production

Then there is real-life work in data science industry. A good example is machine learning in production. Yes, it is great we know the models and how to run them on our machines and on the cloud, how to make them more efficient at hackathons. But a production environment is a totally different beast. If you are a software developer then you know what I am talking about: segmentation-fault, crashes, integration issues, version control issues, etc. Luckily a few have talked about it and I came across even a blog about it by Luigi Patruno:

Conclusion

Yes, the bigger picture is about data. What sense we can make out of it and how we can present our insights to stakeholders and tell decision-makers what business value can be derived from data. So you need to focus on that first. After you are done with that, just don’t forget the coding part, it is what produces all those results. It is often neglected but an important part. Your predictions are as good as your data and your software that runs those predictions is as good as your understanding of software design and development. You do not need to master it like a software developer but you definitely need to know the basics.

UPDATE Nov 12, 2020: I just found a post by Rebecca Vickery where she talked in detail about software engineering principles. It is even better than what I wrote. Find it here:

UPDATE Feb 26, 2021: Found another proof that software engineering (what I call software design) is really important in data science. By Kaggle Grandmaster Vladimir Iglovikov:

UPDATE Mar 12, 2021: Here is another viewpoint by Kurtis Pykes on how important software engineering practices are for data science:

--

--