Part 1 – The Motivation
Hi there, if you’re a data scientist and you care about your code – you’re in the right place. This is the first part of a series of posts on Clean Code for data scientists:
Part 1 – this is what you’re currently reading. In this part, I’ll share why I think it’s important to devote your time and energy to writing clean code.
Part 2 – practical tips to implement.
Part 3 – how to practice clean code as a data science team.
I’ll begin by explaining what is clean code and why it is a crucial practice for data scientists.
What is clean code?
Clean code is code that is readable for people, not only compilers. It’s easy to change and maintain. The best way I found to describe clean code is the metric – "WTFs per minute". Meaning, when you go over a piece of code that’s new to you, how many times do you think to yourself "Whaaaaat?":
When we think of clean code, we might think – well, it’s something the software engineers and architects have to worry about. So actually no. Stay tuned as I’ll try to convince you that clean code is crucial for your work as a data scientist and offer some practical tools and tips for implementing it.
The importance of clean code for data scientists
In general, clean code is readable, so it’s easier to debug and refactor. This ease makes the code maintainable, so adding or changing it is no big deal. This means that productivity doesn’t go down with time and complexity, with more code we add. Plus, clean code is more immune to bugs.
As data scientists, we do research, we learn, we plan, but at the end of the day – we need to write code to make it all happen. Even the smartest state-of-the-art algorithm is bound to mistakes if your code is buggy or unreadable to others (and yourself). Our algorithm is only as good as our code.
As data scientists, we sometimes work alone and sometimes as a team. Writing clean code is crucial in both cases:
- In a team, it’ll make it easier to understand and refactor your team member’s code (which is maybe the number one rule for clean code – spoiler alert). It’ll also enable a smoother beginning and onboarding for new team members (speaking from the heart – as I just recently started a new position).
- When working alone, or doing most projects by yourself, cleaner code will help you jump between projects and go back and forth with a minimal "what did I do here?" phase.
Uncle Bob says that "the ratio of time spent reading versus writing is well over 10 to 1". We are constantly reading old code in the effort of writing new code. And writing new code isn’t easy, let alone when you’re trying to do so while figuring out what was there before.
Just to give you an example, take a look at this short function and try to understand what it does:
Now look at this function, doing exactly the same thing:
You have so much information by only looking at the names, and each line tells a story. Now imagine your code made up of only messy code – wouldn’t it take longer to understand what’s happening there? Let alone refactor it. And the truth is – I only spent about half a minute more writing the nicer piece of code.
The struggle of clean code for data scientists
You might reach this point and say – "yes! I want to write cleaner code, why haven’t I done that all along?". And the truth is you might have good reasons. First, it requires knowledge, practice, and desire. As data scientists, it’s not always easy to keep our repertoire as neat as we would have wanted to.
High-risk code writing
The number one reason, from my experience, is the nature of our work being "high-risk". Meaning, when we write the first line of code in the script, we usually don’t know what will happen with it – Will it work? Will it be in production? Will we use it ever again? Is it worth anything?
We might end up spending much of our time on risky POCs or one-time data explorations. In those cases, writing the neatest time-consuming code might not be the right way to go. But then, this POC we wrote in a sketchy fashion turns into an actual project, it even gets to production, and its code is a mess! Sounds familiar? Used to happen to me all the time.
Time-consuming
What’s common to all code writers out there is the time aspect. Writing clean code costs more time in the first place since you need to think twice before writing any line of code. We’re always pushed or encouraged to get things done, fast, and it might come at the expense of our code.
Just remember – getting things done fast, while in a hurry, can come to bite you later when you’re dealing with bugs on a daily basis. Your time spent writing clean code will for sure pay for itself in the time saved on bugs.
Metrics
Another reason for the struggle is the way we are measured. As data scientists, our employer is looking for results – accurate predictions, insightful data findings, up-to-date technologies. But often, no one looks under the hood and marks the quarter’s task as successful only if our code is neat. Code infrastructure and conditions are internal goals for your teams and it’s harder to make others aware of them.
Some basic principles
If you’ve made it this far and you want to make your code cleaner, here are some of my favorite guidelines:
Match your code quality to the draft level
High-risk POCs or data explorations can start as drafts, with your code getting neater as you’re progressing
Follow standard conventions
This includes meaningful names for variables, functions, and classes; Constants instead of hard-coded strings and integers; Consistent formatting.
Make your functions awesome
Each function should do one thing and one thing only. They should be short and have no side effects.
Less is more
Your code should be as short as possible, without comments (unless they tell you WHY, but never HOW). Plus, make sure you delete commented-out code, as it only causes clutter.
Keep it simple
Simpler is always better, so try to reduce complexity as much as possible.
Boy scout rule – leave the campground cleaner than you found it
When working in a team – don’t be afraid to change someone else’s code. When working alone – change your own code when giving it a second look.
The newspaper article principle
The top of your script should be the most top-level functions, with more details as you go down.
If you want to know more, you’re welcome to check out the second part of my tutorial which explains how to implement those guidelines and lists a few more to watch.
The change starts now
Writing clean code is possible for anyone, anytime. It doesn’t matter if you’re junior or experienced, all it takes is one simple thing – wanting it. The change starts from within.
In Yoga, there’s a notion of thinking about doing something, even if you don’t necessarily succeed. Your intentions are what matters.
I find it very similar for writing code. We don’t necessarily have to follow all of the protocols or do it perfectly. Just having it in the back of our minds, wanting it, and thinking about it will surely make our code better. And of course, practice makes perfect. If you don’t believe me, just try and see for yourself!