Part 2 – The HOW
If you’re here, it means that you understand the importance of Clean Code and that you want to master it. That’s amazing. And as I’ve mentioned in the previous post, your intentions and thought will do most of the work. Let’s just add some practical tips and you’ll be good to go.
I want to write clean code, but how?
If you’re really serious and you want to know the whole theory, I recommend reading "Clean Code: A Handbook of Agile Software Craftsmanship" by Robert C. "Uncle Bob" Martin. But if you don’t have the time, or you’re looking for more data-science-oriented tips, that’s fine. I’ll share with you my two cents and what I’ve learned and used.
However, I must warn you, from now on – it’s merely suggestions. There’s no right or wrong (for example in naming conventions), you just have to stick with something. Plus, there are probably more guidelines out there, I’m just listing what’s working best for me and what I find most handy:
- Match your code quality to the draft level
- The boy scouts rule – leave the campground cleaner than you found it
- Keep it simple
- Name your variables, classes, and functions thoughtfully
- Make sure your functions are short, doing one thing only, and without side effects
- The newspaper article – the top of your script should be the most top-level functions, with more details as you go down
- Less is more – short, no comments
Match your code quality to the draft level
As I’ve mentioned in the previous post, one of the challenges of writing clean code as data scientists is that many times, we start a project (POC/data exploration) and we’re not sure if it’ll work and will we use it ever again. So it’s hard (and rightfully doesn’t pay off) to invest much time in writing clean code for these cases. The problem arises when it does work, we integrate it into our solution and find this poorly written code in production.
The way to overcome this problem – not to invest too much in exploratory code, but keep your real-life solutions neat, is to match your code quality to the draft level.
What do I mean by code quality? To better define code quality, I divide my code into three quality levels.
1. Total mess
Just think of yourself trying out a new algorithm you just learned of, checking if it even works on your data, or getting a new data set and learning what’s inside. It’ll be quick and dirty – no fancy classes, no well-thought-of variable names, you’ll probably name your data frame "df" and it’ll all be just thrown into a script like your clothes in a laundry hamper.
2. Relatively readable code
It might not be the greatest piece of code written, but someone else will be able to understand what’s going on. The variable names should have meaning, it should be well-divided into functions or classes. It won’t follow all protocols, might not include unit tests, but it’s a good place to start and refactor.
3. Top of the line – The Mona Lisa of code
This is the code that’ll grant you the Nobel Prize for greatest code craftsman shift. Yes, I might have exaggerated. This is production-level code so it should reflect the best you’ve got and match all the guidelines you wish to follow because you’ll want this code to be as bug-immuned and easiest to maintain in the future (either by yourself or by your team members).
Now, with these quality levels in mind, I match my writing level, based on the task and its maturity, and let it evolve along the way:
- If you’re writing a low-risk algorithm and you’re sure of your way, start with a high level of code quality (somewhere between 2 to 3).
- If you’re starting a POC, implementing a high-risk algorithm, or conducting possibly one-time data exploration, start with 1 and gradually move to 2 as you’re progressing and learning what’s working and what’s not. This way, you start with low-quality code, move fast, try different approaches, algorithms, etc. and improve your code as you start to see the potential.
- For any other purpose, work with 2 and gradually move it up to 3 as you prepare to release it.
- The most important rule is to never leave messed code! Refactor basically before moving on, while you still remember what you tried to achieve and how. Messy code left behind will be unuseful in the future and you’ll probably just have to rewrite it.
General guidelines
The boy scouts rule
If I had to name my number one rule, it’ll be the boy scouts rule. The boy scouts rule states to always leave the campground cleaner than you found it. Or in our case – always leave the code better than you found it. What does it mean?
- When working in a team – don’t be afraid to change someone else’s code. When you encounter an ill-written code, large or small, that needs to be changed – change it. You don’t need to ask permission or think it over – it’s your duty.
- When working alone – change your own code when giving it a second look, keep refactoring and editing to make it better.
- Applying the boy scouts rule will make sure your codebase is always at its highest level.
- It’s simplest to follow the boy scouts rule when you’re working on code with high-quality tests and high test coverage. That way, you feel secure you’re not causing any harm by refactoring. However, a lack of tests shouldn’t prevent you from refactoring. You can always write some tests to make sure the functionally remains the same and refactor away.
Keep it simple
A second important guideline is "keep it simple". I once got great advice – when writing code, use half of the brain cells you would use for reading code. The reason is obvious – reading code is much harder, so we should aim for simplicity when writing it. Always ask yourself – could this be written more simply?
Naming conventions
Writing code is like telling a story, and your name selections are a huge part of that storytelling. While writing code, make sure you are investing time and effort in choosing names that reveal as much information as possible.
When you choose a name for your function/variable, for the next person reading it (might be future you, of course), that name creates an expectation of what will be there. These expectations are a result of two factors:
- The semantic meaning of that name. We expect the function "def load_training_set()" to load the data, but we wouldn’t expect it to filter out null values, right? We expect the variable "student" to represent a single person, so if it’s a list, that’s confusing.
- The formatting – how it’s written. If the standard convention is writing constant values in capital letters, and you name your data frame "TRAINING_DATA", it makes your code less readable and harder to interpret.
You should always think about the expectations you create when naming an instance in your code and make sure it matches its content. If not – rename it.
The following guidelines contain both general recommendations and specific conventions. It’s important to note that this particular convention is not sacred, it’s just important to have a consistent one that serves the purpose – making your code more readable, matching the reader’s expectations, and having a standardized codebase.
In general, throughout your code, pay attention to the following:
- Coherent terminology – use one term per concept (for example, don’t use both "fetch" and "get" to describe the same action).
- Avoid using the same word for two purposes.
- Use variable names as a way to say more about what the code is doing.
Constant variables
Use constant variables instead of hard-coded strings or integers, and write them as all capitals:
- For example, instead of 0.8, use "CLASSIFICATION_THRESHOLD=0.8".
- Give them a descriptive name.
- Place them at the top of the code or in an external settings file.
- It’ll be easier to read, search, change the value, and test.
Variable names
Your variable names should be: Descriptive, pronounceable, searchable, explanatory, intention-revealing:
- For example, instead of "df", use "train_set".
- Use plural for arrays and lists rather than naming their type. For example, instead of "student_list", use "students".
- Use words, not abbreviations. For example, instead of "lr", use "learning_rate".
- Better long than ambiguous.
Class names
Your class names should be:
- A noun or noun phrase like Classifier, DataHandler.
- Avoid generic names like Manager, Processor.
Function names
Your function names should:
- Say what they do.
- Contain a verb or verb phrase names like "evaluate_model".
- Be as specific as you can. For example, instead of "load", use "load_training_data".
- Imitate the way you would tell your friend what the function does
- Start with _ if they are meant for internal use in this class only. For example, "_load_training_data".
Functions
- Should be small – split a large function into smaller ones.
- Should be smaller than that.
- Rule of thumb – no longer than 20 lines. When you have so many lines of code inside a single function, it’s very hard to understand what it does. But if you split it into smaller chunks, each with a meaningful name, that’s a whole other ball game.
- If you find yourself looking for a few lines of code inside a function – they should probably be their own function.
- Every function should do one thing and ONE THING ONLY. It should either do something or answer something – not both.
- Don’t repeat yourself – this is what functions are for. if needed – split the function into smaller methods and reuse them.
- No side effects – don’t change anything you’re not supposed to.
- When calling a function, always write the parameter names along with their values. It’ll make it easier to read and understand.
- The fewer arguments the better – try to avoid more than 3 parameters.
- Try to use functions to encapsulate complex conditionals. For example, instead of "if (timer.has_expired() and (timer.is_recurrent())", use "if should_be_deleted(timer)"
- Avoid negative conditionals. For example, instead of "if (not follow_up_needed())", use "if (follow_up_not_needed())".
Formatting
Formatting is the way you style your code – the way it looks without diving into the names and actions. Right formatting serves two purposes:
- Communication – if your variable/function names are verbal communication, formatting is the non-verbal part. It tells a lot. If you choose to place two functions one after the other – it might tell the reader that they are related, or will run one after the other.
- Focus – once the formatting is done nicely and the code is organized when looking at, it frees your mind to handle what’s really important – its content. Looking at a visually messy code can divert you from its meaning.
Here are some of the formatting guidelines I try to apply in my code:
- Vertical openness between concepts – every line of code does one thing only.
- Each blank line is a visual cue that identifies a new and separate concept, so use it wisely.
- Related code (and related functions) should appear close vertically.
- Declare variables close to their usage.
The newspaper principal
One of my favorite formatting guidelines is the newspaper principle, which states that your script should start with the high-level functions and dive deeper into details as you scroll down.
- The topmost parts of the source file should provide high-level concepts and algorithms.
- Detail should increase as we move downward until at the end, where we find the lowest level functions and details in the source file.
- It’s like reading a newspaper – you start with the title at the top, then a subtitle, and afterward the entire text with the details.
Less is more
Try to keep your code concise:
- Short short short – the shorter your code is, the faster you could read and understand it. Not at the expense of a mess, of course. You still need a new line for every single action, but try to avoid unnecessary ones.
- Delete commented-out code – trust me, no one will miss it.
Strive for zero comments
You should try to have zero comments in your code – clean code doesn’t need comments.
Why?
- Code should explain itself more than 90% of the time – use your functions and variables names for explanations instead of comments.
- We often forget to update the comments when we update the code.
- It adds clutter.
If you do add a comment:
- It should not be about HOW you’re doing something, but only about WHY you’re doing it.
- Make sure it’s necessary, non-trivial, and not redundant.
- Make it as short as possible.
- Delete outdated comments as your code changes.
- Don’t comment out code – just remove it.
If you want to learn more and see some real-life examples, I recommend this post:
https://towardsdatascience.com/python-clean-code-6-best-practices-to-make-your-python-functions-more-readable-7ea4c6171d60
To sum up, just remember that as a data scientist, your best algorithm is only as good as the code it’s written with. Well written algorithms will be (as close as possible to) bug proof and easy to maintain. All it takes are your good intentions, some practical knowledge and tips, which you hopefully now have. Let me know how it goes!
References:
- "Clean Code: A Handbook of Agile Software Craftsmanship" by Robert C. "Uncle Bob" Martin.
- 7±2 Reasons Psychology Will Help You Write Better Code by Moran Weber & Jonathan Avinor (the lecture is in Hebrew)
- Clean code slides from a lecture by Arturo Herrero