You have to admit, seeing a comment like "This is super clean 😎 " or "Didn’t know this could be done this way" on your code or pull request fills you with a wonderful feeling. Personal experience has taught me, embracing good software engineering principles and making the most of the existing language functionality is the recipe for good code that others will feel grateful for.
As an MLE I use Python day in and day out. Python is a great option for ML practitioners due to its low barrier to entry combined with the massive ecosystem of scientific tooling.
This means that, an individual with little to no software engineering knowledge can quickly start using Python.
This last statement can be said in two different tones of voice; positive or negative (try it!).
It may look like it’s a blessing at first, but in the grand scheme of things, the lack of the confinements of software engineering principles (e.g. Types, Objects) dissuade the engineers (MLE) or scientists (DS/AS) from writing good code (Trust me, we already got a bad reputation among software engineers as not-so-good engineers). This inevitably leads to unreadable, unmaintainable and untestable spaghetti code in most cases. And worse, one-day it becomes some unsuspecting victim’s worst nightmare to reuse this evil code. And you may even see a domino effect, where code built on top of bad code leads to … more bad code. Ultimately, this could even lead to organizational headaches down the track.
The bottom line is, doing something in Python is easy, but doing something the right way in Python is difficult. After 8+ years grappling with Python, I’m still learning different (and better) ways to improve my code. I’ve been blessed with good software engineers that would constructively criticize my code, when I do things in inefficient manner. Count your blessings if you have the same support. Here, I’m going to share a few levers you can pull to take your Python skills to the next level.
1. Dataclasses help to remove the clutter
Say you want to manage a list of students with their heights. You may use a list of tuples to do this.
students = [("Jack", 168), ("Zhou", 172), ("Emma", 165), ("Shan", 170)]
But what if you want to add other attributes like the weight, grade and gender later on? There’s no possible way for you to use the above data structure without getting a headache and making tons of mistakes. You may use a dictionary, but still it’s clunky. A better solution is, using dataclasses
.
import dataclasses
@dataclasses.dataclass
class Student:
name: str
height: int
So much cleaner! Then you simply instantiate a bunch of Student
objects.
students = [
Student(name="Jack", height=168),
Student(name="Zhou", height=172),
Student(name="Emma", height=165),
Student(name="Shan", height=170)
]
Then you can access attributes with, students[0].name
like syntax. No more relying on obscure knowledge like, name is at 0th
position or using error-prone string keys (should you use a dictionary). There’s many other cool things you can do with dataclasses
. Such as,
- Make objects immutable (By using
@dataclasses.dataclass(frozen=True
) - Defining getters and setters
- Use of the
dataclasses.field
for additional support for an attribute - Convert to a dictionary (
.asdict()
) or tuple (.astuple()
) for serialization and for compatibility reasons.
You can read more about dataclasses
here.
2. Compare in Python with style
Reduction and sorting is such an important part of any Machine Learning project. You probably are using min
, max
or sorted
functions on lists of simple data types e.g. str
, float
, etc. But did you know that there’s a neat trick that increases the purview of the problems that can be solved using these basic functions?
You can use min
, max
and sorted
to solve problems creatively using a special argument called key
. The key allows you to define the logic to extract a "comparison key" from each item in your iterable.
Say you wanted to sort the following list of student heights,
students = [("Jack", 168), ("Zhou", 172), ("Emma", 165), ("Shan", 170)]
sorted_heights = sorted(students, key=lambda x: x[1])
Even cooler, say you had a dataclasses.dataclass
instead of this.
sorted_heights = sorted(students, key=lambda x: x.height)
Looks slick. Perhaps you want to find the student with the maximum height.
tallest_student = max(students, key=lambda x: x.height).name
Did you know that, you could even simulate the argmax
operation in plain Python? For those who don’t know, the argmax
gives the index of the maximum value in a list/array. Again such an imperative computation in lot of algorithms.
a = [3, 4, 5, 2, 1]
max_idx = max(enumerate(a), key=lambda x: x[1])[0]
There’s been many occasions in my life where I was writing many lines of code, which I could’ve achieved by simply paying more attention to the key.
3. Make defaultdict your default
When using dictionaries, there’s a handy variant of the standard dictionary that may make your life easier. Say you want to manage how stock prices changed over 3 years. Assume the original format is the following.
stock_prices = [
("abc", 95), ("foo", 20), ("abc", 100),
("abc", 110), ("foo", 18), ("foo", 25)
]
And you want to convert this to a dictionary. You can do:
stock_price_dict = {}
for code, price in stock_prices:
if code not in stock_price_dict:
stock_price_dict[code] = [price]
else:
stock_price_dict[code].append(price)
This gets the job done, no doubt. But here’s a more elegant version of the same code using defaultdict
.
from collections import defaultdict
stock_price_dict = defaultdict(list)
for code, price in stock_prices:
stock_price_dict[code].append(price)
Woah, the code is much cleaner this way. No more worrying about whether the value is already instantiated or not. And it really shows you know your Python and data structures.
4. Say "I do" to the itertools
itertools
is a built-in Python library for performing advance iterating over data structures with easy.
You may have had times in your life, where you want to iterate multiple lists to create a single list. In Python you might do:
student_list = [["Jack", "Mary"], ["Zhou", "Shan"], ["Emma", "Deepti"]]
all_students = []
for students in student_list:
all_students.extend(students)
Would you believe that, with itertools
, it is a one-liner?
import itertools
all_students = list(itertools.chain.from_iterables(student_list))
Say you want to remove students that are less than 170cm tall. With itertools
that’s another one liner.
students = [
Student(name="Jack", height=168),
Student(name="Zhou", height=172),
Student(name="Emma", height=165),
Student(name="Shan", height=170)
]
above_170_students = list(itertools.dropwhile(lambda s: s.height<170, students))
There are many other useful functions such as accumulate
, islice
, starmap
, etc. You can checkout more here. Use itertools
, rather than reinventing the wheel that leads to unwieldy and inefficient code. By using itertools
you get the added benefit of speed, as it has efficient CPython based implementation of its functionality underneath.
5. Packing/Unpacking arguments
Packing and unpacking is achieved via the star (*
) and double star (**
) operators. The easiest way to understand this concept is using functions. You can define a function with packed arguments or unpacked arguments. Let’s say we define the following two functions:
# f1 unpacks arguments
def f1(a: str, b: str, c: str):
return " ".join([a,b,c])
# f1 packs all arguments to args
def f2(*args):
return " ".join(args)
When calling f1
you can only pass 3 parameters in, where f2
can accept an arbitrary number of parameters, which are packed to a tuple args
. So you’d call:
f1("I", "love", "Python")
f2("I", "love", "Python")
f2("I", "love", "Python", "and", "argument", "packing")
All of these would work. If you want dictionary, where the key is the argument and the value is the parameter passed, you use the double star operator.
def f3(**kwargs):
return " ".join([f"{k}={v}" for k,v in kwargs.items()])
To see how much better this is, here’s your alternative for writing f2
would be:
def f2(text_list: list[str]):
return " ".join(text_list)
f2(("I", "love", "Python", "..."))
That double parentheses is already giving me shivers! Using *args
is much sweeter.
zip()
is a real-life function that accepts an arbitrary number of iterables. It creates several new lists by taking the first item of each iterable, the second item from each iterable and so on. For the following example, zip()
lets you interchange between the two formats;
[("a1", "b1"), ("a2", "b2"), ("a3", "b3")] # Format 1
<->
[("a1", "a2", "a3"), ("b1", "b2", "b3")] # Format 2
This is a surprisingly common necessity when you work with data. Remember our student example above? Say we just needed the sorted list of student names. we can simply zip the tuples to two lists.
students = [("Jack", 168), ("Zhou", 172), ("Emma": 165), ("Shan", 170)]
sorted_students, _ = zip(*sorted(students, key=lambda x: x[1]))
It’s bringing the layout of data from format 1 to format 2 and discarding the second list (as that will contain all the heights).
So, if you’re developing a function that needs to work with an arbitrary number of argument, use argument packing. It’s more elegant than passing a tuple or a dictionary.
Conclusion
Those are 5 things that you can do differently next time you write an ML model in Python. Sticking with good software engineering principles and language standards gives a common ground for a group of engineers and scientists to work cohesively and iterate rapidly. By choosing to become better software engineers, you’ll also be lighting the way for your colleagues to do the same, which will enable friction-less collaborations among individuals and teams.
If you enjoyed this story, feel free subscribe to Medium, and you will get notifications to fresh content from me, as well as unlock full access to thousands of engaging stories from other authors.