The world’s leading publication for data science, AI, and ML professionals.

A Data Scientist’s Guide To Improving Python Code Quality

Tools and packages to write production worthy Python code

Photo by Christopher Gower on Unsplash
Photo by Christopher Gower on Unsplash

Background

Nowadays, Data Scientists are becoming more and more involved in the production side of deploying a machine learning model. This means we need to be able to write production standard Python code like our fellow software engineers. In this article, I want to go over some of the key tools and packages that can aid in creating production-worthy code for your next model.

Linters

Overview

Linters are a tool that catches small bugs, formatting errors, and odd design patterns that can lead to runtime problems and unexpected outputs.

In Python, we have PEP8 which fortunately gives us a global style guide to how our code should look. Numerous linters exist in Python that adhere to PEP8, however my preference is flake8.

Flake8

Flake8 is actually a combination of the Pyflakes, pycodestyle and McCabe linting packages. It checks for errors, code smells and enforces PEP8 standards.

To install flake8 pip install flake8 and you can use it by flake8 <file_name.py>. It really is that simple!

For example, let’s say we have the function add_numbers in a file flake8_example.py:

def add_numbers(a,b):
    result = a+  b
    return result

print(add_numbers(5, 10))

To call flake8 on this file, we execute flake8 flake8_example.py and the output looks like this:

Photo by author.
Photo by author.

Flake8 has picked up several styling errors that we should correct to be in line with PEP8.

See here for more information about flake8 and how to customise it for your needs.

Code Formatters

Overview

Linters often just tell you what’s wrong with your code but don’t actively fix it for you. Formatters do fix your code and help expedite your workflow, ensure your code adheres to style guides, and makes it more readable for other people.

isort

The isort package sorts your imports in the required order specified in PEP8. It can easily be installed by pip install isort.

Imports should be written on separate lines:

# Correct
import pandas
import numpy 

# Incorrect 
import pandas, numpy

They should also be grouped in the following order:

  • Standard library (e.g. sys)
  • Related third party (e.g. pandas)
  • Local (e.g. functions from other files in the repo)
# Correct
import math
import os
import sys

import pandas as pd

# Incorrect
import math
import os
import pandas as pd
import sys

Finally, the imports from packages need to be in alphabetical order:

# Correct
from collections import Counter, defaultdict

# Incorrect
from collections import defaultdict, Counter

The following commands show you how to run isort from the terminal:

# Format imports in every file
isort .

# Format in specific file
isort &lt;file_name.py&gt;

For more information on isort, check out their site here.

Black

Black reformats your code based on its own style guide which is a subset of PEP8. See here for the current guide black adheres to when formatting.

To install black simply run pip install black and to call it on a file black <file_name.py>.

Below is an example for a file called black_example.py :

# Before running black 
def   add_numbers  (  x, y ) :

    result= x  +y
    return result

Then we run black black_example.py:

# After running black
def add_numbers(x, y):
    result = x + y
    return result

The output in the terminal will also look like this:

Photo by author.
Photo by author.

For more information and how to customise your black formatter, see their homepage here.

Unit Tests

Overview

Unit tests provide a structured format to ensure your code is doing what it is meant to do. They test small bits of your code like functions and classes to verify they are behaving as expected. Tests are quite simple to setup and can save you hours of debugging time, so are highly recommended for Data Scientists.

PyTest

Pytest is the most popular unit testing framework alongside Python’s native unit testing package and is easily installed through pip install pytest.

To use pytest, we first need a function we can test. Let’s go back to our add_numbers function, which will be in a file called pytest_example.py:

def add_numbers(x, y):
    result = x + y
    return result

Now in a separate file called test_pytest_example.py, we write the corresponding function’s unit test:

from pytest_example import add_numbers

def test_add_numbers():
  assert add_numbers(5, 13) == 18

To run this test, we simply execute pytest test_pytest_example.py:

Photo by author.
Photo by author.

As you can see, our test passed!

If you want a more detailed and comprehensive tutorial on pytest and unit testing, checkout my previous post on the subject:

Debugging Made Easy: Use Pytest to Track Down and Fix Python Code

Type Checker

Overview

The final topic we will cover is typing, and no not the keyboard kind! Python is a dynamic language, which means it does not enforce strict typing for its variables. A variable x, can be an integer and a string in the same code. However, this can be problematic and lead to unexpected bugs. Therefore, there are tools to make Python more like a statically typed language.

Mypy

We can ensure our variables and function have the right expected types by using the package mypy. This package checks that the inputs and outputs are correct with the required types.

For example, for the add_numbers function, we expect the inputs and outputs to both be float. This can be specified in the function:

def add_numbers(x: float, y: float) -&gt; float:
    result = x + y
    return result

print(add_numbers(10, 10))
print(add_numbers("10", "10"))

Now, let’s say we pass the following arguments into the function and print the results:

print(add_numbers(10, 10))
print(add_numbers("10", "10"))

The output would look like this:

print(add_numbers(10, 10))
&gt;&gt;&gt; 20

print(add_numbers("10", "10"))
&gt;&gt;&gt; 1010

We see the first output is what we expect, but the second is not. This is because we passed in two str types, however the python interpreter didn’t error out as Python is a dynamic language.

We can use mypy to catch these errors and avoid any bugs downstream. To do this, call mypy as mypy <file_name.py>. So, for this example we execute mypy mypy_example.py:

Photo by author.
Photo by author.

As we can see, mypy has picked up that the arguments specified in line 6 are str, whereas the function expects float.

If you want a more detailed and comprehensive tutorial on mypy and typing, checkout my previous post on the subject:

A Data Scientist’s Guide to Python Typing: Boosting Code Clarity

What’s The Need?

To summarise, you might be thinking, why do we need all these tools? Well, all these lead to your Python code having:

  • Readability: Your code becomes instantly more intuitive and readable to other developers and data scientists. This allows for better collaboration and quicker delivery times.
  • Robustness: The code will be less prone to errors and also harder to introduce errors, particularly using unit tests.
  • Easier To Identify Bugs: Through the use of linters and tests, we can detect any inconsistencies and odd results from the code, which limits the risk of shipping to production with code errors.

You can view the whole code used in this post at my GitHub here:

Medium-Articles/Software Engineering /code-quality-example at main · egorhowell/Medium-Articles

References & Further Reading

Another Thing!

I have a free newsletter, Dishing the Data, where I share weekly tips for becoming a better Data Scientist. There is no "fluff" or "clickbait," just pure actionable insights from a practicing Data Scientist.

Dishing The Data | Egor Howell | Substack

Connect With Me!


Related Articles