PYTHON-DEV FOR DATA SCIENTISTS

Best practices for setting up a Python environment

Pyenv -> Pipx -> Poetry, three pillars of python toolchain

Adiamaan Keerthi
Towards Data Science
9 min readAug 15, 2021

--

Image by author

Are you a data scientist who just built an ML model but couldn’t make it work on a docker container? Are you afraid of trying out new python versions because you don’t want to mess your local python development environment? Do you work on multiple side projects simultaneously and want to sandbox each environment separately? Are you unclear about managing your python application life cycle from development to publishing?

You have been setting up your python environment wrong!

Unlike python devs, Data scientists rarely care about the development environments and portability of their code. Code mostly lives on jupyter notebooks and is handed over to developers who take care of the deployment. This works perfectly well when you have just 1 version of python in your machine and work on a single project all the time. But once you start to work on multiple python versions or multiple projects needing different environments, it gets progressively difficult to sandbox these environments.

Once you sandbox your environment, you start adding packages to your project with each package having its own list of dependencies. You need to manage your dependency graph and ensure your collaborators can achieve the same exact dependencies.

Staying on top of new python versions and development is a necessity to level up your python skills. You don’t want to be the person in your team who is uncomfortable with f-strings, afraid of async, doesn’t understand walrus operator, and mess up your local dev environment when you start a new project.

The following is the list of tools and steps for setting up your python environment to handle these hairy situations.

Pyenv

Whenever you have a fresh OS install, start by installing pyenv first. Pyenv is a command line tool and doesn’t depend on python being installed.

You can set a global python version and have a local python version for each project. For example, you can have a global version of python as 3.8.2 and still try out, say, python 3.9.2 or 3.10 for your projects. Using pyenv, you can set the scope to your python installation. You have a global python version which should be a stable version of python that you should start as the base for most of your project, and then you can use newer or older versions of python based on the project need. For Windows, there is a port available pyenv-win.

Pyenv Cheatsheet

  1. List available versions and install:

This will give a list of python versions that you can install using pyenv. The list might vary based on the OS.

# List all available python versions
>pyenv install -l
:: [Info] :: Mirror: https://www.python.org/ftp/python
2.4-win32
2.4.1-win32
2.4.2-win32

3.9.0a3
3.9.0a4-win32
3.9.0a4
# Install a specific python version
>pyenv install 3.9.0a4
# Rehash your new installables. Needed to update pyenv shims.
>pyenv rehash

2. Global and Local versions:

Now that installing multiple python versions is trivial, you can switch between multiple versions of python. Even though you could have multiple versions of python in your machine, you need to set 1 version to be your global version. You can check your global python version as,

# Set global python version
>pyenv global 3.9.0a4
# Check global python version
>pyenv global
3.9.0a4

Unlike the global version, you can set up specific python versions for specific projects. For example, you have a project running in python 3.8.2; you can first install that python version and then set up a local python version for that project. This overrides the global python version.

# Install other verion of python
>pyenv install 3.8.2
# Change directory to your project
>cd DreamProject
# Set local python version inside your project
DreamProject>pyenv local 3.8.2
# Check your local python version
DreamProject>pyenv local
3.8.2

This creates a .python-versionfile in your project with the python version inside it. Pyenv will use this to cleverly set the python version whenever you are inside the scope of this directory. Now that you have set your python version, you can open up a new terminal and verify.

Pipx

Use pipx to install global python tools. mypy, flake8, black, and poetry can be installed using pipx once and reused across projects.

Keep it DRY. When you are reusing python tools across projects, it is better to install it once globally and reuse it across projects. Pipx is used to install python tools globally. Linting tools like mypy, flake8, formatters like black, and dependency management tools like poetry can be installed once globally and reused across projects. This helps in keeping only one version of these tools and avoids version mismatches across different projects. If there is a need to override this global version, you can install it in your virtual environment as well.

For example, you can install the black formatter once on your computer and reuse it across projects.

# Verify global python version is active
>pyenv global
3.9.0a4
# Install pipx
>pip install pipx
# Make sure pipx is added to the path
>pipx ensurepath
# Install black globally
>pipx install black
# Check if install is successful
>black --version
black, version 20.8b1

Now that you have black installed you can set this path in your IDE’s to start using it across projects.

In vscode, you can set it up by adding the following line to the user settings.json file,

# Set path for the python executable
“python.formatting.blackPath”: “C:\\Users\\username\\.local\\pipx\\venvs\\black\\Scripts\\black.exe”,

Poetry

Poetry is a perfect tool for the entire lifecycle of your python application. From creating a virtual environment, setting a dev environment, installing packages, resolving dependencies, distributing your code, packaging, and publishing your code.

Poetry helps a developer through the entire lifecycle of the project. Usually, a project starts by creating a virtual environment, adding packages needed for the project, and then ends with packaging the application to the end-user or publishing it in PyPI. We will see how you can do it all in poetry below.

Python project lifecycle using poetry

  1. Initiate poetry: This will create a pyproject.toml file in your directory which contains meta-information related to your project. You can open this file up and edit it later as well.
# Create a directory and setup python version
DreamProject>pyenv local 3.8.2
# Initiate poetry. This will ask meta info related to the project. DreamProject>poetry init
This command will guide you through creating your pyproject.toml config.
Package name [DreamProject]:
Version [0.1.0]:
Description []:
Author [aspiring_dev <aspiring_dev@gmail.com>, n to skip]:
License []:
Compatible Python versions [^3.8]:

2. Create a virtual environment: Note that we have only created a toml file so far and we have to start by creating a virtual environment. It can be done as below,

# Create virtual environment
DreamProject>poetry shell
Creating virtualenv DreamProject-hJUIGXBx-py3.8 in C:\Users\username\AppData\Local\pypoetry\Cache\virtualenvs
Spawning shell within C:\Users\username\AppData\Local\pypoetry\Cache\virtualenvs\DreamProject-hJUIGXBx-py3.8
Microsoft Windows [Version 10.0.19043.1165]
(c) Microsoft Corporation. All rights reserved.

3. Install dev/prod packages: For a data science project, we will start by installing pandas and ipython. One thing to note here is that not all packages are equal. Some are dev packages and some are prod packages. You may use ipython and jupyter notebooks while developing and testing but you don’t need it when you are deploying the application as a script in a main.py file or exposing a class or a function. This isolation avoids packaging dev tools onto the final build that will be shipped to the end-user or being deployed.

# Add prod packages, will be included in final build
DreamProject>poetry add pandas
Using version ^1.3.1 for pandas
Updating dependencies
Resolving dependencies...
Writing lock filePackage operations: 5 installs, 0 updates, 0 removals• Installing six (1.16.0)
• Installing numpy (1.21.1)
• Installing python-dateutil (2.8.2)
• Installing pytz (2021.1)
• Installing pandas (1.3.1)
# Add dev packages, will not be included in final build
DreamProject>poetry add ipython --dev
poetry add ipython --dev
Using version ^7.26.0 for ipython
Updating dependencies
Resolving dependencies...
Writing lock filePackage operations: 13 installs, 0 updates, 0 removals• Installing ipython-genutils (0.2.0)
• Installing parso (0.8.2)
• Installing traitlets (5.0.5)
.......
• Installing pygments (2.9.0)
• Installing ipython (7.26.0)

poetry add is much more powerful than pip install. Pip install just installs the package from PyPi and doesn’t do any dependency resolving. Poetry resolves dependencies and installs the right package version for you.

Dependency hell arises every time your project grows and starts to depend on several packages. Each of the packages that you install has dependencies on its own, so whenever you install or upgrade a package it gets progressively difficult to resolve this dependency graph.

For example, let’s say that you install package x for your project that depends on, request ≤ v2.24.0 and you install package y for your project that depends on request >v2.25.0. Even though x and y packages depend on requests, it has incompatibility in the version and there is no overlap. Poetry will detect and flag these issues whereas pip will not. This gets difficult as you keep on adding more and more packages and the dependency graph grows exponentially. But poetry makes sure that every time you add a package, it resolves the dependency for you and installs the right package version.

4. Collaborate with your code and environment: Once you are done with your development or you want to collaborate with your teammates, you need to start using version control and central repository. Others can now clone your repo and started working on it. Traditionally requirements.txt was used to set up an environment and install all the packages. But with poetry once you clone the repo locally, you can just run poetry install to install all the packages and dependencies in the exact same order it was done on your machine thereby removing any discrepancy in setting up environment across machines. Poetry provides a repeatable, exact copy of your working environment and dependency tree.

# After cloning the repo run poetry install
DreamProject>poetry install

This will install all the required packages for this project in a virtual environment.

5. Package and distribute your code: After development and testing, it is time to package and distribute your code. poetry build helps with building the python project. It can generate both package.tar.gzand package.whlformats.

It is best practice to use wheel format instead of tarballs(source distribution). With wheel format, the source code is already built by the developer and it is made available to the user. With tarballs, the user gets the source code and needs to build it to wheel format and then install it. Building from source code is difficult when you don’t have developer tools and your codebase contains multiple languages like c, c++. Wheel files are also smaller compared to tarballs making it ideal for PyPi packaging and distribution.

# Build the project
>poetry build

Building DreamProject (1.0.0)
- Building sdist
- Built dreamproject-1.0.0.tar.gz

- Building wheel
- Built dreamproject-1.0.0-py2.py3-none-any.whl

6. Publish your code: Based on the type of development you can either publish your code to PyPi or use a private repository. poetry pubish will by default publish your code to PyPi

$ poetry publish

Publishing dreamproject (1.0.0) to PyPI

- Uploading dreamproject-1.0.0.tar.gz 100%
- Uploading dreamproject-1.0.0-py2.py3-none-any.whl 58%

Notes:

1. As noted by Duncan McRae in the comments, sometimes package install will fail in poetry if the python version is not compatible.

For example, let’s say you have a local python version of 3.9. Poetry will initialize the python version as ^3.9, which means the environment is compatible with any python version greater than 3.9. But if you install a package say scipywhich has a dependency of >=3.9,< 3.10, then poetry fails the installation. If this occurs, you may need to open up the pyproject.toml file and change the python version to > = 3.9, < 3.10. This behavior is valid as of version 1.1.6 and may change in the future.

2. Francesco Padovani asked an excellent question in the comments, Don’t know, isn’t conda similar to poetry? And isn’t conda used by the data science community? So if anything, data scientists care more than python devs about the portability of the code since many python devs use pip instead. What am I missing?

These are my reasons for personally using poetry instead of conda.

  1. Conda does not use PyPI and has its own package indexer. The official channel is a few versions behind for few packages, but there are other channels that have the latest versions available. But you have to careful about the authors from where you are getting your packages.

2. Owned by Anaconda Inc, unlike poetry which is an open-source project.

3. Conda has access to the system and can change the system binaries, unlike poetry which is sandboxed within the virtual environment, making it perfect for dockers.

4. poetry is compatible with conda. So you can still use conda for setting up complex packages.

--

--

Senior Data Scientist @ Shopify 🚀 | ✍🏽 All my articles are forever free and not monetized 💰