The world’s leading publication for data science, AI, and ML professionals.

Machine Learning Workflow for Research Scientists

Research Scientists

Hands-on Tutorials

Simple setup to get started with collaborative Machine Learning development.

Who is this Wiki for?

There is a long list of Python and Github workflow tutorials. There are so many Machine Learning Tutorials too. However, there is no one easy to follow Machine Learning-ML Cloud Service-Local Development-Benchmarking Workflow. I am going to talk about three Free services that any person can use with basic internet access – Google Colaboratory (for ML Cloud Service), VS Code (Local ML Development), and Weights and Biases (for benchmarking). TL;DR- This is a process guide for deploying your ML models as a researcher!

What won’t you learn?

How to code in Python or develop Machine Leaning Models or learn Pytorch.

What will you learn?

  • How to contribute code to well-structured projects
  • Work with multiple Google Colaboratory Notebooks
  • Create easy Python Documentation
  • Scale-up as your project and team grows in complexity and size
  • Benchmark and Project Tracking

Machine Learning Environment Setup

Google Colaboratory (Colab)

The easiest way to get started with Machine Learning (for FREE!) is using Colab. Let us look at the easiest out-of-the-box setup to get you started.

  1. Link your Google Account to store notebooks.
  2. Setting up storage.
  • Adding files can easily be done by the Colab UI on the left pane of the notebook by clicking on the folder icon and clicking on the upload symbol. However, if your instance times out, which it easily can depending on your network, RAM usage, whether you are using Colab Pro or the Free version, you might prefer a solution which is more permanent.
  • The permanent solution uses your Google Drive to host your files, especially larger datasets. There are ways to connect Colab to external data storage like AWS. However, for most Research purposes and smaller projects, the Google Drive approach works well. You can check out other approaches at data storage connection. Let us now look at the Google Drive mounting method (reference)

    Here, we mount the drive using the first two commands. Next to speed-up training, we transfer the actual data to the Colab notebook from the drive. Here we have first stored our dataset as a zip file. We need to rerun these commands every time we reconnect our Colab instance, but it can greatly speed up training and data management.

    Straightforward and most packages should be installed out-of-the-box. You can run other system commands by starting with the ! (exclamation mark) too.

If you have a really powerful machine, or if you just prefer running your files locally, I’d recommend using VS Code. Not only is it open-source, but it also has powerful extensions to emulate notebooks, code completion and a ton of python specific tools. My top recommended extensions are (in no specific order):

  • Python for VSCode (for syntax)
  • Visual Studio IntelliCode (for notebooks)
  • Pylance (supercharge python)
  • Code Runner (easily run code snippets)

You will need to have python3 and pip3 installed.


Git, Github and Desktop management

The standard way to use Git is via the terminal. Another approach is to use Github‘s official app at Github Desktop. It supports the most common OS distributions.

Here are some standard methods for easy and error-free collaboration (for local files) NOTE: Adapted from R for Research Scientists

Setting up a local repository from a remote one

  1. Initialize your remote repository

First fork the upstream repository. Next access your remote fork’s url and clone it and set this remote url as the origin to push changes to. Finally, set the original repository (the one you cloned from) as your upstream.

  1. Before starting to code

Always, always and always fetch any changes from the upstream (parent) repository. Otherwise changes you make locally could create issues while merging (pushing your code upstream).

  1. Adding new features
  2. Prevent tiny commits with amend

Make sure your source code is also synced with a cloud service like dropbox or you are using Colab. This ensures you do not lose data due to not saving or due to power failures.

  1. Incorporate upstream changes before rebasing feature branch

    If you get stuck use the following link to troubleshoot further, troubleshoot or google to check it out on StackOverflow

  2. Create a repository on GitHub or decide to use an existing one
  3. Go to you Colab file
  4. Next, select File->Save a copy in GitHub
  5. Authorize your GitHub Account and select the appropriate repository, add the commit message and press OK!
  6. To use version control, make changes to your code and repeat steps (2)-(4)

The Longer Option

Colab provides the option to download your source code to your computer. You can go to File->Download .ipynb OR File->Download .py and store the notebook as a local notebook or convert to a python file. Next, you can use the local Git way to maintain version control. To access the file in Colab, you’ll however have to reupload the file to Colab using the File->Open Notebook option in Colab.


Unit Testing

The simplest way to create unit tests in python are using the module unittest. Let us look at a simple example (from GeeksforGeeks)

The above example is pretty self-explanatory. It tests two simple functions- multiplying a string by 4 times and converting a string to uppercase. These tests are super useful for deterministic functions. However, creating such tests for gradient descent might not be easy. However, if we know the possible bounds on the function’s output we can create functions to test them. A simple ML testing library is indeed available for TensorFlow (deprecated in TF 2.0) and PyTorch. unittest could well be used to define something similar as demonstrated in PyTorch code review.

These libraries check for the following items:

  • Whether variables change
  • If certain variables are fixed, do they remain fixed over iterations
  • Is the output value reasonable (i.e. under certain bounds)?
  • Is the model well connected (i.e. do all inputs contriibute to the training operation)?

Documentation

Maintaining documentation is hard. The easiest way to get to automated documentation (not recommended, have a look at How I Judge Documentation Quality?) is using the pdoc module. Let’s get started with installation and maintenance.

Install pdoc for python-3 using the following command

To automate documentation creation we need to add comments using the Google docstrings format. Let us look at a quick example for a function used to add two numbers (stored in the file _TestExample.py)

Now, to output the documentation in html files, we will use the following command.

Here, the _TestExample.py is the source code. The output-dir parameter provides the location to store the html files for the wiki. The documentation looks like the following. However, this command will not work for a .ipynb file. The best way to create documentation will be to download the .py file from Colab and then use the pdoc command. Another way to convert a notebook to a python file is to use the command.

Furthermore, to create documentation for the entire python source code folder, we can use the following pdoc commmand. To output the documentation in html files, we will use the following command.

Here, src-folder is the folder holding Python source-code.


Benchmarking

ML Systems get super complex very easily. You’ve to track projects, experiments within these projects, metrics inside each experiment, hyperparameters for each metric collection, and so much more. Furthermore, multiple experiment runs slow the experimentation phase and makes it harder to track the results. Recently, the ML systems got an excellent upgrade. Now, users could pipe out all the super interesting stuff onto a cool looking dashboard. Enter Weights and Biases. You not only get a free account as an academic, but you can also invite a team to collaborate! Especially if you work on research and/or in open-source. Weights And Biases (WANDB) exists to give you control of your data, store multiple experimentation runs and compare your results easily!

Let us look at the simplest way to get WANDB running on your project. You can find a similar code setup once you create your project on wandb.ai.


Putting It All Together

Workflow

An overview of the entire workflow to good collaboration on ML projects.

Getting started with Colab and VS Code is super easy. However, to avoid hassles for long-term projects & collaborations it is best to establish sane practices early on. Hopefully, following this guide and the many references on here give you the necessary peace of mind. Code on!


References

[1] N. Brewer, "Publications in RStudio," Mar. 10, 2020. https://nicole-brewer.com/r-for-research-scientists/ (accessed Feb. 06, 2021).


Originally published at ML Workflow for Research Scientists (thehimalayanleo.github.io)


Related Articles