The world’s leading publication for data science, AI, and ML professionals.

Data Scientists Work in the Cloud. Here’s How to Practice This as a Student (Part 2: Python)

Because data scientists don't write production code in the Udemy code editor

Data Scientists Work in the Cloud. Here’s How to Practice This as a Student (Part 2: Python)

Image by Luke Chesser on Unsplash
Image by Luke Chesser on Unsplash

If you want to be a data scientist, it’s not enough to know how to code – you also have to know how to run your code in the cloud.

This was a real problem for me when I was applying for my first Data Science job.

Job descriptions often contained requirements like "AWS" or "GCP," but the coding courses I’d taken focused on how to write correct syntax, and didn’t teach me much about the systems needed to actually run my code in a cloud environment.

In this series, I am offering some advice about how to practice cloud coding for data science. It’s aimed at two types of people:

  1. Aspiring data scientists— If you’re someone who’s got a bit of data experience (e.g., you’ve used Python in a local Anaconda environment/Jupyter notebook, or you’ve run SQL queries in a sandbox on Udemy or DataCamp), this will help you plug an important gap in your skillset as you prepare for getting an industry data science job.
  2. Data scientists who want to get out of local Jupyter notebooks – In the words of Pau Labarta Bajo, "ML models inside Jupyter notebooks have a business value of $0.00." This series will help you get out of notebooks and increase the value you can provide to your company.

Contents

My first article looked at how to practice SQL in the cloud via BigQuery’s free sandbox:

Data Scientists Work in the Cloud. Here’s How to Practice This as a Student (Part 1: SQL)

This article moves onto Python.

We’ll cover two ways to run Python in "real-world" cloud platform environments for free:

  1. Running Jupyter Notebooks using Google Colab and the BigQuery API
  2. Running Python scripts and ML models via GitHub Actions

I’ve picked these platforms because they’re free, beginner-friendly, and widely used in industry.

For the avoidance of doubt, I’m not affiliated with either of them – I’m just a fan

Let’s dive in!

3 ways to run Python code (but only 2 are important)

When I say "run Python code in the cloud," what do I mean by this?

Technically, there are three different ways to run Python code:

  1. In an interactive shell environment on the command line (useful for quick calculations and checks, but not much else):
Image by author
Image by author
  1. In an IPython environment or Jupyter Notebook, which consists of runnable "cells" and in-line outputs/visualisations:
Image by author
Image by author
  1. As a standalone .py script, e.g., main.py or your_script.py (generally executed on the command line with a command like python your_script.py):
Image by author
Image by author

In industry data science, the most important of these three approaches are (1) Jupyter notebooks, and (2) running Python scripts. Let’s look at how to practice them in the cloud.

1. Run Jupyter Notebooks in Google Colab (or IBM Watson Studio, or AWS SageMaker Studio Lab, or…)

Notebooks are a wonderful tool for data scientists.

During my master’s degree, I mostly used local Jupyter notebooks installed/run via Anaconda. Now, as a full-time data scientist, I mostly run my notebooks in the cloud (e.g., in a browser-based environment like GCP’s Vertex AI or AWS’s SageMaker Studio).

Luckily, the big cloud providers all let you use notebooks on their platforms for free:

If you’re applying to a company which uses a particular stack, I’d recommend having a go at using notebooks in that particular environment.

If your target company uses AWS, for instance, you could try SageMaker Studio Lab.

For the sake of this tutorial, however, I’ll use Google Colaboratory, "a free Jupyter notebook environment that requires no setup and runs entirely in the cloud." The main reasons I like Colab are (1) unlike some of the others, it doesn’t require a credit card to get set up, and (2) it will follow on nicely from my previous tutorial which focused on BigQuery.

To get set up, you can follow this setup tutorial which includes instructions on how to customise the environment, e.g., adding more GPUs.

When you load up a new notebook in Colab, you’ll see some starter data in the sample_data folder. It also comes with lots of pre-installed packages, so you’re able to run statements likeimport pandas as pd without pip-installing pandas. This is common in many cloud platforms – often, your company’s cloud engineers will pre-configure coding environments for the entire company (e.g., to ensure that all notebooks follow security best practices), so lots of packages will be pre-installed.

Once you’re in Colab, you can load data from the file system just as you would in a local notebook. E.g., this is how we’d load the sample_data/california_housing_train.csv data set:

Image by author
Image by author

Beyond the data in sample_data/, you’ve got a few options for loading external data with which to practice Python (this article provides a nice overview of the 7 different data loaders).

An example: Loading data from BigQuery

Let’s look at an example of how to load data from another cloud source.

If you’re working with SQL data in BigQuery, you can load those data directly into Colab using the BigQuery API via the %%bigquery cell magic. Here’s a sample query which follows on from my previous tutorial using the StackOverflow data set:

%%bigquery df
SELECT
  EXTRACT(YEAR FROM creation_date) AS Year,
  COUNT(*) AS Number_of_Questions,
  ROUND(100 * SUM(IF(answer_count > 0, 1, 0)) / COUNT(*), 1) AS Percent_Questions_with_Answers
FROM
  `bigquery-public-data.stackoverflow.posts_questions`
GROUP BY
  Year
HAVING
  Year = 2015
ORDER BY
  Year

If you run that query in a cell in your Colab notebook, the data generated by that query (from SELECT onwards) will be saved as a local pandas DataFrame called df (a name we specified on line 1) which you can then use in the rest of your notebook.

2. Run Python scripts via GitHub Actions

Data scientists are often criticised for being too obsessed with Jupyter notebooks:

ML models inside Jupyter notebooks have a business value of $𝟬.𝟬𝟬 (Pau Labarta Bajo)

That’s a little hyperbolic for my liking! (Models can also be used as part of one-off analyses, for which notebooks are perfect.) But Pau’s point is still applicable in many cases: for many Python scripts/models to have value, they need to be deployed and run in the cloud as pipelines which can be called on demand (real-time inference) or regularly run in batches (batch inference).

Let’s look at a simple example of how to run a Python script in the cloud with GitHub Actions.

Training the model (.ipynb) and creating the scoring script (.py)

First, create a new directory and make a new notebook called train.ipynb, which is where we’ll train our ML model:

mkdir ml-github-actions
cd ml-github-actions
touch train.ipynb

Since this article focuses on how to run Python code in the cloud (rather than train complex models), we’ll just use a DummyClassifier which always predicts the class most frequently observed during training:

import numpy as np
from sklearn.dummy import DummyClassifier
import joblib

X = np.array([-1, 1, 1, 1]) # 4 random records for model training, each with one input feature (which has a value of 1 or -1)
y = np.array([0, 1, 1, 1]) # Target can be either 0 or 1

dummy_clf = DummyClassifier(strategy="most_frequent")
dummy_clf.fit(X, y)

joblib.dump(dummy_clf, 'dummy_clf.joblib')

When you run that, it will create a new file called dummy_clf.joblib.

Next, let’s create a file called scoring_data.csv which contains a bunch of (fake) input data for which we want to generate model scores:

Image by author. I created the data in Excel, saved as a csv, and upload into my ml-github-actions directory.
Image by author. I created the data in Excel, saved as a csv, and upload into my ml-github-actions directory.

Next, we create a Python script called score.py which loads the model, loads the data, and generates predictions. We’ll use Python’s logging library to log the scores to a file called scores.log:

touch score.py
import pandas as pd
import joblib
import logging

def main():
    """
    Loads data and model, generates predictions, and logs
    to scores.log
    """

    # Load data to be scored
    scoring_data = pd.read_csv('scoring_data.csv')

    # Load model
    dummy_clf = joblib.load('dummy_clf.joblib')

    # Generate scores
    scores = dummy_clf.predict(scoring_data)

    # Log scores
    logging.basicConfig(level=logging.INFO)
    logger = logging.getLogger(__name__)
    f_handler = logging.FileHandler('scores.log')
    f_format = logging.Formatter('%(asctime)s - %(levelname)s - %(message)s')
    f_handler.setFormatter(f_format)
    logger.addHandler(f_handler)
    logger.info(f'Scores: {scores}')

if __name__ == '__main__':
    main()

Finally, save a record of your package versions to a requirements.txt file:

pip freeze > requirements.txt

This is what your final directory structure should look like:

ml-github-actions
├── train.ipynb
├── dummy_clf.joblib
├── scoring_data.csv
├── score.py
└── requirements.txt

Upload to GitHub

Next, we’ll jump over to GitHub and create a new private repository (no template needed), then we’ll push our code to GitHub:

echo "# ml-github-actions" >> README.md
git init
git add README.md
git commit -m "first commit"
git branch -M main
git remote add origin https://github.com/mattschapman/ml-github-actions.git
git push -u origin main

Create a GitHub Action to run your score.py script at regular intervals

Once your code is on GitHub, you’ll need to create an additional file called .github/workflows/actions.yml (this is where we configure our GitHub Action(s) and schedule it to run at regular intervals).

To do this within GitHub, click ‘Add File’ > ‘+ Create new file’, then type .github/workflows/actions.yml in the file name and copy the code from below, which will run the Action once every 5 minutes:

name: run score.py

on:
  schedule:
    - cron: "*/5 * * * *" # Run every 5 minutes

jobs:
  build:
    runs-on: ubuntu-latest
    steps:

      - name: checkout repo content
        uses: actions/checkout@v4 # Checkout the repo content to GitHub runner

      - name: setup python
        uses: actions/setup-python@v5
        with:
          python-version: '3.11'

      - name: install python packages
        run:
          python -m pip install --upgrade pip
          pip install -r requirements.txt

      - name: execute py script
        run: python score.py

      - name: commit files
        run: |
          git config --local user.email "[email protected]"
          git config --local user.name "GitHub Actions"
          git add -A
          git diff-index --quiet HEAD || (git commit -a -m "updated logs" --allow-empty)

      - name: push changes
        uses: ad-m/github-push-action@master
        with:
          github_token: ${{ secrets.GITHUB_TOKEN }}
          branch: ${{ github.ref }}

Finally, go to Settings > Actions > General and change Workflow permissions to Read and write (instead of just Read), tick the box Allow GitHub Actions to create and approve pull requests, and click Save.

Ta-da!

We’ve now set up a cloud-based scoring pipeline which will run your score.py script once every 5 minutes and save the predictions to a log file.

Images by author. Note: runs can be a little irregular (i.e., not always exactly 5 minutes apart), and it may take up to 15 minutes for the first run to kick-off.
Images by author. Note: runs can be a little irregular (i.e., not always exactly 5 minutes apart), and it may take up to 15 minutes for the first run to kick-off.

What’s next?

We’ve now set up an automated pipeline for running Python code at regular intervals in the cloud.

Pretty awesome, right?

To take this to the next level and make it useful in your real-world scenario, you would need to:

  1. train a decent model (instead of using a DummyClassifier), and
  2. updating the scoring_data.csv file regularly, or pulling data from a live API. That way, the model would be scoring new data each time.

Get the code

Thanks for reading. I hope you found this helpful!

You can view the code on GitHub.

One more thing –

Feel free to connect with me on X or LinkedIn, or get my data science/AI writing in your inbox via AI in Five!

Until next time 🙂


Related Articles