Data Scientists Work in the Cloud. Here’s How to Practice This as a Student (Part 2: Python)

If you want to be a data scientist, it’s not enough to know how to code – you also have to know how to run your code in the cloud.
This was a real problem for me when I was applying for my first Data Science job.
Job descriptions often contained requirements like "AWS" or "GCP," but the coding courses I’d taken focused on how to write correct syntax, and didn’t teach me much about the systems needed to actually run my code in a cloud environment.
In this series, I am offering some advice about how to practice cloud coding for data science. It’s aimed at two types of people:
- Aspiring data scientists— If you’re someone who’s got a bit of data experience (e.g., you’ve used Python in a local Anaconda environment/Jupyter notebook, or you’ve run SQL queries in a sandbox on Udemy or DataCamp), this will help you plug an important gap in your skillset as you prepare for getting an industry data science job.
- Data scientists who want to get out of local Jupyter notebooks – In the words of Pau Labarta Bajo, "ML models inside Jupyter notebooks have a business value of $0.00." This series will help you get out of notebooks and increase the value you can provide to your company.
Contents
My first article looked at how to practice SQL in the cloud via BigQuery’s free sandbox:
Data Scientists Work in the Cloud. Here’s How to Practice This as a Student (Part 1: SQL)
This article moves onto Python.
We’ll cover two ways to run Python in "real-world" cloud platform environments for free:
- Running Jupyter Notebooks using Google Colab and the BigQuery API
- Running Python scripts and ML models via GitHub Actions
I’ve picked these platforms because they’re free, beginner-friendly, and widely used in industry.
For the avoidance of doubt, I’m not affiliated with either of them – I’m just a fan
Let’s dive in!
3 ways to run Python code (but only 2 are important)
When I say "run Python code in the cloud," what do I mean by this?
Technically, there are three different ways to run Python code:
- In an interactive shell environment on the command line (useful for quick calculations and checks, but not much else):

- In an IPython environment or Jupyter Notebook, which consists of runnable "cells" and in-line outputs/visualisations:

- As a standalone
.py
script, e.g.,main.py
oryour_script.py
(generally executed on the command line with a command likepython your_script.py
):

In industry data science, the most important of these three approaches are (1) Jupyter notebooks, and (2) running Python scripts. Let’s look at how to practice them in the cloud.
1. Run Jupyter Notebooks in Google Colab (or IBM Watson Studio, or AWS SageMaker Studio Lab, or…)
Notebooks are a wonderful tool for data scientists.
During my master’s degree, I mostly used local Jupyter notebooks installed/run via Anaconda. Now, as a full-time data scientist, I mostly run my notebooks in the cloud (e.g., in a browser-based environment like GCP’s Vertex AI or AWS’s SageMaker Studio).
Luckily, the big cloud providers all let you use notebooks on their platforms for free:
- IBM: via Watson Studio
- AWS: via SageMaker Studio Lab
- GCP: via Vertex AI or Google Colab
- Azure: via Azure Machine Learning studio
If you’re applying to a company which uses a particular stack, I’d recommend having a go at using notebooks in that particular environment.
If your target company uses AWS, for instance, you could try SageMaker Studio Lab.
For the sake of this tutorial, however, I’ll use Google Colaboratory, "a free Jupyter notebook environment that requires no setup and runs entirely in the cloud." The main reasons I like Colab are (1) unlike some of the others, it doesn’t require a credit card to get set up, and (2) it will follow on nicely from my previous tutorial which focused on BigQuery.
To get set up, you can follow this setup tutorial which includes instructions on how to customise the environment, e.g., adding more GPUs.
When you load up a new notebook in Colab, you’ll see some starter data in the sample_data
folder. It also comes with lots of pre-installed packages, so you’re able to run statements likeimport pandas as pd
without pip-installing pandas. This is common in many cloud platforms – often, your company’s cloud engineers will pre-configure coding environments for the entire company (e.g., to ensure that all notebooks follow security best practices), so lots of packages will be pre-installed.
Once you’re in Colab, you can load data from the file system just as you would in a local notebook. E.g., this is how we’d load the sample_data/california_housing_train.csv
data set:

Beyond the data in sample_data/
, you’ve got a few options for loading external data with which to practice Python (this article provides a nice overview of the 7 different data loaders).
An example: Loading data from BigQuery
Let’s look at an example of how to load data from another cloud source.
If you’re working with SQL data in BigQuery, you can load those data directly into Colab using the BigQuery API via the %%bigquery
cell magic. Here’s a sample query which follows on from my previous tutorial using the StackOverflow data set:
%%bigquery df
SELECT
EXTRACT(YEAR FROM creation_date) AS Year,
COUNT(*) AS Number_of_Questions,
ROUND(100 * SUM(IF(answer_count > 0, 1, 0)) / COUNT(*), 1) AS Percent_Questions_with_Answers
FROM
`bigquery-public-data.stackoverflow.posts_questions`
GROUP BY
Year
HAVING
Year = 2015
ORDER BY
Year
If you run that query in a cell in your Colab notebook, the data generated by that query (from SELECT
onwards) will be saved as a local pandas DataFrame called df
(a name we specified on line 1) which you can then use in the rest of your notebook.
2. Run Python scripts via GitHub Actions
Data scientists are often criticised for being too obsessed with Jupyter notebooks:
ML models inside Jupyter notebooks have a business value of $𝟬.𝟬𝟬 (Pau Labarta Bajo)
That’s a little hyperbolic for my liking! (Models can also be used as part of one-off analyses, for which notebooks are perfect.) But Pau’s point is still applicable in many cases: for many Python scripts/models to have value, they need to be deployed and run in the cloud as pipelines which can be called on demand (real-time inference) or regularly run in batches (batch inference).
Let’s look at a simple example of how to run a Python script in the cloud with GitHub Actions.
Training the model (.ipynb) and creating the scoring script (.py)
First, create a new directory and make a new notebook called train.ipynb
, which is where we’ll train our ML model:
mkdir ml-github-actions
cd ml-github-actions
touch train.ipynb
Since this article focuses on how to run Python code in the cloud (rather than train complex models), we’ll just use a DummyClassifier
which always predicts the class most frequently observed during training:
import numpy as np
from sklearn.dummy import DummyClassifier
import joblib
X = np.array([-1, 1, 1, 1]) # 4 random records for model training, each with one input feature (which has a value of 1 or -1)
y = np.array([0, 1, 1, 1]) # Target can be either 0 or 1
dummy_clf = DummyClassifier(strategy="most_frequent")
dummy_clf.fit(X, y)
joblib.dump(dummy_clf, 'dummy_clf.joblib')
When you run that, it will create a new file called dummy_clf.joblib
.
Next, let’s create a file called scoring_data.csv
which contains a bunch of (fake) input data for which we want to generate model scores:

ml-github-actions
directory.Next, we create a Python script called score.py
which loads the model, loads the data, and generates predictions. We’ll use Python’s logging
library to log the scores to a file called scores.log
:
touch score.py
import pandas as pd
import joblib
import logging
def main():
"""
Loads data and model, generates predictions, and logs
to scores.log
"""
# Load data to be scored
scoring_data = pd.read_csv('scoring_data.csv')
# Load model
dummy_clf = joblib.load('dummy_clf.joblib')
# Generate scores
scores = dummy_clf.predict(scoring_data)
# Log scores
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
f_handler = logging.FileHandler('scores.log')
f_format = logging.Formatter('%(asctime)s - %(levelname)s - %(message)s')
f_handler.setFormatter(f_format)
logger.addHandler(f_handler)
logger.info(f'Scores: {scores}')
if __name__ == '__main__':
main()
Finally, save a record of your package versions to a requirements.txt
file:
pip freeze > requirements.txt
This is what your final directory structure should look like:
ml-github-actions
├── train.ipynb
├── dummy_clf.joblib
├── scoring_data.csv
├── score.py
└── requirements.txt
Upload to GitHub
Next, we’ll jump over to GitHub and create a new private repository (no template needed), then we’ll push our code to GitHub:
echo "# ml-github-actions" >> README.md
git init
git add README.md
git commit -m "first commit"
git branch -M main
git remote add origin https://github.com/mattschapman/ml-github-actions.git
git push -u origin main
Create a GitHub Action to run your score.py script at regular intervals
Once your code is on GitHub, you’ll need to create an additional file called .github/workflows/actions.yml
(this is where we configure our GitHub Action(s) and schedule it to run at regular intervals).
To do this within GitHub, click ‘Add File’ > ‘+ Create new file’, then type .github/workflows/actions.yml
in the file name and copy the code from below, which will run the Action once every 5 minutes:
name: run score.py
on:
schedule:
- cron: "*/5 * * * *" # Run every 5 minutes
jobs:
build:
runs-on: ubuntu-latest
steps:
- name: checkout repo content
uses: actions/checkout@v4 # Checkout the repo content to GitHub runner
- name: setup python
uses: actions/setup-python@v5
with:
python-version: '3.11'
- name: install python packages
run:
python -m pip install --upgrade pip
pip install -r requirements.txt
- name: execute py script
run: python score.py
- name: commit files
run: |
git config --local user.email "[email protected]"
git config --local user.name "GitHub Actions"
git add -A
git diff-index --quiet HEAD || (git commit -a -m "updated logs" --allow-empty)
- name: push changes
uses: ad-m/github-push-action@master
with:
github_token: ${{ secrets.GITHUB_TOKEN }}
branch: ${{ github.ref }}
Finally, go to Settings > Actions > General and change Workflow permissions to Read and write (instead of just Read), tick the box Allow GitHub Actions to create and approve pull requests, and click Save.
Ta-da!
We’ve now set up a cloud-based scoring pipeline which will run your score.py
script once every 5 minutes and save the predictions to a log file.


What’s next?
We’ve now set up an automated pipeline for running Python code at regular intervals in the cloud.
Pretty awesome, right?
To take this to the next level and make it useful in your real-world scenario, you would need to:
- train a decent model (instead of using a
DummyClassifier
), and - updating the
scoring_data.csv
file regularly, or pulling data from a live API. That way, the model would be scoring new data each time.
Get the code
Thanks for reading. I hope you found this helpful!
You can view the code on GitHub.
One more thing –
Feel free to connect with me on X or LinkedIn, or get my data science/AI writing in your inbox via AI in Five!
Until next time 🙂