Google AI Platform allows advanced model training using various environments. So it is really easy to train your model with just one command like so:
gcloud ai-platform jobs submit training ${JOB_NAME}
--region $REGION
--scale-tier=CUSTOM
--job-dir ${BUCKET}/jobs/${JOB_NAME}
--module-name trainer.task
--package-path trainer
--config trainer/config/config_train.json
--master-machine-type complex_model_m_gpu
--runtime-version 1.15
However, Google runtime environments are being deprecated from time to time and you might want to use your own custom runtime environment. This tutorial explains how to set one and train a Tensorflow recommendation model on AI Platform Training with a custom container.
My repository can be found here: https://github.com/mshakhomirov/recommendation-trainer-customEnvDocker/
Overview
This tutorial will explain how to train a user-items-ratings
recommendation model using WALS algorithm.
- This is a very common example when user rated some content or product and you need to recommend a similar product to them.
- This is a production grade code example which will handle user-ratings matrix of any size.
This guide covers the following steps:
- Local environment setup
- Write a Dockerfile and Create a custom container
- Run docker image locally
- Push the image to GCP Container Registry
- Submit a custom container training job
- Schedule model training with Airflow
Prerequisites:
- GCP developer account
- Docker installed
- Python 2
- Cloud SDK is installed.
- Enable the AI Platform Training & Prediction, Compute Engine and Container Registry APIs.
Creating the above mentioned resources will incur costs around $ 0.20. Don’t forget to clean up when finished.
Training dataset
Our data for training (repo) will look like this:

It is very similar to MovieLens ratings dataset but simplified for development purposes. You can apply this schema to anything including Google Analytics Page views or any other product/content related user activity.
Step 1. After you installed docker you need to authenticate it. Use
gcloud
as the credential helper for Docker:
gcloud auth configure-docker
Step 2. Create your Cloud Storage bucket and set your local environment variable:
export BUCKET_NAME="your_bucket_name"
export REGION=us-central1
gsutil mb -l $REGION gs://$BUCKET_NAME
Hint: Try doing everything in one project in the same region.
Step 3. Clone the repo.
cd Documnets/code/
git clone [email protected]:mshakhomirov/recommendation-trainer-customEnvDocker.git
cd recommendation-trainer/wals_ml_engine
Step 4. Write a dockerfile
Docker file is already there in this repo:
This bit is very important otherwise your instance won’t be able to save model to Cloud Storage:
# Make sure gsutil will use the default service account
RUN echo '[GoogleCompute]nservice_account = default' > /etc/boto.cfg
With this docker file you will build an image with these custom environment dependencies:
TensorFlow==1.15
numpy==1.16.6
pandas==0.20.3
scipy==0.19.1
sh
These dependencies versions are the main reason why I’m using a custom container.
Google AI Platform’s runtime-version 1.15 has Tensorflow 1.15 but a different Pandas version which is not acceptable for my use case scenario where Pandas version must be 0.20.3.
Step 5. Build your Docker image.
export PROJECT_ID=$(gcloud config list project --format "value(core.project)")
export IMAGE_REPO_NAME=recommendation_bespoke_container
export IMAGE_TAG=tf_rec
export IMAGE_URI=gcr.io/$PROJECT_ID/$IMAGE_REPO_NAME:$IMAGE_TAG
docker build -f Dockerfile -t $IMAGE_URI ./
Test it locally:
docker run $IMAGE_URI
Output would be:
task.py: error: argument --job-dir is required
And this is alright because this image will be used as our custom environment where entry point is
"trainer/task.py"
For example, after we push our image we will be able to run this command locally:
gcloud ai-platform jobs submit training ${JOB_NAME}
--region $REGION
--scale-tier=CUSTOM
--job-dir ${BUCKET}/jobs/${JOB_NAME}
--master-image-uri $IMAGE_URI
--config trainer/config/config_train.json
--master-machine-type complex_model_m_gpu
--
${ARGS}
and master-image-uri parameter will replace runtime-environment. Check mltrain.sh in the repo for more details.
Step 6. Push the image to docker repo
docker push $IMAGE_URI
Output should be:
The push refers to repository [gcr.io/<your-project>/recommendation_bespoke_container]
Step 7. Submit training job
Run the script included in the repo:
./mltrain.sh train_custom gs://$BUCKET_NAME data/ratings_small.csv - data-type user_ratings
Output:

This means that your training job has been successfully submitted using a custom environment. Now if you go to Google AI Platform console you should be able to see your training running:

Run model training using Cloud composer (AirFlow).
Now let’s deploy a Cloud Composer environment to orchestrate model training updates.
Step 8. Create a Cloud Composer environment in your project:
export CC_ENV=composer-recserve
gcloud composer environments create $CC_ENV --location europe-west2
Step 9.
Get the name of the Cloud Storage bucket created for you by Cloud Composer:
gcloud composer environments describe $CC_ENV
--location europe-west2 --format="csv[no-heading](config.dagGcsPrefix)" | sed 's/.{5}$//'
In the output, you see the location of the Cloud Storage bucket, like this:
gs://[region-environment_name-random_id-bucket]
In my case it was:
gs://europe-west2-composer-recse-156e7e30-bucket
We will upload plugins here.
Step 10. Set a shell variable that contains the path to that output:
export AIRFLOW_BUCKET="gs://europe-west2-composer-recse-156e7e30-bucket"
Step 11. Upload Airflow plugins
In airflow/plugins folder there are two files. These plugins will serve as helper modules to run our DAG and submit the training job.
gcloud composer environments storage plugins import
--location europe-west2 --environment composer-recserve --source airflow/plugins/
Step 12. Check Cloud Composer permissions
Now go to GCP Cloud Composer Web UI and make sure Composer’s service account has all required permissions to launch the jobs, i.e. Cloud ML, BigQuery, etc. You can find it in IAM console.

Also, make sure your Composer environment has these PyPi packages installed:

Step 13. Upload your DAG file
Copy the DAG model_training.py
file to the dags
folder in your Cloud Composer bucket:
gsutil cp airflow/dags/model_training.py ${AIRFLOW_BUCKET}/dags
All done. Now we can go to Airflow web console and check our jobs.
Accessing the Airflow web console
The Airflow web console allows you to manage the configuration and execution of the DAG. For example, using the console you can:
- Inspect and change the schedule of the DAG execution.
- Manually execute tasks within the DAG.
- Inspect task logs.
Step 14. Run this command to get Airflow console uri:
gcloud composer environments describe $CC_ENV
--location europe-west2 --format="csv[no-heading](config.airflow_uri)"
You see the URL for the console website, which looks like the following:
https://za4fg484711dd1p-tp.appspot.com
To access the Airflow console for your Cloud Composer instance, go to the URL displayed in the output. You will use your DAG:

Click recommendations_model_training DAG and check the logs. If everything’s okay you will see some successful activity there. Also, you’ll notice that your custom environment training job is in progress.
Let’s imagine that we extract training data from BigQuery and that would be the first step of our ML pipeline. Go to model_training.py DAG and uncomment this:
...
...
# t1 = BigQueryToCloudStorageOperator(
# task_id='bq_export_op',
# source_project_dataset_table='%s.recommendation_training' % DATASET,
# destination_cloud_storage_uris=[training_file],
# export_format='CSV',
# dag=dag
# )
...
...
...
# t3.set_upstream(t1)
This will enable extract and save from your BigQuery table:
<your-project>.staging.recommendation_training
and the DAG now will look like this:

That’s it! We’ve set up our ML pipeline orchestration.
Conclusion
We’ve just set up ML pipeline orchestration using Airflow where training is performed in a custom environment using Docker image. It is very important because now you don’t rely on Google runtime environments which might be deprecated and you are able to meet any custom runtime version requirements set by your Data Science team. Also, it’s so much easier to set up a reliable CI/CD ML pipeline using version control.
Apache Airflow is a great orchestration manager for pipelines where each step depends on the successful completion of the previous step. In this tutorial we deployed a Google Cloud Composer environment where our ML pipeline is represented in a directed acyclic graph (DAG) to perform the model data extraction and training steps.