The world’s leading publication for data science, AI, and ML professionals.

Training your ML model using Google AI Platform and Custom Environment containers

Complete guide using Tensorflow, Airflow scheduler and Docker

Photo by Setyaki Irham on Unsplash
Photo by Setyaki Irham on Unsplash

Google AI Platform allows advanced model training using various environments. So it is really easy to train your model with just one command like so:

gcloud ai-platform jobs submit training ${JOB_NAME} 
    --region $REGION 
    --scale-tier=CUSTOM 
    --job-dir ${BUCKET}/jobs/${JOB_NAME} 
    --module-name trainer.task 
    --package-path trainer 
    --config trainer/config/config_train.json 
    --master-machine-type complex_model_m_gpu 
    --runtime-version 1.15

However, Google runtime environments are being deprecated from time to time and you might want to use your own custom runtime environment. This tutorial explains how to set one and train a Tensorflow recommendation model on AI Platform Training with a custom container.

My repository can be found here: https://github.com/mshakhomirov/recommendation-trainer-customEnvDocker/

Overview

This tutorial will explain how to train a user-items-ratings recommendation model using WALS algorithm.

  • This is a very common example when user rated some content or product and you need to recommend a similar product to them.
  • This is a production grade code example which will handle user-ratings matrix of any size.

This guide covers the following steps:

  • Local environment setup
  • Write a Dockerfile and Create a custom container
  • Run docker image locally
  • Push the image to GCP Container Registry
  • Submit a custom container training job
  • Schedule model training with Airflow

Prerequisites:

  1. GCP developer account
  2. Docker installed
  3. Python 2
  4. Cloud SDK is installed.
  5. Enable the AI Platform Training & Prediction, Compute Engine and Container Registry APIs.

Creating the above mentioned resources will incur costs around $ 0.20. Don’t forget to clean up when finished.

Training dataset

Our data for training (repo) will look like this:

Image by author
Image by author

It is very similar to MovieLens ratings dataset but simplified for development purposes. You can apply this schema to anything including Google Analytics Page views or any other product/content related user activity.

Step 1. After you installed docker you need to authenticate it. Use gcloud as the credential helper for Docker:

gcloud auth configure-docker

Step 2. Create your Cloud Storage bucket and set your local environment variable:

export BUCKET_NAME="your_bucket_name"
export REGION=us-central1
gsutil mb -l $REGION gs://$BUCKET_NAME

Hint: Try doing everything in one project in the same region.

Step 3. Clone the repo.

cd Documnets/code/
git clone [email protected]:mshakhomirov/recommendation-trainer-customEnvDocker.git
cd recommendation-trainer/wals_ml_engine

Step 4. Write a dockerfile

Docker file is already there in this repo:

This bit is very important otherwise your instance won’t be able to save model to Cloud Storage:

# Make sure gsutil will use the default service account
RUN echo '[GoogleCompute]nservice_account = default' > /etc/boto.cfg

With this docker file you will build an image with these custom environment dependencies:

TensorFlow==1.15
numpy==1.16.6
pandas==0.20.3
scipy==0.19.1
sh

These dependencies versions are the main reason why I’m using a custom container.

Google AI Platform’s runtime-version 1.15 has Tensorflow 1.15 but a different Pandas version which is not acceptable for my use case scenario where Pandas version must be 0.20.3.

Step 5. Build your Docker image.

export PROJECT_ID=$(gcloud config list project --format "value(core.project)")
export IMAGE_REPO_NAME=recommendation_bespoke_container
export IMAGE_TAG=tf_rec
export IMAGE_URI=gcr.io/$PROJECT_ID/$IMAGE_REPO_NAME:$IMAGE_TAG
docker build -f Dockerfile -t $IMAGE_URI ./

Test it locally:

docker run $IMAGE_URI

Output would be:

task.py: error: argument --job-dir is required

And this is alright because this image will be used as our custom environment where entry point is

"trainer/task.py"

For example, after we push our image we will be able to run this command locally:

gcloud ai-platform jobs submit training ${JOB_NAME} 
    --region $REGION 
    --scale-tier=CUSTOM 
    --job-dir ${BUCKET}/jobs/${JOB_NAME} 
    --master-image-uri $IMAGE_URI 
    --config trainer/config/config_train.json 
    --master-machine-type complex_model_m_gpu 
    -- 
    ${ARGS}

and master-image-uri parameter will replace runtime-environment. Check mltrain.sh in the repo for more details.

Step 6. Push the image to docker repo

docker push $IMAGE_URI

Output should be:

The push refers to repository [gcr.io/<your-project>/recommendation_bespoke_container]

Step 7. Submit training job

Run the script included in the repo:

./mltrain.sh train_custom gs://$BUCKET_NAME data/ratings_small.csv - data-type user_ratings

Output:

This means that your training job has been successfully submitted using a custom environment. Now if you go to Google AI Platform console you should be able to see your training running:

Image by author
Image by author

Run model training using Cloud composer (AirFlow).

Now let’s deploy a Cloud Composer environment to orchestrate model training updates.

Step 8. Create a Cloud Composer environment in your project:

export CC_ENV=composer-recserve
gcloud composer environments create $CC_ENV --location europe-west2

Step 9.

Get the name of the Cloud Storage bucket created for you by Cloud Composer:

gcloud composer environments describe $CC_ENV 
    --location europe-west2 --format="csv[no-heading](config.dagGcsPrefix)" | sed 's/.{5}$//'

In the output, you see the location of the Cloud Storage bucket, like this:

gs://[region-environment_name-random_id-bucket]

In my case it was:

gs://europe-west2-composer-recse-156e7e30-bucket

We will upload plugins here.

Step 10. Set a shell variable that contains the path to that output:

export AIRFLOW_BUCKET="gs://europe-west2-composer-recse-156e7e30-bucket"

Step 11. Upload Airflow plugins

In airflow/plugins folder there are two files. These plugins will serve as helper modules to run our DAG and submit the training job.

gcloud composer environments storage plugins import 
    --location europe-west2 --environment composer-recserve --source airflow/plugins/

Step 12. Check Cloud Composer permissions

Now go to GCP Cloud Composer Web UI and make sure Composer’s service account has all required permissions to launch the jobs, i.e. Cloud ML, BigQuery, etc. You can find it in IAM console.

Also, make sure your Composer environment has these PyPi packages installed:

Image by author
Image by author

Step 13. Upload your DAG file

Copy the DAG model_training.py file to the dags folder in your Cloud Composer bucket:

gsutil cp airflow/dags/model_training.py ${AIRFLOW_BUCKET}/dags

All done. Now we can go to Airflow web console and check our jobs.

Accessing the Airflow web console

The Airflow web console allows you to manage the configuration and execution of the DAG. For example, using the console you can:

  • Inspect and change the schedule of the DAG execution.
  • Manually execute tasks within the DAG.
  • Inspect task logs.

Step 14. Run this command to get Airflow console uri:

gcloud composer environments describe $CC_ENV 
    --location europe-west2 --format="csv[no-heading](config.airflow_uri)"

You see the URL for the console website, which looks like the following:

https://za4fg484711dd1p-tp.appspot.com

To access the Airflow console for your Cloud Composer instance, go to the URL displayed in the output. You will use your DAG:

DAG. Image by author
DAG. Image by author

Click recommendations_model_training DAG and check the logs. If everything’s okay you will see some successful activity there. Also, you’ll notice that your custom environment training job is in progress.

Let’s imagine that we extract training data from BigQuery and that would be the first step of our ML pipeline. Go to model_training.py DAG and uncomment this:

...
...
# t1 = BigQueryToCloudStorageOperator(
#     task_id='bq_export_op',
#     source_project_dataset_table='%s.recommendation_training' % DATASET,
#     destination_cloud_storage_uris=[training_file],
#     export_format='CSV',
#     dag=dag
# )
...
...
...
# t3.set_upstream(t1)

This will enable extract and save from your BigQuery table:

<your-project>.staging.recommendation_training

and the DAG now will look like this:

DAG. Image by author
DAG. Image by author

That’s it! We’ve set up our ML pipeline orchestration.

Conclusion

We’ve just set up ML pipeline orchestration using Airflow where training is performed in a custom environment using Docker image. It is very important because now you don’t rely on Google runtime environments which might be deprecated and you are able to meet any custom runtime version requirements set by your Data Science team. Also, it’s so much easier to set up a reliable CI/CD ML pipeline using version control.

Apache Airflow is a great orchestration manager for pipelines where each step depends on the successful completion of the previous step. In this tutorial we deployed a Google Cloud Composer environment where our ML pipeline is represented in a directed acyclic graph (DAG) to perform the model data extraction and training steps.

Recommended read


Related Articles