The world’s leading publication for data science, AI, and ML professionals.

Deploying TFLite model on GCP Serverless

Deploying quantized model in a Serverless fashion

Model deployment is tricky; with the continuously changing landscape of cloud platforms and other AI-related libraries updating almost weekly, back compatibility and finding the correct deployment method is a big challenge. In today’s blog post, we will see how to deploy a tflite model on the Google Cloud Platform in a serverless fashion.

This blog post is structured in the following way:

  • Understanding Serverless and other ways of Deployment
  • What is Quantization and TFLite?
  • Deploying TFLite model using GCP Cloud Run API
Img Src: https://pixabay.com/photos/man-pier-silhouette-sunrise-fog-8091933/
Img Src: https://pixabay.com/photos/man-pier-silhouette-sunrise-fog-8091933/

Understanding Serverless and other ways of Deployment

Let’s first understand what do we mean by serverless because serverless doesn’t mean without a server.

An AI model, or any application for that matter can be deployed in several different ways with three major categorisations.

Serverless: In this case, the model is stored on the cloud container registry and only runs when a user makes a request. When a request is made, a server instance is automatically launched to fulfill the user request, which shuts down after a while. From starting, configuring, scaling, and shutting down, all of this is taken by the Cloud Run API provided by the Google Cloud platform. We have AWS Lambda and Azure Functions as alternatives in other clouds.

Serverless has its own advantages and disadvantages.

  • The biggest advantage is the cost-saving, if you don’t have a large user base, most of the time, the server is sitting idle, and your money is just going for no reason. Another advantage is that we don’t need to think about scaling the infrastructure, depending upon the load on the server, it can automatically replicate the number of instances and handle the traffic.
  • In the disadvantage column, there are three things to consider. It has a small payload limit, meaning it can be used to run a bigger model. Secondly, the server automatically shuts down after 15 min of idle time, thus when we make a request after a long time, the first requests take much longer time than the consecutive ones, this problem is called Cold Start Problem. And lastly, there are no proper GPU-based instances yet for serverless.

Server instances: In this schema, the server is always up and you are always paying up the money even if no one requests our application. For applications with larger user bases, keeping the server up and running is important. In this strategy, we can deploy our apps in multiple ways, one way is to launch a single server instance that you scale manually every time the traffic increases. In practice, these servers are launched with the help of Kubernetes clusters which define the rule for scaling the infrastructure and do traffic management for us.

  • The biggest advantage is that we can work with the biggest-sized models and applications and get precise control over our resources, from GPU-based instances to regular instances. But managing and scaling these server instances properly is quite a big task and often requires a lot of fiddling. These can get super expensive for GPU-based instances since many AI models require GPU for faster inference.

Two great resources to understand Kubernetes and Docker:

Docker for dummies… 🐳 🧠💡🚀 _Dockerize a hello-world node app with me, in 15 mins.med_ium.com

Kubernetes 101: Introduction to Container Orchestration 🎵 🐳

Edge Deployment: When we need the fastest response in places without internet, we go with edge deployment. This deployment type is meant for IoT devices and other smaller devices that do not have large memory or connection with the internet. For instance, if we want AI in a drone, we want the AI module to be deployed in the drone itself, not on some cloud server.

  • This deployment type can only handle a very small payload due to the devices’ hardware-based limitation. In this deployment mode, there is zero cost because everything runs locally. Making models small enough to fit on an IoT device is quite challenging and requires a completely new set of strategies.

Deployment strategies have a ton of things; covering in one blog is almost impossible. Here’s another good blog giving an overview of the entire Mlops strategies.

MLOps: Managing AI models at Scale

What is Quantization and TFLite?

Quantization is a model compression technique in which we convert our weights to lower precision to reduce the size of the model thus making our models smaller and faster at inference. Quantization can greatly improve speed and is often used for edge deployment. Deploying a quantized model in a Serverless fashion can be great cost-saving as this makes the AI model small enough to be used in a serverless fashion.

NOTE: People often think they need GPU instances to serve the AI models as they used GPU instances to train them, but that’s not true. Most AI applications with CPU instances and proper deployment strategy can serve even a billion users.

Quantization is one of many ways to compress model size, there are a lot of other methodologies like pruning, weight sharing, etc.

Here’s an article detailing all the model compression techniques:

Deep learning model compression

What is TFLite

According to the TensorFlow website, "TensorFlow Lite is a set of tools that enables on-device machine learning by helping developers run their models on mobile, embedded, and edge devices."

There are many ways to quantize AI models; two main categorizations are post-training quantization and quantization-aware training. In the prior one, we normally train our models. After the training is complete, quantization is applied to model weights, whereas for the latter, during the training itself, quantization is active. Usually, quantized-aware training performs better than post-training quantization.

Let’s directly jump into the code for quantization. We are using a post-training quantized image segmentation model for this blog. Given below image shows the architecture of our AI Pipeline.

AI Pipeline architecture (Img Src: Belongs to author)
AI Pipeline architecture (Img Src: Belongs to author)

I’m making the following assumptions here:

  • You already have an image segmentation model saved in .hdf5 or .h5 format.

If not, follow this tutorial from Keras official website: https://keras.io/examples/vision/oxford_pets_image_segmentation/

  • You have a variable called _train_input_img_paths_ storing paths to all the training images. Once again, you can follow the Keras official example link of Step 1.
  • If you have your own custom data loaders, modify the _represetative_dataset()_ method.
import tensorflow as tf

## Load your tensorflow model
model = tf.keras.models.load_model("your_model.hdf5") 

# Convert the model to the TensorFlow Lite format with float16 quantization
def representative_dataset():
    for j in range(0, len(train_input_img_paths) // batch_size):
        x_train, _ = train_gen.__getitem__(j)
        yield [x_train.astype(np.float32)]

converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.representative_dataset = representative_dataset
converter.target_spec.supported_types = [tf.float16]

tflite_quant_model = converter.convert()

# Save the quantized model to file
with open('post_training_quantization/model_quantized_float16.tflite', 'wb') as f:
    f.write(tflite_quant_model)

Now we are ready to deploy our TFLite model in a serverless fashion using Google Cloud Run API.

Deploying TFLite model using GCP Cloud Run API

We need these resources and files to deploy our model and make predictions.

  • Dockerfile
  • app.py
  • client.py
  • requirements.txt
  • quantized model

Let’s first understand the flow of deployment first.

The serverless deployment flow starts with containerizing of our application app.py (we use docker here), then pushing the docker image to a container registry (Google container registry in our case); we need a container registry to ensure the versioning, availability, and security of our images. Then configuring and deploying it to a serverless platform (Google Cloud Run API), and then allowing the platform to handle the execution and scaling of our functions.

The serverless mode of deployment abstracts away infrastructure management and provides automatic scaling, giving us more time to focus on developing and deploying our application code.

Dockerfile

FROM python:3.9-slim

# Set the working directory inside the Docker image
WORKDIR /app

# Copy the requirements.txt file to the working directory
COPY requirements.txt ./requirements.txt

# Install the required Python packages specified in requirements.txt
RUN pip install -r requirements.txt

# Copy the pre-trained model file from your local machine to the Docker image
COPY model_quantized_float16.tflite ./post_training_quantization/model_quantized_float16.tflite

# Copy the entire content of the current directory to the working directory inside the Docker image
COPY . .

# Specify the command to run when the Docker container starts
CMD ["python", "app.py"]

Overall, this Dockerfile sets up the necessary environment and dependencies for running the Flask application (app.py) inside a Docker container. It ensures that the required Python packages and the pre-trained model file are available within the container.

app.py

from flask import Flask, request, jsonify
from PIL import Image
import tensorflow as tf
import numpy as np
import io

app = Flask(__name__)

# Load the pre-trained TensorFlow Lite model
model = tf.lite.Interpreter(model_path="post_training_quantization/model_quantized_float16.tflite")
model.allocate_tensors()

@app.route('/predict', methods=['POST'])
def predict():
    """
    Endpoint for making predictions.
    Expects a POST request with an image file in the 'file' field.
    Returns a JSON response with the predicted result.
    """
    # Read the image file from the request
    data = request.files['file'].read()

    # Open and resize the image using Pillow (PIL)
    image = Image.open(io.BytesIO(data)).resize((128, 128))

    # Convert the image to a NumPy array
    image = np.array(image)  # RGB

    # Convert RGB to BGR (required by the model)
    image = image[:, :, ::-1]

    # Normalize the image by dividing by 255.0
    image = image / 255.0

    # Get input and output details of the TensorFlow Lite model
    input_details = model.get_input_details()
    output_details = model.get_output_details()

    # Expand dimensions of the image to match the input shape of the model
    image = np.expand_dims(image, axis=0).astype(input_details[0]['dtype'])

    # Set the input tensor of the model
    model.set_tensor(input_details[0]['index'], image)

    # Run the model inference
    model.invoke()

    # Get the output tensor of the model
    output_data = model.get_tensor(output_details[0]['index'])

    # Convert the output from a NumPy array to a Python list
    output_data_list = output_data.tolist()

    # Return the predicted result as a JSON response
    return jsonify({"result": output_data_list})

if __name__ == '__main__':
    # Run the Flask app on the specified host and port
    app.run(host='0.0.0.0', port=8080)

client.py

import requests
import numpy as np
import matplotlib.pyplot as plt
import json
import time

# Use the URL of your deployed application
url = 'put_your_http_url_which_youll_get_after_successfull_deployment/predict'

# Open your image file in binary mode
with open('test_image.jpg', 'rb') as img_file:
    file_dict = {'file': img_file}

    start_time = time.time()  # Start measuring the time

    # Make a POST request to the server
    response = requests.post(url, files=file_dict)

    end_time = time.time()  # Stop measuring the time

# The response will contain the segmented image data and shape
response_dict = json.loads(response.text)

# Convert result list to numpy array. Adjust dtype according to your model's output.
segmented_image_array = np.array(response_dict['result'], dtype=np.float16)

elapsed_time = end_time - start_time
print(f"Request completed in {elapsed_time:.2f} seconds")

# Plot the image using matplotlib
plt.imshow(segmented_image_array.squeeze(), cmap='gray')  # use squeeze to remove single-dimensional entries from the shape of an array.
plt.show()

Note: When I trained my image segmentation model, I used BGR format (default mode of OpenCV); if you used RGB, remove line 30 from the app.py.

Also, put your own endpoint URL in client.py line 8, which you will get after successfully deploying the Google Cloud RUN API.

And latly, use the same version of Python in your Dockerfile and local environment to avoid breaking anything during the deployment.

requirements.txt

flask==2.0.1
jinja2==3.0.1
tensorflow==2.10.1
Pillow

Quantized model

And lastly, we need to keep our model_quantized_float16.tflite in the same folder as our app.py as we copy our quantized model in our docker image.

This is how my directory looks after collecting all the resources:

Img Src: Belongs to author
Img Src: Belongs to author

Setting Up Serverless

  1. The first step is to get the gcloud CLI (Command Line Interface), I used Windows it’s pretty straightforward: https://cloud.google.com/sdk/docs/install
  2. Navigate to your folder using the standard CLI command
cd path_to_folder
  1. Login into gcloud CLI
gcloud auth login

This will open up a window in your browser and ask for a few permissions, allow that.

  1. Setup a project in gcloud, better use the GUI interface for this.

Here’s the link to create a GCP project:

Create a Google Cloud project | Google Workspace | Google for Developers

GCP project Dashboard (Img Src: Belongs to author)
GCP project Dashboard (Img Src: Belongs to author)
  1. Setup project ID in gcloud CLI, you can see your project ID in your dashboard.
gcloud config set project PROJECT_ID
  1. Build container in gcloud CLI. Replace everywhere.
docker build -t gcr.io/<PROJECT_ID>/tflite-app .
Building container (Img Src: Belongs to author)
Building container (Img Src: Belongs to author)
  1. Push Docker image to Container registry through gcloud CLI
docker push gcr.io/<PROJECT_ID>/tflite-app
Google Container registry (Img Src: Belongs to author)
Google Container registry (Img Src: Belongs to author)
  1. Deploy the Coud RUN API through gcloud CLI. This will ask to choose server locations and some other authentications, allow all of them.
gcloud run deploy tflite-service --image gcr.io/<PROJECT_ID>/tflite-app --platform managed
Model Deployed (Img Src: Belongs to author)
Model Deployed (Img Src: Belongs to author)

If everything is successful, you will see a link in your gcloud CLI, that you need to paste into your client.py. Otherwise, go to logs and try to fix the errors.

Cloud Run API console (Img Src: Belongs to author)
Cloud Run API console (Img Src: Belongs to author)

Key things to Note here:

It’s almost guaranteed that something or other will break in this deployment; the biggest reason is version mismatch in packages.

Use the exact same version in your requirements.txt and Dockerfile as you used to train the model and quantize the model. Remember GCP runs quite behind the TF’s and Python’s latest version. It’s better to use an older version.

I trained my model on Python 3.8.15; the rest are given in the requirements.txt. The errors in the logs are often unclear, so always use the exact same versions; if you can’t find the required versions in GCP, change the version for your local environment.

Next, the biggest reason for failed deployment is that you haven’t activated the required APIs or you don’t have the required permissions and IAM roles. It’s better to use the account as owner with all the permissions enabled if you use GCP for the first time.

Making predictions

Just run client.py in your gcloud CLI or your standard command prompt.

Here’s what my output looks like:

Model prediction (Img Src: Belongs to author)
Model prediction (Img Src: Belongs to author)

I trained a binary image segmentation model on some private data. Due to privacy, I can neither reveal the details of my model nor the data. But all the mentioned things should work with any image segmentation model.

Versioning

And lastly, if you need more resources from the start or want to minimize the cold start problem, we can create a new version of the same with just a few additional steps.

Go to your gcloud console in GUI > Search cloud run API > Select the deployed service > Click on edit and deploy new revision button. And you will get the following options, choose according to your needs, save them, and automatically, a new version of your model will be set up for the next sets of requests.

Solving cold start problem (Img Src: Belongs to author)
Solving cold start problem (Img Src: Belongs to author)

Conclusion

  • Choosing the right strategy for deployment is crucial for cost saving.
  • We can make our models faster and smaller using the quantization techniques.
  • Serverless deployment with quantized model is a great strategy and can easily serve many requests without using costly GPU instances.
  • Serverless takes away the hastle of scaling.

Thanks for your time and patience, happy learning ❤. Follow me for more of such awesome content.

Here’s my reading list for MLOps, discussing several other key concepts and strategies:

MLOps


Related Articles