The world’s leading publication for data science, AI, and ML professionals.

Building a Multi-Purpose GenAI Powered Chatbot

Utilize SageMaker Inference Components to work with Multiple LLMs Efficiently

Image from Unsplash
Image from Unsplash

Large Language Models (LLMs) are immensely powerful and can help solve a variety of NLP tasks such as question answering, summarization, entity extraction, and more. As generative AI use-cases continue to expand, often times real-world applications will require the ability to solve multiple of these NLP tasks. For instance if you have a chatbot for users to interface with, a common ask is to summarize the conversation with the chatbot. This can be used in many settings such as doctor-patient transcripts, virtual phone calls/appointments, and more.

How can we build something that solves these types of problems? We could have multiple LLMs, one for question answering and the other for summarization. Another approach would be taking the same LLM and fine-tuning it across the different domains, but we will focus on the former approach for this use-case. With multiple LLMs though there are certain challenges that must be addressed.

Hosting even a singular model is computationally expensive and requires large GPU instances. In the case of having multiple LLMs it’ll require a persistent endpoint/hardware for both. This also leads to overhead with managing multiple endpoints and paying for infrastructure to serve both.

With SageMaker Inference Components we can address this issue. Inference Components allow for you to host multiple different models on a singular endpoint. Each model has its own dedicated container and you can allocate a certain amount of hardware and scale at a per model basis. This allows for us to have both models behind a singular endpoint while optimizing cost and performance.

For today’s article we’ll take a look at how we can build a multi-purpose Generative AI powered chatbot that comes with question answering and summarization enabled. Let’s take a quick look at some of the tools we will use here:

  • SageMaker Inference Components: For hosting our models we will be using SageMaker Real-Time Inference. Within Real-Time Inference we will use the Inference Components feature to host multiple models while allocating hardware for each model. If you are new to Inference Components, please refer to my starter article here.
  • Streamlit: Streamlit is an open-source Python library that simplifies web development. With Streamlit we will build our ChatBot UI for question answering and summarization. If you are new to Streamlit you can reference Heiko Hotz’s starter article here, we will use parts of it as a template to build our UI.
  • **Models
  • Question Answering Model: For the question answering portion with the chatbot we will use a Llama7B Chat model. Llama7B Chat is optimized for chat-based conversations specifically. The model server/container we will be utilizing is the AWS Large Model Inference (LMI) container powered by DJL Serving. The LMI container allows for Model Partitioning and other optimizations such as Batching and Quantization. We will use the existing LMI Llama 7B Chat deployment example to build our model artifacts. – Summarization Model**: For summarizing the chat’s conversation, we will be using an open-source fine-tuned HuggingFace Hub Model by Karthick Kaliannan Neelamohan (Apache 2.0 License). The base model is BART and has already been fine-tuned upon the popular SAMSUM and DIALOGSUM datasets. In the case that you have your own model and data, you can also fine-tune yourself.

Now that we understand the different components that we’ll be working with let’s directly get into the example!

NOTE: This article assumes an intermediate understanding of Python, LLMs, and Amazon SageMaker Inference. I would suggest following this article for getting started with Amazon SageMaker Inference.

DISCLAIMER: I am a Machine Learning Architect at AWS and my opinions are my own.

Table of Contents

  1. Setup & Endpoint Creation
  2. Inference Components Deployment a. Llama7B Chat Inference Component Creation b. BART Summarization Inference Component Creation

  3. Streamlit UI Creation & Demo
  4. Additional Resources & Conclusion

1. Setup & Endpoint Creation

For development we will be working in SageMaker Studio on a ml.c5.xlarge instance with a conda_python3 kernel. To work with creating our SageMaker Endpoint and Inference Components we will use the AWS Boto3 Python SDK and the higher level SageMaker Python SDK.

import boto3
import sagemaker
import time
from time import gmtime, strftime

#Setup
client = boto3.client(service_name="sagemaker")
runtime = boto3.client(service_name="sagemaker-runtime")
boto_session = boto3.session.Session()
s3 = boto_session.resource('s3')
region = boto_session.region_name
sagemaker_session = sagemaker.Session()
bucket = sagemaker_session.default_bucket()
role = sagemaker.get_execution_role()
print(f"Role ARN: {role}")
print(f"Region: {region}")

# client setup
s3_client = boto3.client("s3")
sm_client = boto3.client("sagemaker")
smr_client = boto3.client("sagemaker-runtime")

Before creating the Inference Components we need to first make a SageMaker Real-Time Endpoint. Here we specify our instance type, count, and enable managed AutoScaling at the endpoint level.

Note that this is different from enabling AutoScaling at a per model/Inference Component level. There you can apply an AutoScaling policy for each Inference Component to scale the number of model copies. Each model copy retains the amount of hardware you allocated to the Inference Component. For managed AutoScaling at the endpoint level you want to ensure that there is enough compute to handle the scaling that you enable at the Inference Component level. You can also still define your own AutoScaling policy at the endpoint level, but be careful here due to any potential clashing between your Components and Endpoint level policies.

# Container Parameters, increase health check for LLMs: 
variant_name = "AllTraffic"
instance_type = "ml.g5.12xlarge" # 4 GPUs available per instance
model_data_download_timeout_in_seconds = 3600
container_startup_health_check_timeout_in_seconds = 3600

# Setting up managed AutoScaling at endpoint level
initial_instance_count = 1
max_instance_count = 2
print(f"Initial instance count: {initial_instance_count}")
print(f"Max instance count: {max_instance_count}")

We also enable container level parameters to handle LLMs and also enable Least Outstanding Requests (LOR) routing.

# Endpoint Config Creation
endpoint_config_response = client.create_endpoint_config(
    EndpointConfigName=epc_name,
    ExecutionRoleArn=role,
    ProductionVariants=[
        {
            "VariantName": variant_name,
            "InstanceType": instance_type,
            "InitialInstanceCount": 1,
            "ModelDataDownloadTimeoutInSeconds": model_data_download_timeout_in_seconds,
            "ContainerStartupHealthCheckTimeoutInSeconds": container_startup_health_check_timeout_in_seconds,
            "ManagedInstanceScaling": {
                "Status": "ENABLED",
                "MinInstanceCount": initial_instance_count,
                "MaxInstanceCount": max_instance_count,
            },
            # can set to least outstanding or random: https://aws.amazon.com/blogs/machine-learning/minimize-real-time-inference-latency-by-using-amazon-sagemaker-routing-strategies/
            "RoutingConfig": {"RoutingStrategy": "LEAST_OUTSTANDING_REQUESTS"},
        }
    ],
)

#Endpoint Creation
endpoint_name = "ic-ep-chatbot" + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
create_endpoint_response = client.create_endpoint(
    EndpointName=endpoint_name,
    EndpointConfigName=epc_name,
)
print("Endpoint Arn: " + create_endpoint_response["EndpointArn"])

Once our endpoint has been created we can add the two Inference Components representing our two different models.

2. Inference Components Deployment

An Inference Component represents a single model container. Generally to create an Inference Component you can reference a SageMaker Model Object and inherit the model data and container information. Along with this information you can add the compute you are allocating and the copy count of your model. Each copy will have the same compute that you assigned originally to your Inference Component.

a. Llama7B Chat Inference Component Creation

For Llama7B Chat we will be using the DJL Serving powered LMI container. With the LMI container you can specify a serving.properties file that specifies the model you are working with as well as any other optimizations such as Batching and Quantization.

engine=MPI
option.model_id=TheBloke/Llama-2-7B-Chat-fp16
option.task=text-generation
option.trust_remote_code=true
option.tensor_parallel_degree=1
option.max_rolling_batch_size=32
option.rolling_batch=lmi-dist
option.dtype=fp16

The "model_id" parameter specifies the model that we are working with. In this case, the model weights are pulled from the HuggingFace Model ID specified. If you have a custom fine-tuned model you can specify the S3 path to those model weights. Along with this serving file, you can specify an inference script if you have any custom pre/post processing or your own model loading logic. Once this has been configured you create a model.tar.gz as expected for the SageMaker Model object and specify the container you are using (managed or your own).

%%sh
# create tarball
mkdir mymodel
rm mymodel.tar.gz
mv serving.properties mymodel/
mv model.py mymodel/
tar czvf mymodel.tar.gz mymodel/
rm -rf mymodel
# retreive image
image_uri = sagemaker.image_uris.retrieve(
        framework="djl-deepspeed",
        region=sagemaker_session.boto_session.region_name,
        version="0.26.0"
    )
print(f"Image being used: {image_uri}")

# create sagemaker model object
from sagemaker.utils import name_from_base
llama_model_name = name_from_base(f"Llama-7b-chat")
print(llama_model_name)

create_model_response = sm_client.create_model(
    ModelName=llama_model_name,
    ExecutionRoleArn=role,
    PrimaryContainer={"Image": image_uri, "ModelDataUrl": code_artifact},
)
model_arn = create_model_response["ModelArn"]

print(f"Created Model: {model_arn}")

For the Inference Component we can inherit this metadata from the SageMaker Model object. Along with this we specify the compute requirements and copy count. In this case for Llama7B Chat we specify a single GPU per copy.

llama7b_ic_name = "llama7b-chat-ic" + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
variant_name = "AllTraffic"

# llama inference component reaction
create_llama_ic_response = sm_client.create_inference_component(
    InferenceComponentName=llama7b_ic_name,
    EndpointName=endpoint_name,
    VariantName=variant_name,
    Specification={
        "ModelName": llama_model_name,
        "ComputeResourceRequirements": {
            # need just one GPU for llama 7b chat
            "NumberOfAcceleratorDevicesRequired": 1,
            "NumberOfCpuCoresRequired": 1,
            "MinMemoryRequiredInMb": 1024,
        },
    },
    # can setup autoscaling for copies, each copy will retain the hardware you have allocated
    RuntimeConfig={"CopyCount": 1},
)

print("IC Llama Arn: " + create_llama_ic_response["InferenceComponentArn"])

Once the Inference Component has been created we can run a sample inference by specifying the Inference Component name as a header.

import json
content_type = "application/json"
chat = [
  {"role": "user", "content": "Hello, how are you?"},
  {"role": "assistant", "content": "I'm doing great. How can I help you today?"},
  {"role": "user", "content": "I am software engineer looking to learn more about machine learning."},
]

payload = {"chat": chat, "parameters": {"max_tokens":256, "do_sample": True}}
response = smr_client.invoke_endpoint(
    EndpointName=endpoint_name,
    InferenceComponentName=llama7b_ic_name, #specify IC name
    ContentType=content_type,
    Body=json.dumps(payload),
    )
result = json.loads(response['Body'].read().decode())
print(type(result['content']))
print(type(result))

b. BART Summarization Inference Component Creation

The creation of the BART Inference Component will be very similar to the Llama7B Chat component. The main difference here is the container we will be using is different so the packaging of model data and image_uri will vary. In this case we use the HuggingFace PyTorch image and specify the HuggingFace Model ID and NLP task we are solving for.

from sagemaker.utils import name_from_base

bart_model_name = name_from_base(f"bart-summarization")
print(bart_model_name)

# replace with your region if needed
hf_transformers_image_uri = '763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-inference:1.13.1-transformers4.26.0-cpu-py39-ubuntu20.04'

# env variables
env = {'HF_MODEL_ID': 'knkarthick/MEETING_SUMMARY',
      'HF_TASK':'summarization',
      'SAGEMAKER_CONTAINER_LOG_LEVEL':'20',
      'SAGEMAKER_REGION':'us-east-1'}

create_model_response = sm_client.create_model(
    ModelName=bart_model_name,
    ExecutionRoleArn=role,
    # in this case no model data point directly towards HF Hub
    PrimaryContainer={"Image": hf_transformers_image_uri, 
                      "Environment": env},
)
model_arn = create_model_response["ModelArn"]
print(f"Created Model: {model_arn}")

We once again pass the created SageMaker Model object to the Inference Component and specify the hardware needed for the model.

bart_ic_name = "bart-summarization-ic" + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
variant_name = "AllTraffic"

# BART inference component reaction
create_bart_ic_response = sm_client.create_inference_component(
    InferenceComponentName=bart_ic_name,
    EndpointName=endpoint_name,
    VariantName=variant_name,
    Specification={
        "ModelName": bart_model_name,
        "ComputeResourceRequirements": {
            # will reserve one GPU
            "NumberOfAcceleratorDevicesRequired": 1,
            "NumberOfCpuCoresRequired": 8,
            "MinMemoryRequiredInMb": 1024,
        },
    },
    # can setup autoscaling for copies, each copy will retain the hardware you have allocated
    RuntimeConfig={"CopyCount": 1},
)

print("IC BART Arn: " + create_bart_ic_response["InferenceComponentArn"])

Once both the Inference Components have been created we can visualize them on the SageMaker Studio UI:

Inference Components (Screenshot by Author)
Inference Components (Screenshot by Author)

3. Streamlit UI Creation & Demo

Now that we have our SageMaker Endpoint and Inference Components created, we can stitch this all together on a Streamlit application. We set up environment variables for our Endpoint and Inference Components to reference later for invocation.

import json
import os
import streamlit as st
from streamlit_chat import message
import boto3

smr_client = boto3.client("sagemaker-runtime")
os.environ["endpoint_name"] = "enter endpoint name here"
os.environ["llama_ic_name"] = "enter llama IC name here"
os.environ["bart_ic_name"] = "enter bart IC name here"

We also set up Streamlit Session State variables to retain user inputs, model outputs, and the chat conversation. We create a clear button to empty our conversation, when this button is clicked we reset the state variables we defined.

# session state variables store user and model inputs
if 'generated' not in st.session_state:
    st.session_state['generated'] = []
if 'past' not in st.session_state:
    st.session_state['past'] = []
if 'chat_history' not in st.session_state:
    st.session_state['chat_history'] = []

# clear button
clear_button = st.sidebar.button("Clear Conversation", key="clear")
# reset everything upon clear
if clear_button:
    st.session_state['generated'] = []
    st.session_state['past'] = []
    st.session_state['chat_history'] = []

We create a submit button to take user input, upon clicking this button we will invoke our Llama7B Chat model.

 if submit_button and user_input:
            st.session_state['past'].append(user_input)
            model_input = {"role": "user", "content": user_input}
            st.session_state['chat_history'].append(model_input)
            payload = {"chat": st.session_state['chat_history'], "parameters": {"max_tokens":400, "do_sample": True,
                                                                                "maxOutputTokens": 2000}}
            # invoke llama
            response = smr_client.invoke_endpoint(
                EndpointName=os.environ.get("endpoint_name"),
                InferenceComponentName=os.environ.get("llama_ic_name"), #specify IC name
                ContentType="application/json",
                Body=json.dumps(payload),
            )
            full_output = json.loads(response['Body'].read().decode())
            print(full_output)
            display_output = full_output['content']
            print(display_output)
            st.session_state['chat_history'].append(full_output)
            st.session_state['generated'].append(display_output)

If we start our app with the following command we will see the UI and we can see our question answering chat model at work.

streamlit run app.py
UI (Screenshot by Author)
UI (Screenshot by Author)

On the side we also create a summarize button:

summarize_button = st.sidebar.button("Summarize Conversation", key="summarize")
Summarize Button (Screenshot by Author)
Summarize Button (Screenshot by Author)

Upon summarization we invoke the fine-tuned BART model. Note that for the BART model we want the input to be structured in a format that the model can understand, something similar to the following:

BART Model Sample Input (Screenshot by Author)
BART Model Sample Input (Screenshot by Author)

We capture the entire model conversation with both the inputs and outputs in our "chat_history" state variable and format it for the model and invoke the BART Inference Component.

# for summarization
if summarize_button:
    st.header("Summary")
    st.write("Generating summary....")
    chat_history = st.session_state['chat_history']
    text = ''''''
    for resp in chat_history:
        if resp['role'] == "user":
            text += f"Ram: {resp['content']}n"
        elif resp['role'] == "assistant":
            text += f"AI: {resp['content']}n"
    summary_payload = {"inputs": text}
    summary_response = smr_client.invoke_endpoint(
        EndpointName=os.environ.get("endpoint_name"),
        InferenceComponentName=os.environ.get("bart_ic_name"), #specify IC name
        ContentType="application/json",
        Body=json.dumps(summary_payload),
    )
    summary_result = json.loads(summary_response['Body'].read().decode())
    summary = summary_result[0]['summary_text']
    st.write(summary)
Summary (Screenshot by Author)
Summary (Screenshot by Author)

4. Additional Resources & Conclusion

GenAI-Samples/Multi-Purpose-Chatbot at master · RamVegiraju/GenAI-Samples

The code for the entire example can be found at the link above. I hope this article was a good example at how you can utilize multiple LLMs in a cost and performance efficient manner for real-world use-cases.

Stay tuned for more GenAI/AWS articles and deep dives into the topics we have discussed above and more. As always thank you for reading and feel free to leave any feedback.


If you enjoyed this article feel free to connect with me on LinkedIn and subscribe to my Medium Newsletter.


Related Articles