The world’s leading publication for data science, AI, and ML professionals.

LLMs for Everyone: Running the HuggingFace Text Generation Inference in Google Colab

Experimenting with Large Language Models for free (Part 3)

Image by Markus Spiske, Unsplash
Image by Markus Spiske, Unsplash

In the first part of the story, we used a free Google Colab instance to run a Mistral-7B model and extract information using the FAISS (Facebook AI Similarity Search) database. In the second part of the story, we used a LLaMA-13B model and a LangChain library to make a chat with text summarization and other features. In this part, I will show how to use a HuggingFace 🤗 Text Generation Inference (TGI). TGI is a toolkit that allows us to run a large language model (LLM) as a service. As in the previous parts, we will test it in the Google Colab instance, completely for free.

Text Generation Inference

Text Generation Inference (TGI) is a production-ready toolkit for deploying and serving Large Language Models (LLMs). Running LLM as a service allows us to use it with different clients, from Python notebooks to mobile apps. It is interesting to test the TGI’s functionality, but it turned out that its system requirements are pretty high, and not everything works as smoothly as expected:

  • A free Google Colab instance provides only 12.7 GB of RAM, which is often not enough to load a 13B or even 7B model "in one piece." The AutoModelForCausalLM class from HuggingFace allows us to use "sharded" models that were split into smaller chunks. It works well in Python, but for some reason, this functionality does not work in TGI, and the instance is crashing with a "not enough memory" error.
  • A VRAM size can be a second issue. In my tests with TGI v1.3.4, 8-bit quantization was working well with a bitsandbytes library, but the 4-bit quantization (bitsandbytes-nf4 option) did not work. I especially verified this in Colab Pro on the 40 GB NVIDIA A100 GPU; even with bitsandbytes-nf4 or bitsandbytes-fp4 enabled, the required VRAM size was 16.4 GB, which is too high for a free Colab instance (and even for Colab Pro users, the 40 GB NVIDIA A100 usage price is 2–4x higher compared to 16 GB NVIDIA T4).
  • TGI needs Rust to be installed. A free Google Colab instance does not have a full-fledged terminal, so proper installation is also a challenge.
  • TGI works as a service and needs to run in the background, which can also be tricky without having a terminal.

But, with a bit of tweaking, these issues can be solved, and we can successfully run a 13B model in a free Colab instance. Let’s get into it!

Install

Before running a Text Generation Inference, we need to install it, and the first step is to install Rust:

import locale
locale.getpreferredencoding = lambda: "UTF-8"

!curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y
!/root/.cargo/bin/rustup component add rust-src
!cp /root/.cargo/bin/cargo /usr/local/sbin

The command itself is self-explanatory; the tricky part here is to specify the "-y" key to start the process automatically and to copy the "cargo" module into /usr/local/sbin so the Colab instance can find it.

After that, we can download and compile the TGI itself. I will be using version 1.3.4, which is the latest at the time of writing this article:

!pip install accelerate autoawq vllm==0.2.2 -U
!wget https://github.com/huggingface/text-generation-inference/archive/refs/tags/v1.3.4.tar.gz
!tar -xf v1.3.4.tar.gz
!source ~/.cargo/env && cd text-generation-inference-1.3.4 && BUILD_EXTENSIONS=False make install

The process is not fast, and it takes about 10 minutes. At least this Google Colab instance is free, and it does not charge us for every minute of access.

When the compilation is done, we are ready for a real test! Well, almost. To run text-generation-launcher in the background, we need a terminal. Which is not installed in a free Google Colab instance but can be easily added using pip:

!pip install colab-xterm
%load_ext colabxterm
%xterm

Now, let’s run the text-generation-launcher command in the terminal window:

source ~/.cargo/env && text-generation-launcher --model-id TheBloke/Llama-2-13B-chat-AWQ --quantize awq --port 5000 --hostname 127.0.0.1 --huggingface-hub-cache=/content/data

As I mentioned before, a 4-bit quantization with a bitsandbytes library for some reason does not work in v1.3.4, so it is important to use the AWQ model and specify --quantize awq as a parameter. If everything was done correctly, we should see something like this:

Colab-xterm in the notebook, Image by author
Colab-xterm in the notebook, Image by author

TGI Test

When the Text Generation Inference server is running, we can try to use it. Because TGI runs as a web service, we can connect to it using different clients. For example, let’s use the Python requests library:

import requests

data = {
    'inputs': 'What is the distance to the Moon?',
    'parameters': {'max_new_tokens': 512}
}

response = requests.post('http://127.0.0.1:5000/generate', json=data)
print(response.json())

#> {'generated_text': "The average distance from the Earth to the Moon is 
#> about 384,400 kilometers (238,900 miles). This is called the lunar 
#> distance or lunar mean distance."}

We can also use an InferenceClient, made by HuggingFace:

from huggingface_hub import InferenceClient

client = InferenceClient(model="http://127.0.0.1:5000")
client.text_generation(prompt="What is the distance to the Moon?",
                       max_new_tokens=512)

With this client, we can use streaming, so new tokens will appear one by one:

from huggingface_hub import InferenceClient

client = InferenceClient(model="http://127.0.0.1:5000")
for token in client.text_generation(prompt="What is the distance to the Moon?",
                                    max_new_tokens=512,
                                    stream=True):
    print(token)

This will provide the answer in the "ChatGPT" style:

Streaming mode, Image by author
Streaming mode, Image by author

For Python asyncio users, the async version is also available:

from huggingface_hub import AsyncInferenceClient

client = AsyncInferenceClient("http://127.0.0.1:5000")
async for token in await client.text_generation(prompt="What is the distance to the Moon?",
                                                stream=True):
    print(token)

Text Generation Inference can also be used with a popular LangChain framework. That allows us to use a lot of features, like chat history, text summarization, agents, and so on:

from langchain_community.llms import HuggingFaceTextGenInference
from langchain.prompts import PromptTemplate
from langchain.schema.output_parser import StrOutputParser
from langchain.callbacks.tracers import ConsoleCallbackHandler

llm = HuggingFaceTextGenInference(
    inference_server_url="http://127.0.0.1:5000",
    max_new_tokens=512,
    top_k=10,
    top_p=0.95,
    typical_p=0.95,
    temperature=0.01,
    repetition_penalty=1.03,
    streaming=True,
)

template = """<s>[INST] <<SYS>>
Provide a correct and short answer to the question.
<</SYS>>
{question} [/INST]"""

prompt = PromptTemplate(template=template, input_variables=["question"])
chain = prompt | llm | StrOutputParser()

chain.invoke({"question": "What is the distance to the Moon?"},
             config={
    # "callbacks": [ConsoleCallbackHandler()]
})

More details about using LangChain in Google Colab can be found in the previous part of this tutorial:

LLMs for Everyone: Running the LLaMA-13B model and LangChain in Google Colab

All LangChain code from that part should also work with TGI as well.

As for system resource consumption, a LLaMA-2 13B model fits well within the free Google Colab instance limits:

System resources tab, Image by author
System resources tab, Image by author

Conclusion

In this article, we were able to run a Text Generation Inference toolkit from 🤗 in a free Google Colab instance. This toolkit is designed to deploy and serve large language models. It is originally made for high-end hardware, and running it on a budget GPU or in a free Google Colab instance can be tricky. But as we can see, it is doable, and it is great for testing and self-education. Naturally, Google is not a charity and free Colab instances still have enough limitations. Its RAM size is limited, the GPU backend may sometimes not be available, and generally speaking, free resources are not guaranteed. But practically, during all these tests, a free instance was running smoothly, which is a generous step from Google considering that at the time of this writing, a 16 GB GPU cost about $500.

Last but not least, using Google Colab is great for another reason. Large language models are, by definition, large. Using a "Llama-2–13B-chat-AWQ" model will require about 10 GB of web traffic and the same size of temporary files on your hard drive. Testing 3–4 different models may easily require 50–100 GB. Google Colab instances are located in data centers with fast internet, and the download speed is much faster compared to what I have at home. When using a local PC, there is also an almost inevitable mess with incompatible CUDA, PyTorch, Tensorflow, and other library versions. In Colab, when the test is done, it is enough to press the "Disconnect and delete runtime" button, and all temporary files will be removed; at the next start, we always have a clean system again.

Those who are interested in using language models and natural language processing are also welcome to read other articles:

If you enjoyed this story, feel free to subscribe to Medium, and you will get notifications when my new articles will be published, as well as full access to thousands of stories from other authors. You are also welcome to connect via LinkedIn. If you want to get the full source code for this and other posts, feel free to visit my Patreon page.

Thanks for reading.


Related Articles