
In the first part of the story, we used a free Google Colab instance to run a Mistral-7B model and extract information using the FAISS (Facebook AI Similarity Search) database. In the second part of the story, we used a LLaMA-13B model and a LangChain library to make a chat with text summarization and other features. In this part, I will show how to use a HuggingFace 🤗 Text Generation Inference (TGI). TGI is a toolkit that allows us to run a large language model (LLM) as a service. As in the previous parts, we will test it in the Google Colab instance, completely for free.
Text Generation Inference
Text Generation Inference (TGI) is a production-ready toolkit for deploying and serving Large Language Models (LLMs). Running LLM as a service allows us to use it with different clients, from Python notebooks to mobile apps. It is interesting to test the TGI’s functionality, but it turned out that its system requirements are pretty high, and not everything works as smoothly as expected:
- A free Google Colab instance provides only 12.7 GB of RAM, which is often not enough to load a 13B or even 7B model "in one piece." The
AutoModelForCausalLM
class from HuggingFace allows us to use "sharded" models that were split into smaller chunks. It works well in Python, but for some reason, this functionality does not work in TGI, and the instance is crashing with a "not enough memory" error. - A VRAM size can be a second issue. In my tests with TGI v1.3.4, 8-bit quantization was working well with a
bitsandbytes
library, but the 4-bit quantization (bitsandbytes-nf4
option) did not work. I especially verified this in Colab Pro on the 40 GB NVIDIA A100 GPU; even withbitsandbytes-nf4
orbitsandbytes-fp4
enabled, the required VRAM size was 16.4 GB, which is too high for a free Colab instance (and even for Colab Pro users, the 40 GB NVIDIA A100 usage price is 2–4x higher compared to 16 GB NVIDIA T4). - TGI needs Rust to be installed. A free Google Colab instance does not have a full-fledged terminal, so proper installation is also a challenge.
- TGI works as a service and needs to run in the background, which can also be tricky without having a terminal.
But, with a bit of tweaking, these issues can be solved, and we can successfully run a 13B model in a free Colab instance. Let’s get into it!
Install
Before running a Text Generation Inference, we need to install it, and the first step is to install Rust:
import locale
locale.getpreferredencoding = lambda: "UTF-8"
!curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y
!/root/.cargo/bin/rustup component add rust-src
!cp /root/.cargo/bin/cargo /usr/local/sbin
The command itself is self-explanatory; the tricky part here is to specify the "-y" key to start the process automatically and to copy the "cargo" module into /usr/local/sbin so the Colab instance can find it.
After that, we can download and compile the TGI itself. I will be using version 1.3.4, which is the latest at the time of writing this article:
!pip install accelerate autoawq vllm==0.2.2 -U
!wget https://github.com/huggingface/text-generation-inference/archive/refs/tags/v1.3.4.tar.gz
!tar -xf v1.3.4.tar.gz
!source ~/.cargo/env && cd text-generation-inference-1.3.4 && BUILD_EXTENSIONS=False make install
The process is not fast, and it takes about 10 minutes. At least this Google Colab instance is free, and it does not charge us for every minute of access.
When the compilation is done, we are ready for a real test! Well, almost. To run text-generation-launcher
in the background, we need a terminal. Which is not installed in a free Google Colab instance but can be easily added using pip
:
!pip install colab-xterm
%load_ext colabxterm
%xterm
Now, let’s run the text-generation-launcher
command in the terminal window:
source ~/.cargo/env && text-generation-launcher --model-id TheBloke/Llama-2-13B-chat-AWQ --quantize awq --port 5000 --hostname 127.0.0.1 --huggingface-hub-cache=/content/data
As I mentioned before, a 4-bit quantization with a bitsandbytes
library for some reason does not work in v1.3.4, so it is important to use the AWQ model and specify --quantize awq
as a parameter. If everything was done correctly, we should see something like this:

TGI Test
When the Text Generation Inference server is running, we can try to use it. Because TGI runs as a web service, we can connect to it using different clients. For example, let’s use the Python requests library:
import requests
data = {
'inputs': 'What is the distance to the Moon?',
'parameters': {'max_new_tokens': 512}
}
response = requests.post('http://127.0.0.1:5000/generate', json=data)
print(response.json())
#> {'generated_text': "The average distance from the Earth to the Moon is
#> about 384,400 kilometers (238,900 miles). This is called the lunar
#> distance or lunar mean distance."}
We can also use an InferenceClient, made by HuggingFace:
from huggingface_hub import InferenceClient
client = InferenceClient(model="http://127.0.0.1:5000")
client.text_generation(prompt="What is the distance to the Moon?",
max_new_tokens=512)
With this client, we can use streaming, so new tokens will appear one by one:
from huggingface_hub import InferenceClient
client = InferenceClient(model="http://127.0.0.1:5000")
for token in client.text_generation(prompt="What is the distance to the Moon?",
max_new_tokens=512,
stream=True):
print(token)
This will provide the answer in the "ChatGPT" style:

For Python asyncio users, the async version is also available:
from huggingface_hub import AsyncInferenceClient
client = AsyncInferenceClient("http://127.0.0.1:5000")
async for token in await client.text_generation(prompt="What is the distance to the Moon?",
stream=True):
print(token)
Text Generation Inference can also be used with a popular LangChain framework. That allows us to use a lot of features, like chat history, text summarization, agents, and so on:
from langchain_community.llms import HuggingFaceTextGenInference
from langchain.prompts import PromptTemplate
from langchain.schema.output_parser import StrOutputParser
from langchain.callbacks.tracers import ConsoleCallbackHandler
llm = HuggingFaceTextGenInference(
inference_server_url="http://127.0.0.1:5000",
max_new_tokens=512,
top_k=10,
top_p=0.95,
typical_p=0.95,
temperature=0.01,
repetition_penalty=1.03,
streaming=True,
)
template = """<s>[INST] <<SYS>>
Provide a correct and short answer to the question.
<</SYS>>
{question} [/INST]"""
prompt = PromptTemplate(template=template, input_variables=["question"])
chain = prompt | llm | StrOutputParser()
chain.invoke({"question": "What is the distance to the Moon?"},
config={
# "callbacks": [ConsoleCallbackHandler()]
})
More details about using LangChain in Google Colab can be found in the previous part of this tutorial:
LLMs for Everyone: Running the LLaMA-13B model and LangChain in Google Colab
All LangChain code from that part should also work with TGI as well.
As for system resource consumption, a LLaMA-2 13B model fits well within the free Google Colab instance limits:

Conclusion
In this article, we were able to run a Text Generation Inference toolkit from 🤗 in a free Google Colab instance. This toolkit is designed to deploy and serve large language models. It is originally made for high-end hardware, and running it on a budget GPU or in a free Google Colab instance can be tricky. But as we can see, it is doable, and it is great for testing and self-education. Naturally, Google is not a charity and free Colab instances still have enough limitations. Its RAM size is limited, the GPU backend may sometimes not be available, and generally speaking, free resources are not guaranteed. But practically, during all these tests, a free instance was running smoothly, which is a generous step from Google considering that at the time of this writing, a 16 GB GPU cost about $500.
Last but not least, using Google Colab is great for another reason. Large language models are, by definition, large. Using a "Llama-2–13B-chat-AWQ" model will require about 10 GB of web traffic and the same size of temporary files on your hard drive. Testing 3–4 different models may easily require 50–100 GB. Google Colab instances are located in data centers with fast internet, and the download speed is much faster compared to what I have at home. When using a local PC, there is also an almost inevitable mess with incompatible CUDA, PyTorch, Tensorflow, and other library versions. In Colab, when the test is done, it is enough to press the "Disconnect and delete runtime" button, and all temporary files will be removed; at the next start, we always have a clean system again.
Those who are interested in using language models and natural language processing are also welcome to read other articles:
- LLMs for Everyone: Running LangChain and a MistralAI 7B Model in Google Colab
- Natural Language Processing For Absolute Beginners
- 16, 8, and 4-bit Floating Point Formats – How Does it Work?
- Python Data Analysis: What Do We Know About Pop Songs?
If you enjoyed this story, feel free to subscribe to Medium, and you will get notifications when my new articles will be published, as well as full access to thousands of stories from other authors. You are also welcome to connect via LinkedIn. If you want to get the full source code for this and other posts, feel free to visit my Patreon page.
Thanks for reading.