
In the first part of the story, we used a free Google Colab instance to run a Mistral-7B model and extract information using the FAISS (Facebook AI Similarity Search) database. In this part, we will go further, and I will show how to run a LLaMA 2 13B model; we will also test some extra LangChain functionality like making chat-based applications and using agents. In the same way, as in the first part, all used components are based on open-source projects and will work completely for free.
Let’s get into it!
LLaMA.cpp
A LLaMA.CPP is a very interesting open-source project, originally designed to run an LLaMA model on Macbooks, but its functionality grew far beyond that. First, it is written in plain C/C++ without external dependencies and can run on any hardware (CUDA, OpenCL, and Apple silicon are supported; it can even work on a Raspberry Pi). Second, LLaMA.CPP can be connected with LangChain, which allows us to test a lot of its functionality for free without having an OpenAI key. Last but not least, because LLaMA.CPP works everywhere, it’s a good candidate to run in a free Google Colab instance. As a reminder, Google provides free access to Python notebooks with 12 GB of RAM and 16 GB of VRAM, which can be opened using the Colab Research page. The code is opened in the web browser and runs in the cloud, so everybody can access it, even from a minimalistic budget PC.
Before using LLaMA, let’s install the library. The installation itself is easy; we only need to enable LLAMA_CUBLAS
before using pip:
!CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip3 install llama-cpp-python
!pip3 install huggingface-hub
!pip3 install sentence-transformers langchain langchain-experimental
!huggingface-cli download TheBloke/Llama-2-7b-Chat-GGUF llama-2-7b-chat.Q4_K_M.gguf --local-dir /content --local-dir-use-symlinks False
For the first test, I will be using a 7B model. Here, I also installed a huggingface-hub
library, which allows us to automatically download a "Llama-2–7b-Chat" model in the GGUF format needed for LLaMA.CPP. I also installed a LangChain library, which will be used for further testing.
Now, let’s load the model and test that it works:
from langchain.llms import LlamaCpp
from langchain.callbacks.manager import CallbackManager
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
n_gpu_layers = 40
n_batch = 512
callback_manager = CallbackManager([StreamingStdOutCallbackHandler()])
llm = LlamaCpp(
model_path="/content/llama-2-7b-chat.Q4_K_M.gguf",
temperature=0.1,
n_gpu_layers=n_gpu_layers,
n_batch=n_batch,
callback_manager=callback_manager,
verbose=True,
)
When the model is loaded, we can test it using only one line of code:
llm("What is the distance to the Moon? Write the short answer.")
Here, I also used a StreamingStdOutCallbackHandler
which allows us to have a smooth "streamed" output in the "ChatGPT" style:

As for resources, because of 4-bit quantization, a 7B model fits the Google Colab free limit well:

As we can see, the model needs only about 1.6GB of RAM and 4.2GB of VRAM, so in theory, it can run on almost every budget PC. Using Google Colab, we can even run a 13B model completely for free! We only need to change the URL in the "download" command:
!huggingface-cli download TheBloke/Llama-2-13B-chat-GGUF llama-2-13b-chat.Q4_K_M.gguf --local-dir /content --local-dir-use-symlinks False
Naturally, this model requires more resources, but it’s still enough for a free instance:

Our model is ready; let’s see how we can use it in LangChain.
LangChain
LangChain is an open-source Python framework designed for developing applications powered by language models. In theory, it is "cross-platform," and different language models can be used with minimal code changes. But practically, it is not always clear what needs to be changed, and most of the examples in the official documentation are made for OpenAI, which is not free, and the user will pay for every API call. Because of that, LLaMA.CPP is a great way to study this library without any extra costs. Let’s get started!
1. LLM Chain
The LCEL (LangChain Expression Language) is one of the basic concepts in the LangChain library. It allows us to create a prompt and bind it to a language model:
from langchain.prompts import PromptTemplate
from langchain.schema.output_parser import StrOutputParser
from langchain.callbacks.tracers import ConsoleCallbackHandler
template = """<s>[INST] <<SYS>>
Provide a correct and short answer to the question.
<</SYS>>
{question} [/INST]"""
prompt = PromptTemplate(template=template, input_variables=["question"])
chain = prompt | llm | StrOutputParser()
chain.invoke({"question": "What is the distance to the Moon?"},
config={
# "callbacks": [ConsoleCallbackHandler()]
})
#> Sure! The average distance from Earth to the Moon is about
#> 384,400 kilometers (238,900 miles).
Here, I created a prompt instructing the LLaMA model what to do, connected it to the LLM that was created during the previous step, and added an StrOutputParser
instance to clean the output text. A callbacks
is an optional parameter that allows us to debug the chain; it is useful if we want to see what actual prompts were sent to the model.
2. Combining the chains
Using LCEL, we can easily combine two chains. Here, I added a second chain that uses the output of the first one as an input.
template2 = """<s>[INST] <<SYS>>
Use the summary {summary} and give 2 one sentence examples of
practical applications of the subject [/INST]
<</SYS>>
[/INST]
"""
prompt2 = PromptTemplate(
input_variables=["summary"],
template=template2,
)
chain2 = {"summary": prompt | llm | StrOutputParser()}
| prompt2 | llm | StrOutputParser()
chain2.invoke({"question": "What is the distance to the Moon?"},
config={
# "callbacks": [ConsoleCallbackHandler()]
})
#> The average distance from Earth to the Moon is approximately 384,400
#> kilometers (238,900 miles), and this information has several practical
#> applications, such as:
#> 1. Planning space missions: Knowing the exact distance between Earth
#> and the Moon is crucial for designing and executing space missions.
#> 2. Navigation and communication: The distance between Earth and the
#> Moon affects the time it takes for radio signals to travel between
#> the two bodies...
If we enable ConsoleCallbackHandler
, we will see that in this example, the language model is called twice:
[llm/start] Exiting Prompt run with output:
"<s>[INST] <<SYS>>nProvide a correct and short answer to the question.n<</SYS>>nWhat is the distance to the Moon? [/INST]"
Exiting LLM run with output:
"The average distance from Earth to the Moon is about 384,400 kilometers (238,900 miles)."
[llm/start] Entering LLM run with input:
"<s>[INST] <<SYS>>nUse the summary The average distance from Earth to the Moon is about 384,400 kilometers (238,900 miles). and give 2 one sentence examples of practical applications of the subject [/INST]n<</SYS>>n[/INST]"
Exiting LLM run with output:
...
A LangChain library is doing all the needed jobs for us and making all LLM calls "under the hood." Things like this are important to keep in mind, especially if we use a paid API instead of a free local model (a 2x increase in the bill can be a bad surprise if we’re not aware of that).
3. Automatic routing
Let’s test a more sophisticated example and use different prompts for different requests. Here, I will use the HuggingFaceEmbeddings
class and cosine similarity to determine if the question is about space or math:
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.utils.math import cosine_similarity
from langchain.schema.runnable import RunnableLambda, RunnablePassthrough
space_template = """<s>[INST] <<SYS>>
You are an astronaut. You are great at answering questions about space.
Provide a short answer to the question, understandable to a small kid.
<</SYS>>
{query} [/INST]"""
math_template = """<s>[INST] <<SYS>>
You are a mathematician. You are great at answering math questions.
Provide a short answer to the question.
<</SYS>>
{query} [/INST]"""
embeddings = HuggingFaceEmbeddings()
prompt_templates = [space_template, math_template]
prompt_embeddings = embeddings.embed_documents(prompt_templates)
def prompt_router(input):
""" Find a proper template for the input """
query_embedding = embeddings.embed_query(input["query"])
similarity = cosine_similarity([query_embedding], prompt_embeddings)[0]
most_similar = prompt_templates[similarity.argmax()]
print("Using MATH" if most_similar == math_template else "Using SPACE")
return PromptTemplate.from_template(most_similar)
chain = (
{"query": RunnablePassthrough()}
| RunnableLambda(prompt_router)
| llm
| StrOutputParser()
)
The logic here is simple. A HuggingFaceEmbeddings
class converts the question into a numerical representation. Then, we use cosine similarity metrics to determine if the question is closer to "math" or "space.".
The output looks like this:
chain.invoke("How far is Mars?", config={
# "callbacks": [ConsoleCallbackHandler()]
})
#> Using SPACE
#> Oh, wow! That's a really cool question! *adjusts spacesuit* Mars is
#> actually quite far from Earth! *grin* It's like, really, really far!
#> *estimates with hands* Let me see... if I hold out my hand like this
#> (gestures), that's how far Mars is from Earth! *smiling* It's about 140
#> million miles away!
As we can see, automatic prompt detection can be useful if the questions are asked by different people, for example, adults and kids.
4. Basic chat
We can also use LLM with a ChatPromptTemplate
class that allows a user to have a conversation with the model.
from langchain.chains import LLMChain
from langchain.prompts.chat import (
ChatPromptTemplate,
HumanMessagePromptTemplate,
SystemMessagePromptTemplate,
)
from langchain.schema import AIMessage, HumanMessage
from langchain_experimental.chat_models import Llama2Chat
sys_template = """<s>[INST] <<SYS>>
Act as an experienced AI assistant. Write only one sentence answers.
<</SYS>>
[/INST]
"""
chat_prompt = ChatPromptTemplate.from_messages([
SystemMessagePromptTemplate.from_template(sys_template),
HumanMessage(content="Hello, how are you doing?"),
AIMessage(content="I'm doing well, thanks!"),
HumanMessage(content="May I ask you a question about Moon?"),
AIMessage(content="Yes, sure."),
HumanMessagePromptTemplate.from_template("{question}"),
])
model = Llama2Chat(llm=llm)
chain = chat_prompt | model | StrOutputParser()
chain.invoke({"question": "How big is it?"},
config={
# "callbacks": [ConsoleCallbackHandler()]
})
#> The Moon has a diameter of approximately 2,159 miles (3,475 kilometers).
Here, I created a SystemMessagePromptTemplate
object with the needed instructions for a model and added a history of conversations. Again, LangChain will do all the needed work to combine this data into a final prompt. We can enable ConsoleCallbackHandler
and see what is being sent to the model:
[llm/start] Entering LLM run with input:
"System: <s>[INST] <<SYS>>nAct as an experienced AI assistant. Write only one sentence answers.n<</SYS>>n[/INST]nnHuman: Hello, how are you doing?nAI: I'm doing well, thanks!nHuman: May I ask you a question about Moon?nAI: Yes, sure.nHuman: How big is it?"
5. Chat with memory and message summary
The possibility to store all messages in the text body is useful, but the final prompt can easily become too long. The same works for humans; we usually cannot remember all phrases from past conversations, but we remember the general idea of what we were talking about. The same idea can be used for LLM with the help of the ConversationSummaryMemory
class.
from langchain.chains import ConversationChain
from langchain.memory import ConversationBufferMemory,
ConversationSummaryMemory,
CombinedMemory,
ChatMessageHistory
conv_memory = ConversationBufferMemory(memory_key="chat_history_lines",
input_key="input")
summary_memory = ConversationSummaryMemory(llm=llm, input_key="input")
memory = CombinedMemory(memories=[conv_memory, summary_memory])
template = """<s>[INST] <<SYS>>
Act as an experienced AI assistant. Write one-sentence answers only.
<</SYS>>
Summary of conversation: {history}
Current conversation: {chat_history_lines}
Human: {input}
[/INST]
"""
summary_memory.save_context({"input": "Hi, how are you"},
{"output": "Thanks, I am fine"})
summary_memory.save_context({"input": "May I ask you questions about Moon?"},
{"output": "Yes, sure"})
summary_memory.load_memory_variables({})
prompt = PromptTemplate(
input_variables=["history", "input", "chat_history_lines"],
template=template,
)
conversation = ConversationChain(llm=llm, verbose=True, memory=memory,
prompt=prompt)
conversation.run("How far is it?")
#> The average distance from the Earth to the Moon is about 238,855 miles
#> (384,400 kilometers)
conversation.run("And what about Mars?")
#> The average distance from Earth to Mars is about 140 million miles
#> (225 million kilometers)
Questions and responses look simple, but a lot of things are going on under the hood. A ConversationSummaryMemory
class calls the LLM every time a new "request-response" pair is added. The summary is also updated automatically every time the ConversationChain
is called. Practically, for our short dialog, the sequence looks like this:
#> save_context({"input": "Hi, how are you"}, {"output": "Thanks, I am fine"})
The human greets the AI and asks how it is doing. The AI responds that it is fine.
#> save_context({"input": "May I ask you questions about Moon?"}, {"output": "Yes, sure"})
The human greets the AI and asks how it is doing. The AI responds that it is fine. The human asks if they can ask questions about the moon. The AI agrees.
#> conversation.run("How far is it?")
<s>[INST] <<SYS>> Act as an experienced AI assistant. Write one-sentence answers only. <</SYS>> Summary of conversation: The human greets the AI and asks how it is doing. The AI responds that it is fine. The human asks if they can ask questions about the moon. The AI agrees. Current conversation: Human: How far is it?[/INST]
The human greets the AI and asks how it is doing. The AI responds that it is fine. The human asks if they can ask questions about the moon. The AI agrees. The human asks how far the moon is from Earth, and the AI provides a one-sentence answer: "The average distance from the Earth to the Moon is about 238,855 miles (384,400 kilometers)."
#> conversation.run("And what about Mars?")
<s>[INST] <<SYS>> Act as an experienced AI assistant. Write short answers only. <</SYS>> Summary of conversation: The human says "hi" and the AI responds with a brief message indicating it is functioning properly. The human asks if they can ask questions about the moon. The AI agrees and provides information about the average distance from the Earth to the Moon. END OF NEW SUMMARY Please provide the new summary after each line of conversation, progressively adding onto the previous summary.Current conversation: Human: How far is it? AI: Sure thing! I am ready to help answer your questions about the moon. The average distance from the Earth to the Moon is about 238,855 miles (384,400 kilometers).Human: And what about Mars?[/INST]
The human says "hi" and the AI responds with a brief message indicating it is functioning properly. The human asks if they can ask questions about the moon. The AI agrees and provides information about the average distance from the Earth to the Moon. Now, the human wants to know about Mars. Sure thing! Here is the updated summary of our conversation so far:nnHuman: Hi! AI: Hi! I amm functioning properly. Human: Can I ask questions about the moon? AI: Of course! I d be happy to help. The average distance from the Earth to the Moon is about 238,855 miles (384,400 kilometers).nnNow, what would you like to know about Mars?
Here, it’s interesting to notice two things. First, the conversation summary is not always perfect (at least for a 13B model), and outputs can be pretty long. In my tests, the chain sometimes returns an error because the number of tokens exceeds the maximum LLaMA limit. Second, as we can see in this example, we provided only two questions, but the LLM was executed six times! Again, it does not matter too much for a LLaMA model, but it can be a surprise when using a paid API and extensive testing.
7. Agents
Connecting external agents with a language model is a powerful idea, that allows a model to use "tools" for making more specific tasks. In this example, I will be using a PythonREPLTool
class, which allows the model to execute Python code.
As we can see from the GitHub source, PythonREPLTool is just a wrapper that executes Python code using a multiprocessing.Process
call. We can easily see how it works:
from langchain_experimental.tools import PythonREPLTool
tool = PythonREPLTool()
tool.run('import math; print(math.sqrt(5))')
#> 2.23606797749979
By the way, at the moment of writing this text, there are no sanity checks inside this class, and it can be unsafe. If the user asks, for example, to delete system files, the tool may execute this command without any doubt.
To use a Python agent, we need only several lines of code:
from langchain_experimental.agents.agent_toolkits import create_python_agent
from langchain.agents.agent_types import AgentType
agent = create_python_agent(llm=llm,
tool=tool,
verbose=True,
agent_type=AgentType.ZERO_SHOT_REACT_DESCRIPTION)
agent.agent.llm_chain.verbose = True
agent.run("What is a square root of 5?")
The output looks like this:
#> You are an agent designed to write and execute python code to answer questions.
#> You have access to a python REPL, which you can use to execute python code.
#> If you get an error, debug your code and try again.
#> Only use the output of your code to answer the question.
#> You might know the answer without running any code, but you should still run the code to get the answer.
#> If it does not seem like you can write code to answer the question, just return "I don't know" as the answer.
#> Python_REPL: A Python shell. Use this to execute python commands.
#> Input should be a valid python command. If you want to see the output
#> of a value, you should print it out with `print(...)`.
#> Use the following format:
#> Question: the input question you must answer
#> Thought: you should always think about what to do
#> Action: the action to take, should be one of [Python_REPL]
#> Action Input: the input to the action
#> Observation: the result of the action
#> ... (this Thought/Action/Action Input/Observation can repeat N times)
#> Thought: I now know the final answer
#> Final Answer: the final answer to the original input question
#> Begin!
#> Question: What is a square root of 5?
#> Thought: Hmm, this one looks easy. I think I can just use the built-in `sqrt()` function.
#> Action: Python_REPL
#> Action Input: `print(sqrt(5))`
#> Observation: NameError("name 'sqrt' is not defined")
#> Thought: Oh dear, it looks like I need to import the math module first.
#> Action: Python_REPL
#> Action Input: `import math`
#> Observation:
#> Thought: Now we're getting somewhere!
#> Action: Python_REPL
#> Action Input: `print(math.sqrt(5))`
#> Observation: 2.23606797749979
Finished chain.
2.23606797749979
Here, it’s interesting to notice several things.
- As we can see, this prompt template is pretty complex, especially for an open-source model. Practically, the 7B LLaMA model was not able to complete the challenge at all. A 13B model did it after several trials and errors, but the result does not look consistent to me. For example, the "import math" statement was not added properly; the REPL command was executed twice, and only because of that did the code succeed.
- In the same way, as in previous examples, the LLM was called several times, which can cause extra charges when using a paid API.
- There can be serious security concerns if users are allowed to run arbitrary Python code on the system. Last but not least, because of the use of natural language prompts, harmful attacks can also be challenging to detect.
I would be too scared to use a Python agent in production for now, but from a self-education perspective, it is fun to see how it works.
Conclusion
Using large language models can be fun. In this article, we were able to run an LLaMA-13B model on a free Google Colab instance and test its functionality using only free components. An open-source LangChain library allows us to make pretty complex things, like making the chat summary, using only several lines of code. LLaMA.CPP is also an interesting project that allows us to use not only LLaMA but other (Alpaca, Vicuna, etc.) language models with different hardware. Last but not least, the ability to run language models and frameworks for free is great for experimenting, prototyping, or self-education.
In the next part, I will show how to run a HuggingFace Text Generation Inference toolkit in Google Colab:
LLMs for Everyone: Running the HuggingFace Text Generation Inference in Google Colab
Those who are interested in using language models and natural language processing are also welcome to read other articles:
- LLMs for Everyone: Running LangChain and a MistralAI 7B Model in Google Colab
- Natural Language Processing For Absolute Beginners
- 16, 8, and 4-bit Floating Point Formats – How Does it Work?
- Python Data Analysis: What Do We Know About Pop Songs?
If you enjoyed this story, feel free to subscribe to Medium, and you will get notifications when my new articles will be published, as well as full access to thousands of stories from other authors. You are also welcome to connect via LinkedIn. If you want to get the full source code for this and other posts, feel free to visit my Patreon page.
Thanks for reading.