The world’s leading publication for data science, AI, and ML professionals.

On a Time Crunch but Still Want to Learn to Develop Multi-Agent AI?

These 3 starter projects only take a weekend (and a few cups of coffee, maybe)

Photo by Photo By: Kaboompics.com from Pexels.
Photo by Photo By: Kaboompics.com from Pexels.

Talk of the town: AI agents will replace SAAS.

This wasn’t a random statement made by any drug dealer. It was Satya Nadella, the CEO of Microsoft, saying this.

If you’re a developer or anyone in tech, you’d want to know how to build AI agents yourself. But not many of us can afford the time to learn them from scratch.

I’m writing this for you if you find yourself in this group.

This is also how I learned to develop AI apps – but more informally. Therefore, I wanted to organize it in a post so that you can follow along nicely and use your weekend to learn the basics of developing AI-powered apps.

I started this post with a reference to AI agents. I will get to that.

But before that, we should get our feet wet on LLMs. Don’t worry—this isn’t a very theoretical post. We will first build an app that uses Llm only. Then, we will build a retrieval-augmented app (RAG). Lastly, we will build a real agentic app.

You may have some knowledge of Gen AI tools. So feel free to skip to your desired project.

The Most Valuable LLM Dev Skill is Easy to Learn, But Costly to Practice.

What you need …

As I mentioned, this won’t take over a weekend to finish. But besides time, you need a few other things.

First, I assume you’re good at programming (Python) and have experience developing something. (Your pet projects and homework count in, too)

We will be using Python frameworks like Streamlit, Lanchain, and Phidata.

You need an active internet connection because we’ll use OpenAI models through their APIs. Oh, this also means you need an active OpenAI API key.

Your PC should be compatible with and capable of running vector stores like Chroma. Most consumer-grade laptops can run them without issue.

You can begin by installing the required libraries using the following command.

pip install -qU streamlit python-dotenv langchain-community langchain-openai langchain-chroma phidata

Also, create a .env file in your working directory and update it with your OpenAI API key.

OPENAI_API_KEY=sk-proj-XXXXX

Ensure you always start your Python scripts by reading the .env file with Python-dotenv.

from dotenv import load_dotenv

load_dotenv()

# The rest of your code

You’re all set to start building.

1 Research paper summarizer

This is the easiest yet the most enjoyable tool I created to learn LLMs.

I don’t use this app directly, but it gave me a solid understanding of LLMs, prompts, and the Langchain ecosystem.

This app summarizes a research paper into a few easily digestible lines.

from langchain_community.document_loaders import WebBaseLoader

from langchain_openai import ChatOpenAI

from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain.chains.llm import LLMChain
from langchain_core.prompts import ChatPromptTemplate

# Create an instance of LLM
llm = ChatOpenAI(model="gpt-4o-mini")

# Load document
loader = WebBaseLoader("https://arxiv.org/abs/2303.16634")
docs = loader.load()

# Prompt template 
prompt_template = """Write a concise summary of the following:
"{context}"
CONCISE SUMMARY:"""

# Define prompt
prompt = ChatPromptTemplate.from_template(prompt_template)

# LLM Chain
chain = prompt | llm

# Invoke chain
result = chain.invoke({"context": docs})
print(result)

>> content='The paper titled "G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment" presents G-Eval, a new framework for evaluating the quality of text generated by natural language generation (NLG) systems. The authors, Yang Liu et al., argue that traditional evaluation metrics like BLEU and ROUGE do not correlate well with human judgments, particularly in creative contexts. Their approach utilizes large language models (LLMs) with chain-of-thought (CoT) reasoning and a form-filling method, showing significant improvements in evaluation accuracy over previous methods. G-Eval achieves a Spearman correlation of 0.514 with human evaluations in summarization tasks, indicating better alignment with human judgment. The study also discusses potential biases in LLM-based evaluations. The paper was submitted to arXiv on March 29, 2023, and last revised on May 23, 2023.' additional_kwargs={'refusal': None} response_metadata={'token_usage': {'completion_tokens': 184, 'prompt_tokens': 1991, 'total_tokens': 2175, 'completion_tokens_details': {'accepted_prediction_tokens': 0, 'audio_tokens': 0, 'reasoning_tokens': 0, 'rejected_prediction_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0}}, 'model_name': 'gpt-4o-mini-2024-07-18', 'system_fingerprint': 'fp_72ed7ab54c', 'finish_reason': 'stop', 'logprobs': None} id='run-46e4511c-e109-41ee-8e37-f7ef7f57b267-0' usage_metadata={'input_tokens': 1991, 'output_tokens': 184, 'total_tokens': 2175, 'input_token_details': {'audio': 0, 'cache_read': 0}, 'output_token_details': {'audio': 0, 'reasoning': 0}}

This is a super simple application. However, it teaches us much about the LangChain ecosystem and how to use it to build LLM-powered apps.

In this app, we read the content of a web page (a research paper about G-Eval, to be precise) and summarize it with the help of GPT-4o-mini. For this model to work, we should have set the OpenAI API key as an environment variable, which we’ve done in the last section.

We should pay attention to the prompt template and the chain. A prompt template is a prompt with a placeholder for any variables we insert dynamically. Anything inside curly braces would be treated as variable names. We must set this at the runtime to create a prompt from this prompt template.

In our case, we’ve named the variable as {context}.

Next, we create something called a chain. The chain is like a function that glues in various components of an execution pipeline. The input variable (context), the prompt, the llm, and much more to come later. We can now call (or, more correctly, invoke) the chain with the input variables.

I. Answering questions (not just summarize)

The last example used only a single input variable. However, let’s modify the prompt template to accept two variables.

This time, we pass in the context and a question. So the user can not only summarize the context, they can ask any random question about the context.

Let’s look at an example:

...

# Prompt template 
prompt_template = """Answer the user's question based on the following context:
context: {context}
question: {question}
"""

# Define prompt
prompt = ChatPromptTemplate.from_template(prompt_template)

# LLM Chain
chain = prompt | llm

# Invoke chain
result = chain.invoke({"context": docs, "question": "What was the GPT version used in this study?"})
print(result)

>> content='The study used GPT-4 as the backbone model in the G-Eval framework for NLG evaluation.' additional_kwargs={'refusal': None} response_metadata={'token_usage': {'completion_tokens': 22, 'prompt_tokens': 2001, 'total_tokens': 2023, 'completion_tokens_details': {'accepted_prediction_tokens': 0, 'audio_tokens': 0, 'reasoning_tokens': 0, 'rejected_prediction_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0}}, 'model_name': 'gpt-4o-mini-2024-07-18', 'system_fingerprint': 'fp_72ed7ab54c', 'finish_reason': 'stop', 'logprobs': None} id='run-43a7a03e-1699-4036-8006-87f42e8f38f2-0' usage_metadata={'input_tokens': 2001, 'output_tokens': 22, 'total_tokens': 2023, 'input_token_details': {'audio': 0, 'cache_read': 0}, 'output_token_details': {'audio': 0, 'reasoning': 0}}

In the new version, the prompt is simplified to answer whatever the user question based on the context. Note that the invoke method now has both the input variables passed as a dictionary.

II. Complex chains

If you look carefully, the output of the invoke method is an object. What if we want a nice text-only output that the rest of our app can use?

In the following example, I’ve shown how to format the output with Langchain’s built-in output parser. I’ve also run an additional lambda function to show that we can run any function in a chain.

...

from langchain_core.output_parsers import StrOutputParser
parser = StrOutputParser()

...

chain = prompt | llm | parser | (lambda x: x.split())

...

>> ['The', 'study', 'used', 'GPT-4', 'as', 'the', 'backbone', 'model', 'for', 'the', 'G-Eval', 'framework.']

This chain isn’t complex yet. But hold on.

2 Q&A App against private data

The previous app was helpful, but only to a certain extent.

The app’s main limitation is the size of the input. It was only a single web page, not a collection of the Harry Potter series.

The latest models support large context windows. For instance, Google’s Gemini 1.5 can handle up to 1 million tokens. In the English language, this translates to roughly 750,000 words.

That’s enough to have an academic textbook comfortably. But here are a few other problems to consider.

  1. With a larger context, LLM tends to lose focus. With too much noise in the context, it’s hard for the LLM to find answers to our specific questions—the result – often hallucinated.
  2. You pay for input tokens as well. LLM API providers like OpenAI charge for input and output tokens. With a larger context, you necessarily pay for these providers. Not to mention the cost of internet bandwidth.
  3. A larger context means a slower response. This isn’t rocket science; you give too much information for other LLM to traverse through, you should expect a slower response. If you use an API provider, your response will be further delayed due to internet traffic.

Providing the LLM with only the relevant information would be wise for these reasons. For instance, if the user asks a question about lightning, instead of passing an entire physics textbook as context, filter the sections that mention lightning (or related terms) and provide only these.

This approach is known as retrieval augmented generation, or RAG for short.

How to Build Helpful RAGs with Query Routing.

Most RAG apps follow the following approach.

  1. Your data is chunked into smaller segments to limit as much as one idea per chunk.
  2. Each chunk is converted to a vector embedding and stored in a vector store, like Chroma.
  3. The user’s question is also converted to a vector embedding during retrieval.
  4. The vector store will compute the similarity between the question’s vector version and the existing vectors of the chunks in the database.
  5. The top n chunks with the highest similarity will be the context to answer your question in the RAG chain.

Here’s how this all happens in code.

# This is to securely load our secrets
from dotenv import load_dotenv
load_dotenv()

# 1. Load the content
# -----------------------------------------
import bs4
from langchain_community.document_loaders import WebBaseLoader

loader = WebBaseLoader(
    web_paths=("https://docs.djangoproject.com/en/5.0/topics/performance/",),
    bs_kwargs=dict(
        parse_only=bs4.SoupStrainer(
            id="docs-content"
        )
    ),
)
doc_content = loader.load()

# 2. Indexing
# -----------------------------------------
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings

text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
    chunk_size=1000, chunk_overlap=200
)
docs = text_splitter.split_documents(doc_content)

vector_store = Chroma.from_documents(documents=docs, embedding=OpenAIEmbeddings())
retriever = vector_store.as_retriever()

# 3. LLM 
# -----------------------------------------
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(temperature=0.5)

# 4. RAG Chain
# -----------------------------------------
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser

prompt = """
Answer the question in the below context:
{context}

Question: {question}
"""

prompt_template = ChatPromptTemplate.from_template(prompt)

chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt_template
    | llm
    | StrOutputParser()
)

# 5. Invoking the chain
# -----------------------------------------
response = chain.invoke(
    "How can I improve site speed?",
)

print(response)

>> To improve site speed, you can consider the following strategies:

1. Utilize Django's caching framework: Implement caching to save dynamic content and reduce the need for recalculating data for each request. Django offers different levels of cache granularity, allowing you to cache specific views or difficult-to-produce pieces of content.

2. Optimize template performance: Use {% block %} instead of {% include %} as it is faster. Avoid heavily-fragmented templates, as they can impact performance. Enable the cached template loader to avoid compiling templates every time they need to be rendered.

3. Minimize static file loading times: Use ManifestStaticFilesStorage to append content-dependent tags to static file names, allowing web browsers to cache them long-term without missing future changes.

4. Consider alternative software implementations: Check if alternative implementations of Python software you're using can execute the same code faster. PyPy, for example, can offer substantial performance gains for heavyweight applications.

5. Benchmark and measure performance: Use performance benchmarking tools like django-debug-toolbar and third-party services like Yahoo's Yslow and Google PageSpeed to analyze and report on your site's performance. Identify areas for improvement and prioritize optimizations based on their impact.

By implementing these strategies and continuously monitoring and optimizing your site's performance, you can enhance its speed and provide a better user experience.

In the above example, we index a web page (One of Django’s official documentation pages) in a Chroma DB.

Please pay attention to step #2, where we chunk and index the web page and store it in a Chroma DB. To chunk the text content of the web page, we’ve used the recursive character splitting technique, one of many ways we can create chunks.

Advanced Recursive and Follow-Up Retrieval Techniques For Better RAGs

We then use OpenAI embeddings to convert the chunks to their vector representation and store them in Chroma DB. One important thing to note is that we need to use the same embedding model when we create vector versions of the questions for which we retrieve them. Fortunately, Langchain has taken care of this.

Now note step #4, where we create the RAG chain. This may look similar to our last app, with two notable differences. In place of context, we now pass in the retriever. Retriever is the vector store that is ready to fetch information for you. Next, instead of a question, we have a RunnablePassthrough object. This allows us to call the chain with the question as its only parameter. We no longer need to pass a dictionary.

5 Proven Query Translation Techniques To Boost Your RAG Performance

Isn’t this easy?

But under the hood, every time you ask a question (every time you invoke the chain), the chain converts the question to a vector, retrieves similar chunks from the vector store, passes it as context to our LLM, and generates an answer to our question.

Congratulations on building your first RAG app. We’re now ready to take on AI agents.

#3 Get things done with AI Agents

The term’ AI agents’ may confuse or even intimidate you, but it’s a simple concept.

Think of an agent as a worker to whom you assign a task. The worker then has to use the tools to finish the task.

This sounds like a traditional software program. Indeed it is. However, unlike software with instructions to be executed in a specific order, an agent will figure out which tools to use and in what order.

Here’s an agent in action.

from phi.agent import Agent
from phi.llm.openai import OpenAIChat
from phi.tools.yfinance import YFinanceTools

agent= Agent(
    llm=OpenAIChat(model="gpt-4o"),
    tools=[YFinanceTools(stock_price=True, analyst_recommendations=True, company_info=True, company_news=True)],
    show_tool_calls=True,
    markdown=True,
)

agent.print_response(
  "Write a comparison between the top chip manufactureres, use all tools available."
)

And here’s its output.

Image by the author.
Image by the author.

This entire report took only 26 seconds to finish. This may sound like a bit high. But here’s the point.

How long would it take for you to do it manually? Even if you can write a code to do that, how long would it have taken for you to finish writing that Python script?

This code is only a few lines long. But it can do countless things with a single tool, YFinanceTool, that fetches stock-related information. Note that we haven’t told the LLM what companies to look for. The LLM has figured it out on its own.

I Agents with many tools

Agents can use multiple tools, too. Here’s an example.

from phi.agent import Agent
from phi.tools.newspaper4k import Newspaper4k
from phi.tools.duckduckgo import DuckDuckGo

agent = Agent(
    tools=[
      Newspaper4k(), 
      DuckDuckGo()
    ], 
    debug_mode=True, 
    show_tool_calls=True
)

agent.print_response(
    "Find the list of companies mentioned in https://www.ycombinator.com/companies/industry/generative-ai and find the latest news about them"
)

In this, we ask the agent to read a link and find the latest news about the companies mentioned in the link. The link is to a list of Gen-AI companies funded by Y-combinator.

We expect the first tool, Newspaper4k, to read through the article and then the second tool, DuckDuckGo, to find related news for each startup mentioned in the link.

Here’s the output:

This task isn’t easy to perform manually or even code, but our agent completed it quickly.

II Multi-agent architecture

Sometimes, it’s best to have specialized agents work together to finish our task.

Rather than giving a single person all your tools and asking them to perform a complex task, you’d better give it to a few more specialized people. Each of your team members will have a specialized way to finish a subtask with the tools at their disposal. There’ll be a leader who coordinates things between the other members.

That sounds wild.

This is what multi-agent systems are. And we’re going to build one.

from phi.agent import Agent
from phi.model.openai import OpenAIChat
from phi.tools.tavily import TavilyTools
from phi.tools.yfinance import YFinanceTools

web_agent = Agent(
    name="Web Agent",
    model=OpenAIChat(model="gpt-4o-mini"),
    tools=[TavilyTools()],
    instructions=["Always include sources"],
    show_tool_calls=True,
    markdown=True
)

finance_agent = Agent(
    name="Finance Agent",
    role="Get financial data",
    model=OpenAIChat(model="gpt-4o-mini"),
    tools=[
      YFinanceTools(
        stock_price=True, 
        analyst_recommendations=True, 
        company_info=True
      )
    ],
    instructions=["Use tables to display data"],
    show_tool_calls=True,
    markdown=True,
)

agent_team = Agent(
    model=OpenAIChat(model="gpt-4o"),
    team=[web_agent, finance_agent],
    instructions=[
      "Always include sources", 
      "Use tables to display data", "show gif for news posts"],
    show_tool_calls=True,
    markdown=True,
)

agent_team.print_response(
  "Find the most recent tech news, then fetch and summarize the latest stock details of the companies mentioned in the news", 
  stream=True
)

The above code has two agents. The first one specializes in searching the Internet for news. It programmatically searches using a tool called Tavily. The second one fetches financial information from Yahoo Finance.

We’ve got a third agent, too.

This one has one responsibility – coordinating the two agents to complete the task it’s asked for.

Here’s what the output looks like when we ask the team of agents to "Find the most recent tech news, then fetch and summarize the latest stock details of the companies mentioned in the news."

Screenshot by the author.
Screenshot by the author.

The research agent has searched credible sources like Forbes and MSN to find the latest tech news. It then used the YFinanceTool to fetch the recent stock information. It then nicely presented a report with the summaries.

Final thoughts

Multi-agent architecture has lots of potential to automate a number of our tedious tasks. They could coordinate between agents in ways that aren’t possible with deterministic coding.

Tools like Langchain help anyone with some knowledge of Python develop intelligent apps powered by LLMs. Likewise, tools like Phidata and Crew AI can also help you build agents and teams of agents.

The time it takes to start these isn’t too crazy, either. Perhaps you need only a weekend and a few cups of coffee.


Related Articles