The world’s leading publication for data science, AI, and ML professionals.

Tiny Llama – a Performance Review and Discussion

Learn how you can utilize a tiny large language model, fine-tune it, and achieve high performance

TinyLlama -The Promising Generation of Powerful Smaller Language Models

Learn about TinyLlama, a smaller language model capable of a variety of complex tasks with a small amount of compute

TinyLlama is an open-source project that trains a small language model of around 1.1B parameters. The project aims to have a language model capable of performing tasks a full LLM like Llama 2 can achieve but with less memory usage. This article will discuss how TinyLlama can be implemented and run locally on your computer. Furthermore, it will also discuss TinyLlama’s current performance, along with its strengths and weaknesses.

ChatGPT's imagination of a TinyLlama model. OpenAI. (2024). ChatGPT (4) [Large language model]. https://chat.openai.com
ChatGPT’s imagination of a TinyLlama model. OpenAI. (2024). ChatGPT (4) [Large language model]. https://chat.openai.com

Table of contents

· Table of contents · Motivation · Implementing the model locally · Testing the modelFibonacci sequenceRAGGenerating dialogCoding with TinyLlama · My thoughts on the model · Conclusion

Motivation

My motivation for writing this article is to keep up with the latest trends in machine learning. Though TinyLlama was released a few months ago, the project is still unique, and I appreciate the task the authors are trying to achieve. Having a small language model opens up many opportunities that are not possible with LLMs. For example, smaller language models require much more computing, making every query compute-intensive. Smaller language models, however, will naturally require much less computing to run. Furthermore, when using LLMs, one often has to access it through an API like OpenAI or run it on a powerful computer locally. Running the Llm on a mobile phone can be difficult; if possible, it would take a long time to run. We could have smaller language models capable of running locally on your phone without an internet connection. This could be useful for completing tasks on the phone and keeping your data private since it is not sent to an API. You can learn more about TinyLlama and the advantages of smaller language models on the TinyLlama GitHub page.

This is the fourth article in my series on implementing and testing new projects within machine learning. I have previously written about Open-Sora-Plan to create synthetic videos, how to use Alibaba’s vision-language model, and time series forecasting with Amazon’s Chronos model:

How to Effectively Forecast Time Series with Amazon’s New Time Series Forecasting Model

Implementing the model locally

First, you should download the required packages. Which is PyTorch, accelerate, and transformers version ≥ 4.34. You can download PyTorch from the PyTorch website, which will have a command like below if you are on Windows:

pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

Then download transformers and accelerate

pip install transformers
pip install accelerate

You should go to the TinyLlama HuggingFace page to implement the model locally. You can run the code below to start downloading the model, which is around 2.2 GB as of this article’s writing.

import torch
from transformers import pipeline

pipe = pipeline("text-generation", model="TinyLlama/TinyLlama-1.1B-Chat-v1.0", torch_dtype=torch.bfloat16, device_map="auto")

You can then run the model with the following code:

def prompt_tinyllama(prompt, system_prompt=""):

 messages = [
  {
   "role": "system",
   "content": system_prompt,
  },
  {"role": "user", "content": prompt},
 ]
 prompt = pipe.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
 outputs = pipe(prompt, max_new_tokens=256, do_sample=True, temperature=0.7, top_k=50, top_p=0.95)
 return outputs[0]["generated_text"].split("<|assistant|>")[1]

prompt = "How many helicopters can a human eat in one sitting?"
system_prompt = "You are a friendly chatbot who always responds in the style of a pirate"
response = prompt_tinyllama(prompt, system_prompt)
print(response)

This will output:

While it's true that humans can eat large amounts of food, it's not true that
they can eat multiple helicopters in one sitting. A human can eat one cup of
cooked pasta, 1 cup of canned beans, 1/2 cup of canned corn, 1 cup of canned 
tomatoes, 1 cup of cooked rice, 1 cup of canned peas, 1 cup of canned potatoes,
and 1 cup of canned pumpkin in one sitting. However, it's unlikely that a human
would consume a full 10-15 helicopters in one sitting, especially if they are
consuming multiple servings of pasta, beans, corn, tomatoes, rice, peas,
pumpkin, and poultry (such as chicken or turkey) in one sitting.

In the prompt above, you should first specify two variables. The prompt variable is the question you are asking the large language model. This can be any question you like, and hopefully, the language model can answer. Furthermore, if you want to add context to the large language model, for example, using a RAG system, you can also input the context in the prompt. For example, you can input the text from a document in the prompt and ask the language model a specific question from the document. You can do this with the code below:

context = "This is the document text, the document is a document about the OpenAI LLM API"
question = "What is the document about"
system_prompt = "You are an expert at answering questions about documents"

prompt = f"Given the following context: {context}, answer the following question: {question}"

This will give the model a document as context and ask the model a question. It will also create a system (with the system variable) where the language model thinks it is an expert within the field you are asking the question. Telling the language model it is an expert is a common trick to improve the quality of language model output.

You can also adjust the other parameters in the model. Decreasing the temperature parameter, for example, will make the language model more deterministic and ensure it only answers the question you are looking for. Increasing the temperature parameters, however, increases the model’s creativity, allowing for more open-ended answers.

Unsplash image of a tiny llama. Photo by Chris on Unsplash
Unsplash image of a tiny llama. Photo by Chris on Unsplash

Testing the model

After implementing and running the model locally, it is time to test it. I will perform different qualitative tests to see how well the model performs. I will also note the time the model takes for inference, only my GeForce GTX 1660 super. If you have a less or more powerful graphics card, the inference time will naturally deviate from the numbers I see.

Fibonacci sequence

The first test I will perform is to query the model for the ten first values of the Fibonacci sequence, which is the sequence in the image below:

The Fibonacci sequence. Image by the author.
The Fibonacci sequence. Image by the author.

I enter the following prompt:

prompt = "What are the 10 first digits of the fibonacci sequence?"
system_prompt = "You are a friendly chatbot who corresponds concisely with a clear answer"
response = prompt_tinyllama(prompt, system_prompt)

The model responds in 2.1 seconds with:

The first 10 digits of the Fibonacci sequence are:nn1, 1, 2, 3, 5, 8, 13, 21, 34, 55nnNote that the first and last digits are not part of the sequence.

Correct! This shows the model can quickly perform simple tasks.

RAG

RAG is an important use case for LLMs since it allows the LLMs to access vast amounts of data in order to give a response. I have previously written about how you can use RAG search mail inbox in the article below:

How to Make a RAG System to Gain Powerful Access to Your Data

To test the RAG capabilities of TinyLlama, I prompt it by giving it some context and then asking questions about the context. This will showcase how well TinyLlama can extract information from its context.

I query the model with the following:

context = "This is the document text, the document is a document about the OpenAI LLM API"
question = "What is the document about"
system_prompt = "You are an expert at answering questions about documents concisely and accurately"

prompt = f"Given the following context: {context}, answer the following question: {question}"

response = prompt_tinyllama(prompt, system_prompt)

And TinyLlama responds in 1.1 seconds with:

Sure, here's the answer:

The document is a document about the OpenAI LLM API.

Correct! But the context was short. I will not test again with a longer context, approximately the size of TinyLlama’s context length of 2048 tokens.


I first set the _max_newtokens parameter in the _prompttinyllama function to 1024 so I can generate more tokens with a single query. I then generate some random text twice using a prompt like:

prompt = "Generate a random text of around 1000 tokens"
system_prompt = "You are an expert at creating random text"
random_text1 = prompt_tinyllama(prompt, system_prompt)
random_text2 = prompt_tinyllama(prompt, system_prompt)

I then give the same RAG query as before but with random text at the context length. This should make it more difficult for the model since it has a lot of noise to deal with before it can find the correct answer to my question. Then, I can prompt the TinyLlama model with the following:

context = f"{random_text1}. This is the document text, the document is a document about the OpenAI LLM API. {random_text2}"
question = "What is the document about"
system_prompt = "You are an expert at answering questions about documents concisely and accurately"

prompt = f"Given the following context: {context}, answer the following question concisely: {question}"

response = prompt_tinyllama(prompt, system_prompt)

Where the model responds in 6.3 seconds with:

The text is a sample text for a document generated by the OpenAI LLM API. It is a document about the API and features a variety of sensory experiences, including the sounds of birds chirping, the feel of the breeze on the skin, the sight of a sunrise, the sound of birds singing, the taste of a freshly brewed cup of coffee, the smell of blooming flowers, the feeling of a cool breeze, and the sight of a sunset. The text also describes the world as coming alive with the sounds of birds and the rustling of leaves, with the cool breeze carrying the scent of flowers and the sound of birds singing. The document is a natural, concise summary of the API's functionalities and capabilities.

Wow, the model still manages to go through the noise I input and find the correct answer. You may notice that the model also gives a lot of text after the correct answer, which differs from the desired behavior, especially considering I specifically prompted the model to answer concisely. Still, this response from TinyLlama is impressive!

Generating dialog

The authors suggested another interesting use case of TinyLlama on the GitHub page: creating dialog for video games. I will, therefore, test the model’s ability to create a dialog between two people given a setting.

I prompt the model with the following:

prompt = "Generate a dialog between person A and person B about the weather today, which is sunny, with a storm incoming"
system_prompt = "You are an expert at creating dialog for video games"
response = prompt_tinyllama(prompt, system_prompt)

The model answers in 6.8 seconds with:

Person A: Hey, can you believe it? Today's weather is so perfect! It's sunny and warm, with a beautiful storm coming up.

Person B: (confused) Oh, right. I remember hearing about that. Is it going to be raining or not?

Person A: (sarcastic) Don't worry, it's not going to rain. The storm is just coming up. It's supposed to be beautiful tomorrow.

Person B: (skeptical) Okay, I guess you're right. I just hope it doesn't ruin my plans for the weekend.

Person A: (laughs) Oh, you're planning to go out tomorrow? That's great! Don't forget to bring your raincoat.

Person B: (nods) Okay, I'll be sure to. Have a great weekend, anyway.

Person A: (smiling) You too. See you later

Wow, the model can create dialog for video games. Imagine how interactive video games could feel when custom-generated dialog like this could be generated in seconds. I am surprised by the quality of the output from the model, which will require so little computation to calculate.

Coding with TinyLlama

I also wanted to check out TinyLlama’s coding performance for the last test. I, therefore, gave it a standard coding question when implementing a binary search for a sorted list. I gave in the prompt:

prompt = "Generate a Python script to find a number in a sorted array using binary search"
system_prompt = "You are an expert at coding"
response = prompt_tinyllama(prompt, system_prompt)

The model responded in 7.7 seconds with:

# Here's a Python script that finds a number in a sorted array using binary search:

# Import the sorted() function from the built-in sorted module
from sorted import sorted

# Define the function to search for a number in the sorted array
def binary_search(arr, target):
    low = 0
    high = len(arr) - 1
    while low <= high:
        mid = (low + high) // 2
        if arr[mid] == target:
            return mid
        elif arr[mid] < target:
            low = mid + 1
        else:
            high = mid - 1
    return -1

# Define the main function to call the binary search function
def main():
    # Prompt the user for an array and target number
    arr = input("Enter an array of integers separated by spaces: ").split()
    target = int(arr.pop())

    # Call the binary search function to find the index of the target number
    index = binary_search(sorted(list(map(int, arr))), target)

    # Print the index if

Since I only used 256 output tokens, the model was cut short. I did, however, graph the binary_search function and tested it. The function compiles immediately and works as expected, giving the correct answers. Though the binary search question is common, it’s impressive that TinyLlama can generate such precise output locally on a computer with a not-too-powerful graphics card in seconds.

My thoughts on the model

After prompting the model with different prompts and getting a feel for the performance of the model, I am quite impressed with its capabilities. TinyLlama is a small language model of 1.1B parameters, but it still performs well on the tasks I have given it. Furthermore, the model responds quickly, given its small size, which makes the model capable of running on less powerful devices in the future. It is also impressive how TinyLlama can perform various tasks, from RAG to coding to generating dialog for video games. This is a good model, with many use cases in real-world scenarios.

Conclusion

In this article, I show how you can implement TinyLlama locally on your computer. Furthermore, I have discussed how to prompt and test the model on various tasks, from creating dialog to coding and working in an RAG system. Overall, the model is impressive, and it can be useful in a wide range of scenarios, from incorporating it into video games to creating custom dialog for a player to having a local language model on your mobile phone.

You can also read my articles on WordPress.


Related Articles