The world’s leading publication for data science, AI, and ML professionals.

How to Implement and Test Phi3: Microsoft’s Powerful New Large Language Model

Learn about Phi3: Microsoft's new LLM capable of tasks like question answering and information extraction

This article discusses the newly released Phi3 large language model from Microsoft, an LLM capable of performing various tasks with a uniquely large context window relative to the model’s size. I will discuss how you can run Phi3 locally and run tests to understand how well It performs on tasks such as Concise answering, JSON formatting, and information extraction. Finally, I will give my thoughts on the model and its performance.

ChatGPT visualization of a small language model working hard. Image by ChatGPT. OpenAI. (2024). ChatGPT (4) [Large language model]. https://chat.openai.com
ChatGPT visualization of a small language model working hard. Image by ChatGPT. OpenAI. (2024). ChatGPT (4) [Large language model]. https://chat.openai.com

Table of contents

· Table of contents · Motivation · Running the model locally · Testing the modelTesting concise answers on simple promptsTesting object formatting ability Testing information extraction/context length utilization · My overall thoughts on Phi3 · Conclusion

Motivation

My motivation for this article is that Phi3 is one of the latest large language models released by Microsoft, making it an interesting model to test. Furthermore, smaller language models are particularly interesting because they utilize fewer parameters more effectively than larger ones. Smaller models are also capable of running on smaller devices, and having a language model that can run locally on your phone could be a huge advancement in AI.

Furthermore, this article is part of a series where I test out the latest models released within machine learning. I have previously written about testing two other language models: TinyLlama, which is similar to Phi3 in that it is a smaller language model, and Llama3, Meta’s latest large language model:

Unleash Llama3 – How you can use the latest big-tech open-source LLM

Running the model locally

One of the simplest ways of running Phi3 locally on your computer is to utilize Ollama. First, you need to download Ollama from this website. Ensure that the Ollama application is installed and run when running the Python code, as this is required for Python to communicate with Ollama. You can then install the pip package for Ollama with the following:

pip install ollama

This now allows you to download any Ollama model with the following code:

import ollama
ollama.pull(<model name>:<model tag>)

You can find a lot of different models in the Ollama model library, though this article will discuss only the Phi3 model. As of writing this article, there are six different Phi3 models available on Ollama, represented with the following tags:

  1. latest
  2. 3.8B
  3. instruct
  4. mini
  5. 3.8b-mini-instruct-4k-q4_k_M
  6. 3.8b-mini-instruct-4k-fp16

To install any of the models, you can, therefore, run the following:

ollama.pull(phi3:<model tag>)
# for example
ollama.pull(phi3:instruct)

Instruct means the model is trained to follow specific instructions, which is what I recommend using. Tags 5 and 6 are models that are quantized to q4 and fp16, where quantization is a process that lowers the computing and storage required for the language model by reducing the precision of the model parameters. Unfortunately, when writing this article, only 4K context window versions are available on Ollama. Still, in the section on testing information extraction ability, later in this article, I will show you how to run the 128K context-length version as well (which is a bit more complicated). Running the ollama.pull command will download the model, and you can then prompt the model with the following code (in this case, using the instruct version of Phi3):

def prompt_phi3(prompt, system_prompt="You are a great question answering machine that answers in a concise manner."):
    if (len(system_prompt) > 0):
      response = ollama.chat(model="phi3:instruct", messages=[
            {'role': 'system','content': system_prompt},
            {'role': 'user','content': prompt},
          ])
    else:
      response = ollama.chat(model="phi3:instruct", messages=[
            {'role': 'user','content': prompt},
          ])

    return response["message"]["content"]

Testing the model

I will perform several tests to check the performance of Phi3 qualitatively. I will use some tests that I applied in my articles on TinyLlama and Llama3, and I will also use an improved information extraction test to truly test how well Phi3 can use its 128k context length. These tests are thought out by considering which tasks I typically require the LLMs to perform. This is the essence of what testing should indicate: how well the machine-learning model can perform the tasks you ask. I have written more about this in my article on testing graph quality below:

How to Test Graph Quality to Improve Graph Machine Learning Performance

Testing concise answers on simple prompts

Prompt: What is the capital of Norway? Only output the answer, nothing else

Phi3 response: Oslo

Prompt: What are the ten first digits of pi? Only output the answer, nothing else

Phi3 response: 3141592653

(this is the correct answer)

Prompt: How many stars are on the flag of the USA? Only output the answer, nothing else.

Phi3 response: 50

Prompt: Who is the CEO of Microsoft?

Phi3 response: The current CEO of Microsoft is Satya Nadella, who took office on February 4, 2014.

My thoughts:

Phi 3 performs very well here, giving clear and concise answers. It answers all questions correctly but manages this perfectly when prompted to respond only with the answer for the first three questions. Notice that when Phi3 is not prompted specifically to answer with only the answer, it answers with a full sentence, which I regard as a natural response. Overall, Phi3 passed this test with flying colors.

Testing object formatting ability

Like how I tested Llama3, I will also ask Phi3 to summarize a text and respond in JSON format. The text is generated by Llama3 and looks like:

The majestic edifice stands tall and proud, its grandeur commanding attention from all who lay eyes on it. The Building of Elegance, as it is affectionately known, is an architectural masterpiece that has been a beloved landmark in the city for generations.

As you approach the structure, you can't help but be struck by its sheer scale and majesty. Rising high into the sky, the building's façade gleams with a subtle sheen, its cream-colored stones polished to perfection by years of gentle wear. The entrance, flanked by grandiose columns, is a symphony of ornate carvings and intricate moldings that seem to dance across the surface.

Stepping inside, you're enveloped in a warm, welcoming atmosphere. The lobby's high ceiling soars above you like a vaulted sky, adorned with delicate frescoes depicting scenes of mythological grandeur. A gleaming marble floor stretches out before you, reflecting the soft glow of overhead lights and casting a gentle sheen across the room.

To either side of the entrance, sweeping staircases curve upward like great wings, leading visitors to the various levels above. The air is filled with the soft hum of activity â€" whispers, footsteps, and the occasional rustle of papers â€" as people go about their daily lives within these hallowed halls.

As you look around, your gaze falls upon an array of stunning features. To one side stands a magnificent chandelier, its delicate crystals refracting light into a kaleidoscope of colors that seem to shift and shimmer with every movement. Nearby, a grand fountain bubbles and splashes, its gentle song providing a soothing accompaniment to the bustle below.

The building's interior is just as impressive as its exterior. Halls lined with gleaming wooden paneling stretch out in all directions, punctuated by occasional doorways leading to various offices, meeting rooms, or quiet nooks for contemplation. Every now and then, a stunning stained-glass window bursts forth from the stone walls, casting vibrant hues across the floor.

The Building of Elegance has been home to countless stories over the years â€" whispers of romance, triumphs of industry, and moments of quiet reflection. It is a place where memories are made, where dreams take shape, and where the passing of time is measured by the soft ticking of clockwork mechanisms hidden away in the walls.

As you wander through these hallowed halls, you can't help but feel a sense of reverence for this magnificent structure â€" a true gem of architecture that has stood the test of time. And as you look up at its soaring heights, you can't help but wonder what secrets it holds within its ancient stones, waiting to be discovered by those who come after.

The point here, however, is not the content of the text but rather if Phi3 can take this text in, give some useful information about the text, and provide the response with a valid JSON format. This is a crucial task for the LLM as one often wants some data summarized, and the response must be a valid object since it can be difficult to convert a non-valid JSON object to a valid one with programming. Responding with a perfect JSON format also requires that the LLM responds close to perfect, as any error while making the JSON object will make the object invalid.

I prompt Phi3 with the following prompt, where building information is the text displayed above:

prompt_phi3(f"Sumarize the following text in a JSON format. Only respond with a JSON object. Text: {building_text}.")

And the first response Phi3 gives is:

{'title': 'The Building of Elegance',
 'description': 'A majestic edifice known as The Building of Elegance stands tall and proud in the city. Its grandeur commands attention with a sheer scale, cream-colored stones polished by years, grandiose columns at its entrance, delicate frescoes adorning its lobby ceiling, sweeping staircases leading to various levels above, magnificent chandelier and bubbling fountain, and halls lined with gleaming wooden paneling. This architectural masterpiece has been a beloved landmark for generations.',
 'features': [{'description': 'Sheer scale', 'location': 'entire building'},
  {'description': 'Cream-colored stones polished by years',
   'location': 'façade'},
  {'description': 'Grandiose columns at entrance', 'location': 'entrance'},
  {'description': 'Delicate frescoes adorning lobby ceiling',
   'location': 'lobby'},
  {'description': 'Sweeping staircases leading to various levels above',
   'location': 'entrance'},
  {'description': 'Magnificent chandelier refracting light into kaleidoscope of colors',
   'location': 'side of entrance'},
  {'description': 'Grand fountain bubbling and splashing softly',
   'location': 'nearby chandelier'},
  {'description': 'Halls lined with gleaming wooden paneling',
   'location': 'various areas inside the building'},
  {'description': 'Stunning stained-glass windows casting vibrant hues across floor',
   'location': 'walls of various rooms'}]}

This response looks good. It is time to see if Phi3 can consistently create valid JSON objects. I will run the same prompt above 100 times to see how many Phi3 fails to generate a valid JSON object. To perform this test, you must first download tqdm with the following:

pip install tqdm

You can then run the experiment with

# test ability to make valid JSON object
from tqdm.auto import tqdm
failed_count = 0
num_runs = 100
for _ in tqdm(range(num_runs)):
  try:
    json_string = prompt_phi3(f"Sumarize the following text in a JSON format. Only respond with a JSON object. Text: {building_text}.")
    obj = json.loads(json_string)
    print("Success")
  except Exception as e:
    failed_count += 1
    print("Failed: ", e)

print(f"Failed {failed_count/num_runs*100}% of the time to make a valid JSON object")

Running this code shows that Phi fails to provide a valid JSON object in 91/100 runs.

My thoughts

Phi3 can provide valid JSON objects, as seen with the first response Phi3 gives, and with the fact that it manages to provide valid JSON objects in 9% of instances. However, I think it is disappointing that Phi3 only manages to create valid JSON objects 9% of the time. If this were an automated system where humans could not intervene, you would have to run the code on average 11 times for a text before Phi3 manages to create a valid JSON summary. This will not be doable in practice, and I, therefore, based on this test, think that Phi3 is not particularly good at returning formatted objects.

Testing information extraction/context length utilization

Another important aspect of LLMs is their ability to perform information extraction. Information extraction in this context means giving the LLM a lot of text and then asking specific questions about that text. The context length of the model is naturally important in this case, as a longer context length allows for prompting the model with longer texts to perform information extraction.

To perform this test, I also switched to the 128K context length version of Phi3. You should follow this GitHub page on running the Phi3 model with onnx files to use this model. Also note that when setting the minimum and maximum number of tokens from following this GitHub page, these tokens count, including the number of input tokens.

ChatGPT imagination of an LLM performing information extraction. Image by ChatGPT. OpenAI. (2024). ChatGPT (4) [Large language model]. https://chat.openai.com
ChatGPT imagination of an LLM performing information extraction. Image by ChatGPT. OpenAI. (2024). ChatGPT (4) [Large language model]. https://chat.openai.com

Phi3 has two models, one with 4K context length, and one with 128K context length. A 128K context length is impressive, but ensuring the model utilizes the full context length of 128K tokens is important. I generated a long text of around 100K tokens to test this. I then have one particular sentence I need the model to extract:

The company is on floor number 27

So, I will have around 100K tokens (around 75K words) surrounding the information about which floor the building is on, and I will then prompt Phi3 to extract which floor the building is on. To test that the model can utilize its full context length, I will try this ten times, placing the sentence about which floor the building is on in 10 different locations in the text, making sure Phi3 can use its full context. My code for performing this is as follows: I first load the generated text, which I have also made sure does not contain any mention of which floor the building is on:

with open(r"random_text.txt", encoding="utf-8") as f: random_text = f.read()

I then have a function to insert the important information:

important_information = "The company is on floor number 27. "

def insert_text(full_text, text_to_insert, index):
 assert index >= 0 and index <= len(full_text)
 return f"{full_text[:index]} {text_to_insert} {full_text[index:]}"

Finally, I use np.linspace to get ten evenly spaced out indices over the context length of the model; I insert the model at one of those indices at a time and prompt the model. Note that _randomtext here is the text I generated with Llama3 to act as noise. This could be any other text you want to use.

import numpy as np
# get 10 indices evenly split over length of random text
indices = np.linspace(0, len(random_text), 10, dtype=int)

responses = []
for idx in tqdm(indices):
 random_text_with_info = insert_text(random_text, important_information, idx)
 assert important_information in random_text_with_info
 prompt = f"In the following text: {random_text_with_info}, what is the floor number of the company?"
 print("PROMPT:", prompt)
 response = main(args, prompt)
 responses.append(response)
 break

I wanted to test the model with the full 128K tokens, but unfortunately, I was not able to due to computing limitations. However, I still run on 16K tokens. Initially, I struggled with making the model find the correct answer. After testing some different prompts, however, I finally made the model discover the correct answer in 6/10 cases with 16K tokens.

My thoughts

Even though I could not test out the full 128K context length of the Phi3 model, locating the important information in 6/10 instances proves that the model cannot use its context well. Furthermore, similar to many other LLMs, Phi3 on this test struggled with being very sensitive to the prompt. I had to tweak the prompt to retrieve the floor number several times before Phi3 could find the correct answer. Being sensitive to the wording of the prompt is something many LLMs struggle with. Still, unfortunately, it can have a strong negative impact on how useful the model is for information extraction. Imagine, for example, I had to scan through text where the information to be extracted is not so explicitly stated or a text where I could not be sure if the information is present (making tweaking the prompt to find information impossible). In these cases, the weakness of Phi3, as explained in this section, would be a serious downside to the model.

My overall thoughts on Phi3

Overall, the Phi3 model is interesting and good at answering questions with concise answers. Phi3 is a smaller model, making inference with Phi3 much faster than inference with Llama3. The faster inference time can be an argument for using the Phi3 model for typical LLM tasks such as question answering or information extraction.

Phi3 also has some downsides, especially its inability to return formatted responses such as JSON objects and its model’s sensitivity to prompting for information extraction. Having to tweak the prompt to find a specific piece of information, such as a floor number, is a severe downside and a weakness one should keep in mind if using Phi3 for information extraction tasks.

Although Phi3 is a small model, I am not too impressed by its capabilities, mostly due to its low performance on formatted responses and information extraction. It is cool that a model with fewer parameters has an option for a context length of 128K. Still, when the model only manages to retrieve the correct information with 16K context in 6/10 instances, it proves the model cannot fully utilize its context length.

Conclusion

In this article, I discuss Microsoft’s new language model, Phi3. I also discuss my motivation for testing new models: to stay up to date with the latest innovations within Machine Learning. I then showed you how to run Phi3 locally on your computer and how to run several tests to check its performance. Finally, I give my thoughts on the model, which is a smaller model that performs well on concise question answering but with a lackluster performance on returning formatted responses and information extraction.

You can also read my articles on WordPress.


Related Articles