This article will discuss using Microsoft’s new language model, WizardLM2. It will also discuss how you can test the model (and language models in general) to get a surface-level view of their performance. Additionally, I will discuss the model’s strengths and weaknesses and give my thoughts on how well it performs.

Table of contents
· Motivation · Implementing WizardLM2 locally · Testing WizardLM2 ∘ Testing concise question-answering ∘ Testing formatted responses ∘ Testing information extraction from context · My overall thoughts · Conclusion
Motivation
My motivation for writing this article is to test out the latest models within machine learning. To keep up with all the advances, I pay attention to websites like PapersWithCode, GitHub Trending, and HuggingFace. I discovered this model from the HuggingFace announcement and then proceeded to run and test out the model myself as quickly as possible. I highly recommend regularly reading websites such as PapersWithCode and HuggingFace to keep up with all the advancements in AI. Knowing the latest available (open-source) models and their capabilities can be a huge advantage when solving various data science problems.
I have previously written about different new machine-learning models, including Phi3, Llama3, TinyLlama, and the DocOwl vision-language model, which are linked below, among other models.
Using a Multimodal Document ML Model to Query Your Documents
Implementing WizardLM2 locally
One of the easiest approaches to running WizardLM2 locally in Python is to use the Ollama application. To do this, first download the application from Ollama’s website. Then, install and run the application; the application must always be running when using Ollama in Python.
You can then go to the WizardLM2 page in Ollama’s model library and browse the different models of WizardLM2 you want to download. As of writing this article, there are 22 versions of WizardLM2 on Ollama. There are two main models of WizardLM2: the 7B parameter model and the 22B parameter model. The rest of the models are quantized versions of these main models. Quantized models mean that the model weights are stored with less precision, making the models take less storage space and making inferences with the models faster. If you are running on a less powerful computer, I recommend running a quantized version of the 7B parameter model of WizardLM2. In Python, you can download the smallest WizardLM2 model with:
import ollama
ollama.pull("wizardlm2:7b-q2_K")
You can also find all the other models you can download on the WizardLM2 model page on Ollama.
To prompt the model, you can then use the following function:
def prompt_wizard(prompt, system_prompt="You are a great question answering machine that answers in a concise manner."):
if (len(system_prompt) > 0):
response = ollama.chat(model="wizardlm2:7b-q2_K", messages=[
{'role': 'system','content': system_prompt},
{'role': 'user','content': prompt},
])
else:
response = ollama.chat(model="wizardlm2:7b-q2_K", messages=[
{'role': 'user','content': prompt},
])
return response["message"]["content"]
The function is called Ollama, either with or without a system prompt. There has to be an if-else statement since you will get an error if you give an empty system prompt to the ollama.chat function. Also, if you use a model tag different from the smallest model (7b-q2_K), you must change the model tag in the above code.
Testing WizardLM2

Testing concise question-answering
The first test for the model is similar to the test I have tested for other language models like Llama3, Phi3, and TinyLlama by testing the question-answering capabilities of the model. Since most advanced large language models today can correctly respond to the questions, an additional challenge is added: Only responding with the answer, with no text around the answer. A large language model’s ability to respond concisely with only the answer is important, especially if the output of the large language model is to be used directly in an automated system. Humans can naturally filter out the precise answer in the output of a large language model, but if this is part of a greater automated system, it is crucial that the large language model can directly output the precise answer with no additional text.
I ask the following prompts to the model:
prompts = ["What is the capital of the USA? Only output the answer, nothing else"
"What are the 10 first digits of Eulers number. Only output the answer, nothing else"
"Mention 3 well-known airline companies. Only output the answer, nothing else"
"How many countries are in the EU? Only output the answer, nothing else"
]
And prompt the model with the following:
for prompt in prompts:
print(prompt_wizard(prompt))
The model then responds with:
["The capital of the USA is Washington, D.C. (Washington, District of Columbia). However, the federal government's administrative 'headquarters' are in various locations due to the nature of a government spread across different cities as per the U.S. Constitution and its amendments. The most prominent among them is the Capitol in Washington, D.C. where both the Senate and House of Representatives meet.",
"The first ten digits of Euler's number (e) are approximately 28415926535.nn(Note: Euler's number is an irrational and transcendental number with an infinite sequence of non-repeating digits, so the above figure is an approximation of the first ten digits after 284.)nnTo see the next part of the sequence, one would typically add a few more digits (e.g., 10106) to reach the next integer value (28415926535 + 10106 = 2841592736) and then continue with the actual base-2 calculations to find the following digits.",
"1. Delta Air Linesnn2. American Airlinesnn3. United Airlines (formerly known as United Airlines and also as UAL)",
"27 countries plus Portugal, as of my knowledge cutoff in 2023, are members of the European Union (EU), following the departure of the United Kingdom. The EU itself is an supranational organization of those member states. However, there are also a few countries that are part of the EU's single market but not part of its governance structures, often referred to as 'EU27' plus the UK."]
These results are disappointing. The model answers questions 1, 3, and 4 correctly but answers question 2 incorrectly. Even worse, the model cannot provide concise answers like I have seen Phi3 and Llama3 do. One reason behind this is likely that I am using a quantized model, though Phi3 was of a similar size to the WizardLM2 quantized version I am using, so I would expect more from the model.
Testing formatted responses
Another interesting test I have also applied to some other language models is to test the ability of the large language model to output formatted responses. One of the most common output formats for large language models is requesting the model to respond with a JSON object, which this test will ask for. Given a text of 435 words describing a building (text randomly generated with Llama3), the WizardLM2 model will be asked to summarize the building described in a JSON format. I will then convert the response into a JSON object. If the model’s output cannot be directly converted to a JSON object, the model fails, and if the model’s output can be directly loaded into a JSON object, the model has succeeded. The experiment will then be repeated 100 times to make the experiment as deterministic as possible. To run this experiment, the following code will be used:
First, load the text describing the building:
building_text = """The majestic edifice stands tall and proud, its grandeur commanding attention from all who lay eyes on it. The Building of Elegance, as it is affectionately known, is an architectural masterpiece that has been a beloved landmark in the city for generations.
As you approach the structure, you can't help but be struck by its sheer scale and majesty. Rising high into the sky, the building's façade gleams with a subtle sheen, its cream-colored stones polished to perfection by years of gentle wear. The entrance, flanked by grandiose columns, is a symphony of ornate carvings and intricate moldings that seem to dance across the surface.
Stepping inside, you're enveloped in a warm, welcoming atmosphere. The lobby's high ceiling soars above you like a vaulted sky, adorned with delicate frescoes depicting scenes of mythological grandeur. A gleaming marble floor stretches out before you, reflecting the soft glow of overhead lights and casting a gentle sheen across the room.
To either side of the entrance, sweeping staircases curve upward like great wings, leading visitors to the various levels above. The air is filled with the soft hum of activity â€" whispers, footsteps, and the occasional rustle of papers â€" as people go about their daily lives within these hallowed halls.
As you look around, your gaze falls upon an array of stunning features. To one side stands a magnificent chandelier, its delicate crystals refracting light into a kaleidoscope of colors that seem to shift and shimmer with every movement. Nearby, a grand fountain bubbles and splashes, its gentle song providing a soothing accompaniment to the bustle below.
The building's interior is just as impressive as its exterior. Halls lined with gleaming wooden paneling stretch out in all directions, punctuated by occasional doorways leading to various offices, meeting rooms, or quiet nooks for contemplation. Every now and then, a stunning stained-glass window bursts forth from the stone walls, casting vibrant hues across the floor.
The Building of Elegance has been home to countless stories over the years â€" whispers of romance, triumphs of industry, and moments of quiet reflection. It is a place where memories are made, where dreams take shape, and where the passing of time is measured by the soft ticking of clockwork mechanisms hidden away in the walls.
As you wander through these hallowed halls, you can't help but feel a sense of reverence for this magnificent structure â€" a true gem of architecture that has stood the test of time. And as you look up at its soaring heights, you can't help but wonder what secrets it holds within its ancient stones, waiting to be discovered by those who come after."""
Then, the WizardLM2 model is prompted, and I attempt to convert the output to a JSON object 100 times and print out the number of times the model can output a correctly formatted JSON object directly.
# test ability to make valid JSON object
from tqdm.auto import tqdm
failed_count = 0
num_runs = 100
for _ in tqdm(range(num_runs)):
try:
json_string = prompt_wizard(f"Sumarize the following text in a JSON format. Only respond with a JSON object. Text: {building_text}.")
obj = json.loads(json_string)
print("Success")
except Exception as e:
failed_count += 1
print("Failed: ", e)
print(f"Failed {failed_count/num_runs*100}% of the time to make a valid JSON object")
In cases where the model cannot create a correctly formatted object directly, there will only be a small error in the JSON object, making it invalid. Still, this is a good test to see how well the model can correctly follow instructions and respond in the desired format.
The model creates a correctly formatted JSON object in 16/100 cases, giving a percentage of 16%. This is higher than the Phi3 model’s, though I still regard this as quite poor performance. Only receiving valid JSON objects 16% of the time means you would have to infer the model on average 6.25 times before you get a correctly formatted response.
Testing information extraction from context
Another important aspect of a large language model is its ability to extract information from its context. This test will, therefore, give the WizardLM2 model in a lot of text, with one sentence containing the relevant information we are after. The model will then be asked to retrieve that piece of relevant information. I use this website’s random Latin text generator to generate random text. An unfortunate part of creating text this way is that the text to extract will be a different language than the noisy text, though it is a quick and easy way to generate a lot of text.
One scenario where information extraction capabilities are crucial is in a retrieval augmented generation (RAG) system, where the model is given context retrieved with vector similarity from encoded text strings, and the model is asked to respond to a question with the given context. I have written more about this in the article below:
How to Make a RAG System to Gain Powerful Access to Your Data
To run this experiment, I followed the same format as in my Phi3 testing, though with different text around the important information. I first load 990 words of random Latin text with:
# load random text
random_text = """Lorem ipsum dolor sit amet, consectetur adipiscing elit. Proin tincidunt dapibus leo, in dictum nunc volutpat et. In id suscipit velit, vel rutrum velit. Donec vel vehicula nulla, id accumsan felis. Integer consectetur felis purus, a rhoncus felis fermentum sit amet. Donec tempor, quam eu feugiat pulvinar, turpis elit placerat erat, vel commodo ante ex eu nisi. In hac habitasse platea dictumst. Integer vel dapibus ex, id cursus ligula. Sed posuere orci nec ex consequat, a sollicitudin dolor posuere.
Proin blandit, nulla at auctor iaculis, neque odio efficitur orci, vitae malesuada urna est eget leo. Fusce aliquet tempor rutrum. Ut eget neque dolor. In hac habitasse platea dictumst. Ut rutrum lectus eget facilisis tristique. Donec nec ante tortor. Nam vitae mi eu dui pulvinar posuere. Mauris sagittis viverra lorem, vitae interdum neque ultricies quis. Donec at arcu non ante lobortis malesuada. Cras molestie convallis lorem, non laoreet leo tristique et. Donec et nibh nisi. Nam interdum libero ac ante aliquam consequat. In tristique a nisi et euismod. Proin consectetur ipsum a leo feugiat euismod.
Nam sed libero sollicitudin, aliquet mauris eu, sodales sem. Phasellus vulputate leo eu odio accumsan, ut lacinia est lobortis. In hac habitasse platea dictumst. Nunc id nulla molestie, sodales leo a, imperdiet libero. Phasellus dapibus tellus quis magna ultricies, ac cursus odio faucibus. Pellentesque ullamcorper sodales turpis, eget rhoncus massa imperdiet eget. Proin vel feugiat ligula, in ornare turpis. Morbi eget nunc vitae arcu sagittis posuere non eget nisi. Pellentesque laoreet ut ex et tempus. Sed sit amet erat libero. Donec a neque metus.
Phasellus risus sapien, lobortis id tristique nec, tincidunt non sem. Cras et urna sed erat congue mollis non non urna. Quisque pretium consectetur suscipit. Praesent iaculis, tortor ut congue finibus, magna augue tempus ligula, et pharetra eros mauris a nisi. Nulla venenatis gravida blandit. Aenean ultricies orci ipsum, eget blandit elit vulputate vel. Etiam sit amet dapibus eros, eu semper orci. Donec gravida lacinia convallis. In nec est eget nisi lobortis mollis sit amet sit amet risus.
Proin sit amet enim mauris. Nunc imperdiet elit commodo dolor posuere pharetra. Suspendisse nec dapibus mi, at tincidunt neque. Duis blandit eget urna ac porta. Morbi et ipsum nec mi sagittis eleifend vitae at tellus. Maecenas vel hendrerit arcu. Fusce ac sem justo. Nullam mauris magna, pharetra sit amet quam ac, ullamcorper venenatis magna. Donec eget nisi enim. Curabitur nec congue dolor. Suspendisse eget efficitur dolor. In varius libero sit amet lectus porta, ut vulputate dolor rutrum. In tincidunt dui sem, non pulvinar odio tincidunt a.
Aliquam at ante eu tellus consequat tincidunt vel sit amet velit. Aenean pharetra ullamcorper urna, luctus egestas nisl ultrices quis. Vestibulum sodales id tellus sed rutrum. Ut at accumsan tortor. Curabitur pretium magna viverra, viverra nulla et, tempor velit. Quisque massa nisl, rhoncus et volutpat a, tincidunt vel elit. Sed quis leo turpis. Pellentesque et est dictum, gravida magna id, aliquet leo. In odio arcu, dictum nec suscipit id, scelerisque id risus. Vestibulum ante ipsum primis in faucibus orci luctus et ultrices posuere cubilia curae; Aliquam consectetur luctus posuere. Aliquam nec ultricies ipsum.
Nam vulputate varius congue. Nullam quis leo sed enim aliquet feugiat nec nec felis. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Vivamus viverra, quam non ornare dignissim, dui nulla fermentum est, nec finibus orci mi nec ligula. Nam lobortis mauris lacus, eu placerat nulla vulputate ut. Nulla facilisi. Nunc molestie neque lorem, in facilisis nibh ornare in. Nunc nec metus nec nunc suscipit blandit. Duis cursus sollicitudin ipsum, tincidunt commodo sem. Aenean in feugiat purus. Proin suscipit pellentesque dolor ut pharetra. Sed sollicitudin nisl egestas sodales rutrum. Vivamus laoreet enim eleifend eros pellentesque laoreet. Nullam tincidunt massa ut felis imperdiet, ac aliquet libero mattis. Nulla tempor sem quis sapien malesuada aliquam. Sed rhoncus tortor a nisi scelerisque fermentum.
Suspendisse elit nulla, tempor eu posuere quis, aliquam in tortor. Donec vitae dictum lectus. Duis tempor vestibulum velit, ut convallis enim aliquet vitae. Sed at nibh et risus vulputate fermentum et vitae urna. Sed accumsan felis vitae est ornare semper. Aliquam rhoncus, massa ut tempor vulputate, augue tellus lobortis nunc, eget sollicitudin lorem dolor non nibh. Cras sagittis augue suscipit congue imperdiet. Nunc pharetra, neque non porta congue, magna tellus iaculis quam, vel pellentesque magna quam pulvinar nisi. Etiam quis condimentum ex. Nullam gravida velit eu justo accumsan tincidunt. Fusce ornare justo et mi egestas, vitae blandit est lobortis. Pellentesque habitant morbi tristique senectus et netus et malesuada fames ac turpis egestas.
Suspendisse potenti. Aenean vehicula pellentesque porttitor. Duis ut nulla in quam volutpat euismod vitae id libero. Etiam id scelerisque mauris, condimentum commodo libero. Nam eget ligula ac ipsum consequat iaculis. Donec mollis est a neque placerat porta. Vestibulum sed bibendum diam, nec pretium elit. Sed semper diam metus, ac condimentum neque fringilla vel. In eu felis id diam blandit eleifend. Duis eget sodales lorem. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Praesent tincidunt lacus vel vestibulum tincidunt. Maecenas auctor, mi eu cursus feugiat, ligula metus posuere turpis, vel ultrices purus dui a justo. Aliquam dolor elit, dapibus eu nunc a, luctus gravida magna. Orci varius natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus.
Nunc mattis mauris sed arcu malesuada, sit amet tempus libero eleifend. Sed pretium lobortis auctor. Curabitur in magna dolor. Proin ac libero ac justo volutpat rhoncus a sit amet mi. Donec vitae libero cursus, vehicula massa in, efficitur ipsum. Cras suscipit accumsan dolor egestas interdum. Mauris rutrum justo non egestas aliquet. Ut eleifend odio at turpis mollis vulputate. Nunc id leo suscipit, facilisis ipsum nec, egestas dui. Aliquam est leo, volutpat eget tellus vitae, luctus finibus metus.
Vivamus nec porta sapien. Etiam a blandit est, sollicitudin feugiat est. Interdum et malesuada fames ac ante ipsum primis in faucibus. Sed venenatis magna rhoncus nulla pretium convallis. Sed quis lectus pretium, vehicula sem elementum, iaculis ipsum. Praesent semper vestibulum tellus nec consectetur. Morbi vel eleifend erat. Nullam elit lorem, laoreet a ex nec, malesuada rutrum tellus. Mauris porta tempus bibendum. Nulla."""
Then, you need a function to insert the important piece of information. Note that the important information can be any text; you only need to ensure the text states a specific fact. I will use a text that says the company is on floor 27 and then proceed to ask the model which floor the company is on.
important_information = "The company is on floor number 27. "
def insert_text(full_text, text_to_insert, index):
assert index >= 0 and index <= len(full_text)
return f"{full_text[:index]} {text_to_insert} {full_text[index:]}"
I then insert the important information in different locations of the random Latin text and test if the model can extract the important piece of information from the different areas of the context. This will ensure that the model can use its entire context length and not just certain parts of its context. The code to insert the text and prompt the model for the answer is:
import numpy as np
# get 10 indices evenly split over length of random text
indices = np.linspace(0, len(random_text), 10, dtype=int)
responses = []
for idx in tqdm(indices):
random_text_with_info = insert_text(random_text, important_information, idx)
assert important_information in random_text_with_info
prompt = f"In the following text: {random_text_with_info}, what is the floor number of the company?"
print("PROMPT:", prompt)
response = prompt_wizard(prompt)
responses.append(response)
You can then manually check the results to see how often the model is correct.
When running the information extraction ten times, the model is able to correctly extract the building’s floor number six out of ten times. This is a decent performance, as the task can be difficult due to using several languages in the text. At the same time, the floor number is explicitly stated in the text. Furthermore, the model is correct when the important piece of information is later in the context. The ten times the important piece of information is inserted, it is inserted at 1/10th intervals of the text (meaning the first run places the important piece of information at the beginning of the text, the second run places the information at around 1/10 of the full-text length, and the tenth run places the information at the end of the text). This implies WizardLM2 utilizes its later context well but struggles to use the beginning of its context.
My overall thoughts
My overall thoughts about the WizardLM2 model are mixed. I am disappointed by the model’s performance on the concise question-answering task, where the model first gave an incorrect response to the first digits of Eulers’ number. The model could provide more than the answer in the response and also gave a lot of unnecessary text around the answer. This is in contrast to Microsoft’s Phi3 model, which I have shown performs well on this task in my Phi3 review article.
However, on the object formatting task, WizardLM3 can create valid JSON objects around 16% of the time, which is much better than the 9% the Phi3 model achieved. Unfortunately, only managing to create a valid object 16% of the time is still a severe restriction of the WizardLM2 model, though this could work better for less quantized models of WizardLM2 or the larger 22B model version of WizardLM2.
Lastly, the information extraction task was also run where WizardLM2 correctly extracted the important information in 6/10 instances. This is decent, but the most noteworthy part of this experiment is how WizardLM2 is better at using the latest context length and struggles with information placed early in the context. This can be highly problematic, considering important information can be placed anywhere in the context, for example, in an RAG system. Furthermore, the information extraction experiment only used around 1000 words, approximately 1333 tokens, while the WizardLM2 context length is 8K. It is worrying that the model struggles with contexts of only around 1333 tokens when the max context length is 8K.
Conclusion
This article started off with motivation on why you should keep up with the latest models within Machine Learning and that you can use websites like PapersWithCode, GitHub Trending, and HuggingFace to keep up. The article then discussed implementing and running WizardLM2 locally using the Ollama application and Python package. Continuing, I ran three different tests:
- concise question answering
- correct formatted output
- information extraction
These tests were used to understand better how well the WizardLM2 model performs. Lastly, I have my own thoughts on the WizardLM2 model. The overall model performance is inconsistent with the capabilities we require of today’s language models.