What is this about?
In December 2023, Apple released their new MLX deep learning framework, an array framework for machine learning on Apple silicon, developed by their machine learning research team. This tutorial will explore the framework and demonstrate deploying the Mistral-7B model locally on a MacBook Pro (MBP). We’ll set up a local chat interface to interact with the deployed model and test its inference performance in terms of tokens generated per second. Additionally, we’ll delve into the MLX API to understand the available levers for altering the model’s behaviour and influencing the generated text.
As usual, the code is available in a public GitHub repository: https://github.com/marshmellow77/mlx-deep-dive
Why is this important?
Apple’s new machine learning framework, MLX, offers notable advantages over other deep learning frameworks with its unified memory architecture for machine learning on Apple silicon. Unlike traditional frameworks such as PyTorch and Jax, which require costly data copying between CPU and GPU, MLX maintains data in shared memory accessible to both. This design eliminates the overhead of data transfers, facilitating faster execution, particularly with the large datasets common in machine learning. For complex ML tasks on Apple devices, MLX’s shared memory architecture could lead to significant speed-ups. This feature makes MLX highly relevant for developers looking to run models on-device, such as on iPhones.
With Apple’s expertise in silicon design, MLX hints at the exciting capabilities that could be integrated into their chips for future on-device AI applications. The potential of MLX to accelerate and streamline ML tasks on Apple platforms makes it a framework developers should keep on their radar.
Initial setup
Before deploying the model, some setup is required. Firstly, it’s essential to install certain libraries. Remember to create a virtual environment before proceeding with the installations:
pip install mlx-lm
This library allows us to deploy a Large Language Model (LLM) locally, enabling it to run with just five lines of code:
from mlx_lm import load, generate
model, tokenizer = load("mistralai/Mistral-7B-Instruct-v0.2")
prompt = """<s>[INST] Hello world! [/INST]"""
response = generate(model, tokenizer, prompt=prompt)
print(response)
The first time this script is executed, it will download the model, which may take some time. In subsequent runs, the model will load from the local cache, significantly speeding up the process. Once the model is downloaded, we receive the response back:

This is quite impressive, but there’s a lot happening behind the scenes. Let’s pull back the curtains and gain a better understanding of what’s actually occurring:

Once the load()
method is invoked, it checks whether the model is available locally on the machine. If it isn’t, the method downloads the model from the Hugging Face Model Hub, in our case from https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2/tree/main. After the weights are loaded, they are converted into MLX format. Additionally, the model will be quantized if specified in the configuration.
Quantization and readily converted models
Note that the configuration of the Mistral-7B-Instruct-v0.2 model does not specify quantization, which means we will load the model with its full weights. However, it is possible to convert the original weights of a model into MLX format (weights.npz
) and quantize them simultaneously using
python convert.py --torch-path <path_to_torch_model> -q
There is also an active MLX community on Hugging Face that has already converted several models into MLX format:

Note: As of now, we won’t be able to load these converted models with the code used above, since the
load()
method specifically requires loading model weights in the.safetensors
format. To load models in theweights.npz
format, we can take inspiration from this example.
Mistral instruction model
Another interesting aspect of the code is the prompt. Large language models (LLMs) are increasingly used in chatbot applications to enable more natural conversations. Unlike simply generating continuous text, Chatbots need to understand dialogue context, which includes exchanges between different roles such as "user" and "assistant." The model input consists of a series of messages rather than a single passage. Each model typically has its own chat template, and we can see how Mistral’s model template appears from their model page:

In the sample code above, we have manually applied this template by inserting the special tokens into the string. Later, we will explore a more efficient method to achieve this.
Chatbot interface
Since we have downloaded the instruct version of the Mistral-7B model, we should maximise its potential by using it as a chatbot model. Fortunately, this can be easily done with Streamlit, an open-source framework that allows us to build a chatbot interface with just a few lines of code.
First, let’s install Streamlit:
pip install streamlit
We won’t cover the entire application (the full code is available on GitHub), but a few notable highlights include:
Loading the model
# Cache the model loading to avoid reloading every time
@st.cache_resource
def load_model():
model, tokenizer = load("mistralai/Mistral-7B-Instruct-v0.2")
return model, tokenizer
model, tokenizer = load_model()
Just like in our first example, we utilise the mlx_lm
library to load our model. To prevent the model from being loaded every time we send a new message to the chatbot, we take advantage of Streamlit’s cache management by using the st.cache_resource
decorator.
Keeping track of the conversation
Similar to caching the model, we also need to keep track of the entire conversation. Chatbots remember previous parts of a conversation using a simple method: every time a new user input is received, the entire previous conversation is sent to the model in one long prompt. To keep track of the conversation, we initialise a stateful variable like this:
if 'messages' not in st.session_state:
st.session_state['messages'] = []
We will then add user and assistant messages as follows:
st.session_state['messages'].append({"role": "user", "content": prompt})
...
st.session_state['messages'].append({"role": "assistant", "content": response})
Converting the conversation history into the required format
As observed earlier, the Mistral-7B-Instruct model requires a very specific format for its prompts to reflect the previous conversation. We can use the apply_chat_template()
method of the tokenizer to convert the conversation history into one long prompt that adheres to this specification:
formatted_conversation = tokenizer.apply_chat_template(st.session_state['messages'], tokenize=False)
Let’s see how this works and ensure it’s functioning correctly. We can test it by running the following:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2")
chat = [
{'role': 'user', 'content': 'Hello, how are you?'},
{'role': 'assistant', 'content': 'I'm good, thank you for asking. How can I assist you today?'},
{'role': 'user', 'content': 'Can you tell me the weather forecast for London?'},
{'role': 'assistant', 'content': 'Sure, the forecast for London is sunny with a high of 23 degrees.'},
{'role': 'user', 'content': 'Thank you, that's very helpful!'}
]
print(tokenizer.apply_chat_template(chat, tokenize=False))
We get as a result:
<s>[INST] Hello, how are you? [/INST]I'm good, thank you for asking. How can I assist you today?</s>[INST] Can you tell me the weather forecast for London? [/INST]Sure, the forecast for London is sunny with a high of 23 degrees.</s>[INST] Thank you, that's very helpful! [/INST]
It looks like the chat template method is working properly and we can use it to format our conversation history.
Time measurement
We will also measure the time it takes to receive a response and calculate how many tokens the model generates per second (TPS). Technically, this measure isn’t entirely accurate because it doesn’t account for the time the model takes to process the prompt (which increases as the conversation progresses). However, from a user experience perspective, this distinction is often less noticeable. Therefore, I will use this method for measurement.
If you’re interested in measuring the tokens generated per second specifically during the generation phase, i.e., after the model has processed the prompt, you might want to check out this example.
Testing the chatbot
Once the code development is complete to our satisfaction, we can start the chatbot by typing:
streamlit run mlx-chatbot.py
This will open a browser tab automatically on http://localhost:8501/ (if not just type this address in your browser).
We can now start interacting with the model:


In the beginning the token generation per second is around 15, and later as the conversation gets longer it drops down to 12–13. This is because the model needs more time to parse the entire previous conversation, which is reflected in the way we measure TPS in this example.
Memory consumption
One of the key questions is of course how much memory is consumed by this chatbot and the locally deployed Llm? When running this on a MBP M1 Max with 64 GB memory initially the script uses 14 GB of memory (via top -o mem
)

However, once the conversation grows longer the memory consumption quickly shoots up to 46GB, and it stays there. I suspect that this is the maximum of memory that the programme is allowed to use:

Exploring the MLX API
Sometimes we want our chatbots to behave differently, for example, create longer or shorter answers. In the example above where I started a conversation with "Hello" the response was too verbose for my taste. Note, that this is separate from setting the max_tokens parameter. If a model responds with a longer message and the max_tokens parameter is set low, for instance at 50 new tokens, the model’s response will just be truncated.
Usually, there are other model inference parameters that can be used to steer the model’s behaviour, such as repetition penalty or length penalty. Unfortunately, the MLX API does not (yet?) support these parameters. To understand which parameters are supported, we can delve deeper into the MLX code and inspect the generate()
API:
parser.add_argument(
"--prompt", default=DEFAULT_PROMPT, help="Message to be processed by the model"
)
parser.add_argument(
"--max-tokens",
"-m",
type=int,
default=DEFAULT_MAX_TOKENS,
help="Maximum number of tokens to generate",
)
parser.add_argument(
"--temp", type=float, default=DEFAULT_TEMP, help="Sampling temperature"
)
parser.add_argument("--seed", type=int, default=DEFAULT_SEED, help="PRNG seed")
So far, it only supports max_tokens
, temp
, and seed
. However, I would expect that over time, other inference parameters will be added either by the MLX research team or the community.
Conclusion
In this tutorial we have deployed the very capable Mistral-7B model locally onto a Macbook Pro with 64 GB of memory. We also created a chatbot interface to interact with the model. This is only the beginning and definitely points towards a future where local LLM deployment will become more common and important.
There are many ways to continue and improve this tutorial, which I will leave up to the readers. Some ideas include:
- Exploring other LLMs and/or using quantized versions
- Improving the interface to add file upload, or fields to change the inference parameters (e.g. temperature, max tokens, etc)
- Measuring and monitoring memory consumption more closely and testing out the limits
- Advanced: Cloning the MLX framework and implementing other inference parameters
Heiko Hotz
👋 Follow me on Medium and LinkedIn to read more about Generative AI, Machine Learning, and Natural Language Processing.
👥 If you’re based in London join one of our NLP London Meetups.
