Set up a local LLM on CPU with chat UI in 15 minutes

This blog post shows how to easily run an LLM locally and how to set up a ChatGPT-like GUI in 4 easy steps.

Published in

Towards Data Science

5 min readFeb 6, 2024

Thanks to the global open source community, it is now easier than ever to run performant large language models (LLM) on consumer laptops or CPU-based servers and easily interact with them through well-designed graphical user interfaces.

This is particularly valuable to all the organizations who are not allowed or not willing to use services that requires sending data to a third party.

This tutorial shows how to set up a local LLM with a neat ChatGPT-like UI in four easy steps. If you have the prerequisite software installed, it will take you no more than 15 minutes of work (excluding the computer processing time used in some of the steps).

This tutorial assumes you have the following installed on your machine:

Ollama
Docker
React
Python and common packages including transformers

Now let’s get going.

Step 1 — Decide which Huggingface LLM to use

The first step is to decide what LLM you want to run locally. Maybe you already have an idea. Otherwise, for English, the instruct version of Mistral 7b seems to be the go-to choice. For Danish, I recommend Munin-NeuralBeagle although its known to over-generate tokens (perhaps because it’s a merge of a model that was not instruction fine tuned). For other Scandinavian languages, see ScandEval’s evaluation of Scandinavian generative models.

Once you’ve decided which LLM to use, copy the Huggingface “path” to the model. For Mistral 7b it would be “mistralai/Mistral-7B-v0.1". You’ll need it in the next step.

How to make a PyTorch Transformer for time series forecasting

This post will show you how to transform a time series Transformer architecture diagram into PyTorch code step by step.

towardsdatascience.com

Step 2 – Quantize the LLM

Next step is to quantize your chosen model unless you selected a model that was already quantized. If your model’s name ends in GGUF or GPTQ it’s already quantized.

Quantization is a technique that converts the weights of a model (its learned parameters) to a smaller data type than the original, eg from fp16 to int4. This makes the model take up less memory and also makes it faster to run inference which is a nice feature if you’re running on CPU.

The script quantize.pyin my repo local_llm is adapated from Maxime Labonne’s fantastic Colab notebook (see his LLM course for other great LLM resources). You can use his notebook or my script. The method’s been tested on Mistral and Mistral-like models.

To quantize, first clone my repo:

git clone https://github.com/KasperGroesLudvigsen/local_llm.git

Now, change the MODEL_ID variable in the quantize.py file to reflect your model of choice. This is where you need the Huggingface “path” that you copied in the first step. So if you wanna use Mistral 7b:

MODEL_ID = "mistralai/Mistral-7B-v0.1"

Then, in your terminal, run the script:

python quantize.py

This will take some time. While the quantization process runs, you can proceed to the next step.

The script will produce a directory that contains the model files for the model you selected as well as the quantized version of the model which has the file extension “.gguf”.

ChatGPT’s energy use per query

How much electricity does ChatGPT use to answer one question?

towardsdatascience.com

Step 3: Build and run Ollama version of model

We will run the model with Ollama. Ollama is a software framework that neatly wraps a model into an API. Ollama also integrates easily with various front ends as we’ll see in the next step.

To build an Ollama image of the model, you need a so-called model file which is a plain text file that configures the Ollama image. If you’re acquainted with Dockerfiles, Ollama’s model files will look familiar.

In the example below, we first specify which LLM to use. We’re assuming that there is a folder in your repo called mistral7b and that the folder contains a model called quantized.gguf. Then we specify the model’s context window to 8,000 – Mistral 7b’s max context size. In the Modelfile, you can also specify which prompt template to use, and you can specify stop tokens.

Save the model file, eg as Modelfile.txt.

For more configuration options, see Ollama’s GitHub.

FROM ./mistral7b/quantized.gguf

PARAMETER num_ctx 8000

TEMPLATE """<|im_start|>system {{ .System }}<|im_end|><|im_start|>user {{ .Prompt }}<|im_end|><|im_start|>assistant<|im_end|>"""

PARAMETER stop <|im_end|>
PARAMETER stop <|im_start|>user
PARAMETER stop <|end|>

Now that you have made the Modelfile, build an Ollama image from the Modelfile by running this from your terminal. This will also take a few moments:

ollama create choose-a-model-name -f <location of the file e.g. ./Modelfile>'

When the “create” process is done, start the Ollama server by running this command. This will expose all your Ollama model(s) in a way that the GUI can interact with them.

ollama serve

How to estimate and reduce the carbon footprint of machine learning models

Two ways to easily estimate the carbon footprint of machine learning models and 17 ideas for how you might reduce it

towardsdatascience.com

Step 4 – Set up chat UI for Ollama

The next step is to set up a GUI to interact with the LLM. Several options exist for this. In this tutorial, we’ll use “Chatbot Ollama” – a very neat GUI that has a ChatGPT feel to it. “Ollama WebUI” is a similar option. You can also setup your own chat GUI with Streamlit.

By running the two commands below, you’ll first clone the Chatbot Ollama GitHub repo, and then install React dependencies:

git clone https://github.com/ivanfioravanti/chatbot-ollama.git
npm ci

The next step is to build a Docker image from the Dockerfile. If you’re on Linux, you need to change the OLLAMA_HOST environment variable in the Dockerfile from hhtp://host.docker.internal:11434to http://localhost:11434 .

Now, build the Docker image and run a container from it by executing these commands from a terminal. You need to stand in the root of the project.

docker build -t chatbot-ollama .

docker run -p 3000:3000 chatbot-ollama

The GUI is now running inside a Docker container on your local computer. In the terminal, you’ll see the address at which the GUI is available (eg. “http://localhost:3000")

Visit that address in your browser, and you should now be able to chat with the LLM through the Ollama Chat UI.

Conclusion

This concludes this brief tutorial on how to easily set up chat UI that let’s you interact with an LLM that’s running on your local machine. Easy, right? Only four steps were required:

Select a model on Huggingface
(Optional) Quantize the model
Wrap model in Ollama image
Build and run a Docker container that wraps the GUI

Remember, it’s all made possible because open source is awesome 👏

GitHub repo for this article: https://github.com/KasperGroesLudvigsen/local_llm

That’s it! I hope you enjoyed the story. Let me know what you think!

Get the benefits of Medium and support my writing by signing up for a Medium membership HERE.

Follow me for more on AI and sustainability and subscribe to get my stories via email when I publish.

I also sometimes write about time series forecasting.

And feel free to connect on LinkedIn.

Set up a local LLM on CPU with chat UI in 15 minutes

This blog post shows how to easily run an LLM locally and how to set up a ChatGPT-like GUI in 4 easy steps.

Step 1 — Decide which Huggingface LLM to use

How to make a PyTorch Transformer for time series forecasting

This post will show you how to transform a time series Transformer architecture diagram into PyTorch code step by step.

Step 2 – Quantize the LLM

ChatGPT’s energy use per query

How much electricity does ChatGPT use to answer one question?

Step 3: Build and run Ollama version of model

How to estimate and reduce the carbon footprint of machine learning models

Two ways to easily estimate the carbon footprint of machine learning models and 17 ideas for how you might reduce it

Step 4 – Set up chat UI for Ollama

Conclusion

Written by Kasper Groes Albin Ludvigsen