Running Llama 2 on CPU Inference Locally for Document Q&A

Clearly explained guide for running quantized open-source LLM applications on CPUs using Llama 2, C Transformers, GGML, and LangChain

Kenneth Leung
Towards Data Science

Photo by NOAA on Unsplash

Third-party commercial large language model (LLM) providers like OpenAI’s GPT4 have democratized LLM use via simple API calls. However, teams may still require self-managed or private deployment for model inference within enterprise perimeters due to various reasons around data privacy and compliance.

The proliferation of open-source LLMs has fortunately opened up a vast range of options for us, thus reducing our reliance on these third-party providers.

When we host open-source models locally on-premise or in the cloud, the dedicated compute capacity becomes a key consideration. While GPU instances may seem the most convenient choice, the costs can easily spiral out of control.

In this easy-to-follow guide, we will discover how to run quantized versions of open-source LLMs on local CPU inference for retrieval-augmented generation (aka document Q&A) in Python. In particular, we will leverage the latest, highly-performant Llama 2 chat model in this project.

Contents

(1) Quick Primer on Quantization

Create an account to read the full story.

The author made this story available to Medium members only.
If you’re new to Medium, create a new account to read this story on us.

Or, continue in mobile web

Already have an account? Sign in

Responses (31)

What are your thoughts?

why do we embedd the documents with a different model than the LLM we use later for queries That does not make sense to me. The embedding space is totally different between the two models?

--

it would be great if you could ( it is on your list) do this tutorial for GPUs. Most of us doing Ai already have desktops and laptops with an Nvidia GPU - so would be great to get the speed and even a bigger Llama2 model by implementing on a GPU instead of a CPU.
thank steve

--