Running Llama 2 on CPU Inference Locally for Document Q&A

Clearly explained guide for running quantized open-source LLM applications on CPUs using Llama 2, C Transformers, GGML, and LangChain

Published in

Towards Data Science

11 min readJul 18, 2023

Photo by NOAA on Unsplash

Third-party commercial large language model (LLM) providers like OpenAI’s GPT4 have democratized LLM use via simple API calls. However, teams may still require self-managed or private deployment for model inference within enterprise perimeters due to various reasons around data privacy and compliance.

The proliferation of open-source LLMs has fortunately opened up a vast range of options for us, thus reducing our reliance on these third-party providers.

When we host open-source models locally on-premise or in the cloud, the dedicated compute capacity becomes a key consideration. While GPU instances may seem the most convenient choice, the costs can easily spiral out of control.

In this easy-to-follow guide, we will discover how to run quantized versions of open-source LLMs on local CPU inference for retrieval-augmented generation (aka document Q&A) in Python. In particular, we will leverage the latest, highly-performant Llama 2 chat model in this project.

Contents

(1) Quick Primer on Quantization…

Create an account to read the full story.

The author made this story available to Medium members only.
If you’re new to Medium, create a new account to read this story on us.

Continue in app

Or, continue in mobile web

Sign up with Google

Sign up with Facebook

Sign up with email

Already have an account? Sign in

Published in Towards Data Science

Last published 5 hours ago

Your home for data science and AI. The world’s leading publication for data science, data analytics, data engineering, machine learning, and artificial intelligence professionals.

Written by Kenneth Leung

Senior Data Scientist at Boston Consulting Group | Top Tech Author | 2M+ reads on Medium | linkedin.com/in/kennethleungty | github.com/kennethleungty

Responses (31)
What are your thoughts?
Also publish to my profile
Kenneth Leung
Author
over 1 year ago (edited)
Link to GitHub repo: https://github.com/kennethleungty/Llama-2-Open-Source-LLM-CPU-Inference
--
Alex Vaith
over 1 year ago
why do we embedd the documents with a different model than the LLM we use later for queries That does not make sense to me. The embedding space is totally different between the two models?
--
Dipaola
over 1 year ago
it would be great if you could ( it is on your list) do this tutorial for GPUs. Most of us doing Ai already have desktops and laptops with an Nvidia GPU - so would be great to get the speed and even a bigger Llama2 model by implementing on a GPU instead of a CPU.
thank steve
--

Recommended from Medium

Self-Hosting LLaMA 3.1 70B (or any ~70B LLM) Affordably

Abhinand

Self-Hosting LLaMA 3.1 70B (or any ~70B LLM) Affordably

If you’re reading this guide, Meta’s Llama 3 series of models need no introduction. They were released in April 2024 and are one of the…

Aug 20

Running Ollama’s LLaMA 3.2 Vision Model on Google Colab — Free and Easy Guide

In

Generative AI

by

Shobhit Agarwal

Running Ollama’s LLaMA 3.2 Vision Model on Google Colab — Free and Easy Guide

Are you interested in exploring the capabilities of vision models but need a cost-effective way to do it? Look no further! This guide will…

Nov 9

Lists

Natural Language Processing

1857 stories1486 saves

AI Regulation

6 stories649 saves

Stories to Help You Grow as a Software Developer

19 stories1518 saves

Generative AI Recommended Reading

52 stories1552 saves

Llama manipulating models as light orbs

In

CyberArk Engineering

by

Roy Ben Yosef

How to Run LLMs Locally with Ollama

Easy and down to earth developer’s guide on downloading, installing and running various LLMs on your local machine.

Mar 27

How I Am Using a Lifetime 100% Free Server

Harendra

How I Am Using a Lifetime 100% Free Server

Get a server with 24 GB RAM + 4 CPU + 200 GB Storage + Always Free

Oct 26

Quantization of LLMs with llama.cpp

Ingrid Stevens

Quantization of LLMs with llama.cpp

Understanding and Implementing n-bit Quantization Techniques for Efficient Inference in LLMs

Mar 15

Ollama-OCR: Now Available as a Python Package!

Anoop Maurya

Ollama-OCR: Now Available as a Python Package!

Stuck behind a paywall? Read for Free!

Dec 2

See more recommendations

Help
Status
About
Careers
Press
Blog
Privacy
Terms
Text to speech
Teams