
It’s easy to assume that the only way that we can perform inference with LLMs that are made up of billions of parameters is with a GPU. While it’s true that GPUs provide significant accelerations over CPUs in deep learning, the hardware should always be selected based on the use case. For example, suppose your end users only need a response every 30 secs. In that case, there’s a diminishing return if you’re struggling (financially and logistically) to reserve accelerators that give you answers in < 30 secs.

This all comes back to a fundamental principle, being a "Compute Aware AI Developer" – working backward from the goals of your application to the right software and hardware to use. Imagine starting a home project like hanging a new shelf and going straight for the sledgehammer without even considering that a smaller and more precise hammer would be the right tool for the project.
In this article, we will perform inference with Falcon-7b and Falcon-40b on a 4th Generation Xeon CPU using Hugging Face Pipelines. Falcon-40b is a 40-billion parameter decoder-only model developed by the Technology Innovation Institute (TII) in Abu Dhabi. It outperforms several models like LLaMA, StableLM, RedPajama, and MPT, utilizing the FlashAttention method to achieve faster and optimized inference, resulting in significant speed improvements across different tasks.
Environment Setup
Once you have accessed your Xeon compute instance, you must secure enough storage to download the checkpoints and model shards for Falcon. We recommend securing at least 150 GB of storage if you want to test both the 7-billion and 40-billion Falcon versions. You must also provide enough RAM to load the model into memory and cores to run the workload efficiently. We successfully ran the 7-Billion and 40-Billion Falcon versions on a 32-core 64GB RAM VM (4th Gen Xeon) on the Intel Developer Cloud. However, this is one of many valid compute specifications, and further testing would likely improve performance.
- Install miniconda. You can find the latest version on their website: https://docs.conda.io/en/latest/miniconda.html
- Create a conda environment
conda create -n falcon python==3.8.10
- Install dependencies
pip install -r requirements.txt
. You can find the contents requirements.txt file below.
transformers==4.29.2
torch==2.0.1
accelerate==0.19.0
einops==0.6.1
# requirements.txt
- Activate your conda environment
conda activate falcon
Running Falcon with Hugging Face Pipelines
Hugging Face pipelines provide a simple and high-level interface for applying pre-trained models to various natural language processing (NLP) tasks, such as text classification, named entity recognition, text generation, and more. These pipelines abstract away the complexities of model loading, tokenization, and inference, allowing users to quickly utilize state-of-the-art models for NLP tasks with just a few lines of code.
Below is a convenient script you can run in the cmd/terminal to experiment with the raw pre-trained Falcon models.
from transformers import AutoTokenizer, AutoModelForCausalLM
import transformers
import torch
import argparse
import time
def main(FLAGS):
model = f"tiiuae/falcon-{FLAGS.falcon_version}"
tokenizer = AutoTokenizer.from_pretrained(model, trust_remote_code=True)
generator = transformers.pipeline(
"text-generation",
model=model,
tokenizer=tokenizer,
torch_dtype=torch.bfloat16,
trust_remote_code=True,
device_map="auto",
)
user_input = "start"
while user_input != "stop":
user_input = input(f"Provide Input to {model} parameter Falcon (not tuned): ")
start = time.time()
if user_input != "stop":
sequences = generator(
f""" {user_input}""",
max_length=FLAGS.max_length,
do_sample=False,
top_k=FLAGS.top_k,
num_return_sequences=1,
eos_token_id=tokenizer.eos_token_id,)
inference_time = time.time() - start
for seq in sequences:
print(f"Result: {seq['generated_text']}")
print(f'Total Inference Time: {inference_time} seconds')
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument('-fv',
'--falcon_version',
type=str,
default="7b",
help="select 7b or 40b version of falcon")
parser.add_argument('-ml',
'--max_length',
type=int,
default="25",
help="used to control the maximum length of the generated text in text generation tasks")
parser.add_argument('-tk',
'--top_k',
type=int,
default="5",
help="specifies the number of highest probability tokens to consider at each step")
FLAGS = parser.parse_args()
main(FLAGS)
# falcon-demo.py
To run the script (falcon-demo.py) you must provide the script and various parameters:
python falcon-demo.py --falcon_version "7b" --max_length 25 --top_k 5
The script has 3 optional parameters to help control the execution of the Hugging Face pipeline:
- falcon_version: allows you to select from Falcon’s 7 billion or 40 billion parameter versions.
- max_length: used to control the maximum length of the generated text in text generation tasks.
- top_k: specifies the number of highest probability tokens to consider at each step.
You can hack the script to add/remove/edit the parameters. What is important is that you now have access to one of the most powerful open-source models ever released!
Playing with Raw Falcon
Raw Falcon is not tuned for any particular purpose, so it will likely spew nonsense (Figure 2). Still, this doesn’t stop us from asking a few questions to test it out. When the script is done downloading the model and creating the pipeline, you will be prompted to provide input to the model. When you’re ready to stop, type "stop".

The script prints the inference time to give you an idea of how long the model takes to respond based on the current parameters provided to the pipeline and the compute you have made available to this workload.
_Tip: You can significantly alter the inference time by adjusting the maxlength parameter.
This tutorial is designed to share how to get Falcon running on a CPU with Hugging Face Transformers but does not explore options for further optimizations on Intel CPUs. Libraries like the Intel Extension for Transformers offer capabilities to accelerate Transformer-based models through techniques like quantization, distillation, and pruning. Quantization is a widely-used model compression technique that can reduce the model size and improve inference latency – this would be a valuable next step to explore enhancing the performance of this workflow.
Summary and Discussion
Foundational LLMs create opportunities for developers to build exciting AI applications. However, half the battle is usually finding a model with the correct license that allows for commercial derivatives. Falcon presents a rare opportunity because it intersects performance and license flexibility.
Although Falcon is fairly democratized from an open-source perspective, its size creates new challenges for engineers/enthusiasts. This tutorial helped address this by combining Falcon’s "truly open" license, Hugging Face Pipelines, and the availability/accessibility of CPUs to give developers more access to this powerful model.
A few exciting things to try would be:
- Fine-tune Falcon to a specific task by leveraging the Intel Extension for PyTorch
- Use model compression tools available in Intel Neural Compressor (INC) and Intel Extension for Transformers
- Play with the parameters of Hugging Face pipelines to optimize performance for your particular use case.
Don’t forget to follow my profile for more articles like this!