
Introduction
In the realm of Large Language Models (LLMs), integrating these advanced systems into real-world enterprise applications is a pressing need. However, the pace at which generative AI is evolving is so quick that most can’t keep up with the advancements.
One solution is to use managed services like the ones provided by OpenAI. These managed services offer a streamlined solution, yet for those who either lack access to such services or prioritize factors like security and privacy, an alternative avenue emerges: open-source tools.
Open-source generative AI tools are extremely popular right now and companies are scrambling to get their AI-powered apps out the door. While trying to build quickly, companies oftentimes forget that in order to truly gain value from generative AI they need to build "production"-ready apps, not just prototypes.
In this article, I want to show you the performance difference for Llama 2 using two different inference methods. The first method of inference will be a containerized Llama 2 model served via Fast API, a popular choice among developers for serving models as REST API endpoints. The second method will be the same containerized model served via Text Generation Inference, an open-source library developed by hugging face to easily deploy LLMs.
Both methods we’re looking at are meant to work well for real-world use, like in businesses or apps. But it’s important to realize that they don’t scale the same way. We’ll dive into this comparison to see how they each perform and understand the differences better.
What powers LLM inference at OpenAI and Cohere
Have you ever wondered why ChatGPT is so fast?
Large language models require a ton of computing power and due to their sheer size, they oftentimes need multiple GPUs. When working with large GPU clusters, companies have to be very mindful of how their computing is being utilized.
LLM providers like OpenAI run large GPU clusters to power inferencing for their models. In order to squeeze as much performance from their GPUs they use tools like the Nvidia Triton inference server to increase throughput and reduce latency.

While Triton is highly performant and has many benefits, it’s quite difficult to use for developers. A lot of the newer models on Hugging Face are not supported on Triton and the process to add support for these models is not trivial.
This is where Text Generation Inference (TGI) comes in. This tool offers many of the same performance improvements as Triton, but it’s user-friendly and works well with Hugging Face models.
LLM inference optimizations
Before we jump into the benchmarks, I want to cover a few of the optimization techniques used by modern inference servers such as TGI to speed up LLMs.
- Tensor Parallelism
LLMs are often too large to fit on a single GPU. Using a concept called Model Parallelism, a model can be split across multiple GPUs. Tensor Parallelism is a type of model parallelism that splits the model into multiple shards that are processed independently by different GPUs.

To put it simply, imagine you’re working on a big jigsaw puzzle, but it’s so huge that you can’t put all the pieces on one table. So, you decide to work with your friend. You split the puzzle into sections, and each of you works on your own section at the same time. This way, you can solve the puzzle faster.
2. Continuous batching
When you make an API call to an LLM it processes it in one go and returns the output. If you make 5 API calls, it will process each one sequentially. This essentially means we have a batch size of 1, meaning only 1 request can be processed at a given time. As you might guess, this design isn’t ideal because each new request has to wait for the one before it to complete.

By increasing the batch size you can handle more requests in parallel. With a batch size of 4, you would be able to handle 4 of the 5 API calls in parallel. You would have to wait until all 4 requests have finished before handling the 5th request.

Continuous batching builds on the idea of using a bigger batch size and goes a step further by immediately tackling new tasks as they come in. For example, let’s say that your GPU has a batch size of 4 meaning it can handle 4 requests in parallel. If you make 5 requests, 4 of them will get processed in parallel and whichever process finishes first will immediately get to work on the 5th request.
Llama 2 Benchmarks
Now that we have a basic understanding of the optimizations that allow for faster LLM inferencing, let’s take a look at some practical benchmarks for the Llama-2 13B model.
There are 2 main metrics I wanted to test for this model:
- Throughput (tokens/second)
- Latency (time it takes to complete one full inference)
I wanted to compare the performance of Llama inference using two different instances. One instance runs via FastAPI, while the other operates through TGI. Both setups utilize GPUs for computation.
Note: Neither of the two instances quantizes the weights of the model.
The TGI setup employs the power of two GPUs(NVIDIA RTX A4000) to leverage parallelism, whereas FastAPI relies on a single(NVIDIA A100), albeit more powerful GPU.
Making a direct comparison between these two approaches is a bit tricky since FastAPI doesn’t allow the model to be distributed across two GPUs. To level the playing field, I opted to equip the FastAPI instance with a more powerful GPU.
Each API request made to the model had the exact same prompt and the generated output token limit was set to 128 tokens.
Throughput Results:


Analysis:
- In both cases, throughput decreases as the number of inference requests increases
- Batching significantly improves the throughput for an LLM, hence why TGI has better throughput
- Although the Fast API instance has a larger GPU memory (VRAM) available to manage requests as larger batches, it doesn’t handle this process efficiently
Latency Results:


Analysis:
- Tensor parallelism results in a nearly 5X reduction in latency for TGI!
- As the number of requests increases, the latency for the Fast API-based instance surpasses 100 seconds
As evident from the results, optimized inference servers are highly performant compared to readily available API wrappers. As a final test, I wanted to evaluate the performance of TGI when the generated output token limit is increased to 256 tokens as opposed to 128 tokens.
Throughput TGI 128 tokens vs 256 tokens:

As you can see, the throughput is quite similar despite doubling the number of generated tokens. One thing to note that’s not on this chart is that at 300 concurrent requests, the throughput dwindled to approximately 2 tokens/sec while producing a 256-token output. At this throughput, the latency was over 100 seconds per request and there were multiple request timeouts. Due to these substantial performance limitations, the results from this scenario were excluded from the chart.
Latency TGI 128 tokens vs 256 tokens:

Unlike the throughput, the latency clearly spikes when generating longer sequences of text. Adding more GPUs would help decrease the latency, but it would come at a financial cost.
Conclusion
My goal for this blog post was to compare the real-world performance of LLMs at scale(hundreds of requests per second). Often times it’s easy to deploy models behind an API wrapper such as Fast API, but in the case of LLMs, you’d be leaving a significant amount of performance on the table.
Even with the optimization techniques used by modern inference servers, the performance cannot match that of a managed service like ChatGPT. OpenAI surely runs several large GPU clusters to power inference for their models along with their own in-house optimization techniques.
However, for generative AI use cases enterprises will likely have to adopt inference servers as they are far more scalable and reliable compared to traditional model deployment techniques.
Thanks for reading!
Peace.