
This article explains my personal experiences using 6 common methods for serving open source LLMs: AWS Sage Maker, Hugging Face, Together.AI, VLLM and Petals.ml.
The struggle…
You’ve felt the pain, struggle and glory of serving your own fine-tuned open source LLM, however, you ultimately decided to return to Open AI or Anthropic due to cost, inference time, reliability and technology challenges 🙁 You’ve also given up on renting a A100 GPU (many providers have GPUs fully booked until the end of 2023!). And you don’t have 100K to shell out for a 2 tier A100 server box. Still, you’re dreaming, and you really want to get open source to work for your solution. Perhaps your firm does not want to send it’s private data to Open AI or you want a fine tuned model for a very specific task? In this article, I will outline and compare some of the most effective inference methods/platforms for serving open source LLMs in 2023. I will compare and contrast 6 methods and explain when you should use one or the other. I have personally tried all 6 of these and will detail my personal experience with these solutions: AWS Sage Maker, Hugging Face Inference endpoints, Together.AI, VLLM and Petals.ml. I don’t have all the answers, but I will do my best to detail my experiences. I have no monetary connection with any of these providers and am simply sharing my experiences for the benefit of others. Please tell about your experiences!
Why open source?
Open source models have a plethora of advantages including control, privacy and potential cost reductions. For example, you could fine tune a smaller open source model for your particular use case, resulting in accurate results and fast inference time. Privacy control means that inference can be done on your own servers. On the other hand, cost reduction is much harder than you might think. Open AI has economies of scale and has competitive pricing. Their pricing model for GPT-3.5 turbo is very hard to compete with, and has been shown to be similar to the cost of electricity. Still, there are methods and techniques you can use to save money and get excellent results with open source models. For example, my fine tuned model of Stable Beluga 2 is currently outperforming GPT-3.5 Turbo significantly, and is cheaper for my application. So I strongly suggest giving open source a shot for your application.
Hugging Face Inference Endpoints
This is the most common and simplest method for serving an open source LLM. It only takes a couple clicks and is foolproof. After all, Hugging Face was originally an NLP company. Your model probably already exists on hugging face as well, so this should be the go-to option for quickly testing your model. The GPU server costs due tend to run on the higher side. For example, if you simply used RunPod.io to deploy your model, you will have more options for providers and lower costs. Hugging Face has open sourced their transformers inference library and have provided a docker image that is easy to modify. So if you want more control go with a custom solution on RunPod. Here is a tutorial on how to do it on RunPod.
VLLM
This solution is very interesting due to the inference speed. They claim to be 24 times faster than hugging face transformers! Using it personally, I found the speed to be about 10 times faster than Hugging Face transformers. I found there to be a few bugs here and there though. This project is being actively worked on, and is not foolproof yet. Still, I strongly suggest you try it for your application. Due to the faster inference speed, the cost will be significantly lower in comparison to HF transformers.

Petals.ml
This is the most intereresting solution. The developers at Petals.ml have discovered a way to run LLMs at home, BitTorrent-style. This allows you to do fine-tuning and inference up to 10 times faster than offloading. In practice, this means that only a small part of the model will be loaded on your own GPU, and the rest will exist on a GPU network swarm. In other words, a network of GPUs will work together in order to do the compute. This is very interesting because it democratizes LLM usage to some extent, i.e. anyone has the ability to run huge LLMs without paying a cent! The paper of the technology used can be found here. I strongly suggest you give Petals.ml a try!
Together.AI
They provide an API for open source models with excellent pricing. You can fine tune and deploy open source models using Together.AI compute cluster. Their pricing is 20% of AWS. Their platform is straightforward and easy to get started. Therefore, I would highly suggest this platform. Their API is about 1/10th the price of GPT-3.5 turbo. This is my new favorite way to deploy open source models!
AWS Sagemaker
The tried and true method for deploying ML models. Sagemaker is not particular beginner friendly and is probably the most expensive method in comparison to the methods provided above. It’s also the most complex. However, if your business is already using AWS, this may be your only option. Also, if you have free compute on AWS like me, why not give it a shot? Here is a tutorial on how to do it by AI Anytime: https://www.youtube.com/watch?v=A9Pu4xg-Nas&ab_channel=AIAnytime.
Conclusion:
In conclusion, I highly suggest trying Together.AI and Petals.ml due to the many upsides in using these platforms. If you require privacy and very fast inference speeds, I suggest using VLLM. If you are forced to use AWS, then go with SageMaker. Finally, if you want something simple and efficient (especially for testing), go with HF transformers endpoints.
📢 Hey there! If you found this article helpful, please consider:
👏 Clapping 50 times (this helps a lot!)
✏️ Leaving a comment
🌟 Highlighting parts you found insightful
👣 Following me
Any questions? 🤔 Don’t hesitate to ask. Supporting me this way is a free and easy way to show appreciation for my detailed tutorial articles! 😊
Final Notes
If you’re interested in learning more about developing full stack AI apps only with Python, please feel free to sign up here. As always, please leave your own experiences and comments below. I look forward to reading them all.
That’s all folks – if you’ve made it this far, please comment below and add me on LinkedIn.
My Github is here.
Other Deep Learning Blogs
Sequence Models by Andrew Ng – 11 Lessons Learned
Computer Vision by Andrew Ng – 11 Lessons Learned
Deep Learning Specialization by Andrew Ng – 21 Lessons Learned