Pierre LienhartinTowards Data ScienceThe AQLM Quantization Algorithm, ExplainedIn this blog post, we cover the AQLM quantization algorithm which sets a new state-of-the-art for compressing LLMs down to 2 bits!13 min read·Mar 13, 2024--2--2
Pierre LienhartLLM Inference Series: 5. Dissecting model performanceIn this post, we look deeper into the different types of bottleneck that affect model latency and explain what arithmetic intensity is.14 min read·Feb 2, 2024--2--2
Pierre LienhartLLM Inference Series: 4. KV caching, a deeper lookIn this post, we will look at how big the KV cache, a common optimization for LLM inference, can grow and at common mitigation strategies.18 min read·Jan 15, 2024--8--8
Pierre LienhartLLM Inference Series: 3. KV caching unveiledIn this post we introduce the KV caching optimization for LLM inference, where does it come from and what does it change.11 min read·Dec 22, 2023--7--7
Pierre LienhartLLM Inference Series: 2. The two-phase process behind LLMs’ responsesAfter a quick reminder on the Transformer architecture, this post covers the algorithm of text generation using Transformer decoder models.4 min read·Dec 22, 2023----
Pierre LienhartLLM Inference Series: 1. IntroductionIn this post, I introduce the outline of this deep dive series about the specifics and challenges of hosting LLMs for inference.3 min read·Dec 22, 2023--1--1