Understand BLOOM, the Largest Open-Access AI, and Run It on Your Local Computer

See BLOOM in action solving math, translation, and coding problems.

Published in

Towards Data Science

6 min readAug 6, 2022

BLOOM is an open-access multilingual language model that contains 176 billion parameters and was trained for 3.5 months on 384 A100–80GB GPUs. A BLOOM checkpoint takes 330 GB of disk space, so it seems unfeasible to run this model on a desktop computer. However, you just need enough disk space, at least 16GB of RAM, and some patience (you don’t even need a GPU), to run this model on your computer.

BLOOM is a collaborative effort of more than 1,000 scientist and the amazing Hugging Face team. It is remarkable that such large multi-lingual model is openly available for everybody. By the end of this tutorial, you will learn how to run this massive language model on your local computer and see it in action generating texts such as:

- INPUT: "The SQL command to extract all the users whose name starts with A is: "
  OUTPUT: "SELECT * FROM users WHERE name LIKE 'A%'"- INPUT: "The Spanish translation of thank you for your help is: "
  OUTPUT: "gracias por su ayuda"- INPUT: "John is 4 times as old as Bob. Bob is 3 years younger than Mike. Mike is 10 years old. What is John's age? Let's think step by step. "
  OUTPUT: "First, we need to find out how old Bob is. Bob is 3 years younger than Mike. So, Bob is 10–3=7 years old. Now, we need to find out how old John is. John is 4 times as old as Bob. So, John is 4 times 7=28 years old"

This tutorial uses some components of the Hugging Face’s transformers library, along with custom Python code to strategically load the model weights from disk and generate a sequence of tokens. For the sake of learning, the inference Python code in this tutorial was written from scratch and does not use the out-of-the-box implementation available in Hugging Face Accelerate. For production, Hugging Face Accelerate is much more robust and versatile. The Python code in this tutorial generates one token every 3 minutes on a computer with an i5 11gen processor, 16GB of RAM, and a Samsung 980 PRO NVME hard drive (a fast hard drive can significantly increase inference speeds).

Bloom Architecture

BLOOM is a causal model language, which means that it was trained as a next-token predictor. This apparently simple strategy of predicting the next token in a sentence, based on a set of preceding tokens, has shown to capture certain degree of reasoning abilities for large language models (arXiv:2205.11916). This enables BLOOM and similar models to connect multiple concepts in a sentence and manage to solve non-trivial problems such as arithmetic, translation, and programming with fair accuracy. BLOOM uses a Transformer architecture composed of an input embeddings layer, 70 Transformer blocks, and an output language-modeling layer, as shown in the figure below. Each Transformer block has a self-attention layer and a multi-layer perceptron layer, with input and post-attention layer norms.

To predict the next token in a sentence using BLOOM, we simply need to pass the input tokens (in the form of embeddings) through each of 70 BLOOM blocks. Given that this is a sequential operation, we can load into RAM only one block at a time to avoid memory overflow. Similarly, the word embeddings and output language-modeling layer can be loaded on-demand from disk.

Download a Pre-trained BLOOM checkpoint

Use the code below to download the BLOOM (176-B version) from the Hugging Face models repository: https://huggingface.co/bigscience/bloom. This downloads the specific BLOOM checkpoint 2a3d62e. Although BLOOM’s model size is around 330GB, git lfs downloads additional linked files, then the download size is almost 700GB. Make sure you have enough disk space.

git lfs install
export GIT_LFS_SKIP_SMUDGE=1
git clone https://huggingface.co/bigscience/bloom
cd bloom
git lfs fetch origin 2a3d62e
git lfs checkout

The downloaded folder contains a sharded BLOOM checkpoint, as shown below. Sharded means that the checkpoint was split into 72 different files named pytorch_model_00001-of-00072.bin to pytorch_model_00001-of-00072.bin for convenient handling.

> ls -la
6.7 GB  pytorch_model_00001-of-00072.bin 
4.6 GB  pytorch_model_00002-of-00072.bin 
...
4.6 GB  pytorch_model_00071-of-00072.bin
 57 KB  pytorch_model_00072-of-00072.bin
0.5 KB  config.json
 14 MB  tokenizer.json
 13 KB  pytorch_model.bin.index.json

The file 00001 contains the word embeddings and associated layer norm, the files 00002 to 00071 contain the 70 BLOOM blocks, and the file 00072 contains the final layer norm. The output language modeling layer uses the same weights as the word embeddings. In case you are curious, the pytorch_model.bin.index.json file specifies how the BLOOM layers are distributed across the shards.

Inference

Now let’s use the downloaded BLOOM model to do inference. First, we need to install Hugging Face transformers v4.20.0, as shown below. This specific version is required, as the custom Python code in this tutorial uses methods available only in this specific version of transformers.

pip install transformers==4.20.0

Second, we create a method (get_state_dict) that takes as input a shard number (1 to 72), reads the shard from disk, and returns a dictionary with the model object state. This method allows to remove prefixes from the dictionary keys to facilitate loading the weights into the model objects using torch.load_state_dict. We also create the tokenizer and configuration objects by loading them from the downloaded folder.

Third, we create three methods to load the state dictionaries into the different model objects. We use these methods during inference to load only specific parts of the model to RAM. These three methods follow a similar pattern that consists of: 1) reading a shard from disk, 2) creating a model object, 3) filling up the weights of the model object using torch.load_state_dict, and 4) returning the model object. The only exception is the load_block method, which does not create a new block object but instead overwrites an object passed as parameter to save RAM memory.

Fourth, we create a method to do a full forward pass through all the BLOOM’s layers. This method takes as input an array of token input ids, and returns the token id predicted as next in the sentence. The method starts by creating an attention mask and the position encodings (alibi). Then, it does a forward pass on the embedding layer to create the initial hidden_states. Next, it sequentially passes the hidden_states through the 70 BLOOM blocks and the output language model head to generate the output logits. The argmax takes the output logits and returns the token id with highest prediction probability. Note that, after using the embeddings, we delete them to avoid overflowing the memory. Also, every time we call a bloom block, we read a new object from disk but overwrite the weights of the existing block object to save memory.

Finally, we define an input sentence, tokenize it, and sequentially call the forward method to predict the next tokens in the sentence, one token at a time. Note that, at every step, we concatenate the newly generated token with the previous tokens (input_ids) to further generate additional tokens.

INPUT: The SQL command to extract all the users whose name starts with A is:OUTPUT:
Token 1 ....... SELECT
Token 2 ....... *
Token 3 ....... FROM
Token 4 ....... users
Token 5 ....... WHERE
Token 6 ....... name
Token 7 ....... LIKE
Token 8 ....... 'A
Token 9 ....... %'
Token 10 ....... 

The SQL command to extract all the users whose name starts with A is:  SELECT * FROM users WHERE name LIKE 'A%'

This example shows that BLOOM can generate a meaningful SQL sentence. You can run other examples (for instance, the ones mentioned at the beginning of this tutorial) to see how powerful BLOOM is. Just remember to increase the number of tokens to generate using the max_tokens variable.

Conclusion

BLOOM has been deemed as one of the most important AI models of the decade due to its open-access and multi-lingual nature. This ground-breaking technology will revolutionize the research and practice in Natural Language Processing. By following this tutorial, you can leverage the power of BLOOM for text generation, even if you have limited computational resources. Further, you can use the great Hugging Face transformers library to fine tune BLOOM for downstream tasks such as question answering and text classification. In case that the large version of BLOOM is too big for your application or available computational resources, you can take advantage of smaller versions of BLOOM available in the Hugging Face models repository (https://huggingface.co/bigscience).

A Jupyter Notebook with all the source code in this tutorial is available in the Blog section of my website: https://arteagac.github.io