
Table of Contents
- What is Bloom? Some Cautions
- Setting up Your Environment
- Downloading a Pre-Trained Tokenizer & Model
- Running Inference: Strategies for Better Responses
- Conclusions & Next Steps
- References
What Bloom is, and Why we Should Tread Carefully
Bloom is a new 176B parameter multi-lingual LLM (Large Language Model) from BigScience, a Huggingface-hosted open collaboration with hundreds of researchers and institutions around the world. The most remarkable thing about Bloom, aside from the diversity of contributors, is the fact that Bloom is completely open source and Huggingface has made their full (as well as some smaller) pre-trained models available to the public via their transformers API. Other organizations conducting research into LLMs, including OpenAI, Meta and Google, have chosen to keep their LLMs largely internal, or have restricted access to tightly controlled groups of closed beta testers.

There is a conversation to be had about the dangers of using these models in the real world, let alone making them publicly accessible. Concerns run the gamut from reinforcing unfair & systemic bias, to accelerating the spread of misinformation online. Much more competent voices than my own have, and continue to advocate for more human-accountable, transparent and equitable development and use of this technology. If you’re not familiar, I’d encourage you to pause here and spend some time catching up on the work of folks like Timnit Gebru (DAIR Institute), Margaret Mitchell and the team at the Partnership on AI, among many others.
Accordingly, I would encourage everyone to stick to the intended uses and be mindful of the risks and limitations laid out on Bloom’s model card as you proceed beyond this Hello World
style introductory tutorial.
Setting Up Your Environment
We’re going to be using the 1.3B parameter version of the general Bloom model in PyTorch, running inference using just the CPU. While I am using a Python 3 Jupyter Lab VM on Google Cloud’s Vertex service, you should be able to follow along on almost any local or hosted *nix Jupyter environment.
First we need to set up a virtual environment as a cleanroom to install all of the correct versions of our dependencies. We’re going to create an environment named .venv
(which also produces a hidden directory by the same name) and then activate it to start working:
pip install venv
python -m venv .venv
source .venv/bin/activate
Next we’ll install the packages we’re going to need to our .venv
environment:
pip install transformers
pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cpu
Lastly, we’ll need to exit our venv, register our new environment with Jupyter Lab as a kernel, and start it back up:
deactivate
ipython kernel install --user --name=venv
source .venv/bin/activate
When you go to the Select a Kernel option in Jupyter Lab you should now see venv
as an option. Let’s select and connect to it.
Downloading a Pre-Trained Tokenizer & Model
Starting up our example notebook (also available on GitHub), we first import a few modules from the packages we installed to venv
previously:
import transformers
from transformers import BloomForCausalLM
from transformers import BloomTokenizerFast
import torch
Now, to the main event, we download the pre-trained Bloom 1.3B parameter general LLM. While I haven’t sized it exactly, it seems this version of the model’s weights & biases takes up about 1.5Gb of space. Critically, we also need to fetch Bloom’s tokenizer. This is going to allow us to turn our input text ("prompt") into an embedding Bloom can understand:
model = BloomForCausalLM.from_pretrained("bigscience/bloom-1b3")
tokenizer = BloomTokenizerFast.from_pretrained("bigscience/bloom-1b3")
Speaking of which, let’s set some globals, including our prompt text:
prompt = "It was a dark and stormy night"
result_length = 50
inputs = tokenizer(prompt, return_tensors="pt")
A few notes:
result_length
calibrates the size of the response (in tokens) we get for the prompt from the model.inputs
contains the embedding representation ofprompt
, encoded for use specifically by PyTorch. If we were using TensorFlow we’d passreturn_tensors="tf"
.
Running Inference: Strategies for Better Responses
Before we send the model our prompt, we need to think about which decoding / search strategies might work best for our use case. With autoregressive transformers (trained for next token prediction) we have a number of options to search the answer space for the most "reasonable" output. This great article by Patrick von Platen (Huggingface) does an excellent job explaining the details and math behind the 3 techniques we’ll be trying, so I won’t reinvent the wheel here. I will however, give you the TL;DR version of each:
- Greedy Search simply chooses the next word at each timestep t+1 that has the highest predicted probability of following the word at t. One of the main issues here is that greedy search will miss words with a high probability at t+1 if it is preceded by a word with a low probability at t.
- Beam Search keeps track of the n-th (
num_beams
) most likely word sequences and outputs the most likely sequence. Sounds great, but this method breaks down when the output length can be highly variable – as in the case of open-ended Text Generation. Both greedy and beam search also produce outputs whose distribution does not align very well with the way humans might perform the same task (i.e. both are liable to produce fairly repetitive, boring text). - Sampling With Top-k + Top-p is a combination of three methods. By sampling, we mean that the next word is chosen randomly based on its conditional probability distribution (von Platen, 2020). In Top-k, we choose the
k
most likely words, and then redistribute the probability mass amongst them before the next draw. Top-p adds an additional constraint to top-k, in that we’re choosing from the smallest set of words whose cumulative probability exceedp
.
Now we’ll try all 3 strategies so we can compare the outputs.:
# Greedy Search
print(tokenizer.decode(model.generate(inputs["input_ids"],
max_length=result_length
)[0]))
It was a dark and stormy night, and the wind was blowing hard. The snow was falling fast, and the ground was covered with it. The horses were all frozen to the ground, and the men were huddled
# Beam Search
print(tokenizer.decode(model.generate(inputs["input_ids"],
max_length=result_length,
num_beams=2,
no_repeat_ngram_size=2,
early_stopping=True
)[0]))
It was a dark and stormy night, and the wind was blowing hard. I was in the middle of the road, when I heard a loud crash. It came from the house at the other side of my road. A man was
# Sampling Top-k + Top-p
print(tokenizer.decode(model.generate(inputs["input_ids"],
max_length=result_length,
do_sample=True,
top_k=50,
top_p=0.9
)[0]))
It was a dark and stormy night. It was almost noon. As I got out of the car and took off my shoes, a man walked over to me and sat down. He had a mustache, thick hair and brown eyes. He
Conclusions & Next Steps
Personally, all of these results appear mostly reasonable. You’ll find that as you iterate and adjust the parameters and prompts, some strategies may produce more optimal outputs for your specific use case. In fact, constructing prompts to coax LLMs into doing something useful is emerging as a bit of an art and science onto itself.
As a bonus, the inconsistency between the term "night" and the output "almost noon" in the sampling top-k + top-p output **** illustrates a valuable point, in that it can be easy to mistake LLMs for reasoning machines with internal models of the world that they use to structure their responses (like humans). In fact, we don’t need deep learning, big data or LLMs to prove that humans will anthropomorphize anything. Instead we should see LLMs for what they are: syntactically believable sentence generators which should be deployed with eyes wide open (and plenty of mitigating engineering and inclusive design) as to their limitations.
With that in mind, my own journey with Bloom will follow a few threads forward; largely focused on adapting both the text generation, as well as classification heads to problems in modern auditing. Specifically:
- Code summarization. Can Bloom summarize the logic of a code block in plain English?
- Transfer learning for token classification. Can Bloom be trained to identify risks and/or controls in process documentation?
- Reliability. What guarantees, if any, can we build into Bloom predictions as to the factual accuracy of generated summaries and classifications?
Happy generating!
References
- Bloom Model Card, 2022, Huggingface
- Bloom
transformers
Documentation, 2022, Huggingface - How to generate text: using different decoding methods for language generation with Transformers, 2020, Patrick von Platen
venv
Module Documentation, 2022, Python.org- Prompt Engineering Tips and Tricks with GPT-3, 2021, Andrew Cantino
- Getting Started with Bloom: Sample Notebook, 2022, Danie Theron