Large Language Models (LLMs) are capable and general-purpose tools, but often they lack domain-specific knowledge, which is frequently stored in enterprise repositories.
Fine-tuning a custom LLM with your own data can bridge this gap, and data preparation is the first step in this process. It is also a crucial step that can significantly influence your fine-tuned model’s performance.
However, manually creating datasets can be an expensive and time-consuming. Another approach is leveraging an LLM to generate synthetic datasets, often using high-performance models such as GPT-4, which can turn out to be very costly.
In this article, I aim to bring to your attention to a cost-efficient alternative for automating the creation of instruction datasets from various documents. This solution involves utilizing a lightweight open-source library called Bonito.

Getting Started with Bonito, the Open-Source Solution
Understanding Instructions
Before we dive into the library bonito and how it works, we need to first understand what even an instruction is.
An instruction is a text or prompt given to a LLM, such as Llama, GPT-4, etc. It directs the model to produce a specific kind of answer. Through instructions, people can guide the discussion, ensuring that the model’s replies are relevant, helpful, and in line with what the user wants. Creating clear and precise instructions is important to achieve the desired outcome.
Introducing Bonito, an Open-Source Model for Conditional Task Generation
Bonito is an open-source model designed for conditional task generation. It can be used to create synthetic instruction tuning datasets to adapt Large Language Models to users’ specialized, private data.

The research paper underlying Bonito’s development illustrates how it can be effectively employed to adapt both pre-trained and instruction-tuned models to various tasks without requiring any text annotations.
The model itself is fine-tuned from mistralai/Mistral-7B-v0.1
with a new large-scale dataset containing 1.65M examples.
Bonito also supports a variety of task types, including multiple-choice question answering, yes-no question answering, natural language inference, topic classification etc.
How to Use Bonito
The easiest way to use the Bonito model is through their package built upon the transformers
and the vllm
libraries.
In the next section, I’ll show you how to easily use the Bonito package to create a synthetic dataset from a PDF document.
Step-by-Step Guide to Generating Datasets
In this guide, I’ll show you how to generate a question-answering dataset from a PDF document using the Bonito package.
I’ve selected in this example the Circular 12/552 issued by the CSSF, Luxembourg’s financial regulator, which pertains to bank governance and central administration. The motivation behind this choice stems from the observation that tools like ChatGPT often struggle to grasp domain-specific knowledge, particularly regulatory requirements within specific industries and from smaller countries like Luxembourg.
My aim is to transform this circular into an instructional dataset suitable for fine-tuning a LLM. This tailored LLM will enable me to comprehend the underlying regulatory requirements, respond to inquiries about them, and ultimately extend its utility to broader applications such as risk management, impact assessment, and ongoing monitoring.
Pre-requisite: Since Bonito is a fine-tuned model from Mistral 7B, I personally run this demo using a Google Colab A100 GPU instance. It should also work locally on a computer with a sufficient GPU and RAM.
You can find my Colab notebook here.
Step 1 – Installing the Bonito Package and Other Dependencies
Besides the Bonito package, we’ll also need :
- Datasets and Hugging Face Hub libraries to handle datasets and interact with Hugging Face repository
- PyMuPDF and SpaCy: PyMuPDF is used for reading and extracting text from PDF files, while SpaCy for natural language processing tasks.
!pip install -e git+https://github.com/BatsResearch/bonito#egg=bonito
!pip install datasets huggingface_hub
!pip install pymupdf spacy
Step 2: Processing the PDF Document
First, we utilize PyMuPDF library to extract text from the document.
import fitz # PyMuPDF
def extract_text_from_pdf(pdf_path):
doc = fitz.open(pdf_path) # Open the PDF file
text = ""
for page in doc: # Iterate through each page
text += page.get_text() # Extract text and append it to the text variable
return text
pdf_path = 'cssf_12_552_governance.pdf' # Specify the path to your PDF document
text = extract_text_from_pdf(pdf_path) # Call the function with the path to your PDF
Next, we process the extracted text by splitting it into sentences. This step uses SpaCy, a library for advanced natural language processing (NLP).
import spacy
nlp = spacy.load("en_core_web_sm") # Load the English language model
def split_into_sentences(text):
doc = nlp(text) # Process the text with SpaCy
sentences = [sent.text.strip() for sent in doc.sents] # Extract sentences and strip whitespace
return sentences
sentences = split_into_sentences(text) # Split the extracted text into sentences
Finally, we transform the list of sentences into a format that can be utilized by the model Bonito, specifically using the datasets
library:
from datasets import Dataset
# Assuming sentences is a list of strings, where each string is a sentence
data = {"sentence": sentences}
dataset = Dataset.from_dict(data)
print(dataset)
Step 3: Generating the Synthetic Dataset
Now it’s time to utilize the Bonito library to generate a synthetic dataset tailored for question answering!
from bonito import Bonito, SamplingParams
from datasets import load_dataset
# Initialize the Bonito model
bonito = Bonito("BatsResearch/bonito-v1")
sampling_params = SamplingParams(max_tokens=256, top_p=0.95, temperature=0.5, n=1)
synthetic_dataset = bonito.generate_tasks(
dataset,
context_col="sentence",
task_type="qg",
sampling_params=sampling_params
)
In this example, we use Bonito for "question generation" (qg) to create questions for the dataset. But Bonito can handle a wide array of tasks. Here’s a brief overview of the types of tasks Bonito can manage:
- Extractive Question Answering (exqa): Generates answers to questions based on a given text snippet, extracting the answer directly from the text.
- Multiple-Choice Question Answering (mcqa): Provides answers to questions from a set of multiple choices.
- Question Generation (qg): Creates questions based on the content of a provided text.
- Question Answering Without Choices (qa): Answers questions without providing multiple-choice options.
- Yes-No Question Answering (ynqa): Generates yes or no answers to questions.
- Coreference Resolution (coref): Identifies mentions in a text that refer to the same entity.
- Paraphrase Generation (paraphrase): Rewrites sentences or phrases with different wording while retaining the original meaning.
- Paraphrase Identification (paraphrase_id): Determines whether two sentences or phrases convey the same meaning.
- Sentence Completion (sent_comp): Fills in missing parts of a sentence.
- Sentiment Analysis (sentiment): Identifies the sentiment expressed in a text, such as positive, negative, or neutral.
- Summarization: Condenses a longer text into a shorter summary, capturing the main points.
- Text Generation (text_gen): Creates coherent and contextually relevant text based on a prompt.
- Topic Classification (topic_class): Categorizes text into predefined topics.
- Word Sense Disambiguation (wsd): Determines the meaning of a word based on its context.
- Textual Entailment (te): Predicts whether a given text logically follows from another text.
- Natural Language Inference (nli): Determines the relationship between two pieces of text, such as contradiction, entailment, or neutrality.
Step 4: Saving the Generated Dataset
Now we can save either the generated dataset locally or upload it to the Hugging Face Hub.
To upload and save the dataset on the Hugging Face Hub, log into the Hub.
from huggingface_hub import notebook_login
notebook_login()
Then create a repository for the dataset and push it to the hub.
from huggingface_hub import create_repo
from huggingface_hub import Repository
repo_name = "dataset_12_552" # Choose a name for your dataset repository
repo_url = create_repo(repo_name, repo_type="dataset")
print("Repository URL:", repo_url)
synthetic_dataset.push_to_hub(f"Ronal999/dataset_12_552")
Here is the dataset I’ve created with my document, which of course needs some further cleaning and refinement to ensure its quality and performance, before the finetuning process.

Closing Thoughts
Creating a high-quality instruction dataset is key to achieving a well-performing model, but it can be a time-consuming process.
In this guide, we’ve looked at how to use Bonito, a specially fine-tuned open-source model, to create datasets from any text. This new way offers a good option compared to doing things by hand or using paid models like GPT-4, which can get really expensive.
Bonito is a relatively new approach, released just last month. Since a significant amount of knowledge is in unstructured data scattered across various documents, I’ve employed Bonito to automate the generation of datasets from multiple documents. These datasets are then used to train a local LLM, enabling me to customize my models to comprehend and utilize specific knowledge.
Now you have your synthetic dataset prepared, you are all set to launch the fine-tuning process! Be sure to check out the following article to kick off the efficient fine-tuning of your LLM.
Unleash Mistral 7B’ Power: How to Efficiently Fine-tune a LLM on Your Own Data
Before you go! 🦸🏻 ♀️
If you liked my story and you want to support me:
- Throw some Medium love 💕 (claps, comments and highlights), your support means the world to me.👏
- Follow me on Medium and subscribe to get my latest article🫶