The world’s leading publication for data science, AI, and ML professionals.

How to Generate Instruction Datasets from Any Documents for LLM Fine-Tuning

Generate high-quality synthetic datasets economically using lightweight libraries

Large Language Models (LLMs) are capable and general-purpose tools, but often they lack domain-specific knowledge, which is frequently stored in enterprise repositories.

Fine-tuning a custom LLM with your own data can bridge this gap, and data preparation is the first step in this process. It is also a crucial step that can significantly influence your fine-tuned model’s performance.

However, manually creating datasets can be an expensive and time-consuming. Another approach is leveraging an LLM to generate synthetic datasets, often using high-performance models such as GPT-4, which can turn out to be very costly.

In this article, I aim to bring to your attention to a cost-efficient alternative for automating the creation of instruction datasets from various documents. This solution involves utilizing a lightweight open-source library called Bonito.

Image generated by author using Bing chat powered by DALL.E 3
Image generated by author using Bing chat powered by DALL.E 3

Getting Started with Bonito, the Open-Source Solution

Understanding Instructions

Before we dive into the library bonito and how it works, we need to first understand what even an instruction is.

An instruction is a text or prompt given to a LLM, such as Llama, GPT-4, etc. It directs the model to produce a specific kind of answer. Through instructions, people can guide the discussion, ensuring that the model’s replies are relevant, helpful, and in line with what the user wants. Creating clear and precise instructions is important to achieve the desired outcome.

Introducing Bonito, an Open-Source Model for Conditional Task Generation

Bonito is an open-source model designed for conditional task generation. It can be used to create synthetic instruction tuning datasets to adapt Large Language Models to users’ specialized, private data.

Bonito workflow. Source: Learning to Generate Instruction Tuning Datasets for Zero-Shot Task Adaptation
Bonito workflow. Source: Learning to Generate Instruction Tuning Datasets for Zero-Shot Task Adaptation

The research paper underlying Bonito’s development illustrates how it can be effectively employed to adapt both pre-trained and instruction-tuned models to various tasks without requiring any text annotations.

The model itself is fine-tuned from mistralai/Mistral-7B-v0.1 with a new large-scale dataset containing 1.65M examples.

Bonito also supports a variety of task types, including multiple-choice question answering, yes-no question answering, natural language inference, topic classification etc.

How to Use Bonito

The easiest way to use the Bonito model is through their package built upon the transformers and the vllm libraries.

In the next section, I’ll show you how to easily use the Bonito package to create a synthetic dataset from a PDF document.


Step-by-Step Guide to Generating Datasets

In this guide, I’ll show you how to generate a question-answering dataset from a PDF document using the Bonito package.

I’ve selected in this example the Circular 12/552 issued by the CSSF, Luxembourg’s financial regulator, which pertains to bank governance and central administration. The motivation behind this choice stems from the observation that tools like ChatGPT often struggle to grasp domain-specific knowledge, particularly regulatory requirements within specific industries and from smaller countries like Luxembourg.

My aim is to transform this circular into an instructional dataset suitable for fine-tuning a LLM. This tailored LLM will enable me to comprehend the underlying regulatory requirements, respond to inquiries about them, and ultimately extend its utility to broader applications such as risk management, impact assessment, and ongoing monitoring.

Pre-requisite: Since Bonito is a fine-tuned model from Mistral 7B, I personally run this demo using a Google Colab A100 GPU instance. It should also work locally on a computer with a sufficient GPU and RAM.

You can find my Colab notebook here.

Step 1 – Installing the Bonito Package and Other Dependencies

Besides the Bonito package, we’ll also need :

  • Datasets and Hugging Face Hub libraries to handle datasets and interact with Hugging Face repository
  • PyMuPDF and SpaCy: PyMuPDF is used for reading and extracting text from PDF files, while SpaCy for natural language processing tasks.
!pip install -e git+https://github.com/BatsResearch/bonito#egg=bonito
!pip install datasets huggingface_hub
!pip install pymupdf spacy

Step 2: Processing the PDF Document

First, we utilize PyMuPDF library to extract text from the document.

import fitz  # PyMuPDF

def extract_text_from_pdf(pdf_path):
    doc = fitz.open(pdf_path)  # Open the PDF file
    text = ""
    for page in doc:  # Iterate through each page
        text += page.get_text()  # Extract text and append it to the text variable
    return text

pdf_path = 'cssf_12_552_governance.pdf'  # Specify the path to your PDF document
text = extract_text_from_pdf(pdf_path)  # Call the function with the path to your PDF

Next, we process the extracted text by splitting it into sentences. This step uses SpaCy, a library for advanced natural language processing (NLP).

import spacy

nlp = spacy.load("en_core_web_sm")  # Load the English language model

def split_into_sentences(text):
    doc = nlp(text)  # Process the text with SpaCy
    sentences = [sent.text.strip() for sent in doc.sents]  # Extract sentences and strip whitespace
    return sentences

sentences = split_into_sentences(text)  # Split the extracted text into sentences

Finally, we transform the list of sentences into a format that can be utilized by the model Bonito, specifically using the datasets library:

from datasets import Dataset

# Assuming sentences is a list of strings, where each string is a sentence
data = {"sentence": sentences}
dataset = Dataset.from_dict(data)

print(dataset)

Step 3: Generating the Synthetic Dataset

Now it’s time to utilize the Bonito library to generate a synthetic dataset tailored for question answering!

from bonito import Bonito, SamplingParams
from datasets import load_dataset

# Initialize the Bonito model
bonito = Bonito("BatsResearch/bonito-v1")

sampling_params = SamplingParams(max_tokens=256, top_p=0.95, temperature=0.5, n=1)
synthetic_dataset = bonito.generate_tasks(
    dataset,
    context_col="sentence",
    task_type="qg",
    sampling_params=sampling_params
)

In this example, we use Bonito for "question generation" (qg) to create questions for the dataset. But Bonito can handle a wide array of tasks. Here’s a brief overview of the types of tasks Bonito can manage:

  • Extractive Question Answering (exqa): Generates answers to questions based on a given text snippet, extracting the answer directly from the text.
  • Multiple-Choice Question Answering (mcqa): Provides answers to questions from a set of multiple choices.
  • Question Generation (qg): Creates questions based on the content of a provided text.
  • Question Answering Without Choices (qa): Answers questions without providing multiple-choice options.
  • Yes-No Question Answering (ynqa): Generates yes or no answers to questions.
  • Coreference Resolution (coref): Identifies mentions in a text that refer to the same entity.
  • Paraphrase Generation (paraphrase): Rewrites sentences or phrases with different wording while retaining the original meaning.
  • Paraphrase Identification (paraphrase_id): Determines whether two sentences or phrases convey the same meaning.
  • Sentence Completion (sent_comp): Fills in missing parts of a sentence.
  • Sentiment Analysis (sentiment): Identifies the sentiment expressed in a text, such as positive, negative, or neutral.
  • Summarization: Condenses a longer text into a shorter summary, capturing the main points.
  • Text Generation (text_gen): Creates coherent and contextually relevant text based on a prompt.
  • Topic Classification (topic_class): Categorizes text into predefined topics.
  • Word Sense Disambiguation (wsd): Determines the meaning of a word based on its context.
  • Textual Entailment (te): Predicts whether a given text logically follows from another text.
  • Natural Language Inference (nli): Determines the relationship between two pieces of text, such as contradiction, entailment, or neutrality.

Step 4: Saving the Generated Dataset

Now we can save either the generated dataset locally or upload it to the Hugging Face Hub.

To upload and save the dataset on the Hugging Face Hub, log into the Hub.

from huggingface_hub import notebook_login

notebook_login()

Then create a repository for the dataset and push it to the hub.

from huggingface_hub import create_repo
from huggingface_hub import Repository

repo_name = "dataset_12_552"  # Choose a name for your dataset repository
repo_url = create_repo(repo_name, repo_type="dataset")
print("Repository URL:", repo_url)
synthetic_dataset.push_to_hub(f"Ronal999/dataset_12_552")

Here is the dataset I’ve created with my document, which of course needs some further cleaning and refinement to ensure its quality and performance, before the finetuning process.

My synthetic dataset generated with Bonito
My synthetic dataset generated with Bonito

Closing Thoughts

Creating a high-quality instruction dataset is key to achieving a well-performing model, but it can be a time-consuming process.

In this guide, we’ve looked at how to use Bonito, a specially fine-tuned open-source model, to create datasets from any text. This new way offers a good option compared to doing things by hand or using paid models like GPT-4, which can get really expensive.

Bonito is a relatively new approach, released just last month. Since a significant amount of knowledge is in unstructured data scattered across various documents, I’ve employed Bonito to automate the generation of datasets from multiple documents. These datasets are then used to train a local LLM, enabling me to customize my models to comprehend and utilize specific knowledge.

Now you have your synthetic dataset prepared, you are all set to launch the fine-tuning process! Be sure to check out the following article to kick off the efficient fine-tuning of your LLM.

Unleash Mistral 7B’ Power: How to Efficiently Fine-tune a LLM on Your Own Data

Before you go! 🦸🏻 ‍♀️

If you liked my story and you want to support me:

  1. Throw some Medium love 💕 (claps, comments and highlights), your support means the world to me.👏
  2. Follow me on Medium and subscribe to get my latest article🫶

Get an email whenever Yanli Liu publishes.

Reference

  1. Learning to Generate Instruction Tuning Datasets for Zero-Shot Task Adaptation

Related Articles