Fine-Tuning Language Models the Easy Way with blather

An open-source library to train and use generative text models in a few lines of code

Published in

Towards Data Science

4 min readDec 4, 2021

If you want to skip ahead and just see how to fine-tune your own dataset you can skip the rest of the article and just check out the colab notebook!

GitHub - bigthonk/blather: Train and use generative text models in a few lines of code.

Train and use generative text models in a few lines of code. To see blather in action check out…

github.com

Presenting guest lectures is one of my favorite things; all the joy of teaching without any of the responsibility. My goal in a lecture is to get the students excited about machine learning, get them playing around with practical examples, and then get the heck out of the way.

This week I had the opportunity to talk with some students from the University of Arizona’s Cyber program about machine learning. In a pre-lecture survey on questions one of the major concerns the students in the class identified was the developing threat of high-fidelity bots capable of influencing conversations on social media.

Creating botnets capable of influencing elections or spreading disinformation is a task of much greater scope, both in terms of the ethical questions of disseminating such information and in terms of complexity, than what can be accomplished in a single class. As a simple first step, we could start with talking about how one might get a bot to articulate posts with suitable content and style. This subtask can be accomplished through the use of fine tuned natural language models. Language models like the GPT family, XLNet, or our friends from Sesame Street, BERT and ELMo, are powerful models beyond the scope of training from scratch for most individuals. Thankfully the open source community has stepped up and developed pre-trained models which allow those of us without access to serious data and compute to experiment with them.

Unfortunately direct application of these models would offer solutions with strong biases toward the training datasets our open source friends trained them on. For our proposed botnet application this would make it difficult to get postings that are anywhere close to the desired personality we are attempting to mimic. To overcome this shortcoming we apply a method known as fine-tuning. Fine tuning a model refers to taking a model trained on a large dataset and doing some partial retraining of the model on a smaller dataset with better style and content suitability. This approach allows us to use models which would otherwise be computationally intractable for us to train due to lack of data or compute resources.

Here I run into a problem, I want to introduce the student to some of the high-level concepts of natural language models, and give them some tools to train and experiment with their own without boring them to death or exposing all of the details in an overwhelming manner. If only there was a simple library that allowed me to train models and then use them to generate text without having to understand that much of what was going on beneath the surface.

This led me to develop blather, a simple abstraction library sitting on top of the wonderful huggingface library. https://huggingface.co/

blather is basically just this with fewer options available to you

blather allows a user to read a dataset from a text file, and train a model (right now only GPT-2 is supported but I plan on adding support for other models in the future), the user can then generate samples from the model or save/load the model as desired.

Step 1. Install blather

!pip install blather

Step 2. Import blather

from blather import Blather

Step 3. Create a blather object

blather = Blather()

Step 4. Read Some Text

blather.read("prince.txt")

This will produce a training-time estimate and eventually produce some training stats. You can optionally pass an epochs parameter with desired number of cycles as it defaults to only training for a single epoch.

Training...
Estimated Training time of 3 minutes 
Running Validation...
  Validation Loss: 1.62[{'Training Loss': 2.586719648164395,
  'Valid. Loss': 1.6193444598011855,
  'epoch': 1}]

Step 5. Write Some Text

blather.write("The fool was known to blather about")

This calls the model to produce a generated ending to our prompt text, in this case our Machiavellian bot says that…

The fool was known to blather about it at every time, and I never heard a word of it," said the pope to Olivertronicus in the year 1543

Summary

I wrote blather to provide people who are interested in fine-tuning their own language models but don’t want to get too deep into the details of doing so. I hope it’s helpful or at the very least entertaining to a few folks. If you’re interested in working on the project, or just want to talk data science or ML, please reach out.