The world’s leading publication for data science, AI, and ML professionals.

LoRA – Intuitively and Exhaustively Explained

Exploring the modern wave of machine learning: cutting edge fine tuning

Natural Language Processing | Machine Learning

"Lora The Tuner" By Daniel Warfield using MidJourney. All images by the author unless otherwise specified.
"Lora The Tuner" By Daniel Warfield using MidJourney. All images by the author unless otherwise specified.

Fine tuning is the process of tailoring a Machine Learning model to a specific application, which can be vital in achieving consistent and high quality performance. In this article we’ll discuss "Low-Rank Adaptation" (LoRA), one of the most popular fine tuning strategies. First we’ll cover the theory, then we’ll use LoRA to fine tune a language model, improving its question answering abilities.

The results of fine tuning. Before fine tuning the output is gibberish, the model repeats the question and a bogus answers repeatedly. After fine tuning the output is clear, concise, and accurate.
The results of fine tuning. Before fine tuning the output is gibberish, the model repeats the question and a bogus answers repeatedly. After fine tuning the output is clear, concise, and accurate.

Who is this useful for? Anyone interested in learning state of the art machine learning approaches. We’ll be focusing on language modeling in this article, but LoRA is a popular choice in many machine learning applications.

How advanced is this post? This article should be approachable to novice data scientists and enthusiasts, but contains topics which are critical in advanced applications.

Pre-requisites: While not required, a solid working understanding of large language models (LLMs) would probably be useful. Feel free to refer to my article on transformers, a common form of language model, for more information:

Transformers – Intuitively and Exhaustively Explained

You’ll also probably want to have an idea of what a gradient is. I also have an article on that:

What Are Gradients, and Why Do They Explode?

If you don’t feel confident on either of these topics you can still get a lot from this article, but they exist if you get confused.

What, and Why, is Fine Tuning?

As the state of the art of machine learning has evolved, expectations of model performance have increased; requiring more complex machine learning approaches to match the demand for heightened performance. In the earlier days of machine learning it was feasible to build a model and train it in a single pass.

Training, in its simplest sense. You take an untrained model, give it data, and get a performant model.
Training, in its simplest sense. You take an untrained model, give it data, and get a performant model.

This is still a popular strategy for simple problems, but for more complex problems it can be useful to think of training as two parts; "pre-training" then "fine tuning". The general idea is to do an initial training pass on a bulk dataset and to then refine the model on a tailored dataset.

Pre Training and Fine Tuning, a refinement of the typical single-shot training strategy.
Pre Training and Fine Tuning, a refinement of the typical single-shot training strategy.

This "pre-training" then "fine tuning" strategy can allow data scientists to leverage multiple forms of data and use large pre-trained models for specific tasks. As a result, pre-training then fine tuning is a common and incredibly powerful paradigm. It comes with a few difficulties, though, which we’ll discuss in the following section.

Difficulties with Fine Tuning

The most basic form of fine tuning is to use the same exact process you used to pre-train a model to then fine tune that model on new data. You might train a model on a huge corpus of general text data, for instance, then fine tune that model using the same training strategy on a more specific dataset.

In it's simplest form, pre-training and fine tuning are proceduraly identical. You pre-train a model on one set of data, then fine tune on another set of data.
In it’s simplest form, pre-training and fine tuning are proceduraly identical. You pre-train a model on one set of data, then fine tune on another set of data.

This strategy can be expensive. LLMs are absolutely massive, to fine tune using this strategy you would need enough memory to store not only the entire model, but also gradients for every parameter in the entire model (gradients being the thing that lets the model know what direction to tweak its parameters). Both the parameters and the gradients need to live on a GPU, which is why training LLMs requires so much GPU memory.

Back propagation, which is the strategy used in training machine learning models. Machine learning models are "differentiable", which means you can calculate "gradients", which can tell you how a small change to a certain parameter will impact model output. We generate a prediction, calculate gradients, calculate how wrong the prediction is, then use the gradients to improve the parameters of the model. Both pre-training and fine tuning employ back propagation, which requires the computation of a gradient for ever learnable parameter in the model. This means, if you have a 100 billion parameter model, you need to store 100 billion gradients as well. This cycle is done repeatedly, perhaps billions of times, to train a model.
Back propagation, which is the strategy used in training machine learning models. Machine learning models are "differentiable", which means you can calculate "gradients", which can tell you how a small change to a certain parameter will impact model output. We generate a prediction, calculate gradients, calculate how wrong the prediction is, then use the gradients to improve the parameters of the model. Both pre-training and fine tuning employ back propagation, which requires the computation of a gradient for ever learnable parameter in the model. This means, if you have a 100 billion parameter model, you need to store 100 billion gradients as well. This cycle is done repeatedly, perhaps billions of times, to train a model.

On top of the issue of storing gradients, it’s common to save "checkpoints", which are copies of the model at a particular state throughout the training process. This is a great strategy, allowing one to experiment with the model at different phases of the fine-tuning process, but it means we need to store numerous full-size copies of the model. Falcon 180B, a popular modern LLM, requires around 360GB in storage. If we wanted to store a checkpoint of the model ten times throughout the fine-tuning process it would consume 3.6 terabytes of storage, which is a lot. Perhaps even more importantly, it takes time to save such a large amount of data. The data typically has to come off the GPU, into RAM, then onto storage; potentially adding significant delay to the fine-tuning process.

LoRA can help us deal with these issues and more. Less GPU Memory usage, smaller file sizes, faster fine-tuning times, the list goes on and on. In a practical sense one can generally consider LoRA a direct upgrade of the traditional style of fine-tuning. We’ll cover exactly how LoRA works and how it can achieve such a remarkable improvements in the following sections.

LoRA in a Nutshell

"Low-Rank Adaptation" (LoRA) is a form of "parameter efficient fine tuning" (PEFT), which allows one to fine tune a large model using a small number of learnable parameters. LoRA employs a few concepts which, when used together, massively improve fine tuning:

  1. We can think of fine tuning as learning changes to parameters, instead of adjusting parameters themselves.
  2. We can try to compress those changes into a smaller representation by removing duplicate information.
  3. We can "load" our changes by simply adding them to the pre-trained parameters.

Don’t worry if that’s confusing; in the following sections we’ll go over these ideas step by step.

1) Fine Tuning as Parameter Changes

As we previously discussed, the most basic approach to fine tuning consists of iteratively updating parameters. Just like normal model training, you have the model make an inference, then update the parameters of the model based on how wrong that inference was.

Recall the back propagation diagram previously discussed. This is the basic form of fine tuning.
Recall the back propagation diagram previously discussed. This is the basic form of fine tuning.

LoRA thinks of this slightly differently. Instead of thinking of fine tuning as learning better parameters, you can think of fine tuning as learning parameter changes. You can freeze the model parameters, exactly how they are, and learn the changes to those parameters necessary to make the model perform better at the fine tuned task.

This is done very similarly to training; you have the model make an inference, then update based on how wrong the inference was. However, instead of updating the model parameters, you update the change in the model parameters.

In LoRA we freeze the model parameters, and create a new set of values which describes the change in those parameters. We then learn the parameter changes necessary to perform better on the fine tuning task.
In LoRA we freeze the model parameters, and create a new set of values which describes the change in those parameters. We then learn the parameter changes necessary to perform better on the fine tuning task.

You might be thinking this is a bit of a silly abstraction. The whole point of LoRA is that we want to make fine tuning smaller and faster, how does adding more data and extra steps allow us to do that? In the next section we’ll discuss exactly that.

2) Parameter Change Compression

For the sake of illustration many represent dense networks as a series of weighted connections. Each input gets multiplied by some weight, and then added together to create outputs.

A conceptual diagram of a dense network as a list of neurons connected by weights. The value of a particular neuron would be the sum of all inputs multiplied by the inputs respective weight.
A conceptual diagram of a dense network as a list of neurons connected by weights. The value of a particular neuron would be the sum of all inputs multiplied by the inputs respective weight.

This is a completely accurate visualization from a conceptual perspective, but under the hood this actually happens via matrix multiplication. A matrix of values, called a weight matrix, gets multiplied by a vector of inputs to create the vector of outputs.

A conceptual diagram of matrix multiplication. Source
A conceptual diagram of matrix multiplication. Source

To give you an idea of how matrix multiplication works. In the example above the red dot is equal to a₁₁•b₁₂ + a₁₂•b₂₂. As you can see, this combination of multiplication and addition is very similar to that found in the neuron example. If we create the correctly shaped matices, matrix multiplication ends up shaking out exactly identically to the concept of weighted connections.

Thinking of a dense network as weighted connections on the left, and as matrix multiplication on the right. On the right hand side diagram, the vector on the left would be the input, the matrix in the center would be the weight matrix, and the vector on the right would be the output. Only a portion of values are included for readability.
Thinking of a dense network as weighted connections on the left, and as matrix multiplication on the right. On the right hand side diagram, the vector on the left would be the input, the matrix in the center would be the weight matrix, and the vector on the right would be the output. Only a portion of values are included for readability.

From the perspective of LoRA, understanding that weights are actually a matrix is incredibly important, as a matrix has certain properties which we can be leveraged to condense information.

Matrix Property 1) Linear Independence

You can think of a matrix, which is a two dimensional array of values, as either rows or columns of vectors. For now let’s just think of matrices as rows of vectors. Say we have a matrix consisting of two vectors which look something like this:

A matrix consisting of two vectors, represented as rows in the matrix.
A matrix consisting of two vectors, represented as rows in the matrix.

Each of these vectors point in different directions. You can’t squash and stretch one vector to be equal to the other vector.

Each row as a matrix, plotted as a vector. No matter how the blue vector gets squashed or stretched, it will never point in the same direction as the red vector, and vice versa.
Each row as a matrix, plotted as a vector. No matter how the blue vector gets squashed or stretched, it will never point in the same direction as the red vector, and vice versa.

Let’s add a third vector into the mix.

Vectors A and Bare pointing in the same exact direction, while vector C is pointing in a different direction. As a result, no matter how you squash and stretch either Aor B, they can never be used to describe C. Therefore, C is linearly independent from Aand B. However, you can stretch A to equal B , and vice versa, so A and B are linearly dependent.

Let’s say A and B pointed in slightly different directions.

Now A and B can be used together (With some squashing and stretching) to describe C , and likewise A and B can be described by the other vectors. In this situation we would say none of the vectors are linearly independent, because all vectors can be described with other vectors in the matrix.

Using A and B to describe C. B's magnitude can be multiplied by a negative number to flip it's magnitude, the added to A.
Using A and B to describe C. B’s magnitude can be multiplied by a negative number to flip it’s magnitude, the added to A.

Conceptually speaking, linearly independent vectors can be thought of as containing different information, while linearly dependent vectors contain some duplicate information between them.

Matrix Property 2) Rank

The idea of rank is to quantify the amount of linear independence within a matrix. I’ll skip the nitty gritty details and get straight to the point: We can break a matrix down into some number of linearly independent vectors; This form of the matrix is called "reduced row echelon form".

A matrix (left) and that same matrix in reduced row echelon form (right). in the RREF matrix you can see that there are four linearly independent vectors (rows). Each of these vectors can be used in combination to describe all vectors in the input matrix.
A matrix (left) and that same matrix in reduced row echelon form (right). in the RREF matrix you can see that there are four linearly independent vectors (rows). Each of these vectors can be used in combination to describe all vectors in the input matrix.

By breaking the matrix down into this form (I won’t describe how because this is only useful to us conceptually), you can count how many linearly independent vectors can be used to describe the original matrix. The number of linearly independent vectors is the "rank" of the matrix. The rank of the RREF matrix above would be four, as there are four linearly independent vectors.

A little note I’ll drop in here: regardless of if you consider a matrix in terms of rows of vectors or columns of vectors, the rank is always the same. This is a mathy little detail which isn’t super important, but does have conceptual implications for the next section.

Matrix Property 3) Matrix Factors

So, matrices can contain some level of "duplicate information" in the form of linear dependence. We can exploit this idea using factorization to represent a large matrix in terms of two smaller matrices. Similarly to how a large number can be represented as the multiplication of two smaller numbers, a matrix can be thought of as the multiplication of two smaller matrices.

The two vectors on the right, when multiplied together, are equivalent to the matrix on the left. Even though they have the same value, the vectors on the left occupy 40% of the size that the matrix on the right occupies. The larger the matrix becomes, the more factors have a tendency to save on space.
The two vectors on the right, when multiplied together, are equivalent to the matrix on the left. Even though they have the same value, the vectors on the left occupy 40% of the size that the matrix on the right occupies. The larger the matrix becomes, the more factors have a tendency to save on space.

If you have a large matrix, with a significant degree of linear dependence (and thus a low rank), you can express that matrix as a factor of two comparatively small matrices. This idea of factorization is what allows LoRA to occupy such a small memory footprint.

The Core Idea Behind LoRA

LoRA thinks of tuning not as adjusting parameters, but as learning parameter changes. With LoRA we don’t learn the parameter changes directly, however; we learn the factors of the parameter change matrix.

Diagram of LoRA, from the LoRA paper. matrices A and B are trained to find optimal changes to the pretrained weights. We'll talk about "r" in a future section.
Diagram of LoRA, from the LoRA paper. matrices A and B are trained to find optimal changes to the pretrained weights. We’ll talk about "r" in a future section.

This idea of learning factors of the change matrix relies on the core assumption that weight matrices within a large language model have a lot of linear dependence, as a result of having significantly more parameters than is theoretically required. Over parameterization has been shown to be beneficial in pre-training (which is why modern machine learning models are so large). The idea behind LoRA is that, once you’ve learned the general task with pre-training, you can do fine tuning with significantly less information.

learned over-parametrized models in fact reside on a low intrinsic dimension. We hypothesize that the change in weights during model adaptation also has a low "intrinsic rank", leading to our proposed Low-Rank Adaptation (LoRA) approach. LoRA allows us to train some dense layers in a neural network indirectly by optimizing rank decomposition matrices of the dense layers’ change during adaptation instead, while keeping the pre-trained weights frozen – The LoRA Paper

This results in a significantly smaller amount of parameters being trained which means an overall faster and more storage and memory efficient fine tuning process.

Fine-Tuning Flow with LoRA

Now that we understand how the pieces of LoRA generally work, let’s put it all together.

So, first, we freeze the model parameters. We’ll be using these parameters to make inferences, but we won’t update them.

We create two matrices. These are sized in such a way that, when they’re multiplied together, they’ll be the same size as the weight matrices of the model we’re fine tuning. In a large model, with multiple weight matrices, you would create one of these pairs for each weight matrix.

the LoRA paper refferes to these matrices as matrix "A" and "B". Together, these matices represent the leranable parameters during LoRA fine tuning.
the LoRA paper refferes to these matrices as matrix "A" and "B". Together, these matices represent the leranable parameters during LoRA fine tuning.

We calculate the change matrix

Then we pass our input through both the frozen weights and the change matrix.

We calculate a loss based on the combination of both outputs then we update matrix A and B based on the loss

Note, while the change matrix is displayed here for illustrative purposes, in reality it's computed on the fly and never stored, which is why LoRA has such a small memory footprint. In reality, only the Model Parameters, the matrices A and B, and the gradients of A and B are stored during training.
Note, while the change matrix is displayed here for illustrative purposes, in reality it’s computed on the fly and never stored, which is why LoRA has such a small memory footprint. In reality, only the Model Parameters, the matrices A and B, and the gradients of A and B are stored during training.

We do this operation until we’ve optimized the factors of the change matrix for our fine tuning task. The backpropagation step to update the matrices A and B is much faster than the process to update the full set of model parameters, on account of A and B being significantly smaller. This is why, despite more operations in the training process, LoRA is still typically faster than traditional fine-tuning.

When we ultimately want to make inferences with this fine tuned model, we can simply compute the change matrix, and add the changes to the weights. This means the LoRA does not change the inference time of the model.

A cool little note, we can even multiply the change matrix by a scaling factor, allowing us to control the level of impact tha change matrix has on the model. In theory, we could use a bit of this LoRA and a dash of that LoRA at the same time, an approach which is common in image generation.

A Note on LoRA For Transformers

When researching this article I found a conceptual disconnect which a lot of people didn’t discuss. It’s fine to treat a machine learning model as a big box of weights, but in actuality many models have a complex structure which isn’t very box like. It wasn’t obvious to me how, exactly, this concept of a change matrix applies to the parameters in something like a transformer.

The transformer diagram, which I cover in another article. The "Nx" symbol represents the fact that both the left and right side get repeated numerous times. This is not a clean square of weights, and thus it's not obvious how LoRA might be applied. Image source
The transformer diagram, which I cover in another article. The "Nx" symbol represents the fact that both the left and right side get repeated numerous times. This is not a clean square of weights, and thus it’s not obvious how LoRA might be applied. Image source

Based on my current understanding, for transformers specifically, there are two things to keep in mind:

  1. Typically the dense network in a transformer’s multi-headed self attention layer (the one that construct the query, key, and value) is only of depth one. That is, there’s only an input layer and an output layer connected by weights.
  2. These shallow dense networks, which comprise most of the learnable parameters in a transformer, are very very large. There might be over 100,000 input neurons being connected to 100,000 output neurons, meaning a single weight matrix, describing one of these networks, might have 10B parameters. So, even though these networks might be of depth one, they’re super duper wide, and thus the weight matrix describing them is super duper large.

From the perspective of LoRA on transformer models, these are the chief parameters being optimized; you’re learning factorized changes for each of these incredibly large, yet shallow, dense layers which exist within the model. Each of these shallow dense layers, as previously discussed, has weights which can be represented as a matrix.

A Note on LoRA Rank

LoRA has a hyperparameter, named r, which describes the depth of the A and B matrix used to construct the change matrix discussed previously. Higher r values mean larger A and B matrices, which means they can encode more linearly independent information in the change matrix.

Diagram of LoRA, from the LoRA paper. the "r" parameter can be thought of as an "information bottleneck". Low r values mean A and B can encode less information with a smaller memory footprint. Larger r values mean A and B can encode more information, but with a larger memory footprint.
Diagram of LoRA, from the LoRA paper. the "r" parameter can be thought of as an "information bottleneck". Low r values mean A and B can encode less information with a smaller memory footprint. Larger r values mean A and B can encode more information, but with a larger memory footprint.
A conceptual diagram of LoRA with an r value equal to 1 and 2. In both examples the decomposed A and B matrices result in the same sized change matrix, but r=2 is able to encode more linearly independent information into the change matrix, due to having more information in the A and B matrices
A conceptual diagram of LoRA with an r value equal to 1 and 2. In both examples the decomposed A and B matrices result in the same sized change matrix, but r=2 is able to encode more linearly independent information into the change matrix, due to having more information in the A and B matrices

It turns out the core assumption the LoRA paper makes, that the change to model parameters is of low implicit rank, is a pretty strong assumption. The folks at Microsoft (the publishers of LoRA) tried out a few r values and found that even A and B matrices of rank one perform surprisingly well.

from the LoRA paper
from the LoRA paper

Generally, in selecting r, the advice I’ve heard is the following: When the data is similar to the data used in pre-training, a low r value is probably sufficient. When fine tuning on very new tasks, which might require substantial logical changes within the model, a high r value may be required.

LoRA in Python

Considering how much theory we went over you might be expecting a pretty long tutorial, but I have good news! HuggingFace has a module which makes LoRA super duper easy.

In this example we’ll be fine tuning a pre-trained model for question answering. Let’s go ahead and jump right in. Full code can be found here:

MLWritingAndResearch/LoRA.ipynb at main · DanielWarfield1/MLWritingAndResearch

1) Downloading Dependencies

We’ll be using a few modules which are beyond a simple PyTorch project. This is what they do:

  • bitsandbytes: for representing models using smaller datatypes, saving on memory.
  • datasets: for downloading datasets
  • accelerate: required dependency for machine learning interoperability for some of the modules
  • loralib: LoRA implementation
  • peft: a general "parameter efficient fine tuning" module, our interface for LoRA
  • transformers: for downloading and using pre-trained transformers from huggingface.
!pip install -q bitsandbytes datasets accelerate loralib
!pip install -q git+https://github.com/huggingface/peft.git git+https://github.com/huggingface/transformers.git

2) Loading Pre-Trained Model

We’ll be using BLOOM, an open source and permissively licensed language model. We’ll be using the 560 million parameter version to save on memory, but you could apply this same strategy to larger versions of BLOOM.

"""Importing dependencies and downloading pre-trained bloom model
"""

import torch
import torch.nn as nn
import bitsandbytes as bnb
from transformers import AutoTokenizer, AutoConfig, AutoModelForCausalLM

#loading model
model = AutoModelForCausalLM.from_pretrained(
    # "bigscience/bloom-3b",
    # "bigscience/bloom-1b1",
    "bigscience/bloom-560m",
    torch_dtype=torch.float16,
    device_map='auto',
)

#loading tokenizer for this model (which turns text into an input for the model)
tokenizer = AutoTokenizer.from_pretrained("bigscience/tokenizer")

3) Setting Up LoRA

Configuring LoRA with the following parameters:

  • r: the rank of the A and B matrices
  • lora_alpha: this is a pretty controversial parameter. A lot of people have a lot of ideas about it. You can consider it a scaling factor, and by default it should be equal to r, as far as I understand.
  • target_modules: the portions of the model we want to optimize with LoRA. the BLOOM module has parameters named query_key_value which we want to optimize.
  • lora_dropout: dropout is a technique which hides inputs to suppress the model from overfitting (called regularization). This is a probability of being hidden.
  • bias: neural networks typically have two paramet per connection, a "weight" and a "bias". We’re only training weights in this example.
  • task_type: not super necessary, used in the superclass PeftConfig. Setting to CAUSAL_LM because the specific language model we’re using is "causal".
"""Setting up LoRA using parameter efficient fine tuning
"""

from peft import LoraConfig, get_peft_model

#defining how LoRA will work in this particular example
config = LoraConfig(
    r=8,
    lora_alpha=8,
    target_modules=["query_key_value"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

#this actually overwrites the model in memory, so
#the rename is only for ledgibility.
peft_model = get_peft_model(model, config)

4) Examining Memory Savings

One of the big ideas of LoRA is that training contains significantly fewer training parameters, meaning a large savings in terms of memory consumption. Let’s see exactly how much we’re saving in this particular example.

"""Comparing parameters before and after LoRA
"""

trainable_params = 0
all_param = 0

#iterating over all parameters
for _, param in peft_model.named_parameters():
    #adding parameters to total
    all_param += param.numel()
    #adding parameters to trainable if they require a graident
    if param.requires_grad:
        trainable_params += param.numel()

#printing results
print(f"trainable params: {trainable_params}")
print(f"all params: {all_param}")
print(f"trainable: {100 * trainable_params / all_param:.2f}%")
The results of comparing the trainable parameters in LoRA to the parameters in the original model. In this example, we're training on just over one tenth of a percent.
The results of comparing the trainable parameters in LoRA to the parameters in the original model. In this example, we’re training on just over one tenth of a percent.

5) Loading Fine Tuning Dataset

We’ll be using the SQUAD dataset to improve our language model’s performance on question answering. The Stanford Question Answering Dataset (SQUAD) is a high quality, commonly used, and permissively licensed dataset.

"""Loading SQUAD dataset
"""

from datasets import load_dataset
qa_dataset = load_dataset("squad_v2")

6) Re-Structuring Data

We’re going to be fine tuning the language model on a specific structure of data. The model will expect text in this general form:

**CONTEXT:**
{context}

**QUESTION:**
{question}

**ANSWER:**
{answer}</s>

We’ll provide to the model the context and question, and the model will be expected to provide an answer to us. So, we’ll be reformatting the data in SQUAD to respect this format.

"""Reformatting SQUAD to respect our defined structure
"""

#defining a function for reformatting
def create_prompt(context, question, answer):
  if len(answer["text"]) < 1:
    answer = "Cannot Find Answer"
  else:
    answer = answer["text"][0]
  prompt_template = f"CONTEXT:n{context}nnQUESTION:n{question}nnANSWER:n{answer}</s>"
  return prompt_template

#applying the reformatting function to the entire dataset
mapped_qa_dataset = qa_dataset.map(lambda samples: tokenizer(create_prompt(samples['context'], samples['question'], samples['answers'])))

7) Fine Tuning on SQUAD using LoRA

This code is largly co-opted. In the absence of a rigid validation procedure, the best practice is to just copy a successful tutorial or, better yet, directly from the documentation. If you were to train an actual model for an actual use case, you would probably want to research and potentially optimize some of these parameters.

"""Fine Tuning
This code is largly co-opted. In the absence of a rigid validation
procedure, the best practice is to just copy a successful tutorial or,
better yet, directly from the documentation.
"""

import transformers

trainer = transformers.Trainer(
    model=peft_model,
    train_dataset=mapped_qa_dataset["train"],
    args=transformers.TrainingArguments(
        per_device_train_batch_size=4,
        gradient_accumulation_steps=4,
        warmup_steps=100,
        max_steps=100,
        learning_rate=1e-3,
        fp16=True,
        logging_steps=1,
        output_dir='outputs',
    ),
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False)
)
peft_model.config.use_cache = False  # silence the warnings. Please re-enable for inference!
trainer.train()
The loss (how much error is in the model). In this example we don't have to look at loss super closely, but it serves as a good metric. In this example we train for 100 steps and, while there's some random variation in loss across steps, the loss generally goes down throughout the course of training, which is good.
The loss (how much error is in the model). In this example we don’t have to look at loss super closely, but it serves as a good metric. In this example we train for 100 steps and, while there’s some random variation in loss across steps, the loss generally goes down throughout the course of training, which is good.

8) Checking LoRA Size

Let’s go ahead and save our LoRA optimization

"""Saving the LoRA fine tuning locally
"""
model_id = "BLOOM-560m-LoRA"
peft_model.save_pretrained(model_id)

Then check how large the file is in our file system

!ls -lh {model_id}

The BLOOM 560m model, in it’s float 16 datatype, is over 1 gigabyte in total size. With LoRA, and us only needing to save the decomposed matrices, our checkpoint size is a mere 3 megabytes. That’s like compressing the entirety of the game "Plants vs Zombies" down into a single image taken on an iPhone.

9) Testing

Ok, so we have a LoRA fine tuned model, let’s ask it a few questions. First, we’ll define a helper function which will take in a context and question, run predictions, and generate a response.

"""Helper Function for Comparing Results
"""

from IPython.display import display, Markdown

def make_inference(context, question):

    #turn the input into tokens
    batch = tokenizer(f"**CONTEXT:**n{context}nn**QUESTION:**n{question}nn**ANSWER:**n", return_tensors='pt', return_token_type_ids=False)
    #move the tokens onto the GPU, for inference
    batch = batch.to(device='cuda')

    #make an inference with both the fine tuned model and the raw model
    with torch.cuda.amp.autocast():
        #I think inference time would be faster if these were applied,
        #but the fact that LoRA is not applied allows me to experiment
        #with before and after fine tuning simultaniously

        #raw model
        peft_model.disable_adapter_layers()
        output_tokens_raw = model.generate(**batch, max_new_tokens=200)

        #LoRA model
        peft_model.enable_adapter_layers()
        output_tokens_qa = peft_model.generate(**batch, max_new_tokens=200)

    #display results
    display(Markdown("# Raw Modeln"))
    display(Markdown((tokenizer.decode(output_tokens_raw[0], skip_special_tokens=True))))
    display(Markdown("n# QA Modeln"))
    display(Markdown((tokenizer.decode(output_tokens_qa[0], skip_special_tokens=True))))

Let’s take a look at a few examples, and see just how much better our fine tuned model is at question answering:

Example 1)

context = "You are a monster, and you eat yellow legos."
question = "What is the best food?"

make_inference(context, question)

Example 2)

context = "you are a math wizard"
question = "what is 1+1 equal to?"

make_inference(context, question)
We're only using a 560M parameter model, so it's not a surprise that it's not very good at basic reasoning. asking it what 1+1 was might have been a bit of a stretch, but at least it failed more elegantly.
We’re only using a 560M parameter model, so it’s not a surprise that it’s not very good at basic reasoning. asking it what 1+1 was might have been a bit of a stretch, but at least it failed more elegantly.

Example 3)

context = "Answer the riddle"
question = "What gets bigger the more you take away?"

make_inference(context, question)
Again, we're only using a 560M parameter model. That said, the fine-tuned model failed to answer the question significantly more elegantly.
Again, we’re only using a 560M parameter model. That said, the fine-tuned model failed to answer the question significantly more elegantly.

Conclusion

And that’s it! We covered the concept of fine tuning, and how LoRA thinks of fine tuning as learning changes in parameters, rather than iteratively learning new parameters. We learned about linear independence and rank, and how the change matrix can be represented by small factors because of how low rank most weight matrices are. We put it all together, went through LoRA step by step, then used the HuggingFace PEFT module to implement LoRA on a question answering task.

Follow For More!

I describe papers and concepts in the ML space, with an emphasis on practical and intuitive explanations.

Get an email whenever Daniel Warfield publishes

Never expected, always appreciated. By donating you allow me to allocate more time and resources towards more frequent and higher quality articles. Link
Never expected, always appreciated. By donating you allow me to allocate more time and resources towards more frequent and higher quality articles. Link

Attribution: All of the resources in this document were created by Daniel Warfield, unless a source is otherwise provided. You can use any resource in this post for your own non-commercial purposes, so long as you reference this article, https://danielwarfield.dev, or both. An explicit commercial license may be granted upon request.


Related Articles