
In 2021, Hu et al. proposed low-rank adapters (LoRa) for LLMs. This method significantly reduces the cost of fine-tuning large language models (LLMs) by only training a few added parameters (low-rank networks) while keeping the LLM’s original parameters (high-rank networks) frozen.
With LoRa, we still need an existing pre-trained model to fine-tune, i.e., it can’t pre-train a good LLM from scratch due to the low-rank restrictions. It leaves pre-training unaffordable for most individuals and organizations.
To reduce this cost, Lialin et al. (2023) propose ReLoRa. This is a modification of LoRa that allows pre-training LLMs from scratch.
In this article, I first explain how ReLoRa works. Then, I analyze and comment on the results presented in the scientific paper describing ReLoRa. In the last section, I show how to set up and run ReLoRa on your computer.
_Note about the licenses: The scientific paper published on arXiv and describing ReLoRa is distributed under a CC BY 4.0 license. The source code of ReLoRa is published on GitHub and distributed under an Apache 2.0 license allowing commercial use._
ReLoRa: from low-rank to high-rank networks
To understand how ReLoRa works, we must first have a closer look at LoRa.
LoRA works by adding two different sets of new trainable parameters, A and B, which are merged back after training into the original frozen high-rank network of the pre-trained model.
It may seem obvious but it is important to understand that the rank of the sum of A and B is higher than the sum of their individual ranks. This can be formalized as followed:

LoRa only trained these two sets of parameters. However, if we could reset, train, and merge them back to the original high-rank network multiple times in a row we would be able to increase over time the total rank of the network. In other words, we would obtain a larger model.
Why LoRa does not perform these resets?
Because there are several significant obstacles to overcome to make these resets useful. Standard LLM training uses the Adam optimizer which has its own states. Resetting LoRa’s trainable parameters without changing the states of Adam would make the new LoRa parameters similar to the LoRa parameters of the previous iteration. The model wouldn’t learn anything.
One of the main ideas proposed in ReLoRa is to also partially reset the Adam optimizer’s states in combination with "a jagged scheduler to stabilize training and warm starts". This jagged scheduler resets the learning rate to 0 and performs a new warm-up for a few training steps. The following effect on the learning rate is as follows:

More formally, the full algorithm running ReLoRa can be written as follows:

Results: ReLoRa yields a lower perplexity than LoRa
In practice, ReLoRa seems to yield results similar to standard pre-training, for a much lower cost, but only above a minimum number of parameters.

As we can see in the table above, ReLoRa roughly performs as well as "full training" (without LoRa and without freezing any parameters) for models with 250 million parameters or more.
They also included results for pre-training with the standard LoRa. It poorly performs and illustrates very well how critical are the resets in ReLoRa.
For these experiments, they used a neural network architecture similar to Meta’s LLaMa models. The training for most of these models took only 1 day using 8 RTX 4090 GPUs (which are consumer GPUs).
They didn’t experiment with more parameters than 350M (approximately the size of BERT large) due to the computational cost.
In an ablation study, the authors also demonstrate that LoRa’s restarts and scheduler new warm-ups are essential to achieve a lower perplexity. They also show that removing the jagged scheduler may lead to training divergence.
The computational efficiency of ReLoRa
ReLoRa iteratively trains and adds new parameters to the model, while keeping the parameters from previous iterations frozen.
ReLoRa is thus as efficient as LoRa in terms of memory usage. Moreover, the frozen parameters can be quantized at a low precision to further reduce memory usage. We can do this for instance with QLoRa as I described here:
Running ReLoRa on your computer
Lialin et al. (2023) released their implementation of ReLoRa on GitHub.
Since ReLoRa works with models of 250 million parameters or more, we can run it on consumer hardware, e.g., with a GPU with more than 6 GB of VRAM, or on a free instance of Google Colab.
You can easily reproduce their experiments and pre-train your own 250-million-parameter LLM as follows.
Note: If you want to directly test it without coding anything, I created a Google Colab notebook on The Kaitchup (my substack newsletter). Search for notebook #3.
First, clone the repository:
git clone https://github.com/Guitaricet/peft_pretraining.git
And then install all the requirements:
cd peft_pretraining
pip install -r requirements.txt
Since training takes a lot of time, I recommend first trying the framework with hyperparameters that will early stop the training and shorten the validation (which can take hours on consumer hardware).
The framework doesn’t have the option to shorten the validation, so we will have to do it manually. Open the file torchrun_main.py, and replace the line:
if evaluated_on_tokens > target_eval_tokens:
Note: At the time of writing this article, this is line 129.
by:
if evaluated_on_tokens > target_eval_tokens or total_batches > 10:
For better performance, the authors of the framework recommend running it in two steps.
The first step just initializes and trains the network for a few steps.
This is the hyperparameters I used to test that everything is working:
torchrun --nproc-per-node 1 torchrun_main.py
--model_config configs/llama_250m.json
--batch_size 4
--total_batch_size 8
--lr 5e-4
--max_length 512
--tags warm_start_250M
--save_every 10
--num_training_steps 2
--workers 1
--eval_every 1
Note: I set "— nproc-per-node" to 1 since I have only 1 GPU. You should change it to your number of GPUs.
In the configs directory, you will see several llama files. They contain the configurations for different models, similar to the architecture of the LLaMa model, with different sizes. Here I chose the size 250M.
The output is very verbose since the framework logs everything with wandb. It will also require you to make a choice here:
wandb: (1) Create a W&B account
wandb: (2) Use an existing W&B account
wandb: (3) Don't visualize my results
I entered "3" since I don’t have an account.
Then, in the second step, we rerun the framework with ReLoRa’s hyperparameters and PEFT:
torchrun --nproc-per-node 1 torchrun_main.py
--model_config configs/llama_250m.json
--batch_size 4
--total_batch_size 8
--lr 1e-3
--max_length 512
--use_peft
--relora 5
--cycle_length 5
--restart_warmup_steps 10
--scheduler cosine_restarts
--warmup_steps 2
--reset_optimizer_on_relora True
--num_training_steps 50
--save_every 10
--eval_every 10
--continue_from checkpoints/llama_250m-2023-07-19-09-39-08/model_3
--tags relora_250M
With " – continue_from", we give the model saved at the first step. Your model will have a different name, so you will have to change it. You can find it in the "checkpoints" directory.
Once you have confirmed that everything is working, i.e., without errors and with a decreasing perplexity, you can relaunch everything but with reasonable hyperparameters so that the model will be much better trained.
_Note: Don’t forget to remove the change we made in torchrunmain.py to shorten validation. You should at least increase the total number of batches to 100 to have a meaningful validation perplexity.
torchrun --nproc-per-node 1 torchrun_main.py
--model_config configs/llama_250m.json
--batch_size 4
--total_batch_size 8
--lr 5e-4
--max_length 512
--tags warm_start_250M
--save_every 1000
--num_training_steps 10000
torchrun --nproc-per-node 1 torchrun_main.py
--model_config configs/llama_250m.json
--batch_size 4
--total_batch_size 8
--lr 1e-3
--max_length 512
--use_peft
--relora 5000
--cycle_length 5000
--restart_warmup_steps 100
--scheduler cosine_restarts
--warmup_steps 500
--reset_optimizer_on_relora True
--num_training_steps 10000
--save_every 5000
--eval_every 5000
--continue_from <your checkpoint from step one>
--tags relora_250M
Note: Increase the batch size if you have GPUs with a lot of VRAM.
By default (it’s hardcoded), the framework used the C4 dataset (English only) for pre-training. This dataset was also used to pre-train the T5 models by Google. It covers many tasks, domains, and genres.
Once pre-training is finished, you will still have to fine-tune the resulting model for downstream tasks or domains of your choice. You can do that very efficiently with QLoRa.
Conclusion
To sum up, ReLoRa is a new pre-training method exploiting low-rank networks. It’s like performing LoRa multiple times in a row but with:
- A partial reset of the optimizer’s states
- A jagged learning rate schedule
- A short warmup after each reset of LoRa
Thanks to ReLoRa, we can now pre-train LLM on consumer hardware.
It remains to explore whether this approach is also competitive for very large language models (with more than 1 billion parameters).
According to ReLoRa’s authors, ReLoRa works better as the model gets larger. Nonetheless, it should be empirically verified by future work.
If you like this article and would be interested to read the next ones, the best way to support my work is to become a Medium member using this link:
_If you are already a member and want to support this work, just follow me on Medium._