The world’s leading publication for data science, AI, and ML professionals.

Say Once! Repeating Words Is Not Helping AI

How and why is repeating tokens harming LLMs? Why is this a problem?

| ARTIFICIAL INTELLIGENCE | NLP | LLMs

image by Kristina Flour on Unsplash
image by Kristina Flour on Unsplash

Large Language Models (LLMs) have shown their capabilities and have taken the world by storm. Every big company now has a model with a fancy name. But, under the hood, they are all transformers. Everyone dreams of the trillion parameters, but is there no limit?

In this article, we discuss that:

  • Is it guaranteed that a bigger model has better performance than a small model?
  • Do we have the data for huge models?
  • What happens if instead of collecting new data you use the data again?

Scaling over the sky: what is hurting the wing?

Image by Sean Pollock on Unsplash
Image by Sean Pollock on Unsplash

OpenAI has defined the scaling law, stating that model performance follows a power law according to how many parameters are used and the number of data points. This along with the search for emergent properties has created the parameter race: the bigger the model, the better.

Is that true? Are bigger models giving better performance?

Recently, emergent properties have come into crisis. As shown by Stanford researchers, the concept of emergent property may not exist.

Emergent Abilities in AI: Are We Chasing a Myth?

The scaling law probably assigns much less value to the dataset than is actually thought. DeepMind has shown with Chinchilla, that one should not only think to scale the parameters but also the data. In fact, Chinchilla shows that it is superior in capacity to Gopher (70 B vs. 280 B parameters)

"Overlaid predictions. We overlay the predictions from our three different approaches, along with projections from Kaplan et al. (2020). We find that all three methods predict that current large models should be substantially smaller and therefore trained much longer than is currently done." Image source: here
"Overlaid predictions. We overlay the predictions from our three different approaches, along with projections from Kaplan et al. (2020). We find that all three methods predict that current large models should be substantially smaller and therefore trained much longer than is currently done." Image source: here

Recently, the machine learning community got excited about LLaMA not only because it is open source, but because the 65 B version of parameters outperformed OPT 175 B.

META’s LLaMA: A small language model beating giants

As DeepMind states in the Chinchilla article, one can estimate how many tokens are required to fully train a state-of-the-art LLM. On the other hand, one can also estimate how many high-quality tokens exist. Recent research has wondered about this topic. They concluded:

  • Language datasets have grown exponentially, with a 50% yearly growth in language dataset publication (up to 2e12 words by the end of 2022). This is showing the research and publication of new language datasets is a very active field.
  • On the other hand, the number of words on the internet (stock of words) is growing (and the authors estimate it between 7e13 and 7e16 words, so 1.5 -4.5 orders of magnitude).
  • Since, however, they try to use a stock of words that is of high quality, actually the authors estimate the quality stock to be between n 4.6e12 and 1.7e13 words. The authors state that between 2023–2027 we will have exhausted the number of quality words and between 2030 and 2050 the entire stock.
  • The stock of images is not much better off either (three to four orders of magnitude)
projection of data usage. image source: here
projection of data usage. image source: here

Why is this happening?

Well, because we humans are not infinite and do not produce text as much as ChatGPT. In fact, projections of the number of Internet users (real and predicted) speak volumes:

Real and projected evolution of internet users. image source: here
Real and projected evolution of internet users. image source: here

In fact, not everyone is happy about texts, code, and other sources being used to train Artificial Intelligence models. In fact, Wikipedia, Reddit, and other sources historically used to train models would like companies to pay to use their data. In contrast, companies are invoking fair use, and at present the regulation landscape is unclear.

Combining the data together, a trend can be clearly seen. The number of tokens required to optimally train an LLM is growing faster than the tokens in stock.

image source: here
image source: here

According to the scaling law defined by Chinchilla (number of tokens required for optimal LLM training), we have already exceeded the limit. From the graph, we can see that according to these estimates with PaLM-540 B, we have reached the limit (10.8 trillion tokens required vs. 9 trillion in stock).

Some authors have called this problem with the "token crisis." Moreover, so far we have considered only English-language tokens, but there are seven thousand other languages. Fifty-six percent of the entire web is in English, and the remaining 44 percent belongs to only 100 other languages. And this is reflected in the performance of models in other languages.


Can we get more data?

image by Karen Vardazaryan on Unsplash
image by Karen Vardazaryan on Unsplash

As we have seen more parameters do not equate to better performance. For better performance, we need quality tokens (texts), but these are in short supply. How can we obtain them? Can we help ourselves with artificial intelligence?

Why we are not using Chat-GPT to produce text?

If we humans are not producing enough text, why not automate this process? A recent study shows how this process is not optimal. Stanford Alpaca was trained using 52,000 examples derived from GPT-3, but only apparently achieved similar performance. In reality, the model learns the style of the target model but not its knowledge.

Why not train longer?

For both PaLM, Gopher, and LLaMA (also for the other LLMs) it is clearly written that the models were trained for a few epochs (one or however few). This is not a limitation of the Transformer because, for example, the Vision Transformers (ViT) have been trained for 300 epochs on ImageNet (1 million images), as shown in the table:

image source: here
image source: here

Because it is beyond expensive. In the LLaMA article, the authors trained for only one epoch (and two epochs for only part of the dataset). Nevertheless, the authors report:

When training a 65B-parameter model, our code processes around 380 tokens/sec/GPU on 2048 A100 GPU with 80GB of RAM. This means that training over our dataset containing 1.4T tokens takes approximately 21 days. (source)

Training an LLM for even a few epochs is extremely expensive. As calculated by Dmytro Nikolaiev (Dimid) this is meaning 4.0 million dollars if you train a model similar to META’s LLaMA on the Google Cloud Platform.

So training for other epochs would lead to an exponential increase in costs. Also, we don’t know if this additional training is really useful: we haven’t tested it yet.

Recently a group of researchers at the University of Singapore studied what happens if we train an LLM for multiple epochs:

To Repeat or Not To Repeat: Insights from Scaling LLM under Token-Crisis

Repetita iuvant aut continuata secant

Image by Unseen Studio on Unsplash
Image by Unseen Studio on Unsplash

Until now we know that the performance of a model is derived not only by the number of parameters but also by the number of quality tokens used to train. On the other hand, these quality tokens are not infinite and we are approaching the limit. If we cannot find enough quality tokens and it is an option to generate with AI, what could we do?

Can we use the same training set and train longer?

There is a Latin locution that states that repeating things benefits (repetita iuvant), but over time someone added "but continuing bores" (continuata secant).

The same is true for neural networks: increasing the number of epochs improves network performance (decrease in loss); at some point, however, while the loss in the training set continues to fall, the loss in the validation set begins to rise. The neural network went into overfitting, beginning to consider patterns that are only present in the training set and losing the ability to generalize.

Overfitting/overtraining in supervised learning. Image source: here
Overfitting/overtraining in supervised learning. Image source: here

Ok, this has been studied extensively for small neural networks, but what about huge transformers?

The authors of this study used the T5 model (encoder-decoder model) on the C4 dataset. The authors trained several versions of the model, increasing the number of parameters until the larger model outperformed the smaller model (indicating that the larger model received a sufficient number of tokens, as Chinchilla’s law). The authors noted that there was a linear relationship between the number of tokens required and the size of the model (confirming what DeepMind saw with Chinchilla).

Image source: here
Image source: here

The C4 dataset is limited (does not have infinite tokens) so to increase the number of parameters the authors found themselves in a tokens-scarcity condition. Thus they decided to simulate what happens if an LLM sees repeated data. They sampled a certain number of tokens, so the model found itself seeing them again in tokens training. This showed:

  • Repeated tokens lead to degraded performance.
  • Larger models are more susceptible to overfitting under tokens-crisis conditions (so even though it theoretically consumes more computational resources this leads to degraded performance).
Image source: here
Image source: here

In addition, these models are used for downstream tasks. Often an LLM is trained unsupervised on a large amount of text and then fine-tuned on a smaller dataset for a downstream task. Or it may go through a process called alignment (as in the case of ChatGPT).

When an LLM is trained on repeated data even though it is then fine-tuned on another dataset, performance is degraded. So the downstream tasks are also impacted.

Image source: here
Image source: here

Why repeated tokens are not a good idea

Image by Brett Jordan on Unsplash
Image by Brett Jordan on Unsplash

We just saw that repeated tokens harm training. But why does this happen?

The authors decided to investigate by keeping the number of repeated tokens fixed and increasing the number of total tokens in the dataset. The results show that a larger dataset alleviates multi-epoch degradation issues.

Image source: here
Image source: here

Last year Galactica was published (a model that was supposed to help scientists but lasted only three days). Apart from the spectacular debacle, the article suggested that part of their results was from the quality of the data. According to the authors, data quality reduced the risk of overfitting:

We are able to train on it for multiple epochs without overfitting, where upstream and downstream performance improves with use of repeated tokens. (source)

image source: here
image source: here

For the authors, the repeated tokens actually not only do not harm the model training but actually improved downstream performance.

In this new study, the authors use the Wikipedia dataset which is considered a higher quality dataset than C4, and add repeated tokens. The results show that there is a similar level of degradation, which is against what is stated in Galactica’s article.

image source: here
image source: here

The authors also tried to investigate whether it was also due to model scaling. During the scaling of a model, both the number of parameters and the computational cost increase. The authors decided to study these two factors individually:

  • Mixture-of-Experts (MoE) because although it increases the number of parameters it maintains a similar computational cost.
  • ParamShare, on the other hand, reduces the number of parameters but maintains the same computational cost.
image source: here
image source: here

The results show that the model with fewer parameters is less affected by repeated tokens. In contrast, the MoE model (greater number of parameters) is more prone to overfitting. The result is interesting because MoE has been used successfully in many AI models, so the authors suggest that although MoE is a useful technique when there is enough data, it can hurt performance when there are not enough tokens.

The authors also explored whether objective training impacts performance degradation. In general, there are two training objectives:

Recently, with PaLM2–2, Google introduced UL2 which is a mix between these two training objectives. UL2 has been shown to accelerate model training however interestingly, UL2 is more prone to overfitting and has greater multi-epoch degradation.

image source: here
image source: here

The authors next explored how they could try to alleviate multi-epoch degradation. Since regularization techniques are used precisely to prevent overfitting, the authors tested whether these techniques had a beneficial effect here as well.

Dropout shows to be one of the most efficient techniques to alleviate the problem. This is not surprising because one of the most efficient regularization techniques, it is easily parallelized and used by most of the models.

image source: here
image source: here

Moreover, it works best for the authors to start without dropout and only at a later point in the training to add dropout.

image source: here
image source: here

On the other hand, the authors note that using Dropout in some models, especially the larger ones, can lead to a slight reduction in performance. So although it may have beneficial effects against overfitting it could lead to unexpected behaviors in other contexts. So much that models GPT-3, PaLM, LLaMA, Chinchilla, and Gopher do not use it in their architecture.

image source: here
image source: here

As described in the table below, the authors used for their experiments what are now considered almost small models. Thus, it is expensive to test different hyperparameters when designing an LLM:

For instance, in our specific scenario, training T5-XL five times would require approximately $37,000 USD for renting Google Cloud TPUs. Considering even larger models like PaLM and GPT-4, trained on even larger datasets, this cost becomes unmanageable (source)

image source: here
image source: here

Since in their experiments, a Sparse MoE model approximates the behavior of a dense model (which is more computationally expensive), one can use it to search for the best hyperparameters.

For example, the authors show that one can test different learning rates for the MoE model and it exhibits the same performance as the equivalent dense model. So for the authors, one can test different hyperparameters with the MoE model and then train with the chosen parameters the dense model, thus saving cost:

sweeping the MoE Large model incurred an expenditure of approximately 10.6K USD on the Google Cloud Platform. Conversely, training the Dense XL model only once required 7.4K USD. Consequently, the entire development process, including sweeping, amounted to a total cost of 18K USD, which is only 0.48 times the expense of directly tuning the Dense XL model (source)

image source: here
image source: here

Parting thoughts

In recent years there has been a race to have the biggest model. On the one hand, this race has been motivated by the fact that at a certain scale, properties were emerging that were impossible to predict with smaller models. On the other hand, the scaling law of OpenAI stated that performance is a function of the number of model parameters.

In the past year this paradigm has come into crisis.

Recently LlaMA has shown the importance of data quality. Also, Chinchilla showed a new rule for calculating the number of tokens needed to train a model optimally. In fact, a model of a certain number of parameters requires a number of data in order to perform optimally.

Subsequent studies have shown that the number of quality tokens is not infinite. On the other hand, the number of model parameters grows more than how many tokens we humans can generate.

This led to the question of how we can solve the tokens crisis. Recent studies show that using LLM to generate tokens is not a viable way. This new work shows how using the same tokens for multiple epochs can actually deteriorate performance.

Work like this is important because although we are training and using LLM more and more, there are many even basic aspects that we do not know about. This work answers a question that seems basic but which the authors answer with experimental data: what happens when training an LLM for multiple epochs?

Moreover, this article is part of a growing slice of literature that shows how an uncritical increase in the number of parameters is unnecessary. On the other hand, bigger and bigger models are more and more expensive and also consume more and more electricity. Considering that we need to optimize resources, this article suggests that training a huge model without enough data is just a waste.

This article still shows how we need new architectures that can replace the transformer. So it is time to focus research on new ideas instead of continuing to scale models.

If you have found this interesting:

You can look for my other articles, you can also subscribe to get notified when I publish articles, you can become a Medium member to access all its stories (affiliate links of the platform for which I get small revenues without cost to you) and you can also connect or reach me on LinkedIn.

Here is the link to my GitHub repository, where I am planning to collect code and many resources related to Machine Learning, artificial intelligence, and more.

GitHub – SalvatoreRa/tutorial: Tutorials on machine learning, artificial intelligence, data science…

or you may be interested in one of my recent articles:

Scaling Isn’t Everything: How Bigger Models Fail Harder

META’S LIMA: Maria Kondo’s way for LLMs training

Google Med-PaLM 2: is AI ready for medical residency?

To AI or not to AI: how to survive?


References

A list of the principal references consulted for this article:

  1. Fuzhao Xue et al, 2023, To Repeat or Not To Repeat: Insights from Scaling LLM under Token-Crisis, link
  2. Hugo Touvron et all. 2023, LLaMA: Open and Efficient Foundation Language Models. link
  3. Arnav Gudibande et all, 2023, The False Promise of Imitating Proprietary LLMs. link
  4. PaLM 2, google blog, link
  5. Pathways Language Model (PaLM): Scaling to 540 Billion Parameters for Breakthrough Performance. Google Blog, link
  6. Buck Shlegeris et all, 2022, Language models are better than humans at next-token prediction, link
  7. Pablo Villalobos et. all, 2022, Will we run out of data? An analysis of the limits of scaling datasets in Machine Learning. link
  8. Susan Zhang et al. 2022, OPT: Open Pre-trained Transformer Language Models. link
  9. Jordan Hoffmann et all, 2022, An empirical analysis of compute-optimal large language model training. link
  10. Ross Taylor et al, 2022, Galactica: A Large Language Model for Science, link
  11. Zixiang Chen et al, 2022, Towards Understanding Mixture of Experts in Deep Learning, link
  12. Jared Kaplan et all, 2020, Scaling Laws for Neural Language Models. link
  13. How AI could fuel global warming, TDS, link
  14. Masked language modeling, HuggingFace blog, link
  15. Mixture-of-Experts with Expert Choice Routing, Google Blog, link
  16. Why Meta’s latest large language model survived only three days online, MIT review, link
  17. Exploring Transfer Learning with T5: the Text-To-Text Transfer Transformer, Google Blog, link
  18. Scaling laws for reward model overoptimization, OpenAI blog, link
  19. An empirical analysis of compute-optimal large language model training, DeepMind blog, link
  20. Xiaonan Nie et al, 2022, EvoMoE: An Evolutional Mixture-of-Experts Training Framework via Dense-To-Sparse Gate. link
  21. Tianyu Chen et al, 2022, Task-Specific Expert Pruning for Sparse Mixture-of-Experts, link
  22. Bo Li et al, 2022, Sparse Mixture-of-Experts are Domain Generalizable Learners, link

Related Articles