| ARTIFICIAL INTELLIGENCE | FUTURE |TRANSFORMERS |

"’The Godfather of A.I.’ Leaves Google and Warns of Danger Ahead", is the title of the New York Times. How we can know if LMs are a threat to humanity if they are not open-source? What is actually happening? How the world of language models is on the brink of Changement.
The calling for the open-source crusade

A short while ago GPT-4 was revealed to the public, and I think we all went to read the technical report and were disappointed.

Recently, Nature also addressed the issue: we need large language models (LLMs) to be open-source.
Many of the LLMs are proprietary, not released, and we don’t know what data they were trained on. This does not allow them to be inspected and tested for limitations, especially with regard to bias.
In addition, sharing information and code with ChatGPT is at risk of leakage as discovered by Samsung. Not to mention that some states believe that data storage by these companies violates the GDPR.
This is why we need LLMs to be open-source, and there should be more investment in the development of new LLMs, such as the BLOOM consortium (a 170 B parameter LLM that was developed by an academic consortium).
There has often been sensationalism in recent months, both about the real capabilities of these LLMs and the risks of Artificial Intelligence. If researchers cannot test the models, they cannot really assess their capabilities, and the same for analyzing the risks. In addition, an open-source model is much more transparent and the community can also try to identify the source of problematic behavior.
Moreover, it is not a demand by academia, institution are alarmed by AI. European Union is discussing these days the EU AI act that can reshape the future of LLMs. At the same time, the White House is pushing tech CEO to limit the risk of AI. Thus, open source could be actually a future requirement for language models.
Why ChatGPT is that good?

We have all heard about ChatGPT, and how it seemed revolutionary. But how was it trained?
Let us start with the fact that ChatGPT was trained on the basis of an LLM (GPT 3.5 to be precise). Typically, these GPT-like language models are trained using the prediction of the next token in a sequence (from a sequence of tokens w, the model must predict the next token w+1).
The model typically is a transformer: consisting of an encoder that receives input as a sequence and a decoder that generates the output sequence. The heart of this system is multi-head self-attention, which allows the model to learn information about the context and dependencies between the various parts of the sequence.

GPT-3 was trained with this principle (like the other models in the Generative Pre-training Transformer, GPT, family), only with many more parameters and much more data (570 GB of data and 176 B of parameters).
GPT3 has tremendous capabilities however when it comes to generating text it often hallucinates, lacks helpfulness, is uninterpretable, and often contains biases. This means that the model is not aligned with what we expect from a model that generates text like a human
How do we obtain ChatGPT from GPT-3?
The process is called Reinforcement Learning from Human Feedback (RHLF), and was described by the authors in this article:
Training language models to follow instructions with human feedback
Here I will describe it very generally and succinctly. Specifically, it consists of three steps:
- Supervised fine-tuning, is the first step in which the LLM is fine-tuned to learn a supervised policy (baseline model or SFT model).
- Mimic human preferences, in this step, the annotators must vote on a set of outputs from the baseline model. This curated dataset is used to train a new model, the reward model.
- Proximal Policy Optimization (PPO), here the reward model is used to fine-tune the SFT model and obtain the policy model

To prepare for the first step OpenAI collected a series of prompts and asked human annotators to write down the expected response (12–15 K prompts). Some of these prompts were collected from GPT-3 users, so probably what one writes on ChatGPT will be used for the next model.
The authors used as a model GPT-3.5 that had already been fine-tuned on programming code, this also explains the code capabilities of ChatGPT.
Now this step however is not exactly scalable since it is supervised learning. In any case, the model thus obtained is not yet aligned.

The annotators noted a range of responses from the SFT model, according to how desirable such a response is (from worst to best). We now have a much larger dataset (10 x) and provide the SFT model responses to the new model, which must rank in order of preference.
During this stage, the model is learning a general policy about the data, and how to maximize its reward (when he is able to rank well the outputs).

So we have the SFT model, and we use its weights to initialize a new PPO model. This model is fine-tuned using Proximal Policy Optimization (PPO).
In other words, we use a reinforcement learning algorithm. The PPO model receives a random prompt and responds to the prompt, after which it receives a penalty or reward. Instead of classical Q-learning, here the model policy is updated to each response (the model learns directly from experience, on policy).
In addition, the authors use the per-token Kullback-Leibler (KL) penalty to make the model’s response distribution similar to that of the SFT model. This is because we want to optimize the model with the RL (due to the reward model) but we still do not want it to forget what it learned in step 1, which are prompts curated by humans.
Finally, the model is evaluated on three aspects: helpfulness, truthfulness, and harmlessness. After all, these were exactly the aspects we wanted to optimize.
A curious note is that the model when evaluated on classic benchmarks (question answering, summarization, classification) has lower performance than GPT-3. This is the cost of alignment.
Alpaca, a revolutionary animal

As mentioned there is a real need to study the behavior of these models and this is only possible if they are open source. On the other hand, any LM can be aligned using RHLF.
RHLF is much less expensive and computationally intensive than training a model from scratch. On the other hand, it requires that there be annotators (you do indeed need a dataset with instructions). But can’t these steps be automated?
The first step was Self-instruct, in this 2022 article, the authors propose a semi-automated method. In fact, the general idea is to start with a set of manually-written instructions. This set of instructions serves both as seeds and to be sure that most NLP tasks are covered.
Starting then with only 175 instructions prompted the model to generate the dataset (50k instructions). The dataset was then used for instruction tuning.

Having a method needed only a model. ChatGPT is based on OpenAI GPT-3.5, but can’t a smaller model be used? Does it necessarily need more than 100 B parameters?
Instead, the Stanford researchers used LLaMA and specifically the 7B version and 52 K instructions generated following the self-instruct method (instructions generated using OpenAI’s text-davinci-003). The real value of Alpaca is that the authors simplified the pipeline and greatly reduced costs in a way that any academic lab could replicate the process (which is in this repository). As in fact stated:
For our initial run, fine-tuning a 7B LLaMA model took 3 hours on 8 80GB A100s, which costs less than $100 on most cloud compute providers. (source)
The initial model evaluation showed that Alpaca is almost good at GPT-3.5 (in some cases even exceeding it). This may seem surprising given that this is a model that is 20 times smaller. On the other hand, the model behaved like GPT in a series of inputs (so the training acts as a kind of knowledge distillation). On the other hand, the model has the same limitations as typical language models, showing hallucinations, toxicity, and stereotypes.
Alpaca then demonstrates that any academic laboratory can train its own version of ChatGPT (using LLaMA, which is available only for research). On the other hand, any company using another model can align and create its own version of ChatGPT. In addition, similar models could still even be deployed on cell phones or Raspberry Pi computers.
The authors released a demo, but it was shut down after a short time (as a matter of security). Also, although one had to apply to use LLaMA (and access the model weights), a few days later the model was leaked online.
Are LLMs at the border of a revolution?

It seems like it has been years since ChatGPT was released but instead, it was only a few months. Up to that time we were talking about the power law, how it was necessary for a model to have more parameters, more data, and more training in order to allow for the origin of emergent behaviors.
These ideas led to the idea that we could define a kind of Moore’s law for language models. In a sense, in recent years we have seen almost an exponential law (we have gone from 1.5 B parameters for GPT-2 to 175 B for GPT-3).
What has changed?
The first blow to this doctrine could be called, the arrival of Chinchilla. DeepMind’s model showed that it is not only a matter of data quantity but also of data quality. Second, META’s LLaMA showed that even smaller models using a curated data set can achieve similar if not better results than huge models.
It is not just a matter of models. The data is the other issue. Humans do not produce enough data, probably not enough data to support any GPT-5 according to when required by the power law. Second, the data will not be as accessible as before.
In fact, Reddit (a popular data resource) has announced that AI developers will have to pay to access its content. Even Wikipedia has thought the same and now StackOverflow is moving in the same way, it will require companies to pay.
"Community platforms that fuel LLMs absolutely should be compensated for their contributions so that companies like us can reinvest back into our communities to continue to make them thrive," Stack Overflow’s Chandrasekar says. "We’re very supportive of Reddit’s approach." (source)
And even if one manages to get the data, it may not be safe the same for a company. Getty has sued an AI art generator, but the artists themselves have also filed lawsuits. Not to mention, that programmers have done the same with GitHub Copilot which has been trained with code in the repositories. In addition, the music industry (notoriously litigious) has spoken out against AI-generated music and urged against streaming services. If even AI companies appeal to fair use, it is by no means a given that they will have the same access to data in the future.
There is another factor to consider, apart from extending models by hetero modality, the transformer architecture has not changed since 2017. All language models are based on the dogma that only multi-head self-attention is needed and nothing more. Until recently Sam Altman was convinced that the scalability of the architecture was the key to AGI. But as he said at a recent MIT event, the key to AGI is not in more layers and more parameters.

The transformer has definite limitations and this is reflected in the LMs: hallucinations, toxicity, and bias. Modern LLMs are not capable of critical thinking. Techniques such as chain of thoughts and prompt engineering serve as patches to try to mitigate the problem.
Moreover, multi-head self-attention has been shown to be capable of solving RNN-derived problems and allowing behaviors to emerge as in-context learning has a quadratic cost. Recently, it has been seen that one cannot replace self-attention with non-quadratic variants of attention without losing expressiveness. However, work such as Spike-GPT and Hyena show that less expensive alternatives not based on self-attention exist and allow for comparable results in the construction of language models.
Also as shown aligning a model using RHLF has a cost with respect to performance in the various tasks. Therefore, LMs will not replace the "expert model" but in the future will perhaps be orchestrators of other models (as for example suggested by HuggingGPT).
You cannot stop Open-source and why it is always winning

is MidJourney or DALL-E better? it is difficult perhaps to say. What is certain is that stable diffusion is the winning technology. Stable diffusion by the fact that it has been open-source has spawned so many applications and has been the inspiration for so much derivative research (ControlNet, synthetic data for medical imaging, parallels to the brain).
Through the work of the community, Stable diffusion in its various versions has been improved and there are endless variations. On the other hand, there is no application of DALL-E that does not have a counterpart based on stable diffusion (but the reverse is true).
Why then has the same not happened for language models?
So far the main problem is that training a language model was a prohibitive undertaking. BigScience’s BLOOM is indeed a huge consortium. But LLaMA has shown that much smaller models can compete with monsters of more than 100 B parameters. Alpaca showed that LM alignment can also be done with little cost (less than $1,000 total cost). These are the elements that allowed Simon Willson to say "Large language models are having their Stable Diffusion moment."
From Alpaca to the present day, a lot of models have come out that are open-source. Not only has Stability AI released a number of models that are competitive with giants and can be used by everyone, but other companies have also released chatbots and models. In just a few weeks we have seen: Dolly, HuggingChat, Koala, and many more

Now, some of the models mentioned are yes open-source however they are for non-commercial use. although they are open to academic research this means that they cannot be exploited by interested companies.
This is only part of the story. In fact, there are already models on HuggingFace that can be easily trained (models, datasets, and pipelines) and there are to date several models that are commercially available (to date more than 10):

Open-source model, private data, and new applications

Dario Amodei, CEO of Anthropic is seeking billions to beat OpenAI on the bigger model of the world. However, the rest of the world is moving in another direction. For example, Bloomberg, which is not a known player in AI has released a LLM for finance (trained on 363 billion tokens from finance sources).
Why do we want an LLM for finance? Why do not use just ChatGPT?
Google MedPalm showed that a generalist model has poor performance compared to a model that is fine-tuned on a specific topic (in this case it was datasets of medical, scientific, and so on articles).
Fine-tuning an LLM is clearly expensive. Especially if we are talking about models with hundreds of billions of parameters. Smaller models are much less expensive, however still not indifferent. META’s LLaMA with being open-source has partly solved this problem. In fact, the authors of LLaMA-Adapter showed that only 1.2 million parameters need to be added in order to do fine-tuning (the training took less than an hour).
While it is true that LLaMA is not commercially available, there are many other models that are available (from small to large). What will obviously enable a successful application in a given field is data.
As Samsung discovered unpleasantly, it is a risk to use ChatGPT inside a company. Even if ChatGPT now allows people to disable chat history or decline to use their data to train the model, companies will consider it risky to concede their data.
Many companies will consider it possible to train their own chatbot, a model that is fine-tuned on their own corporate data and will remain internal. After all, the technology is available and affordable even for companies with small budgets. Moreover, the low cost allows them to be able to fine-tune regularly as new data arrives or if a better open-source model is released. Companies that now have the data will be much more reluctant to grant it.
Moreover, we have seen how important is to have quality data. Data in medicine and many other fields are difficult to collect (expensive, regulated, scarce) and companies that possess them have a vantage. OpenAI could spend billions trying to collect for example medical data, but beyond the cost, patient recruitment requires years and an established network (which it has not). Companies that have the data now will be more restrictive in sharing these data with models that can store what they are exposed.

In addition, works such as HuggingGPT and AudioGPT are showing the LLM is an interface for the user to interact with expert models (text-to-image, audio model, and much more). In the last years, many companies have hired data scientists and have developed different specialized models for their needs (pharmaceutical companies’ models for drug discovery and design, manufacturing companies for component design and predictive maintenance, and so on). Thus, now data scientists can instruct LLMs to connect with their previously trained models and allow internal non-technical users to interact with them through textual prompts.
There is also another element that points toward such a scenario, the regulations on generative AI are unclear (for example, Google has not released its generative music model for fear of copyright infringement). In addition to the copyright issue, questions about liability remain open. Therefore, many companies may internalize the technology and create their own AI assistant in the coming months.
Parting thoughts

Dr. Hinton said that when people used to ask him how he could work on technology that was potentially dangerous, he would paraphrase Robert Oppenheimer, who led the U.S. effort to build the atomic bomb: "When you see something that is technically sweet, you go ahead and do it."
He does not say that anymore. (source)
Hinton recently stated that we need to discuss the risks of artificial intelligence. But we cannot study the risks of a bomb exploding if it is inside a black box. That is why it is increasingly urgent for models to be Open source.
LLMs are in a phase of change anyway. Creating bigger and bigger models is unsustainable and does not give the same advantage as it once did. The future of the next LLMs will lie in data and probably in new architectures no longer based on self-attention.
However, data will not be as accessible as it once was; companies are beginning to stop access to it. Microsoft says it is willing to allow companies to create their own version of ChatGPT. But companies will be skeptical.
Some companies fear for their Business (it seems ChatGPT has already claimed its first victim), and others are afraid of data leakage. Or simplyment the technology is finally within reach of almost all companies, and each will create a chatbot tailored to its needs.
In conclusion, we can see different trends (which in part they already happening):
- A mounting fear of AI is pushing for open-source models
- This is leading to an increasing publication of open-source LLMs models. Which in turn, it is showing you can use smaller models and reduce the cost of their alignment.
- LLM models are a threat to different businesses and companies fear that these models could menace their business. Thus, different companies are reducing access to their data or asking for payment from AI companies.
- Reduction in cost, fear of competition, a new relevance for proprietary data, and the new availability of open-source models are leading companies to train their own chatbots on their own data using open-source models.
What do you think about the Future of LLMs? Let me know in the comments
If you have found this interesting:
You can look for my other articles, you can also subscribe to get notified when I publish articles, you can become a Medium member to access all its stories (affiliate links of the platform for which I get small revenues without cost to you) and you can also connect or reach me on LinkedIn.
Here is the link to my GitHub repository, where I am planning to collect code and many resources related to machine learning, artificial intelligence, and more.
GitHub – SalvatoreRa/tutorial: Tutorials on machine learning, artificial intelligence, data science…
or you may be interested in one of my recent articles:
Welcome Back 80s: Transformers Could Be Blown Away by Convolution
META DINO: how self-supervised learning is changing computer vision
Looking into Your Eyes: How Google AI Model Can Predict Your Age from the Eye
The Mechanical Symphony: Will AI Displace the Human Workforce?