
ChatGPT, or more broadly Large Language AI Models (LLMs), have become ubiquitous in our lives. Yet, most of the mathematics and internal structures of LLMs are obscure knowledge to the general public.
So, how can we move beyond perceiving LLMs like ChatGPT as magical black boxes? Physics may provide an answer.
Everyone is somewhat familiar with our physical world. Objects such as cars, tables, and planets are composed of trillions of atoms, governed by a simple set of physical laws. Similarly, complex organisms, like ChatGPT, have emerged and are capable of generating highly sophisticated concepts like arts and sciences.
It turns out that the equations of the building blocks of LLMs, are analogous to our physical laws. So that by understanding how complexity arises from our simple physical laws, we might be able to gleam some insight on how and why LLMs work.
Complexity from Simplicity

Our world is inherently complex, yet it can be described by a remarkably small number of fundamental interactions. For instance, complicated snowflakes and bubble films can be linked to simple attractive forces between molecules.
So, what is the commonality in how complex structures arise? In Physics, complexity is generated when we zoom out from the smallest to the largest scale.
Drawing an analogy to language, English starts with a modest number of fundamental constituents – 26 symbols. These symbols can combine to form around 100,000 usable words, each carrying a distinctive meaning. From these words, a countless number of sentences, passages, books, and volumes can be generated.
This linguistic hierarchy is similar to the ones found in physics. Our current fundamental law (the Standard Model) starts with a limited number of elementary particles such as quarks and electrons, along with a few interactions mediated by force carriers like photons. These particles and forces combine to form atoms, each carrying a distinctive chemical property. From these atoms, an immense number of molecules, structures, cells, and beings are created.
In our physical world, there is a sort of emergent universality: even though many complex systems have totally different origins, they often share some universal characteristics. For instance, many liquids, despite having distinctive chemical properties, share three common phases (liquid, solid, and gas). As a more extreme example, the physics behind certain materials (Type-I superconductors) can be borrowed to describe fundamental physics (the celebrated Higgs Mechanism).
Although it is important to keep in mind the difference between language and physics – that physical laws are dictated and constrained by nature, while languages are human creations that seem unconstrained – there is no requirement for linguistic complexity to resemble physical complexity in our world.
However, as we will argue below, ChatGPT, and other LLMs alike, contain particle-physics-like structures. If we believe that these structures are the key to the successes of LLMs, it may provide hints that linguistic complexity shares some commonalities with physical complexities. Additionally, this could provide valuable insight into why and how LLMs work.
The Physics of Language Models

In order to connect LLMs to physics, we need to relate their underlying mathematical structures. In physics, the movement of particles (or more generally, fields or states), can be written schematically as:

(** technical note: the Hamiltonian formalism makes it more precise, although the derivative part needs to be modified a bit)
Intuitively, it states that particles move because of forces, which come from the slopes of some abstract object called potential. This is analogous to water flowing down a stream, where the potential would then come from gravity and fluid dynamics.
It turns out, the structure of LLMs is very similar: they break down sentences into fundamental constituents called tokens, and the AI model modifies these tokens layer-by-layer in an analogous way:

This will be made more precise in the technical section below. From this, we can draw an analogy
Transformer based language models are treating words as particles, which move around under the influence of each other, generating intriguing patterns.
In this way, just like how water molecules can build beautiful snowflakes, or how liquid-soap mixtures can create intricate bubble patterns, interesting results from ChatGPT might be attributed to its physics-like behaviors.
In the following optional section, we’ll describe in more details how this analogy come to be more rigorously, and then dive into the specifics on how this insight could help us understand LLMs.
Technical Detour
Below we provide a more detailed explanation on how one can think of LLMs as physics models.
From the physics side, on a microscopic level, each particle is typically influenced by all the particles in the system. For instance, let’s consider a hypothetical world with only 3 particles; in this case, there would be a total of 3 × 3 = 9 possible potentials between one particle and another. Schematically, we can represent this as follows:

(**In physics, the potentials are usually symmetric i.e. Potential₁₂ = Potential₂₁, we are relaxing this constraint here)
To see how this relates to LLMs, let’s recall some basic facts:
- To feed data into LLMs, a document or text is broken down into tokens. Tokens typically consist of one word or part of a word. Like particles, tokens are thought of as the smallest indivisible constituents in an LLM.
- LLMs have multiple layers, and in each layer, all the tokens are modified by self-attention modules.
- The final output layer aggregates the tokens to form predictions, which can be used for classifications or for generating texts/images.
If we take a three token example (say from the sentence "I like physics"), what would the equations look like?
There are some small differences depending on the specific types of LLMs we’re working with: BERT or GPT.
BERT Models
For BERT-like models (typically used for classification), each layer would modify the tokens schematically as follows:

(** the reason layer₁ is involved is due to the residual layer)
If we think of the layer as analogous to the time dimension, then the structure of the equation is similar to the equations governing the movements of three particles, although in LLMs the layers are discrete, as opposed to continuous in physics.
To make the analogy complete, we still need to convert the attention portion into a sort of potential. Let’s dig deeper mathematically. Pick a specific token, tᵢ, at each layer, it gets modified accordingly to the self-attention mechanism (ignoring multiple attention-head):

Where Q, K, V are the Query, Key, Value matrices typically seen in an attention module. For now we are ignoring normalization layers. The crux is that the exponential form can be re-written as the derivative of a kind of potential term!

(** while _Q_ᵀK may not always be invertible and this equation may not be exact, V is an arbitrary weight in our model: So we can always trade M for V in our attention module to achieve the same model performance)
In this way, passing tokens through layers in LLMs is analogous to having groups of particles interacting under some pair-wise interactions! It’s sort of like gas molecules bumping into one another and forming weather patterns.
(** in this view, we can interpret the normalization and matrix multiplication M as a sort of projection, so that the token-particles are properly constrained in the system. It’s analogous to a roller coaster being confined to its tracks.)
GPT
For (Chat)GPT-like models, the discussion gets modified. The attention module has an extra casual structure – that tokens can only get modified by the ones before them. This means that the equations are missing a few terms:

Following our analogy, this means that particles are coming in one at a time, and each one would get stuck after going through all the interaction layers. It’s sort of like growing a crystal one-atom at a time.
One thing to keep in mind is that our physics analogy isn’t 100% exact, as fundamental features like symmetries and laws like energy/momentum conservations that are so ubiquitous in physics do not apply to LLMs.
Emergence in Language Models

Now that we have our physics analogies, how does it help us understand LLMs? The hope is that, like complex physical systems, we can draw analogies from other, more familiar and well-understood systems to gain insights into LLMs. However, I have to caution the reader that most of our discussions below will be speculative, as confirming them would require conducting detailed experimental studies on LLMs.
(* Indeed if I have more resources, I’d imagine some of these ideas might be fruitful academic research projects)
Below is a sampler of how we may use the language of physics to reframe our understanding of LLMs.
LLMs Training
Using the language of thermal physics, we can think of LLMs as a tunable physical system, and model training is analogous to applying thermal pressure to the system to adjust its parameters. This viewpoint was described in my other article, "The Thermodynamics of Machine Learning," so I won’t delve too much into the details here.
Emergence of Intelligence?
While there is plenty of discussion on whether systems like ChatGPT are intelligent or not, I will refrain from adding to this controversial topic as I am not even sure how one may define intelligence. Nevertheless, it is clear that ChatGPT can consistently produce sophisticated and interesting outputs.
If we subscribe to our physics analogy, this should not be surprising. From snowflakes to tornadoes, we know that even simple laws can give rise to highly complex behaviors, and from complex behaviors, structures that appear intelligent could arise.
Complexity as a concept is not easily defined, so in order to proceed further, we can try to examine some key features of complex systems: Phase transition is one of them.
Phase Transition
Many complex physical systems possess distinctive phases, each with a highlighted set of physical properties. Thus, it is reasonable to suggest that within LLMs, there could be distinctive phases as well, with each phase tuned to be helpful in specific tasks (such as coding vs proofreading).
How might we verify or refute such a claim? This is where things could get interesting. In physics, phases arise when interactions start to form interesting structures. Some examples include:
- When water cools down, the attractive forces between molecules get stronger, causing the molecules to stick together and form solids.
- When a metal is cooled to extremely low temperatures, electrons may become attracted to each other through sound-wave (phonon) interactions, forming Type-I superconductors.
Perhaps something analogous could occur in LLMs? For instance, in ChatGPT, one might speculate that certain groups of tokens from "code" or "proofread" could trigger an avalanche of particular forces that drive specific types of output.
Another technical aspect of phase transitions is the modification of symmetries. This is related to the creation of structures, such as ice crystal patterns from water vapor. While LLMs do not possess physical symmetries, they should contain some sort of permutation symmetries of the model weights. This is because model performance should be the same as long as they are initialized with the same statistics and trained in the same paradigm. The specific values of a particular weight only become important during training. This can be thought of as the freezing-in of the weights. However, to continue this discussion, we would need to delve into the technical subject of spontaneous symmetry breaking, which we’ll save for a later date.
Are LLMs Efficient?
While there are many criticisms regarding the perceived inefficiency of LLMs due to their large number of parameters (especially when compared to physical models), these criticisms may not be fully warranted.
Why? It comes down to the technical limitations of our computers, which result in significant differences between physics and LLMs:
- Physical laws have infinite precision, while LLMs have finite precision.
- Physics exhibits huge hierarchies, with some forces being tiny and others large. In LLMs, we attempt to make all outputs/weights similar in size through normalization.
- In physics, tiny effects can accumulate to enormous influences (such as Earth’s gravity). In LLMs, these tiny effects are often rounded away and eliminated.
- Nature is an incredibly efficient computer, with interactions computed instantaneously across all scales with infinite precisions. LLMs, on the other hand, are relatively slow computers limited by finite precisions.
This means that while we can strive to make LLMs mimic physics better and create more powerful models, in practice, computers are simply incapable of fully simulating our world (as discussed in "Why We Don’t Live In a Simulation"). Therefore, resorting to a large number of parameters may be a last-resort tactic to address some of these deficiencies.
It is even plausible that given finite precision, there may be an upper limit to the complexity achievable with standard computers. This could make it very challenging to significantly reduce the number of parameters (although advancements in quantum computing might change this in the future).
Improvements to LLMs
Could our physics analogy help provide hints for the next generation of LLMs? I believe it’s possible. Logically, there are two possible directions to pursue based on our beliefs:
- Physics-like features are desirable: We should draw more inspiration from physics to create better model structures.
- Physics-like features are undesirable: Physics-like features may actually limit the capabilities of LLMs due to inherent computational limits, so we should avoid them.
Since we are using physics to understand LLMs, I’ll focus on the first possibility. Under this assumption, how could we address the shortcomings of LLMs like ChatGPT?
- Preserving Hierarchies: Instead of solely focusing on normalizing weights and reducing precision, we should explore alternative approaches to account for diverse interactions with different strengths and scales. We could draw inspiration from how electromagnetism (which is very strong) and gravity (which is very weak) are combined in nature.
- Accommodating Different Phases: Describing both ice and water using the same basic molecular equations is inefficient. It’s more efficient to use difference descriptions for difference phases (say sounds waves vs water waves). We could create a better structure that naturally accommodates the macroscopic differences within the model.
- Advanced Physics Techniques: In physics, we don’t study emergent phenomena using only fundamental equations. Techniques like thermodynamics, mean-field theory and renormalization can be used to help us simplify the problem. Incorporating some of these ideas into the building blocks of LLMs could improve their efficiency. For example, recent advances on linear attention (A. Katharopoulos et. al.) may already be interpreted as a sort of mean-field approach.
By exploring these avenues, we may be able to enhance the capabilities and efficiencies of LLMs, leveraging insights from physics to advance the field further.
Epilogue
In summary, we have showcased how the mathematics of LLMs resemble those in physics. This allows us to use our intuitions about everyday physical systems to understand these new emergent phenomena, such as ChatGPT. I hope that this helps demystify the reasons behind the characteristics of LLMs.
More generally, I hope I have conveyed to you how physics can provide valuable insights into complex subjects like LLMs. I firmly believe that Science is most effective when we borrow insights from seemingly disparate fields
If you enjoyed this article, you might be interested in my other pieces on similar topics, such as the link between physics and AI.
Please leave a comment or provide feedback, as it encourages me to write more insightful pieces! 👋
A Physicist’s View of Machine Learning: The Thermodynamics of Machine Learning
The Meaning Behind Logistic Classification, from Physics
Why Causation Is Correlation: A Physicist’s Perspective (Part 1)