
Note: This article presents key findings from our recent research paper, "A non-ergodic framework for understanding emergent capabilities in Large Language Models" [link1] [link2]. While the paper offers a comprehensive mathematical framework and detailed experimental evidence, this post aims to make these insights accessible to a broader audience.
However, how many kinds of sentence are there? Say assertion, question and command? There are countless kinds; countless different kinds of use of all the things we call "signs", "words", "sentences". Furthermore, this diversity is not something fixed, given once for all; but new types of language, new language-games, as we may say, come into existence, and others become obsolete and get forgotten. We can get a rough picture of this from the changes in mathematics.
Ludwing Wittgenstein, Philosophical investigations, first. publ. in 1953, 4th ed. 2009, Wiley-Blackwell, UK
Motivation
I remember the summer of 2006 vividly. As I sat at my desk, surrounded by stacks of papers on innovation management theory and the logic of invention, my research had hit a wall. The econometric frameworks for understanding how innovation emerges in organizations felt mechanical and out-of-date. I was missing something essential about how new possibilities arise from existing capabilities. That’s when I first found Stuart Kauffman’s book ‘Investigations’ ¹. Its ambitious subtitle—"The Nature of autonomous agents and the worlds they mutually create"—is both appealing and spooky. When I finally began reading, I found myself drawn into Kauffman’s exploration of how complex biological systems perpetually generate novelty. His central idea, the "adjacent possible," described how each new innovation in biological systems opens up new possibilities, creating an ever-expanding space of potential futures.
The idea wouldn’t let me go. Kauffman’s mathematical framework for understanding how biological systems explore and expand their possibility spaces through restricted combinations of existing elements rather than random search felt profound. It opened a path to think about innovation not as a linear process but as an organic expansion of possibilities.
Fast forward to 2024. The rise of increasingly sophisticated large language models such as GPT 4o or Claude 3.5. Current research primarily emphasizes empirical observations and scaling laws—regularities found in the empirical analysis of LLM as well as in many real-world complex systems describing how system’s properties change with size. But prevailing approaches still struggle to explain the real large language model’s behavior. Fulfilling the promises of superintelligence and AGI will require a clear response to the question of how edge models work.
Theoretical physics (and theoretical Biology too), in its effort to understand a natural process’s essence without accessories, finds that things become more symmetrical, more similar, and more indistinguishable. This simplifies the development of mathematically simpler and more comprehensible models. Hilbert asserted that all reality can be expressed mathematically, and ultimately, it is geometric in nature. According to him, physics can be articulated using geometry. I’m convinced this vision can be very useful when trying to understand large language models.
The way these AI systems seem to develop new capabilities—not gradually but through sudden transitions—reminded me of Kauffman’s ideas of how biological systems evolve. Could the same theoretical framework that helped explain the expansion of biological possibility spaces help us understand how large language models develop their capabilities?
Emergence in complex systems: when the whole becomes more than its parts
When OpenAI released GPT-4o, something unexpected happened. The model didn’t just get marginally better at existing tasks—it suddenly showed entirely new capabilities that weren’t explicitly programmed into it. This aspect, where complex behaviors arise unexpectedly from simpler rules and interactions, in physics is called "emergence" and plays a central role in describing, with simple rules, complex systems. But to understand why this is so significant for AI, we need to first understand how emergence works in complex systems.
A complex system isn’t just complicated or messy—it has specific properties that make it fundamentally different from other systems. Think of the difference between an automatic Swiss watch and an ant colony. A watch, though sophisticated, behaves in predictable ways that can be understood by analyzing its parts. Ant colony dynamics, however, show behaviors that can’t be predicted just by analyzing individual ants movements.
Complex systems share three key properties: non-linear interactions between components, spontaneous self-organization at critical points, and the emergence of new properties at different scales.

In complex systems, changes do not happen gradually. At certain moments, small changes lead to sudden qualitative shifts—what physicists call phase transitions. The classic example is water turning to ice. No matter how slowly you decrease the temperature, at exactly 0°C, water molecules suddenly realign into a crystalline structure (to be precise, below 4°C, water molecules start to form isolated crystals). These transitions follow a mathematical pattern.
When a system approaches a critical point, certain quantities follow a power law that can be mathematically described as ξ ~ |T – Tₖ|^(-ν), where T is the system parameter (like temperature), Tₖ is the critical point, and ν is a critical exponent. This equation tells us something fundamental: emergence often happens through sudden reorganizations at specific critical points, not through gradual improvement. In physics, phase transitions are related to the physics fundamental concept of symmetry².
When we analyzed how these models develop new capabilities, we found patterns similar to phase transitions in physical systems. For example, we observed that the ability to solve multi-step mathematical problems doesn’t improve linearly with model size. Instead, there’s a critical point where the model suddenly transitions from random attempts to structured, step-by-step solutions. But perhaps the most surprising aspect of emergence in LLMs is self-organization. Just as birds in a flock follow simple local rules that lead to complex collective behavior, neural networks develop organized patterns of information processing without explicit programming. We can measure this self-organization through the attention patterns in transformer models. At critical points, these patterns spontaneously reorganize from relatively uniform distributions to highly specialized structures. This reorganization happens through sudden transitions that point to the emergence of new capabilities. This behavior would explain why traditional statistical measures such as perplexity often fail to fully capture the capabilities of the model. Just as measuring the average velocity of water molecules wouldn’t predict ice formation, statistical metrics can’t capture the complex reorganizations happening within these models. Doesn’t this sound a lot like the concept of emergence in complex systems? Does this indicate that language models follow the fundamental laws discovered by physicists in nature?
The adjacent possible: from biology to AI
Kauffman’s theory of the adjacent possible³, or TAP, provides a powerful tool for analyzing how systems can be both constrained by their current state and perpetually creative. In biological systems, each new innovation opens up new possibilities, but these possibilities are constrained by what can be reached from the current state. Think of the sequence of characters of the word "LOT". The sequence has the consonant "L, followed by a vowel "O", and followed by another consonant "T". If we keep the consonants, using a binomial coefficient (|P| choose i) where P=[a,e,i,o,u], the adjacent possible would lead us to the series LAT, LET, LIT, LOT, LUT. But, as far as I know, the word "LUT" is not in the English dictionary (the lexicon). It means not all possible adjacent states are reachable.
This same principle might apply to how language models develop their capabilities. Each new ability emerges not in isolation but from combinations of existing patterns and capabilities, constrained by the model’s architecture, training, and context. Just as biological systems explore their possibility spaces through restricted combinations of existing elements, language models might navigate their semantic spaces through constrained exploration of learned patterns.

We need to prove three key hypotheses
To test the applicability of the adjacent possible theory to describe LLMs behavior, in the above-mentioned paper we developed three hypotheses about how language models develop new capabilities:
- Capabilities emerge through phase transitions in semantic space, similar to how biological systems undergo sudden reorganizations at critical points.
- These transitions are shaped by the multiplicative interaction of architectural, training, and contextual constraints.
- The path a model takes through its possibility space fundamentally influences its future capabilities, making these systems inherently non-ergodic.
Current research primarily emphasizes empirical observations and scaling laws¹. Scaling laws are regularities found in the empirical analysis of LLM as well as in many real-world complex systems describing how system’s properties change with size. But current approaches still struggle to explain the model’s behavior. The challenge becomes particularly evident when we observe that these capabilities don’t emerge gradually; they often appear through sudden transitions at certain scales. Furthermore, these models show strong dependence on their interaction history, suggesting that their behavior might be fundamentally different from simple statistical systems.
In nature, some complex biological systems, like developing organisms and ecosystems, show similar patterns: capabilities emerge through sudden transitions rather than gradual improvements, system behavior depends strongly on historical context, and responses change significantly based on the evolution path taken to reach a particular state. In this post, we’ll explore some fundamental aspects of language model behavior that parallel complex systems in nature. By understanding these principles, we can better predict and potentially guide how language models develop new capabilities.
Hypothesis 1. Non-ergodic systems: the memory effect
To understand how language models develop capabilities, we first need to address a fundamental question: do these models explore their possibility space uniformly over time (ergodic behavior), or does their history fundamentally constrain their future states (non-ergodic behavior)? Ergodic theory studies the long-term statistical behavior of dynamical systems, addressing the long-term statistical behavior of these systems⁴. Formally, a system is non-ergodic when its time averages differ from its ensemble averages.
Consider this simple example. Let’s ask the model the question, "What is quantum entanglement?" but in two different contexts.
User: I'm watching a sci-fi movie about time travel.
AI: Those can be fascinating! Time travel in science fiction often plays with
some interesting physics concepts.
User: What is quantum entanglement?
AI: Quantum entanglement is like a cosmic connection between particles. Imagine
two characters in your sci-fi movie who are always connected - when one
makes a choice, it instantly affects the other, no matter how far apart
they are. In physics terms, entangled particles behave similarly -
measuring one immediately influences the other, regardless of distance.
This "spooky action at a distance," as Einstein called it, is one reason
why quantum physics seems to defy our usual understanding of time and
space.
User: I'm trying to understand how quantum computers work.
AI: That's a great topic! Quantum computing is built on some fundamental
principles of quantum mechanics.
User: What is quantum entanglement?
AI: Quantum entanglement is a fundamental property that makes quantum computers
powerful. When two quantum bits (qubits) become entangled, their quantum
states are interconnected. This means that performing operations on one
qubit instantly affects its entangled partner. This property allows
quantum computers to perform certain calculations exponentially faster than
classical computers because they can manipulate multiple entangled qubits
simultaneously, creating quantum circuits that process information in ways
impossible for traditional computers.
The same prompt, given to multiple instances of a model (ensemble average), produces different results than giving prompts sequentially to a single instance (time average). Thus, system future "states" (possible responses) depend critically on its history; time averages and ensemble averages diverge. This provides an initial argument that LLMs are non-ergodic systems.

In our experiments, we measured this effect systematically across different models. When we tracked how models solved complex problems, we found that some models showed an imporant difference in their solution paths depending on the previous context. Even more telling, the length and style of their responses varied by up to several steps (more than 10) on average when we changed the order of information presented.
Our experimental results indicate the large language models share three fundamental properties with non-ergodic systems:
- Path-dependent state space: In contrast to a gas molecule that eventually explores all potential states no matter the initial conditions, language models show constrained exploration patterns. Each interaction effectively modifies the model’s next toke accessible state space. This is why the same prompt can lead to different solutions depending on previous interactions—the model isn’t exploring all possible responses but rather following paths shaped by its interaction history.
- Time-ensemble inequivalence: Another defining feature of non-ergodic systems is that averaging over time produces different results than averaging over multiple instances. We observed this directly: running the same sequence of prompts multiple times (ensemble average) produced significantly different results than running different prompt sequences over time (time average). This inequivalence is a key property of non-ergodic systems.

- Memory effects: Just as a river’s future path depends on the channels it has already carved, each model interaction creates patterns that influence future responses. The fact that response length and solution paths changed by an average of over 10 steps shows that these aren’t just effects of the context; they’re fundamental changes in how the model navigates its knowledge space.

These properties have direct practical implications when using large language models. For example, sequential ordering of prompts matters because each interaction modifies the accessible state space. Solutions found in one context might be unreachable in another due to path dependence. This concept helps explain why prompt engineering is more than just finding the right words; it’s about understanding how to guide a non-ergodic system through its state space to reach desired capabilities.
Hypothesis 2. Phase transitions: when models learn to rise
Language models don’t develop new abilities through steady improvement but through sudden changes, what in physics is called phase transitions. These sudden shifts don’t happen gradually; they occur at specific critical points where the entire system reorganizes itself. A classic example is water turning to ice: no matter how slowly you decrease the temperature, at exactly 0°C, water molecules suddenly realign into a crystalline structure in a complete reorganization of the system’s behavior. At certain critical points, a system’s fundamental behavior changes not because of external forces but through internal reorganization. This is called self-organized critically, or SOC, in complex systems theory: a complex system will naturally organize itself into a state that is on a critical point that is the edge of two different regimes, without intervention from outside the system. Our research shows that language models have identical behavior.
- Critical points in learning: Large language models show distinct critical points where their capabilities suddenly change. Performance doesn’t improve smoothly—instead, we observed sharp transitions where problem-solving abilities emerge suddenly. At these critical points, the model’s internal organization fundamentally shifts.
- Symmetry breaking: Before this phase transition, large language models explore possibilities somewhat randomly. But after reaching a critical point, they develop structured, consistent problem-solving strategies. This mirrors how magnetic materials suddenly align their internal structure at certain temperatures—a fundamental principle in physics called symmetry breaking⁶.
- Energy landscapes: The path to these transitions follows predictable patterns. We observed that just before a capability emerges, the model’s attention patterns (its way of processing information) reorganize themselves. This reorganization isn’t random; it follows specific patterns that we can measure and potentially predict. These reorganizations represent the model finding lower-energy configurations that enable more efficient information processing. For instance, when a model suddenly develops the ability to solve multi-step math problems, we observe a rapid shift from random attempts to structured, step-by-step solutions.
Understanding these phase transitions gives us new insights into how LLM models develop. We might be able to predict and eventually guide the emergence of new model capabilities by identifying and analyzing these critical points.
Hypothesis 3. The three forces shaping large language models behavior
Kauffman’s theory of the adjacent possible starts with a surprisingly simple but powerful equation:

This simple equation helps us understand a simple idea: new possibilities don’t emerge randomly but through constrained combinations of what already exists. Mₜ represents the number of possibilities available at time t, and the summation describes how these possibilities can combine to create new ones. The αⁱ term is key in the equation. It describes different constraints’ interactions that limit which combinations are actually possible. As a result of this combination and the constraints interactions, we can get the space of available possibilities Mₜ₊₁ at time t+1.
When we applied this framework to large language models, we discovered three fundamental types of constraints that shape how these models develop new capabilities:
- Architectural Constraints
The model’s architecture sets fundamental limits on what it can process. Memory capacity acts like the cell’s physical size, limiting how much information can be held at once. Attention mechanisms behave like cellular pathways, ruling how different pieces of information can interact. Lastly, network connectivity creates the equivalent of cellular signaling networks, establishing possible routes for information flow.
- Training Constraints
Similar to how evolutionary history shapes what adaptations are possible for an organism, training molds the model’s landscape of possibilities. Training examples seize preferred pathways through the possibility space, meanwhile training data distribution creates "valleys" that attract similar patterns. These learned patterns act like enzymes, making some transformations easier while inhibiting others.
- Contextual Constraints
Like a cell reacting (we should better say "adapting") to its current environment, context dynamically shapes what’s possible. Last interactions modify the local "landscape" of token possibilities, as illustrated in figure 2. Current context provides scaffolding that enables some combinations while blocking others. These effects create temporary patterns driving the model’s behavior.
The multiplicative nature of constraints
Constraints can interact in different ways, such as additively or multiplicatively. What we have found in our research is that large language model constraints interact multiplicatively. This means that, like a chemical reaction requiring both the right temperature and the right pH level and the right catalysts, a model capability only emerges when all constraints align favorably. This multiplicative interaction explains why model capabilities often appear suddenly rather than gradually. It’s similar to how you can’t partially boil water—it needs to reach exactly 100°C (at sea level) for the phase transition to occur. All the necessary conditions must be met concurrently.

To adapt the original TAP equality in Equatoin 2 to how language models explore their possibility space, we have developed the following equation (the aim of the paper presented at the beginning of this post):

Aₜ₊₁ represents the accessible state space at the next time step, and Aₜ is the current accessible state space. The first sum Σ captures the hierarchical levels of language processing (from tokens to phrases to broader structures), where gₗ is a hierarchical function that transforms these combinations into the model’s semantic space, being L the total number of hierarchical levels in the language model’s processing⁷. The second sum Σ represents all possible combinations within the vocabulary size |Vₜ| or lexicon. Parameter α(t,i) combines all constraints (architectural, training, and contextual) at time t, and the binomial coefficient (|Vₜ| choose i) represents possible combinations of tokens. Equation 2 may point to approaches to better direct the development of AI systems in the future.
Practical implications
Our theoretical framework offers several potential insights for working with large language models, though we must be careful not to overstate their immediate practical applications.
What we can say with (statistical) confidence
- Context ordering matters: Our experiments prove that the order in which information is presented to a model significantly affects its responses, reflecting fundamental constraints on how models access their capabilities. For instance, we observed that presenting mathematical concepts before problem-solving tasks consistently led to better performance than presenting the same information in reverse order.
- Critical points in training: The identification of phase transitions (symmetry breaking) suggests that certain capabilities might emerge suddenly rather than gradually during training. This could help explain why intermediate checkpoints often show surprisingly large jumps in performance. However, we cannot yet reliably predict when these transitions will occur.
- Resource allocation trade-offs: Understanding the multiplicative nature of constraints helps explain why simply increasing computational resources doesn’t always improve performance proportionally. All three types of constraints—architectural, training, and contextual—need to align for new capabilities to emerge.
Limitations and open questions
While our framework explains observed phenomena, its predictive power remains limited. We cannot yet reliably forecast when specific capabilities will emerge in new models or architectures. Though we understand that architectural constraints play a crucial role, translating this insight into concrete architecture improvements requires additional research. The link between model architecture and capability emergence remains largely correlational. Our framework suggests that training strategies might be optimized around critical points, but we don’t yet have reliable methods to identify these points in advance.
Future Research Directions
The most promising areas for applying this framework include:
- Diagnostic tools: developing better tools for identifying when models are approaching critical transitions in capability development.
- Resource optimization: understanding how to better balance computational resources across different architectural components based on constraint interactions.
- Training curriculum optimization: exploring how the order of training examples might be optimized to work with, rather than against, the natural phase transitions in model development.Disclaimer
Disclaimer
While our framework provides a new lens for understanding LLM behavior, it should be seen as a starting point rather than a complete theory. Like any model of complex systems, it necessarily simplifies many aspects of LLM behavior. Its primary value lies in suggesting new ways to think about and investigate these systems, rather than providing immediate solutions to practical challenges.
¹ Readers can find a copy of the book ‘Investigations’ at the Santa Fe institute web site: https://www.santafe.edu/research/results/working-papers/investigations
² Symmetries can be seen as a geometric language used by physics to describe fundamental observations of reality. In mathematics, symmetries are referred to as group theory, a highly powerful tool that enables us to model complex observations in an accessible manner. This language can be useful in understanding complex phenomena such as the behavior of huge language models. You can find in this book a comprehensive explanation of symmetries importance in real-world phenomena: https://link.springer.com/book/10.1007/978-3-030-03633-1
³ Kauffman, S. (2022). Is There a Fourth Law for Non-Ergodic Systems That Do Work to Construct Their Expanding Phase Space? Entropy 2022, Vol. 24, Page 1383, 24 (10), 1383. https://doi.org/10.3390/E24101383
⁴ A deeper dive in the ergodic theory can be found here: https://plato.stanford.edu/entries/ergodic-hierarchy/#Ergo
⁵ Markovian dynamics refer to systems where the probability distribution of future states depends simply on the current state, not on the sequence of events that preceded it. Formally, for a system with states {Xₜ}, it satisfies P(Xₜ₊₁|Xₜ,Xₜ₋₁,…,X₁) = P(Xₜ₊₁|Xₜ). Language models exhibit non-Markovian behavior as their responses depend on their entire interaction history.
⁶ Symmetry breaking refers to a phenomenon where a system that appears symmetric under certain conditions transitions to a state with less symmetry. A classic example is a pencil balanced perfectly on its tip—while the initial setup is symmetric (it could fall in any direction), once it falls, this symmetry is "broken" as it chooses one specific direction. In particle physics, this concept explains how particles gain mass through the Higgs mechanism, where an initially symmetric state transitions to one with lower symmetry. In our analysis of language models, we observe similar symmetry breaking when models transition from exploring all possibilities equally (symmetric state) to developing preferred, structured approaches (broken symmetry state).
⁷ The hierarchical nature of non-ergodicity in language models shows interesting parallels with LeCun’s H-JEPA framework. While H-JEPA explicitly designs architectures for hierarchical prediction-based learning, our framework describes how hierarchies emerge naturally through constraint interactions. Both approaches postulate that lower levels constrain higher-level predictions, creating path-dependent learning. This similarity suggests that non-ergodic behavior might be fundamental to hierarchical learning in AI systems, rather than just an architectural artifact. The key difference lies in implementation: H-JEPA builds hierarchical structures into the architecture, while we observe them emerging from constraint interactions. This alignment between designed and emergent hierarchies suggests promising directions for architecture design, though the underlying mechanisms may differ substantially. LeCun’s work provides an interesting perspective on how we might intentionally structure these hierarchical relationships in future AI systems.
For any comments, please contact me at [email protected]