Complex systems are ubiquitous in nature, and physicists have found great success using thermodynamics to study these system. Machine learning can be very complex, so can we use thermodynamics to understand it?

As a theoretical physicist turned data scientist, people often ask me how relevant my academic training was. While it is true that my ability to calculate particle interactions and understand the structures of our Universe have no direct relevance in my daily work, the physics intuitions that I learned are of immeasurable value.
Probably the most relatable areas of physics to Data Science is statistical physics. Below, I’ll share some thoughts on how I connect the dots, and draw inspirations from physics to help me understand an important part of data science – machine learning (ML).
While some of these thoughts below are definitely not fully mathematically rigorous, I do believe some of them are of profound importance in helping us understand the why/how of ML.
Models as Dynamical Systems
One of the key problems of data science is to to predict/describe some quantities using some other quantities. For instance, one might want to predict the price of a house based on its features, or understand how the number of patrons visiting a restaurants is affected by the menu.
To achieve this, data scientists may build mathematical objects called models, that can turn some raw inputs into useful outputs.
To make these models work, we train them. Now many readers have probably heard of the conventional explanations on how models work; here I’ll take an unconventional path – by using physics analogies.
The boring way of thinking a model mathematically is that it is a parameterized function of some sort. Let’s scratch that and think physically. A model is like a machine that turns some raw material into useful outputs. So it is a system made of many smaller, simpler moving parts – like a box filled with different particles.

Where does the smaller/simpler part come from? Well, ML models are typically constructed from layers of simple mathematical operations: multiplication, additions, or basic logical operation (e.g. decision trees or ReLU units in neural networks). This is actually just like many large physical systems in the real world: A crystal made up of atoms, or a pool of water made up of water molecules. In other words,
ML models can be thought of as dynamical systems constructed from smaller constituents with simple interactions
In this language, the goal of training would be to intelligently assemble these constituents, like a beautiful snowflake of some sort.

Data as Thermal Baths
What does it mean to feed an input to a model and get some output then? Well, in the language of interactions, feeding an input to a model would be equivalent to poking/perturbing the system with some external influence, while the output would be the response of the system.
What would the analogy for data be? In the real world we cannot control what data points "chose" to be collected. They are usually assumed to be randomly sampled. The clear physical analogy is the thermal bath – a source of randomness and uncertainty.

This is what makes the thermodynamic analogy powerful: Thermal physics extract insight from noisiness of thermal fluctuations; ML models extract insights amid noisy data.
Note that in ML, usually there is a difference between training/fitting a model, versus using a model to evaluate a particular dataset. Now this distinction is entirely enforced by us and doesn’t really have a physical analog. In any case, we are forgoing mathematical rigor; so let’s just see how far it can take us.
Training as Dynamical Processes
To get further, let’s really dig deep into the dynamical part of our model.
Let’s look at the model training process. How is that done? Usually we feed the model some training data, and force the model to minimize some objective loss function. The loss function measures how close the model’s prediction is compared to the actual data (there may be some extra regularization terms).
What’s the analog in physics? Let’s see, what is the quantity that tends to get minimized? It’s energy! Here I am not referring to any specific types of energy, but rather thinking of energy as a concept. For instance, electrons on a metal sphere will tend to spread themselves apart, which will tend to lower the overall energy. Forces can be thought of as arising from the tendency to minimize these energies.

Still, systems don’t just blindly minimize these "energies". The sun doesn’t just spontaneously explode and cool down, and we don’t just instantaneously freeze and lose all our body heat.
The evolution of a system boils down to the fundamental interactions between its constituents. Energy guides the system but doesn’t completely dictate what it does. In physics, "energy" simply affords us one specific view of a system.
So let’s think of the loss function in ML as an energy of some sort. While data scientists always find novel ways to minimize the loss function, the loss function isn’t everything. ML models don’t just blindly minimize loss functions. Instead, we should think of ML model training as the evolution of a complex dynamical system:
Training a model is similar to a dynamical system organizing itself under some interactions
Adding in external training data, this dynamical system is now under the influence of a thermal bath! Together, we get a very powerful analogy:
Training a model is akin to a dynamical system organizing itself, under the influence of a thermal bath of data

In fact, this gives us a better understanding of regularization terms in Machine Learning: they are extra energies, or rather extra interactions to help us engineer a more desirable dynamical evolution of our models.
So finally we arrive at the physicists’ view of an ML model:
- It’s a dynamical system made up of many small interacting components
- As a model trains, it reorganizes itself under the influence of an external thermal bath (i.e. the random data source)
Now, let’s apply some physical laws to ML!
The Second Law: Entropy Always Goes Up

Perhaps one of the most famous "laws" of Physics, the Second Law of Thermodynamics stats that, there exists a notion called entropy, that can only increase as time goes on (See my other article on Entropy for a detailed exploration).
In simple terms, entropy captures how generic a system is, and the second law stats that the most generic configuration will most likely prevail (an almost tautological statement).
For us, the important point is that:
Thermodynamical systems DO NOT just minimize the energy. Instead, the system maximizes entropy. In other words, systems tend to settle down to the most likely configurations.
When applying this to ML models, the statement becomes
ML models DO NOT necessarily minimize the loss function, instead they simply settle down to the most likely configuration
This may sound counter-intuitive as we are always told to minimize the loss function, and the loss function is often touted as a key performance metric (especially in Kaggle competitions).
However, remember that the training data (and testing data) is always just a subset of the full data. So the goal really is to roughly minimize the loss function, while also minimizing the risk of overfitting to training data. So the entropy maximization part is a desirable compromise: because we want the model to generalize and not overfit.
(* there is some caveat to the applicability of this law, as one requirement is ergodicity, or the notion that a dynamical system can efficiently sample almost all the possibilities; many ML models probably don’t even come close to any sort of ergodicity. But hey, we’re not not trying to be 100% mathematically rigorous so let’s not be too pedantic)
Thermal Equilibrium?

We can bring another powerful concept in thermal physics to help us understand models: the notion of thermal equilibrium.
So what happens when a system reaches maximum entropy? the final outcome is a thermal equilibrium. In terms of physics, this means that:
- While the microscopic configuration of a system can change continuously, the macroscopic behavior of the system stops changing.
- The "historical" behaviors of the system is completely lost, as the system essentially forgets what happened in the past.
The ML analogies are quite profound. Models reaching thermal equilibrium have the properties that:
- For models initialized with different random seeds and trained with different random dataset, they will have the same performance (macroscopic behavior), while their parameters may be different (microscopic configurations)
- The overall ML model performances are not sensitive to the training trajectories (forgetting histories)
Both of these are highly desirable outcomes for ML models. So perhaps the goal of training ML models is to drive a system toward some sort of thermal equilibrium!
Again, this analogy isn’t 100% rigorous, and one can easily find counter-examples like simple linear models. Still, it’s undeniable that there are some intriguing parallels and insights we can draw from physics.
Conclusion
To summarize, using concepts like energy, thermal bath, and entropy from physics, we are able to view ML models as complex dynamical systems built from simple interactions.
Just like in nature, the ML model evolves under the influence of complex interactions guided by the loss function (akin to energy), which allows it to reorganizes itself as we feed it data (thermal bath). Eventually, the system comes to a thermal equilibrium, resulting in an intriguing structure similar to a snowflake/crystal.
Why is this viewpoint useful? Because it gives us some hints on why ML works or doesn’t work:
- ML models don’t just minimize a singular loss functions. Instead, they evolve dynamically. We need to consider the dynamical evolution when thinking about ML.
- We cannot fully understand ML models using just a handful of performance metrics. These metrics capture the macroscopic behaviors but miss out on the microscopic details. We should think of metrics as tiny windows into a complex dynamical system, with each metric highlighting just one aspect of our models.
As a physicist, I’d argue that these viewpoints give us more intuitions on the often touted "black magic" of ML.
In summary, we can describe a data scientist’s job as a system engineer: to create the right type of microscopic interactions and environments, for the model to mold itself into a desirable macroscopic structure – to build that perfect snowflake. To me, this is a way more poetic of a job description; in contrast to sitting in front of the computer, coding lines and lines of code, and waiting for the training job to finish. Anyway, that’s how I like to think about data science.
Well that’s all the insight for now, hope you enjoy it!
Please do leave comments and feedback below, that’d encourage me to write more insight pieces!