The world’s leading publication for data science, AI, and ML professionals.

The Meaning Behind Logistic Classification, from Physics

Why do we use the logistic and softmax functions? Thermal physics may have an answer.

Logistic regression is ubiquitous, but what's the intuition behind why it works? (Image by Author)
Logistic regression is ubiquitous, but what’s the intuition behind why it works? (Image by Author)

Logistic regression is perhaps the most popular and well-known machine learning model. It solves the binary classification problem – for predicting whether a data point belongs to a category.

The key ingredient is the the logistic function, where an input is converted into a probability, written in the form:

But why does the exponential make an appearance? What is the intuition behind it, beyond just converting a real number into a probability?

It turns out, thermal physics has an answer. But before digging into insights from physics, let’s understand the mathematics first.

Mathematical Quandary

Before tackling the "why" behind the logistic function, let’s understand its properties first.

Time to get the math out of the way! (Photo by Michal Matlon on Unsplash)
Time to get the math out of the way! (Photo by Michal Matlon on Unsplash)

It is helpful to understand the logistic function in a more generalized form – one that is applicable for multiple categories – so that the logistic function becomes a special two-category example.

This can be done by introducing multiple extra variables zᵢ, one for each category. We then arrive at the softmax function, where each class i is assigned a probability:

For linear models, the zᵢ’s are generally linear sums of features. This function has a lot of extra utilities beyond linear models:

  • as the final classification layer in a neural network
  • as the attention weights in a transformer
  • as the sampling layer for choosing an action in reinforcement learning

But why do we choose this functional form for parameterizing probabilities? The common explanation is that it is simply a conversion tool. Indeed, any sets of probabilities, as long as none of them are zero, can be written in this form. Mathematically, we can always solve for zᵢ by simply taking logarithms:

But is this the only special property of the softmax, as a conversion tool? Not quite, because there are infinitely many choices for converting numbers into probabilities. In fact we can consider functions of other form:

To produce a sensible probability function for optimization, we require the following criterion on f:

  • Always positive, always increasing, ranges from 0 to ∞, and differentiable

So exp(z) is just one of infinitely many choices, as can be seen below:

Alternatives to exp(z) that can also provide reasonable conversions from numbers to probabilities (Image by Author)
Alternatives to exp(z) that can also provide reasonable conversions from numbers to probabilities (Image by Author)

So is exp(z) just a matter of conventions out of a sea of possibilities? Should exp(z) be preferable? While we could simply justify this choice by model performance, we should strive for understanding whenever possible. In this case, there is a theoretical motivation, and it comes from thermal physics.

A Thermal Physics Link

It turns out, these probabilities are ubiquitous in thermal physics, they are called the Boltzmann distribution, which describes the probability of a particle (or a system more generally) to be in a particular energy state:

Where Eᵢ denotes the energy of the state and T the temperature. When the temperature is non-zero, one can measure energy as multiples of temperature so that T can be conveniently set to 1.

The Boltzmann distribution is a bit confusing though, as they model systems that are governed by precise and deterministic equations, so how do probability and statistics come into the picture (ignoring quantum physics)?

Thermal physics uses probability to manage our ignorance in the complex world (Photo by Rainer Gelhot on Unsplash)
Thermal physics uses probability to manage our ignorance in the complex world (Photo by Rainer Gelhot on Unsplash)

The crux is complexity. While we have equations that describe the details of a system, they tend to be very complicated. Additionally, these equations often exhibit chaotic behaviors, leading to high unpredictability (the Butterfly effect). So practically, these detailed deterministic equations are not so useful.

How do we understand these complex systems then? Luckily in real life, we rarely need to know about the microscopic details of a system, as we cannot measure them anyway. It is often sufficient for us to consider macroscopic and emergent quantities (like temperature, energy, or entropy). This is where probability theory comes in – it is a bridge between the micro and the macro:

  • Microscopic details are modeled using probability distributions
  • Macroscopic quantities are modeled as various averages of these distributions

For an analogy, imagine we want to study all the digits of some irrational numbers, say √2, π and e

  • √2 = 1.41421356237309504880…
  • π = 3.14159265358979323846…
  • e = 2.71828182845904523536…

The task seems daunting, as each number arises from a different mathematical concept. To get the precise digits would require specific numerical methods to compute them. However, if we simply look at macroscopic behaviors of the digits, we’ll easily find that each of the 10 digits appear roughly 10% of the time. The crux is that these digits, like many dynamical systems, tend to explore all the possibilities without prejudice. In more technical terms,

Chaotic systems tend to maximize all the possibilities, or in other words, entropy

What does this have to do with our Boltzmann distribution? Well, mathematically the Boltzmann distribution is a maximum entropy distribution, under the constraint that the statistics it approximates respects a key physical law – the conservation of energy. In other words, the Boltzmann distribution is the solution to:

So the specific form of exp(−E) comes from energy conservation.

Intuitively, we can think of energy as a sort of budget. Chaotic systems are trying to explore all possibilities (to maximize entropy). If a category has a high energy need, it will explore that category less so that there are more opportunities to explore other categories. Yet, the probability doesn’t drop to zero because the system does still want to explore this energy-inefficient category. The exponential form is a result of the compromise between efficiency and exploration.

Going back to Data Science and classification, our data are not part of a dynamical system, and there is no conservation of energy, so how are these physics insights useful?

Energies for categories

Energy is the basic unit of all interactions, but does it have a place in classification problem? (Photo by Fré Sonneveld on Unsplash)
Energy is the basic unit of all interactions, but does it have a place in classification problem? (Photo by Fré Sonneveld on Unsplash)

The crux is that in data science, we are not necessarily interested in literally how the data came to be—this is an almost impossible task. Rather, we are trying to construct a system that can mimic relevant behaviors in our data. From this viewpoint, a model becomes a sort of dynamical system that is molded by data in some desirable ways (this was explored in my article).

Just like in thermal Physics, the Boltzmann distribution is effective in capturing rough microscopic details, while faithfully reproducing macroscopic physical quantities (temperature, pressure… etc). So it’s at least plausible that it could lend that superpower to data science.

Indeed, rather than looking for some sort of energy conservation laws in our data, we could just enforce a notion of energy conservation in our classification model. This way, analogous to the Boltzmann distribution in thermal physics:

The softmax function in a model assumes a maximum entropy guess for categories, under the assumption that there is a sort of conserved budget for making these guesses

The maximum entropy can be justified as a sort of maximal likelihood estimate (see my article on entropy). The remaining question is, why does it make sense to create an artificially fixed energy budget? Here are some reasons:

  1. Central Limit Theorem: the energies are often linear sums of features, so they have well-defined mean. So it’s not much of a leap to enforce the average of these energies over categories to be constant.
  2. Regularization: this forces the extremes of the probabilities to be limited, as a 1 or 0 probability would require some energies to be infinite.
  3. Variance reduction: by imposing a reasonable constraint, we introduce bias while reducing variance in our model (bias/variance trade-off)

Point 1 and 3 are particularly salient when using softmax in deep neural networks. Since network layers often have some sort of normalization already, it makes even more sense to enforce a fixed energy to ensure good statistical behaviors downstream.

So, how do these insights help us understand our models? Well, we can use them to explain anecdotal facts regarding models that utilize softmax (i.e. logistic regression):

  1. Issues with imbalance: this is because softmax assumes maximum entropy. It is intentionally trying to get the model to explore all possible categories without prejudice.
  2. Issues with large # of categories: softmax attempts to assign all possible categories, even ones that are impossible (i.e. assign cats as vehicles). For data with cleanly separated categories, clustering, nearest-neighbors, support-vector-machines and random forest models could perform better.

Beyond just the model structure, our thermal physics analogy also helps us understand the training paradigm, which we turn to next.

Putting on the Training Pressure

Boltzmann distribution allows us to compute things like pressure, what's the analogy in classification? (Photo by NOAA on Unsplash)
Boltzmann distribution allows us to compute things like pressure, what’s the analogy in classification? (Photo by NOAA on Unsplash)

The utility of the Boltzmann distribution goes beyond simply categorizing states of a system: it enables us to compute other useful emergent quantities – things like pressure. This pressure can be the physical pressure we experience, or more abstract thermodynamic forces like magnetic fields and chemical potentials.

As an example, the equation for pressure is defined by the change in the Boltzmann factor per change in volume:

More generally, thermodynamic pressures are defined as some sort of change in our statistical distributions with respect to some variable.

Jumping back to data science, what are some quantities that look like "pressure" in classification? Well, one related quantity is the Cross Entropy, which is the loss function that is typically minimized when training classification models. The cross entropy is often estimated via sampling:

where l indicates the proper category a particular data point is in. To optimize this we can perform gradient descent: taking the derivative and updating until the derivative is zero.

Using the physics analogy, we could then view the derivative/gradient as a sort of thermodynamic pressure!

What this means then, for model training, is that we are imposing pressure on our system (our model) until it equalizes. The model is trained when this internal "model pressure" reaches zero—some sort of thermal equilibrium.

While the analogy isn’t 100% exact, it gives us an intuition on how classification models work (more specifically, models that utilize softmax and are optimized through something like cross-entropy). So we conclude that:

Classification models are mock thermodynamic system driven to equilibrium to mimic categories in our data

This viewpoint can perhaps explain why simple logistic regressions are effective, even when assumptions regarding linear models are often violated in real life.

Conclusion

Hopefully I’ve shown you some intriguing connections between simple logistic regressions and thermal physics.

Data science is a very broad discipline. Like thermal physics, we can think of data science as a high level way of understanding our macroscopic world, while properly handling microscopic details of which we may be ignorant. Perhaps it is not surprising that many conceptual and mathematical tools in data science can be linked or even traced back to physics, as they both share the common objective of modelling our world.

If you like this article, please do leave a comment. Also let me know if you would like to see other abstract concepts demystified and properly explained. Happy reading 👋 .

If you like my article, you may be interested in my other insight pieces:

What does Entropy Measure? An Intuitive Explanation

A Physicist’s View of Machine Learning: The Thermodynamics of Machine Learning

Entropy Is Not Disorder: A Physicist’s Perspective

Why We Don’t Live in a Simulation


Related Articles