Thoughts and Theory
Authors: Patrick Huembeli (EPFL), Juan Miguel Arrazola (Xanadu), Nathan Killoran (Xanadu), Masoud Mohseni (Google Quantum AI), Peter Wittek
In Memory of Peter Wittek
When we think of Peter, his positive nature and his can-do attitude inevitably come to mind. He had an ability to inspire and carry the people around him, convincing them that almost everything is possible. Back in 2018 he came up with the idea for an interactive blog post on the subject of Energy Based Models and pitched it to us. It was almost impossible to resist his enthusiastic manner and of course we embarked on this adventure with him. Equipped with our knowledge in physics and machine learning and hardly any clue about Javascript, we started our journey. After a lot of hard work, we were finally able to complete this project. We are overjoyed with the final result and we wish he could see it. We miss you, Peter. We miss you as a co-worker, as an inspiration, and above all as our friend. Since Medium does not support Javascript and equations written in Latex, we recommend to check out our interactive post as well.
Introduction
We present an overview of energy-based models, a historically important class of generative Machine Learning models, from a rather simple physical perspective. We assume no prior knowledge in machine learning. Rather than a purely mathematical formulation of the main concepts, we motivate energy-based models by connecting to the physical behaviour of interacting particles. We explain familiar concepts from different perspectives and uncover key physical concepts underlying the theory and practice of energy-based models. This includes the design, implementation, and training of Hopfield networks [1] and Boltzmann machines [2], highlighting how physical principles can explain both the success and challenges associated with these models. We focus on these models because of their historical significance – which paved the way for future revolutions in the field – and because of their immediate connection to fundamental physical principles. Additionally, attention is placed on using physics to understand important concepts such as energy functions, Boltzmann distributions, and contrastive divergence [3]. Finally, we reflect on the possible future trends on machine intelligence based on recent advances, shining light on cutting-edge research at the interface of physics and machine learning.
Energy-based models
An energy-based model is a probabilistic model governed by an energy function that describes the probability of a certain state. Energy-based models emerged in the machine learning literature in the 1980s [1, 2]. They have since been further developed, extended, and improved over several decades of work [4]. Some energy-based models have recently been "back-ported" to Physics, and used for example to model the wavefunctions of quantum systems [5, 6, 7, 8]. Furthermore, they are still being employed in other areas [9, 10, 11, 12], and they have become competitive with GANs for specific tasks [13]. Many types of energy-based models are generative, guaranteed to sample a valid probability distribution, and can often be stacked to create deep, hierarchical representations [14]. More recent developments include their use in reinforcement learning [15, 16, 17], as potential replacements for discriminators in GANs [18], and the appearance of quantum energy-based models [19, 20].
To understand the origin of energy-based models, imagine a scientist with the goal of designing a physical system that is capable of learning and memorizing patterns. An initial strategic choice is to use the random behaviour of the particles to model the probability distribution of some data that should be learned. The goal is to design probabilistic systems that harness randomness to generalize memorized patterns. Mathematically, a probabilistic system is characterized by a probability distribution that determines the likely states of the system. The challenge is to design systems that are sufficiently complex to give rise to rich behaviour, but also simple enough such that they can be efficiently trained and characterized.
For large systems, it is overwhelmingly difficult to keep track of all their rapidly-fluctuating internal degrees of freedom. Typically only coarse-grained information can be accessed, meaning that we can only keep track of macroscopic properties of the system and not of all degrees of freedom. An example for such a coarse-grained information is the total energy [21], which can be theoretically determined for any given state of the system.
More precisely, the total energy of a system is defined in terms of an energy function E(x)which assigns energy values to all its possible states. The state of an n-particle system can be described as x=(_x_₁, _x_₂,…, _x_ₙ), where _x_ᵢ denotes a relevant degree of freedom of the i-th particle.
The Boltzmann distribution
What probability distribution P(x) can be assigned to a state x with energy E(x)? If we do not have any information – and therefore no constraint – about a particular degree of freedom, we should remain maximally flexible in the choice of model while remaining consistent with the quantitites that are constrained. Therefore, it is reasonable to choose the distribution with maximum entropy because the maximum entropy model reduces the amount of potentially biased or unsubstantiated prior information built into a model. This strategy is known as Jaynes’ maximum entropy principle [22]. It states that in assigning a model on the basis of partial available information, the distribution with the largest possible entropy should be chosen. The resulting distribution P(x) is the solution to the optimization problem

giving [22]

where T is a free parameter and Z=∑ₓ exp[−E(x)/T] is a normalization constant known as the partition function. This probability distribution first appeared in statistical physics and is now widely used in other fields. In its original physical interpretation, the Boltzmann distribution describes the probability of finding a system in a state x when the system is in thermal equilibrium with a heat bath at temperature T. The Boltzmann distribution establishes a concrete relationship between energy and probability: low-energy states are the most likely to be observed.
The energy function of a physical system can be expressed as a sum over contributions arising both from the internal energy of each particle and the interactions between them. In such cases, the energy function can be written as

for appropriate parameters _θ_ᵢ and functions _f_ᵢ(x). The resulting Boltzmann distribution at temperature T is uniquely determined by the parameters _θ_ᵢ or, equivalently, by the expectation values ⟨_f_ᵢ(x)⟩, which are the sufficient statistics of the distribution [24]. Knowledge of the expectation values ⟨_f_ᵢ(x)⟩ fixes the parameters _θ_ᵢ and therefore also the properties of the resulting Boltzmann distribution. The parameters _θ_ᵢ determine the relevance of these expectation values in minimizing the energy. When a particular _θ_ᵢ is small, the corresponding sufficient statistic, i.e, the state of the particle or the interaction term, will be largely irrelevant in its contribution to the energy.
It is important to recognize the role of temperature: it determines the relative probability of observing higher-energy states, not just the lowest energy ones. In the limit of zero temperature, only those states corresponding to global minima of the energy function can be observed. In the limit of infinite temperature, all states are equally likely. Physically, temperature quantifies the average energy of the interactions between the system and environment. Such exchanges lead to sporadic "jumps" towards states of higher energy. The higher the temperature, the more common and widespread such jumps will be. The role of temperature in the Boltzmann distribution is illustrated in the figure below, where you can study the effect of varying temperature and energy-function parameters.

Architectures
The next step in designing an energy-based model is to select the energy function. We start with arguably the simplest interesting model: a collection of particles with two degrees of freedom whose energy function depends on the individual state of particles and the pairwise interactions between them. The states of n such particles are described in terms of the vector σ=(_σ_₁,_σ_₂,…,_σ_ₙ), where _σ_ᵢ∈{−1,1} is the state of the i-th particle. The resulting energy function is

This energy function is known as the Ising model [25]. The parameters _b_ᵢ set the individual energies of the particles, which depend on the state _b_ᵢ_σ_ᵢ=±_b_ᵢ. The parameters _w_ᵢⱼ introduce energy contributions due to pairwise interactions: for wᵢⱼ<0, equal states __ (σᵢ=σⱼ) have higher energy. F_o_r wᵢⱼ>0, opposite states (σᵢ=−σⱼ) lead to higher energies. This energy-based model is known as _a Hopfield netwo_rk [1]. The paramete_r_s wᵢⱼ in the energy function can be viewed as weighted edges in a graph and therefore the model can be represented by a network – a neural network, when particles themselves are viewed as neurons.
What tasks can be performed with this model? Consider the zero-temperature case. At equilibrium, only the lowest-energy states can be observed. If the system is set to a state with higher energy and allowed to equilibrate back to zero temperature, it reverts to one of the ground states with lowest energy. If data is encoded into the ground states of the system, this model has the ability to retrieve data instances from incomplete or corrupted inputs, which are non-equilibrium states. The model can serve as an auto-associative memory [1], capable of "remembering" patterns when similar ones are given as input.
The energy function of Hopfield networks considers only pairwise interactions. Extending the scope to more complicated models is possible, but comes at a significant price: they will typically be much more difficult to train and much harder to perform inference on them. Instead, new particles can be added to the network which are not used to represent data, but whose role is to increase the overall complexity of the model. They are referred to as hidden nodes (as in nodes in a network) and act as intermediaries between the remaining visible nodes, which encode data. Each collection of hidden or visible nodes is known as a layer.
Physically, the hidden nodes enable effective higher-order interactions between visible nodes, leading to a new effective energy function for the visible nodes [26, 27]. The resulting networks are called Boltzmann machines, in allusion to the Boltzmann distribution governing their behaviour. They are generalizations of Hopfield networks in the sense that these are contained as a special case: a Boltzman machine simplifies to a Hopfield network when the interactions _w_ᵢⱼ between hidden and visible nodes are set to zero. Importantly, Boltzmann machines are not only more powerful than Hopfield networks, but in a sense as powerful as can be: they are universal approximators, in principle able to replicate any discrete probability distribution with arbitrary precision [28].
Simulating and training Boltzmann machines can be challenging. To facilitate progress, models can be studied where some connections are set to zero. In the most extreme case, all intralayer connections are removed, leaving present only connections between visible and hidden nodes. The resulting models are known as Restricted Boltzmann machines (RBMs). Conventionally, the state of an RBM is written in terms of visible (v) and hidden (h) nodes, σ=(v,h), with its energy function given by

Compared to the fully-connected Boltzmann machine, the RBM is less expressive because it has fewer parameters. Nevertheless, the advantages gained in simulating and training these models surpass the loss in expressivity, especially since less complex models tend to generalize better [29]. The three fundamental energy-based models we study in this article, Hopfield networks, Boltzmann machines, and RBMs, are illustrated in the figure below where we introduce the different architectures and illustrate how hidden nodes can make the model more complex.
When the energy function doesn’t include bias terms, Hopfield networks and Boltzmann machines have a so-called flip symmetry: if we flip the values of all the nodes, the states before and after the flip have the same energy. Therefore patterns that differ by a flip on all nods cannot be differentiated. Adding hidden nodes can break this symmetry by assigning non-symmetric hidden-node states to each visible pattern. The figure below illustrates this behavior (Check out the interactive version to get more intuition).

Sampling
From a physical perspective, sampling from the Boltzmann distribution is conceptually straightforward: just place the system in contact with an environment at the desired temperature and register the system’s state at different times. But building these systems and engineering their energy functions is extremely challenging in practice. However, there are recent proposal to build probabilistic computer that employ stochastic magnetic tunneling junctions interaction with thermal noise to physically realize probabilistic bits (p-bits)[59]or their FPGA-based simulations [60]. Today, in practice one simulates such probabilistic machines using standard computing infrastructures, based on Turing machines with von Numann architecture, which consequently requires additional energy and time overheads compare to their direct physical realizations which could happen in near future[59].
Consider first the zero-temperature case. The key physical principle underlying the Boltzmann distribution is the connection between energy and probability. A strategy to simulate sampling from a Boltzmann distribution is to identify low-energy states and, depending on the temperature, ocassionally select also states with higher energy. One simple method is to locally change the state of each particle such that it decreases the total energy of the system. For an Ising energy function, the change in energy (ΔE) introduced by changing the i-th particle’s state from _σ_ᵢ to −_σ_ᵢ is

To search for equilibrium states, we iteratively apply the update rule

that flips the value of the node only if this change decreases the energy. Starting from a random initialization, by repeatedly updating the state of individual particles, the system’s state converges to a local minimum [30]. This method is not guaranteed to find the true ground states, i.e., global minima of the energy function. For finite temperature, the strategy is similar, except that it is now possible to ocassionally jump to higher-energy states. In this case, the update rule is:

where p=exp(−(1/T) Δ_E_ᵢ). Physically, these jumps to higher-energy states mimic the random thermal fluctuations arising from exchanging energy with an environment at finite temperature. This sampling algorithm is a specific instance of the Metropolis-Hastings algorithm [31, 32], which is particularly convenient when the distribution is known up to a normalization constant, as is the case for the Boltzmann distribution. The figure below illustrates the algorithm in action.

Gibbs Sampling
One possible improvement to this algorithm is to find models where the local update rule is even better at finding low-energy states. The issue with local updates is that the change in energy

for node i depends on the states of all other nodes. Therefore, each time a state σiσi is updated, the effect spreads out to all other nodes, changing their local update rules. But for an RBM, this is not exactly the case. The energy change in a visible node,

does not depend on any of the other visible nodes: when one of them is flipped, other visible nodes are unaffected. This makes the local update rule more effective because the entire collection of visible nodes can be treated as a single entity. A similar statement holds for updating hidden nodes when the visible ones are fixed. In fact, the lack of intralayer interactions in an RBM implies that the conditional probabilities P(v|h) and P(h|v) factorize:

Moreover, because of the independence between nodes in the same layer, the individual conditional probabilities can be calculated analytically, and are given by: [33]

The remaining probabilities P(hⱼ=−1|v) and P(_v_ᵢ=−1|h) follow from these equations. These properties of RBMs permit a new sampling strategy:
- Fix the hidden nodes. Sample visible nodes from the conditional distribution P(v|h)=∏ᵢ P(_v_ᵢ|h) by independently sampling the state of each node from its distribution P(_vᵢ|h_).
-
Fix the visible nodes according to the samples of the previous step. Sample hidden nodes from the conditional distribution P(h|v)=∏ᵢ P(_h_ᵢ|v) by independently sampling the state of each node from its distribution P(_h_ᵢ|v).
- Repeat the above steps N times, for suitably chosen N.
The resulting states (v,h) will be approximately sampled from the system’s Boltzmann distribution. This algorithm is known as Gibbs sampling, and it is the method typically used in practice to sample RBMs [3].
Training
We have learned that probabilistic models can be built from the same principles as physical systems at thermal equilibrium. We have also identified useful architectures and have developed algorithms to sample from their Boltzmann distributions. A final challenge remains: how can these systems be trained to perform specific tasks? In this context, the training is equivalent to an inverse Ising problem; that is identifying the parameters of the energy function that give rise to a desired probability distribution.
For illustrative purposes, consider a simple physical system: a mass m attached to a spring. The spring is characterized by a constant k and the mass experiences a position-dependent force F(y)=−ky, where y is the displacement of the mass. If the mass is placed in a gravitational field, it experiences a constant external force _F_ᴳ=mg, where g is the acceleration due to gravity. The mass is only in equilibrium at the precise position where these two forces balance, i.e., when _F_ᴳ+F(y)=0.

For a model with a single-parameter energy-function E(x)=θf(x), the expectation value ⟨f(x)⟩ is a sufficient statistic of the Boltzmann distribution – knowing this expectation uniquely fixes the parameter θ and therefore also the distribution P(x)=1/Z exp[−θf(x)/T]. The goal is to train an energy-based model to reproduce the statistics of a dataset, specified as a set of states (x⁽¹⁾,x⁽²⁾,…,x⁽ᵐ⁾). Training the model is equivalent to identifying the parameter θ such that the sufficient statistics of the model coincide with those of the data, i.e., such that ⟨f(x)⟩ᵐᵒᵈᵉᴵ=⟨f(x)⟩ᵈᵃᵗᵃ. Since the expectation over the model distribution depends on θ, we can write ⟨f(x)⟩ᵐᵒᵈᵉᴵ=:−F(θ) for a suitable function F(θ). The expectation over data is constant, so we can write ⟨f(x)⟩ᵈᵃᵗᵃ=:_F_ᴰ. Interpreting these as generalized forces acting in opposite directions, and the parameter θ as a generalized position, the model is trained when the position θ is such that the forces are in equilibrium: _F_ᴰ+F(θ)=0.
When forces are unbalanced, for instance if the pull of gravity outweighs the restoring force of the spring, objects accelerate and change position. For an object starting at rest, the initial displacement due to an inbalance of the forces is

where η>0 is a constant that depends on the object’s mass and the time for which the forces act. The first force _F_ᴰ can be interpreted as an external force due to an outside system, like the Earth’s gravitational field. For energy-based models, the external system is actually a set of training data that provides a constant "pull" towards the preferred states appearing in the dataset. The second force F(θ) represents the system’s natural preference for certain states. This provides an internal lift, working against the downward pull of the training data. Crucially, in the presence of two competing forces acting in different directions, the resulting displacement causes the object to move to a position that reduces the inbalance of forces. For example, in a spring-mass system, if _F_ᴳ>ky, the mass is pulled downwards to a new position y′>y that increases the force due the spring, bringing both forces closer to balance. By repeatedly applying the update rule above for sufficiently small step sizes, a parameter value that balances the two fictitious forces can be found, leading to a trained model. We now develop this physical intuition into concrete training strategies for energy-based models.
Training Hopfield networks
Training a Hopfield network is equivalent to identifying parameters such that the ground states of the energy function are the states of the input data. Suppose we are given a single n-dimensional data vector σ⁽¹⁾. There is a straightforward way to ensure it is a ground state: set all _b_ᵢ=0 and fix _w_ᵢⱼ=−_σ_ᵢ⁽¹⁾_σ_ⱼ⁽¹⁾. The energy function is then E(σ)=−⟨σ⁽¹⁾,σ⟩, which attains its minimum when the inner product ⟨σ⁽¹⁾,σ⟩=Σᵢ _σ_ᵢ⁽¹⁾,_σ_ᵢ is maximized, i.e., when σ=σ⁽¹⁾. For more data points σ⁽¹⁾,σ⁽²⁾,…,σ⁽ᵐ⁾, we follow a similar rule, this time setting the interaction parameters to

Note the appearance of the ficticious force we studied before. This strategy is known as Hebbian learning and works best when all data vectors are nearly mutually orthogonal, in which case all data points are local minima of the energy function [34, 35]. If not, so-called spurious minima can appear that do not correspond to data vectors, leading to erroneous memory retrieval [36].
To address the issue of spurious minima, the concept of "unlearning" was introduced to improve the performance of Hopfield networks [37]. For unlearning, parameters are set according to

where ϵ>0 is a small parameter. The model equilibrium states

are obtained by choosing a random initial state and using the sampling algorithms described earlier to find equilibrium states. During sampling, the weights of the model are set according to the Hebbian rule. The addition of this expectation value over model states leads to an increase in energy of all minima, including spurious ones. By appropriately selecting the value of ϵ, this increase in energy can lead to the disappearance of spurious minima, which are consequently "unlearned".
Training Boltzmann machines

Hebbian learning and unlearning techniques are problematic when applied to Boltzmann machines: since only the visible nodes encode data, it is not clear how to assign values to the hidden nodes. The Hebbian learning rule can be promoted to an update rule that iteratively improves the weights with each step. By doing so, we allow the hidden nodes to "move" together with the visible ones, leading to training of the complete model. Starting from an initial value _w_ᵢⱼ, a weight is updated to

where η>0 is a small learning rate. Whenever _σ_ᵢ_σ_ⱼ includes hidden nodes, averages over the data are taken by fixing the visible units to the data and then using Gibbs sampling to obtain values for the hidden nodes. In the specific case of an RBM, when setting ϵ=1, the update rule takes the form

This rule is known as the contrastive divergence formula for training RBMs. The Hebbian learning term ⟨_v_ᵢ_h_ⱼ⟩ᵈᵃᵗᵃ, usually referred to as the positive phase, decreases the energy of the states of the training data. The unlearning term ⟨_v_ᵢ_h_ⱼ⟩ᵐᵒᵈᵉᴵ (the negative phase) increases the energy of all the states that are near equilibrium. The figure below illustrates how the competing Hebbian learning term and the unlearning term can help to learn a pattern while avoiding local spurious minima.
![Learning and Unlearning (The interactive version of this figure can be found here): A randomly initialized system has an energy landscape (here represented by a simplified curve) with many different local minima. In physics this is often referred to as the spin-glass phase [34]. In energy-based models, learning a certain state (red dots) means that their energy needs to be decreased. This can be achieved through "learning steps" that locally update the weights of the model as to decrease the energy of data states, with what we referred to as the Hebbian learning term (positive phase). You can try this by clicking on the "Without unlearning" button. This strategy suffers from the drawback that the energy of other points can be local minimas, dependent on the initialization of the system. Instead, by alternating between learning and unlearning steps - which you can do by clicking on the "With unlearning" button - the occurrence of spurious minima can be reduced with the unlearning term (negative phase). Hovering over the orange and red dots shows which patterns they represent. On the left we show what happens with the energies of certain states from a more global perspective.](https://towardsdatascience.com/wp-content/uploads/2021/06/1C9Vc4608Vsebunkqecpquw.png)
The role of contrastive divergence is to gradually shape the energy function of the model until all low-energy states of the model correspond to data points. Viewing the weights as generalized positions, we interpret the term Δ_w_ᵢⱼ:=⟨_v_ᵢ_h_ⱼ⟩ᵈᵃᵗᵃ−⟨_v_ᵢ_h_ⱼ⟩ᵐᵒᵈᵉᴵ as a net force originating from two competing forces: a fixed external one due to data, and an internal force due to specific generalized positions of the system. When the forces are not balanced, this causes a shift in position in a direction that brings the forces closer to balance. Each shift in weight propagates across the entire model affecting all forces, so the strategy of the contrastive divergence training algorithm is to set a small learning rate η until all forces are balanced and the model is trained.
The figure below collects the concepts we have covered thus far, showing how a trained RBM can recover data instances from corrupted inputs. Once the model is trained, the minima of the energy function correspond to the encoded data. A damaged input, which has a large energy, can be repaired by allowing the system to equilibrate back to low-energy states.

Outlook: Future Research in Physics and Machine Learning
Energy-based models are experiencing a resurgence of interest: they are stable to train, require few parameters, show high sample quality, generalization ability, and are amenable to compositionality [13]. These are characteristics that other generative models like variational autoencoders or generative adversarial networks struggle with. Progress has also been made to improve the scaling of training [13, 38]. With these developments, there is an indication that this family of models continues to be relevant in machine learning. However, energy-based models also need improvements on several frontiers. Learning and inference over discrete data are still extremely challenging due to the combinatorial explosion of the state space, often referred to as the curse of dimensionality. In particular, training and probabilistic inference in energy-based models are still inefficient. This is due to difficulty of calculating or sampling the partition function, a task that is #P hard from a computational complexity perspective, which originates from the NP-hardness of discrete optimization problems in the worst-case. There is a significant opportunity for developing physics-based models that could address these difficulties, in particular from the prosepctive of non-equilibrium statistical physics or quantum models that we discuss below. There are exciting emergent areas of research at the interfaces of machine learning, quantum computing, many-body physics, and optimization. This convergence provides a unique opportunity to make contributions at the mutual interfaces between these major disciplines.
In quantum mechanics, the possible states of a system are extended to also include linear combinations, known as superpositions, over different states. For instance, states of n_n_particles were previously denoted by vectors x=(_x_₁, _x_₂,…, _x_ₙ). In a quantum setting, more general states ψ are possible, having the form

where the complex coefficients c(x) satisfy ∑ₓ |c(x)|²=1. This property gives rise to concepts such as interference and entanglement. The second change is that energy functions, usually referred to as Hamiltonians in this context, are replaced by operators. They generally take the form

where hats are used to denote that these are operators. This change has profound implications: operators generally do not commute, i.e., it may hold that

This property underlies concepts such as the uncertainty principle and the no-cloning theorem [39].
As a concrete example, we can describe a quantum Boltzmann machine, with a Hamiltonian

The individual operators _σ_ᵢ⁽ᶻ⁾, when expressed as matrices, take the form

The superscript is used to denote a specific basis in which the operator is diagonal. This model can be made more interesting by using non-commuting operators

where

Notice that _σ_ᵢ⁽ˣ⁾ does not commute with _σ_ⱼ⁽ᶻ⁾. This is known as the transverse-field Ising model. A quantum Boltzmann machine is a system described by any such Hamiltonian involving at most pairwise interactions between particles. Theoretically understanding these models, deriving methods for efficiently training them, and implementing them in practice, are all tasks being currently pursued [19, 40, 41].
Beyond quantum models, we can also consider spin glasses – a generalization of the Ising Hamiltonian – which provide a computationally universal language for representing discrete optimization problems [42, 54]. The energy functions of such systems contain terms of the form

Remarkably, physical Complexity of the low-energy states of such generalized spin glass systems and the significant difficulties of finding all such states is fundamentally related to computational complexity of many important problems in computer science. For example, this is mathematically equivalent to counting high-quality solutions to NP-hard combinatorial optimization problems within a given approximation ratio [43], or sampling partition function for probabilistic inference in graphical models [44]. Consequently, exploring the structure and dynamics of generalized spin-glass complexes and their phase transitions could lead to a deeper understanding of emergent many-body phenomena in non-equilibrium complex dynamical systems that have been among major open problems for decades across many diverse disciplines including condensed matter physics, physical chemistry, biology, theoretical neuroscience, economy, and social sciences.
A general class of probabilistic physics-inspired approaches to sample the solution space of such problems is based on Markov Chain Monte Carlo techniques, for example the Metropolis-Hastings algorithm, simulated annealing [45], and parallel tempering [46]. More advanced methods combine probabilistic cluster-update strategies over a backbone algorithm from the Markov Chain Monte Carlo family [47, 48, 49, 50, 51, 52]. However, these approaches either break down for frustrated spin-glass systems [48], percolate for dimensions larger than two [49], or assume random tree-like subgraphs [50, 51, 52] that are not necessarily related to the actual low-energy excitation "droplets" of the underlying spin-glass problem.
Unfortunately, the discrete nature of spin-glass systems together with their inherent capacity to contain a significant amount of disorders and frustrations renders the possibility of applying off-the-shelf gradient-based machine learning algorithms to such complex systems. Recently, several attempts have been made to apply advanced techniques in generative models, such as normalizing flows combined by MCMC-based tools such as parallel tempering, and certain relaxation techniques to characterize spin-glass complexes [54].
It has been demonstrated that certain physical and computational properties of the spin-glass phase could be successfully learned, including multi-modal steady-state distributions and topological structures among metastable states. Remarkably, the learning itself could correspond to a phase transition of a spin-glass within the layers of the trained normalizing flows [54]. Overall, there is an opportunity to employ such combinations of machine learning and statistical mechanics techniques to efficiently sample the high-quality solutions in an instance-wise fashion for certain subsets of hard problems.[61]
Progress in physics and machine learning will benefit from an interplay between both fields. There are emerging interfaces of machine learning with quantum computation, quantum control, and many-body systems. Machine learning tools can help understand physical data [55, 56], design new experiments [57], and solve old problems in new ways [58]. Similarly, insights from physics can inspire new models and lead to new hardware. Physics and machine learning, in combination, continue to expand perspectives, increase understanding, and push the frontiers of technology.
The interactive version of this post can be found here. Since Medium does not support Javascript and equations written in Latex, we recommend to check out our interactive post as well.
References
-
Neural networks and physical systems with emergent collective computational abilities Hopfield, J.J., 1982. Proceedings of the National Academy of Sciences, Vol 79(8), pp. 2554–2558. National Acadademy of Sciences.
-
A learning algorithm for Boltzmann machines Ackley, D.H., Hinton, G.E. and Sejnowski, T.J., 1985. Cognitive Science, Vol 9(1), pp. 147–169. Elsevier.
-
On contrastive divergence learning. Carreira-Perpinan, M.A. and Hinton, G.E., 2005. Aistats, Vol 10, pp. 33–40.
-
A tutorial on energy-based learning LeCun, Y., Chopra, S. and Hadsell, R., 2006.
-
Solving the quantum many-body problem with artificial neural networks Carleo, G. and Troyer, M., 2017. Science, Vol 355(6325), pp. 602–606. American Association for the Advancement of Science.
-
Efficient representation of quantum many-body states with deep neural networks Gao, X. and Duan, L., 2017. Nature communications, Vol 8(1), pp. 662. Nature Publishing Group.
-
Neural-network quantum state tomography Torlai, G., Mazzola, G., Carrasquilla, J., Troyer, M., Melko, R. and Carleo, G., 2018. Nature Physics, Vol 14(5), pp. 447. Nature Publishing Group.
-
Restricted Boltzmann machines in quantum physics Melko, R.G., Carleo, G., Carrasquilla, J. and Cirac, J.I., 2019. Nature Physics, Vol 15(9), pp. 887–892. Nature Publishing Group.
-
Restricted Boltzmann machines for collaborative filtering Salakhutdinov, R., Mnih, A. and Hinton, G., 2007. Proceedings of the 24th international conference on Machine learning, pp. 791–798.
-
Deep structured energy based models for anomaly detection Zhai, S., Cheng, Y., Lu, W. and Zhang, Z., 2016. arXiv:1605.07717.
-
Phone recognition with the mean-covariance restricted Boltzmann machine Dahl, G., Ranzato, M., Mohamed, A. and Hinton, G.E., 2010. Advances in neural information processing systems, pp. 469–477.
-
On autoencoders and score matching for energy based models Swersky, K., Buchman, D., Freitas, N.D., Marlin, B.M. and others,, 2011. Proceedings of the 28th international conference on machine learning (ICML-11), pp. 1201–1208.
-
Implicit generation and generalization in energy-based models Du, Y. and Mordatch, I., 2019. arXiv:1903.08689.
-
Deep belief networks are compact universal approximators Le Roux, N. and Bengio, Y., 2010. Neural computation, Vol 22(8), pp. 2192–2207. MIT Press.
-
A connection between generative adversarial networks, inverse reinforcement learning, and energy-based models Finn, C., Christiano, P., Abbeel, P. and Levine, S., 2016. arXiv preprint arXiv:1611.03852.
-
Reinforcement learning with deep energy-based policies Haarnoja, T., Tang, H., Abbeel, P. and Levine, S., 2017. Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1352–1361.
-
Model based planning with energy based models Du, Y., Lin, T. and Mordatch, I., 2019. arXiv:1909.06878.
-
Energy-based generative adversarial network Zhao, J., Mathieu, M. and LeCun, Y., 2016. arXiv:1609.03126.
-
Quantum boltzmann machine Amin, M.H., Andriyash, E., Rolfe, J., Kulchytskyy, B. and Melko, R., 2018. Physical Review X, Vol 8(2), pp. 021050. APS.
-
A Quantum Approximate Optimization Algorithm for continuous problems Verdon, G., Arrazola, J.M., Bradler, K. and Killoran, N., 2019. arXiv preprint arXiv:1902.00409.
-
Classical Mechanics Goldstein, H., 2002. {Pearson Education}.
-
Information theory and statistical mechanics Jaynes, E.T., 1957. Physical review, Vol 106(4), pp. 620. APS.
-
Learning discriminative sufficient statistics score space for classification Li, X., Wang, B., Liu, Y. and Lee, T.S., 2013. Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 49–64.
-
Inverse Ising inference using all the data Aurell, E. and Ekeberg, M., 2012. Physical review letters, Vol 108(9), pp. 090201. APS.
-
An introduction to the Ising model Cipra, B.A., 1987. The American Mathematical Monthly, Vol 94(10), pp. 937–959. Taylor and Francis.
-
Nonperturbative k-body to two-body commuting conversion Hamiltonians and embedding problem instances into Ising spins Biamonte, J., 2008. Physical Review A, Vol 77(5), pp. 052331. APS.
-
Resource efficient gadgets for compiling adiabatic quantum optimization problems Babbush, R., O’Gorman, B. and Aspuru-Guzik, A., 2013. Annalen der Physik, Vol 525(10–11), pp. 877–888. Wiley Online Library.
-
Representational power of restricted Boltzmann machines and deep belief networks Le Roux, N. and Bengio, Y., 2008. Neural computation, Vol 20(6), pp. 1631–1649. MIT Press.
-
Exploring generalization in deep learning Neyshabur, B., Bhojanapalli, S., McAllester, D. and Srebro, N., 2017. Advances in Neural Information Processing Systems, pp. 5947–5956.
-
Neural Networks: A Systematic Introduction Rojas, R., 1996. Springer-Verlag.
-
Monte Carlo sampling methods using Markov chains and their applications Hastings, W.K., 1970. Oxford University Press.
-
The Metropolis – Hastings Algorithm Robert, C.P. and Casella, G., 1999. Monte Carlo Statistical Methods, pp. 231–283. Springer.
-
An introduction to restricted Boltzmann machines Fischer, A. and Igel, C., 2012. Iberoamerican Congress on Pattern Recognition, pp. 14–36.
-
Spin-glass models of neural networks Amit, D.J., Gutfreund, H. and Sompolinsky, H., 1985. Physical Review A, Vol 32(2), pp. 1007. APS.
-
Spin-glass models of a neural network [link] van Hemmen, J.L., 1986–10–01. , Vol 34(4), pp. 3435–3445. DOI: 10.1103/PhysRevA.34.3435
-
Neural networks: a systematic introduction Rojas, R., 2013. Springer Science & Business Media.
-
"Unlearning" has a stabilizing effect in collective memories [link] Hopfield, J.J., Feinstein, D.I. and Palmer, R.G., 1983–07. , Vol 304(5922), pp. 158. DOI: 10.1038/304158a0
-
On the Anatomy of MCMC-based Maximum Likelihood Learning of Energy-Based Models Nijkamp, E., Hill, M., Han, T., Zhu, S. and Wu, Y.N., 2019.
-
Quantum computation and quantum information Nielsen, M.A. and Chuang, I., 2002. AAPT.
-
Quantum variational autoencoder [link] Khoshaman, A., Vinci, W., Denis, B., Andriyash, E., Sadeghi, H. and Amin, M.H., 2018. Quantum Science and Technology, Vol 4(1), pp. 014001. {IOP} Publishing. DOI: 10.1088/2058–9565/aada1f
-
Tomography and generative training with quantum Boltzmann machines Kieferova, M. and Wiebe, N., 2017. Physical Review A, Vol 96(6), pp. 062327. APS.
-
Spin Glasses and Complexity [link] Stein, D.L. and Newman, C.M., 2013. Princeton University Press.
-
Information, Physics, and Computation Mezard, M. and Montanari, A., 2009. Oxford University Press, Inc.
-
The Nature of Computation [link] Moore, C. and Mertens, S., 2011. Oxford University Press, Inc.
-
Optimization by Simulated Annealing [link] Kirkpatrick, S., Gelatt, C.D. and Vecchi, M.P., 1983. Science, Vol 220(4598), pp. 671–680. American Association for the Advancement of Science. DOI: 10.1126/science.220.4598.671
-
Parallel tempering: Theory, applications, and new perspectives [link] Earl, D.J. and Deem, M.W., 2005. Phys. Chem. Chem. Phys., Vol 7(23), pp. 3910–3916. The Royal Society of Chemistry. DOI: 10.1039/B509983H
-
Nonuniversal critical dynamics in Monte Carlo simulations [link] Swendsen, R.H. and Wang, J., 1987. Phys. Rev. Lett., Vol 58(2), pp. 86–88. American Physical Society. DOI: 10.1103/PhysRevLett.58.86
-
Collective Monte Carlo Updating for Spin Systems [link] Wolff, U., 1989. Phys. Rev. Lett., Vol 62(4), pp. 361–364. American Physical Society. DOI: 10.1103/PhysRevLett.62.361
-
A cluster Monte Carlo algorithm for 2-dimensional spin glasses [link] Houdayer, J., 2001. Eur. Phys. J. B, Vol 22(4), pp. 479–484. DOI: 10.1007/PL00011151
-
From Fields to Trees [link] Hamze, F. and de Freitas, N., 2004. Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence, pp. 243–250. AUAI Press.
-
Efficient subgraph-based sampling of {Ising-type} models with frustration Selby, A., 2014.
-
Solving spin glasses with optimized trees of clustered spins [link] Hen, I., 2017. Phys. Rev. E, Vol 96(2), pp. 022105. American Physical Society. DOI: 10.1103/PhysRevE.96.022105
-
A Probability Density Theory for Spin-Glass Systems Hartnett, G.S. and Mohseni, M., 2020.
-
Self-Supervised Learning of Generative Spin-Glasses with Normalizing Flows Hartnett, G.S. and Mohseni, M., 2020.
-
Unsupervised learning of phase transitions: From principal component analysis to variational autoencoders Wetzel, S.J., 2017. Physical Review E, Vol 96(2), pp. 022140. APS.
-
Automated discovery of characteristic features of phase transitions in many-body localization Huembeli, P., Dauphin, A., Wittek, P. and Gogolin, C., 2019. Physical Review B, Vol 99(10), pp. 104106. APS.
-
Active learning machine learns to create new quantum experiments Melnikov, A.A., Nautrup, H.P., Krenn, M., Dunjko, V., Tiersch, M., Zeilinger, A. and Briegel, H.J., 2018. Proceedings of the National Academy of Sciences, Vol 115(6), pp. 1221–1226. National Acad Sciences.
-
Discovering physical concepts with neural networks Iten, R., Metger, T., Wilming, H., Del Rio, L. and Renner, R., 2018. arXiv preprint arXiv:1807.10300.
-
Integer factorization using stochastic magnetic tunnel junctions Borders, W.A., Pervaiz, A.Z., Fukami, S., Camsari, K.Y., Ohno, H. and Datta, S., 2019. Nature, Vol 573(7774), pp. 390–393. Nature Publishing Group.
-
Autonomous probabilistic coprocessing with petaflips per second Sutton, B., Faria, R., Ghantasala, L.A., Jaiswal, R., Camsari, K.Y. and Datta, S., 2020. IEEE Access, Vol 8, pp. 157238–157252. IEEE.
- M. Mohseni, et al, (2021), in preparation