RESEARCH LOG

Non Equilibrium Statistical Physics and Learning Machines II

March 4, 2009 · Leave a Comment

In the last post we described a stochastic learning procedure which approximately follows the gradient of a function called progressive contrastive divergence.

We asked some questions about thermodynamic properties of a such a mechanical process -

It should be clear that because the Markov chain which draws samples from the model is not-necessarily allowed to relax to equilibrium – this process falls under the domain of non-equilibrium thermodynamics.

To formalise this discussion, imagine that we record several instances of the learning procedure described in the previous post.  Thus we have a sequence of model parameters \theta_{ij}(t) and state variables x_{i}(t).  We can calculate the mechanical work done on the parameters and the heat dissipated by accounting for the change in the model’s energy as the learning process is carried out:

W = \sum_t E(x(t+1), \theta(t+1)) - E(x(t+1), \theta(t))

Q = \sum_t E(x(t+1), \theta(t)) - E(x(t), \theta(t))

The equations above just state that the heat dissipated Q is the cumulative change in energy due to the change of the model variables x_i during the MCMC sampling process while holding the model parameters constant. Similarly, the mechanical work W is the cumulative change in energy due to the change of the  model parameters \theta while holding the  x_i constant. Each iteration of the gradient descent algorithm involves both updating the state of the model variables by the Markov chain and then updating the state of the model parameters, thus every iteration does some work and dissipates some heat.

Recall that the second law of thermodynamics states that:

\langle W \rangle \geq \Delta F

Where F = E - TS is the free energy (energy minus entropy)

The average work required for a mechanical process is greater than or equal to the change in free energy of the system.  Equality is achieved only when the process is done quasi-statically so that it is at thermal equilibrium during every step in the process. The precise meaning of this is that the rate at which the controlled variables are changed is much slower than the relaxation time of the system r \ll \tau For such a quasi static process there is no heat dissipated and thus the change in entropy is zero.  For any non-equilibrium process performed faster than this some of the input work will be dissipated and there will be a net increase in entropy.  The more conventional formulation of the second law states merely that the change in entropy is non negative.

Note also that the free energy change is closely related to the intractable partition function Z(\theta)

-\beta \Delta F = \log \frac{Z(\theta_0)}{Z(\theta_f)}

Thus if we could devise a method for estimating the free energy change between \theta_0 and \theta_f then we can estimate log ratio of partition functions (a critical task for model comparison):

If all that could be said about non-equilibrium processes was that the work was lower bounded by the free energy change then there would not be much to discuss here – but there are wonderful results which relate the \Delta F between two equilibrium states to the work done by a non-equilibrium process connecting the two.

The first applies to near-equilibrium processes where the work W can be expected to have a Gaussian distribution

\langle W \rangle = \Delta F + \sigma^2/2

This is called the “Fluctuation-Dissipation estimator”  It comes from the fluctuation-dissipation theorem of linear-response theory.  It states that the amount of dissipation in a non-equilibrium process Q = \Delta F - \langle W \rangle is equal to the magnitude of the equilibrium fluctuations of that quantity \sigma.

This deep result connects the diffusion constant of a particle undergoing Brownian motion to the viscosity of the fluid it is embedded in – and the Johnson noise of a ohmic conductor to its resistance.

For processes where the system is perturbed far from equilibrium there is no reason to expect the fluctuations to be Gaussian.  In this case the Jarzynski equality gives an exact relation between the average work and the free energy change

\langle \exp (-\beta W) \rangle = \exp (-\beta \Delta F)

This remarkable identity was proven in 1997 – there are now several proofs which cover different cases, and it has been proposed as a method for calculating the free energy change in experiments where a single RNA molecule is pulled apart by optical tweezers.

So what’s the point?  The utility of the Jarzynski equality to learning machines was recognized immediately – it provides a way to estimate the partition function of markov random fields.  Neal has called it Annealed Importance Sampling (AIS).

We will discuss this method and it’s relationship to the JE, and other non-equilibrium work theorems in the next installment.

Also at some point I will write up a list of citations which relate the ideas described here to their primary sources – I’m being a bit lazy.

Categories: Uncategorized

0 responses so far ↓

  • There are no comments yet...Kick things off by filling out the form below.

Leave a Comment