In the last post we discussed the Jarzynski Equality
Which relates the exponentially weighted average of the work done by a non-equilibrium process to the free energy change between the initial and final states of that process.
We’re going to take a little diversion and explain why this formula is true, then discuss some of the statistical issues that arise when trying to use it to estimate free energy differences.
In the context of machine learning this equation relates the ratio of the normalizing constants of two probabilistic models to the cumulative work done on the model’s parameters by a learning process that connects these two models. It’s been a while so let’s revisit our definition of “mechanical work” in the context of training an exponential family model – e.g. a markov random field:
Our model consists of variables and parameters
which describes the coupling between the variables. The probability of a certain configuration of the variables is given by:
The form which appears in the argument of the exponential is the energy of the model.
The learning task is to find the values of the parameters which maximizes the log likelihood of the model. The gradient of the log-likelihood is intractible to evaluate but we can construct a stochastic approximation to it by MCMC. This approximate gradient can be thought of as a stochastic force which pushes the model parameters in the direction of higher likelihood. We can thus describe the path that the parameters take towards their maximum likelihood value as stochastic process generated by the following langevin equation:
The stochastic force is given by:
Proof of JE using Crooks Fluctuation Theorem
The simplist proof (In my opinion) of the JE is has been derived by Crooks (J. Stat. Phys. 90, 1481 (1998))- it is based on the assumption that the system has markovian dynamics – that is, that the probability of each transition depends only present state.
and the system’s dynamics satisfy the detailed balance condition. The detailed-balance condition (also called microscopic reversibility) means that when the markov chain has relaxed to equilibrium the probability of every transition is equal to the probability of the reverse transition.
A better way to understand this is to consider the master equation for markov chain
It should be obvious that the detailed balance condition comes from setting the left hand side of the master equation to zero.
Now consider a learning protocol with the following steps
- The model variables
are updated using one step of the markov chain to
holding the parameters of the model
fixed. This process does no mechanical work and dissipates a quantity of heat equal to
- The parameters of the model
are updated using the langevin equation this process involves mechanical work
The probability of the entire sequence of states can be written using the markov property as
Now consider the probability of time reversed trajectory
From detailed balance we know that
So we can write the ratio of the forward and time-reversed trajectories as
Thus the ratio of the forward and time reversed trajectories of the process is related to the heat dissipated on the process. This is called the Crooks fluctuation theorem.
The Jarzynski Equality quickly follows. The heat Q is equal to We must compute the average
where the average is taken over all possible paths
This is a functional or path integral - the Crooks theorem makes this one especially simple to evaluate. All that is required is to replace the probability of the forward path – with the probability of the time reversed path.
Now the weight factor in the average becomes . This comes from the fact that the total work W can be decomposed into the sum of the heat and the reversible work. The reversible work is a path independent quantity which is equal to the change in free energy
.
The free energy is path independent meaning it depends only on the initial and final values of the parameters , so it can be moved out of the path integral and we are left with
The assumptions that this proof is based on are actually much stronger than are required. There are many other proofs out there. I like this one because it’s very simple and the assumptions – markovian dynamics and detailed balance – apply to the cases I want to treat: machine learning with MCMC evaluation of the gradient. The Gibbs sampler satisfies detailed balance.
0 responses so far ↓
There are no comments yet...Kick things off by filling out the form below.