RESEARCH LOG

Interlude: Proof of JE from Crooks Fluctuation Theorem

March 29, 2009 · Leave a Comment

In the last post we discussed the Jarzynski Equality

\langle \exp (-\beta W) \rangle = \exp (-\beta \Delta F)

Which relates the exponentially weighted average of the work done by a non-equilibrium process to the free energy change between the initial and final states of that process.

We’re going to take a little diversion and explain why this formula is true, then discuss some of the statistical issues that arise when trying to use it to estimate free energy differences.

In the context of machine learning this equation relates the ratio of the normalizing constants of two probabilistic models to the cumulative  work done on the model’s parameters \theta by a learning process that connects these two models. It’s been a while so let’s revisit our definition of “mechanical work” in the context of training an exponential family model – e.g. a markov random field:

Our model consists of variables x_i and parameters \theta_i which describes the coupling between the variables.  The probability of a certain configuration of the variables is given by:

p(x | \theta) = \frac{1}{Z(\theta)}\exp ( x^T \cdot \theta \cdot x )

The form x^T \cdot \theta \cdot x which appears in the argument of the exponential is the energy of the model.

The learning task is to find the values of the parameters \theta which maximizes the log likelihood of the model.  The gradient of the log-likelihood is intractible to evaluate but we can construct a stochastic approximation to it by MCMC. This approximate gradient can be thought of as a stochastic force which pushes the model parameters in the direction of higher likelihood. We can thus describe the path that the parameters take towards their maximum likelihood value as stochastic process generated by the following langevin equation:

m \ddot{\theta}_{ij} = \tilde{F}_{ij}(\theta) - \beta \dot{\theta}_{ij} - \gamma |\theta_{ij}|

The stochastic force is given by:

\tilde{F}_{ij} = \langle x_i x_j \rangle_{data} - \langle x_i x_j \rangle_{model}

Proof of JE using Crooks Fluctuation Theorem

The simplist proof (In my opinion) of the JE is has been derived by Crooks (J. Stat. Phys. 90, 1481 (1998))- it is based on the assumption that the system has markovian dynamics – that is, that the probability of each transition depends only present state.

p(x_n | x_{n-1}, x_{n-2}, ..., x_0) = p(x_n | x_{n-1})p(x_{n-1}|x_{n-2})...p(x_1|x_0)p(x_0)

and the system’s dynamics satisfy the detailed balance condition.  The detailed-balance condition (also called microscopic reversibility) means that when the markov chain has relaxed to equilibrium the probability of every transition is equal to the probability of the reverse transition.

p(x_i | x_j)e^{-\beta E(x_j, \theta)} = p(x_j | x_i)e^{-\beta E(x_i, \theta)}

A better way to understand this is to consider the master equation for markov chain

\frac{d}{dt}p(x_i) = \sum_j [p(x_i | x_j)p(x_j) - p(x_j | x_i) p(x_i)]

It should be obvious that the detailed balance condition comes from setting the left hand side of the master equation to zero.

Now consider a learning protocol with the following steps

  1. The model variables x(t) are updated using one step of the markov chain to x(t+1) holding the parameters of the model \theta(t) fixed.  This process does no mechanical work and dissipates a quantity of heat equal to Q(t) = E(x(t+1), \theta(t)) - E(x(t),\theta(t))
  2. The parameters of the model \theta are updated using the langevin equation this process involves mechanical work W(t) = E(x(t+1), \theta(t+1))-E(x(t+1),\theta(t))

The probability of the entire sequence of states can be written using the markov property as

p(x(t_0), ... x(t_n) | \theta(t_0), ..., \theta(t_n)) = p(x(t_n) | x(t_{n-1}),\theta(t_{n-1})) p(x(t_{n-1})| x(t_{n-2}), \theta(t_{n-2})) ... p(x(t_1) | x(t_0), \theta(t_0))

Now consider the probability of time reversed trajectory

p(x(t_n), ... x(t_0) | \theta(t_n), ..., \theta(t_0)) = p(x(t_0) | x(t_1),\theta(t_1)) p(x(t_1)| x(t_2), \theta(t_2)) ... p(x(t_{n-1}) | x(t_n), \theta(t_n))

From detailed balance we know that

p(x(t_i) | x(t_{i-1})) = p(x(t_{i-1}) | x(t_i)) \exp (\beta Q(t_i))

So we can write the ratio of the forward and time-reversed trajectories as

\frac{p(x(t_0), ... x(t_n) | \theta(t_0), ..., \theta(t_n))}{p(x(t_n), ... x(t_0) | \theta(t_n), ..., \theta(t_0))} = e^{\beta \sum_{i=1}^n Q(t_i)}

Thus the ratio of the forward and time reversed trajectories of the process is related to the heat dissipated on the process.  This is called the Crooks fluctuation theorem.

The Jarzynski Equality quickly follows.  The heat Q is equal to Q = W - \Delta F We must compute the average

<\exp( -\beta W )> where the average is taken over all possible paths

<\exp( -\beta W )> = \int_{x(t_0), ..., x(t_n)} p(x(t_0), ..., x(t_n) | \theta(t_0), ..., \theta(t_n)) \exp( -\beta \sum_i W(t_i) )

This is a functional or path integral -  the Crooks theorem makes this one especially simple to evaluate.  All that is required is to replace the probability of the forward path – with the probability of the time reversed path.

p(x(t_0), ..., x(t_n) | \theta(t_0), ..., \theta(t_n)) \rightarrow p(x(t_n), ..., x(t_0) | \theta(t_n), ..., \theta(t_0)) \exp(\beta Q)

Now the weight factor in the average becomes \exp(-\beta(W - Q)) = \exp(-\beta \Delta F).  This comes from the fact that the total work W can be decomposed into the sum of the heat and the reversible work. The reversible work is a path independent quantity which is equal to the change in free energy \Delta F.

The free energy is path independent meaning it depends only on the initial and final values of the parameters \theta, so it can be moved out of the path integral and we are left with

<\exp(-\beta W)> = \exp(-\beta \Delta F)

The assumptions that this proof is based on are actually much stronger than are required.  There are many other proofs out there.  I like this one because it’s very simple and the assumptions – markovian dynamics and detailed balance – apply to the cases I want to treat: machine learning with MCMC evaluation of the gradient.  The Gibbs sampler satisfies detailed balance.

Categories: Uncategorized

0 responses so far ↓

  • There are no comments yet...Kick things off by filling out the form below.

Leave a Comment