RESEARCH LOG

Non-Equilibrium Statistical Physics and Learning Machines I

March 2, 2009 · Leave a Comment

Boltzmann Machines and Contrastive Divergence Learning

One formalization of the idea of “learning from data” is to take a rich statistical model such as a Hinton and Sejnowski’s Boltzmann machine:

p(x;\theta) = \frac{1}{Z(\theta)}e^{-\beta E(x,\theta)}

and choose values of the parameters \theta for which the model will assign high probability to values of x that are like the data and low probability to values that are unlike the data.  One method of doing this is to maximize the log-likelihood of the data.  This estimator is called the maximum likelihood estimator and it has a theoretically desirable property called asymptotic efficiency.   This means that in the limit of a large sample it is unbiased and achieves  the minimum variance possible for an unbiased estimator.  If we are given a data set \{x_i\} the average  log-likelihood and its gradient are:

p(x;\theta) = \prod_i^N \frac{1}{Z(\theta)}\exp[-\beta E(x_i,\theta)]

\langle \log p(x;\theta) \rangle_{data} = - \beta \langle E(x,\theta) \rangle_{data} - \log Z(\theta)

\langle \frac{d}{d\theta} \log p(x;\theta) \rangle_{data} = -\beta \langle \frac{d E(x,\theta)}{d\theta} \rangle_{data} - \frac{d}{d\theta} \log Z(\theta)

To evaluate the derivatives of \log Z(\theta) note that:

\frac{d}{d\theta} \log Z(\theta) = \frac{1}{Z(\theta)}\frac{d Z(\theta)}{d\theta}

\frac{d}{d\theta} \log Z(\theta) = \frac{-\beta}{Z(\theta)}\sum_{x} \frac{dE(x,\theta)}{d\theta} \exp[-\beta E(x,\theta)]

\frac{d}{d\theta} \log Z(\theta) = -\beta \langle \frac{d E(x,\theta)}{d\theta}\rangle_{model}

Thus the average gradient of the log-likelihood is:

\langle \frac{d}{d\theta} \log p(x;\theta) \rangle_{data} = -\beta \langle \frac{d E(x,\theta) }{d\theta}\rangle_{data} -\beta \langle \frac{d E(x,\theta)}{d\theta}\rangle_{model}

Now if we assume that the energy E(x,\theta) has a quadratic form E = x^T \cdot \theta \cdot x  we can evaluate the derivatives of the energy function and the gradient takes on an especially simple form:

\langle \frac{d}{d\theta} \log p(x;\theta) \rangle_{data} = -\beta ( \langle x_i x_j \rangle_{data} - \langle x_i x_j \rangle_{model} )

Unfortunately, for many models the log-likelihood or even its gradient are impractical to compute because of the  normalizing constant (a.k.a. the partition function) which appears in the evaluation of the model averages \langle x_i x_j \rangle_{model}.  Z requires summing over all configurations of the model  variables – a set which grows exponentially with the number of variables.

Despite this difficulty practical solutions exist that make use of approximate methods.  One strategy is to choose a family of functions with a set of variational free parameters for which the likelihood can be evaluated  and use an optimization procedure to choose the function in this family which is closest to the desired density.

Another method is to replace the expectation over the model distribution which appears in the gradient of the log likelihood by an average over a finite sample.  This finite sample can be derived by the use of Markov-Chain Monte-Carlo methods (MCMC).  MCMC methods bring gradient descent of the log-likelihood  into the world of the possible, but there are still many practical issues.  MCMC samples are only representative of  the model distribution after the chain has been run for many steps – called the burn-in or relaxation time.  For many models of practical interest this relaxation time can be very long especially if the model density is multi-modal.  One intuitive explanation for the slowness of MCMC is that the markov chain explores the state space by a random walk which takes time of order \sqrt{t} to travel a distance d.

Fortunately, it turns out that effective results can be achieved even if the Markov chain is never allowed to relax to  equilibrium.  Which brings us to the title of this post.  If the Markov chain is initialized at the data distribution and then run for N steps, the parameters will still be pushed towards values which reduce the difference between the expectations of the model and the data.  These methods follow the gradient of a different function called the contrastive divergence.  They can lead learned parameters which perform reasonably well on unsupervised learning tasks.  This line of work has been developed by Hinton, Salakhutdinov and many others.

Although the published results with CD learning are impressive, closer analysis suggests that this type of learning is not a substitute for maximum likelihood learning.  If one takes the learned model parameters and then runs the Markov chain to equilibrium to draw samples from its equilibrium distribution – the samples don’t resemble the data. It has been proven that the minima of the contrastive divergence do not correspond to the minima of the likelihood function, so CD is likely a biased estimate as well. (although the bias is claimed to be small — and this is not to suggest that bias is always an evil to be avoided: especially in the computation of a gradient one might prefer an estimator which achieves lower variance than the MLE, by trading off a small amount of bias)

Improvements on CD Learning – A mechanical analogy

An improvement on CD learning which was recently proposed in a paper by Tijmen Tieleman is to allow the Markov chain to retain its state between gradient descent iterations instead of resetting to the data distribution.  If the learning rate or step size of the gradient descent is sufficiently small then the change in the equilibrium distribution of the markov chain can be expected to be small, on each distribution, and the markov chain can relax to equilibrium during the learning protocol.  This procedure has been called progressive contrastive divergence (PCD), and it is much more physically plausible.  The first experiments with this method have shown that its performance is better than CD learning in the long run, although CD learning can perform better for short learning protocols.

This PCD iterative stochastic gradient descent learning procedure has a simple physical interpretation.  The model parameters follow a Brownian motion, driven by a position dependent stochastic force with mass m,  viscous damping \beta, and weight decay \gamma.   Thus, we can write the following Langevin equation for the model parameters:

m\ddot{\theta}_{ij} = \tilde{F}_{ij}(\theta_{ij}) -\beta\dot{\theta}_{ij} - \gamma|\theta_{ij}|

\tilde{F}_{ij} is the stochastic force on the parameter \theta_{ij} it is equal to the difference between the pair correlation functions with respect to the data and model distributions.

\tilde{F}_{ij} = \langle x_i x_j \rangle_{data} - \langle x_i x_j \rangle_{model}

Just as with contrastive divergence we only run the Markov chain for a few steps per iteration, but we don’t reset it to the data distribution. When the \theta_{ij} has reached the value for which the model pair correlation functions match the data pair correlation functions the stochastic force \tilde{F}  vanishes on average.  There are, of course,  still fluctuations about this average.

This mechanical interpretation of the learning process is very suggestive of other questions – how much mechanical work is done on the parameters by the stochastic force?  How much heat is dissipated?  How much entropy is produced?

Categories: Uncategorized

0 responses so far ↓

  • There are no comments yet...Kick things off by filling out the form below.

Leave a Comment