RESEARCH LOG

Topic Models vs. Sparse Coding

June 1, 2009 · Leave a Comment

This post is very speculative – please comment if there are any errors or misunderstandings on my part.

A topic model (e.g. Latent Dirichlet Allocation) takes a large corpus of documents x \in D  and represents each document as convex combination of a set of topic vectors x_i = \beta \gamma_i where \gamma_i is the topic mixture for document x_i and \beta is the matrix of topics – each topic can be thought of as a distribution over the vocabulary of words.  The inference task is to find the mixture of topics for every document \gamma_i and also to find the set of topics \beta which is able to best represent the training corpus.  LDA improved on its predecessor technique pLSI, by taking a Bayesian approach and placing Dirichlet priors on the topic mixtures (and the topic vectors themselves).  Like many Bayesian methods these prior distributions can be intuitively thought of as “regularizing” the inferred topic mixtures so that they do not overfit to the training corpus.  Another key benefit is is that the computational labor required for inferences scales as the number of topics rather than the number of training examples.

The essence of the problem though is really two steps – (1) learning a basis of vectors \beta (topics) that can be used to describe each document (2) representing each document as a mixture \gamma  in the learned basis, to optimize some loss function.

LDA fits a corpus by alternately performing MAP inference for the topic mixtures and then the topics.  Computing the posterior probabilities involves intractable sums so they are computed by variational approximation or MCMC methods.

A recent paper by Lee et. al. – Exponential Family Sparse Coding with Applications to Self-Taught Learning uses a multivariate poisson distribution (which is in the general exponential family along with the gaussian, bernoilli, and dirichlet distribution) as a generative model for document bag of words vectors.  The parameters of this multivariate distribution is fit for each document, but representing it as a sparse linear combination of a dictionary of basis vectors.  L1 regularized regression is used to enforce sparsity.  The authors performed a comparison with Latent Dirichlet Allocation, where they used the topic mixture as input to a standard document classifier and the performance of the sparse vectors was superior. Of course this is a very limited comparison of the two methods, but it is fascinating to contemplate.

Learning a dictionary for sparse coding is analogous to the inference problem of topic modeling.  We can write the regression task as a L1 regularized least squares regression problem.

\min_{\{\gamma\}} \sum_i |x_i - \beta \gamma_i|^2 + \lambda |\gamma_i|

The L1 penalty on the norm of \gamma encourages the mixtures to have as many elements as possible equal to zero.  This optimization problem is difficult to solve because the derivative of the L1 penalty term is not continuous at zero.  There are several algorithms to  solve this problem such as LARS.  Sparsity and its benefits is a very active research topic.

It is also essential that the basis vectors \beta be constrained so that they have positive coefficients and that their coefficients sum to one.  This complicates the optimization a little.

A recent paper Online Dictionary Learning for Sparse Coding – by Mairal, Bach, Ponce, and Sapiro (ICML 2009)  Addresses the problem of learning the dictionary as well as the mixture vectors.  The authors propose an algorithm which they claim can scale to datasets with millions of training examples.  This is the “online” part.

I am very curious to try and compare the dictionary constructed using this method to the topics learned using LDA.

I suspect that this has already been thought of by experts in these techniques.  Has anyone already tried this?

Categories: Uncategorized

0 responses so far ↓

  • There are no comments yet...Kick things off by filling out the form below.

Leave a Comment