RESEARCH LOG

judging by the title

September 16, 2009 · Leave a Comment

It’s one of my favorite times of the year – the list of nips 2009 papers is out! Lots of new stuff to read.  Thanks to all those who contributed.

http://nips.cc/Conferences/2009/Program/accepted-papers.php

Although the text of the papers haven’t been released – I’m going to play a game and guess which ones will be my favorites judging only by the title and authors.

I have lately been working on the development of hierarchical (I guess they could be called “deep” sequence models, which take advantage of translational and other symmetries) So I’m particularly interested in the ones with “invariances” in the title.

Here’s my list. I’ll grade my predictions after I actually read them, so you’ll find out how I did later (I know that this sounds like a shallow exercise):

  • Decoupling Sparsity and Smoothness in the Discrete Hierarchical Dirichlet Process
    C. Wang, D. Blei
  • Variational Inference for the Nested Chinese Restaurant Process
    C. Wang, D. Blei

Yikes! That’s a lot of papers to read.  There’s just too many of these that look interesting.  Obviously this is a very biased list, and it reflects my ignorance as much as my interests.

→ Leave a CommentCategories: Uncategorized

Topic Models vs. Sparse Coding

June 1, 2009 · Leave a Comment

This post is very speculative – please comment if there are any errors or misunderstandings on my part.

A topic model (e.g. Latent Dirichlet Allocation) takes a large corpus of documents x \in D  and represents each document as convex combination of a set of topic vectors x_i = \beta \gamma_i where \gamma_i is the topic mixture for document x_i and \beta is the matrix of topics – each topic can be thought of as a distribution over the vocabulary of words.  The inference task is to find the mixture of topics for every document \gamma_i and also to find the set of topics \beta which is able to best represent the training corpus.  LDA improved on its predecessor technique pLSI, by taking a Bayesian approach and placing Dirichlet priors on the topic mixtures (and the topic vectors themselves).  Like many Bayesian methods these prior distributions can be intuitively thought of as “regularizing” the inferred topic mixtures so that they do not overfit to the training corpus.  Another key benefit is is that the computational labor required for inferences scales as the number of topics rather than the number of training examples.

The essence of the problem though is really two steps – (1) learning a basis of vectors \beta (topics) that can be used to describe each document (2) representing each document as a mixture \gamma  in the learned basis, to optimize some loss function.

LDA fits a corpus by alternately performing MAP inference for the topic mixtures and then the topics.  Computing the posterior probabilities involves intractable sums so they are computed by variational approximation or MCMC methods.

A recent paper by Lee et. al. – Exponential Family Sparse Coding with Applications to Self-Taught Learning uses a multivariate poisson distribution (which is in the general exponential family along with the gaussian, bernoilli, and dirichlet distribution) as a generative model for document bag of words vectors.  The parameters of this multivariate distribution is fit for each document, but representing it as a sparse linear combination of a dictionary of basis vectors.  L1 regularized regression is used to enforce sparsity.  The authors performed a comparison with Latent Dirichlet Allocation, where they used the topic mixture as input to a standard document classifier and the performance of the sparse vectors was superior. Of course this is a very limited comparison of the two methods, but it is fascinating to contemplate.

Learning a dictionary for sparse coding is analogous to the inference problem of topic modeling.  We can write the regression task as a L1 regularized least squares regression problem.

\min_{\{\gamma\}} \sum_i |x_i - \beta \gamma_i|^2 + \lambda |\gamma_i|

The L1 penalty on the norm of \gamma encourages the mixtures to have as many elements as possible equal to zero.  This optimization problem is difficult to solve because the derivative of the L1 penalty term is not continuous at zero.  There are several algorithms to  solve this problem such as LARS.  Sparsity and its benefits is a very active research topic.

It is also essential that the basis vectors \beta be constrained so that they have positive coefficients and that their coefficients sum to one.  This complicates the optimization a little.

A recent paper Online Dictionary Learning for Sparse Coding – by Mairal, Bach, Ponce, and Sapiro (ICML 2009)  Addresses the problem of learning the dictionary as well as the mixture vectors.  The authors propose an algorithm which they claim can scale to datasets with millions of training examples.  This is the “online” part.

I am very curious to try and compare the dictionary constructed using this method to the topics learned using LDA.

I suspect that this has already been thought of by experts in these techniques.  Has anyone already tried this?

→ Leave a CommentCategories: Uncategorized

Interlude: Proof of JE from Crooks Fluctuation Theorem

March 29, 2009 · Leave a Comment

In the last post we discussed the Jarzynski Equality

\langle \exp (-\beta W) \rangle = \exp (-\beta \Delta F)

Which relates the exponentially weighted average of the work done by a non-equilibrium process to the free energy change between the initial and final states of that process.

We’re going to take a little diversion and explain why this formula is true, then discuss some of the statistical issues that arise when trying to use it to estimate free energy differences.

In the context of machine learning this equation relates the ratio of the normalizing constants of two probabilistic models to the cumulative  work done on the model’s parameters \theta by a learning process that connects these two models. It’s been a while so let’s revisit our definition of “mechanical work” in the context of training an exponential family model – e.g. a markov random field:

Our model consists of variables x_i and parameters \theta_i which describes the coupling between the variables.  The probability of a certain configuration of the variables is given by:

p(x | \theta) = \frac{1}{Z(\theta)}\exp ( x^T \cdot \theta \cdot x )

The form x^T \cdot \theta \cdot x which appears in the argument of the exponential is the energy of the model.

The learning task is to find the values of the parameters \theta which maximizes the log likelihood of the model.  The gradient of the log-likelihood is intractible to evaluate but we can construct a stochastic approximation to it by MCMC. This approximate gradient can be thought of as a stochastic force which pushes the model parameters in the direction of higher likelihood. We can thus describe the path that the parameters take towards their maximum likelihood value as stochastic process generated by the following langevin equation:

m \ddot{\theta}_{ij} = \tilde{F}_{ij}(\theta) - \beta \dot{\theta}_{ij} - \gamma |\theta_{ij}|

The stochastic force is given by:

\tilde{F}_{ij} = \langle x_i x_j \rangle_{data} - \langle x_i x_j \rangle_{model}

Proof of JE using Crooks Fluctuation Theorem

The simplist proof (In my opinion) of the JE is has been derived by Crooks (J. Stat. Phys. 90, 1481 (1998))- it is based on the assumption that the system has markovian dynamics – that is, that the probability of each transition depends only present state.

p(x_n | x_{n-1}, x_{n-2}, ..., x_0) = p(x_n | x_{n-1})p(x_{n-1}|x_{n-2})...p(x_1|x_0)p(x_0)

and the system’s dynamics satisfy the detailed balance condition.  The detailed-balance condition (also called microscopic reversibility) means that when the markov chain has relaxed to equilibrium the probability of every transition is equal to the probability of the reverse transition.

p(x_i | x_j)e^{-\beta E(x_j, \theta)} = p(x_j | x_i)e^{-\beta E(x_i, \theta)}

A better way to understand this is to consider the master equation for markov chain

\frac{d}{dt}p(x_i) = \sum_j [p(x_i | x_j)p(x_j) - p(x_j | x_i) p(x_i)]

It should be obvious that the detailed balance condition comes from setting the left hand side of the master equation to zero.

Now consider a learning protocol with the following steps

  1. The model variables x(t) are updated using one step of the markov chain to x(t+1) holding the parameters of the model \theta(t) fixed.  This process does no mechanical work and dissipates a quantity of heat equal to Q(t) = E(x(t+1), \theta(t)) - E(x(t),\theta(t))
  2. The parameters of the model \theta are updated using the langevin equation this process involves mechanical work W(t) = E(x(t+1), \theta(t+1))-E(x(t+1),\theta(t))

The probability of the entire sequence of states can be written using the markov property as

p(x(t_0), ... x(t_n) | \theta(t_0), ..., \theta(t_n)) = p(x(t_n) | x(t_{n-1}),\theta(t_{n-1})) p(x(t_{n-1})| x(t_{n-2}), \theta(t_{n-2})) ... p(x(t_1) | x(t_0), \theta(t_0))

Now consider the probability of time reversed trajectory

p(x(t_n), ... x(t_0) | \theta(t_n), ..., \theta(t_0)) = p(x(t_0) | x(t_1),\theta(t_1)) p(x(t_1)| x(t_2), \theta(t_2)) ... p(x(t_{n-1}) | x(t_n), \theta(t_n))

From detailed balance we know that

p(x(t_i) | x(t_{i-1})) = p(x(t_{i-1}) | x(t_i)) \exp (\beta Q(t_i))

So we can write the ratio of the forward and time-reversed trajectories as

\frac{p(x(t_0), ... x(t_n) | \theta(t_0), ..., \theta(t_n))}{p(x(t_n), ... x(t_0) | \theta(t_n), ..., \theta(t_0))} = e^{\beta \sum_{i=1}^n Q(t_i)}

Thus the ratio of the forward and time reversed trajectories of the process is related to the heat dissipated on the process.  This is called the Crooks fluctuation theorem.

The Jarzynski Equality quickly follows.  The heat Q is equal to Q = W - \Delta F We must compute the average

<\exp( -\beta W )> where the average is taken over all possible paths

<\exp( -\beta W )> = \int_{x(t_0), ..., x(t_n)} p(x(t_0), ..., x(t_n) | \theta(t_0), ..., \theta(t_n)) \exp( -\beta \sum_i W(t_i) )

This is a functional or path integral -  the Crooks theorem makes this one especially simple to evaluate.  All that is required is to replace the probability of the forward path – with the probability of the time reversed path.

p(x(t_0), ..., x(t_n) | \theta(t_0), ..., \theta(t_n)) \rightarrow p(x(t_n), ..., x(t_0) | \theta(t_n), ..., \theta(t_0)) \exp(\beta Q)

Now the weight factor in the average becomes \exp(-\beta(W - Q)) = \exp(-\beta \Delta F).  This comes from the fact that the total work W can be decomposed into the sum of the heat and the reversible work. The reversible work is a path independent quantity which is equal to the change in free energy \Delta F.

The free energy is path independent meaning it depends only on the initial and final values of the parameters \theta, so it can be moved out of the path integral and we are left with

<\exp(-\beta W)> = \exp(-\beta \Delta F)

The assumptions that this proof is based on are actually much stronger than are required.  There are many other proofs out there.  I like this one because it’s very simple and the assumptions – markovian dynamics and detailed balance – apply to the cases I want to treat: machine learning with MCMC evaluation of the gradient.  The Gibbs sampler satisfies detailed balance.

→ Leave a CommentCategories: Uncategorized

Papers: SCOTUS – Trueskill Through Time

March 6, 2009 · Leave a Comment

Martin-Quinn scores measure the the evolution of the ideal point of the justices of the U.S. supreme court.  The “ideal point” is a latent variable derived from voting records that intuitively resembles the concept of ideological bias.  The inference procedure is based on Markov Chain Monte Carlo.

mqnim

Along these same lines there is the work of Lawrence Sirovich (who wrote the early papers applying PCA to human faces to construct “Eigenfaces”) in PNAS who computes a singular value decomposition of the voting matrix of the Rehnquist court, and finds a large latent factor that corresponds to liberal / conservative bias.

A pattern analysis of the second Rehnquist U.S. Supreme Court: Lawrence Sirovich PNAS 13 7432 (2003)

Another interesting work which appeared in NIPS a few years ago:

TrueSkill through time: Revisiting the History of Chess

by Pierre Dangautheir, Ralf Herbrich, Tom Minka, and Thore Graepel of Microsoft Research

How would Morphy or Capablanca fare against Kramnik, Kasparov, or the late Bobby Fischer? The authors developed a method for deducing the change in a player’s skill over time from their record of wins and losses.  It allows players who were never played each other (and who weren’t even alive at the same time) to be compared on the same scale.

I’m just surprised that they didn’t include Deep Thought / Blue

This paper is a great application of graphical models and belief propagation.

trueskill_chess

→ Leave a CommentCategories: Uncategorized

Non Equilibrium Statistical Physics and Learning Machines II

March 4, 2009 · Leave a Comment

In the last post we described a stochastic learning procedure which approximately follows the gradient of a function called progressive contrastive divergence.

We asked some questions about thermodynamic properties of a such a mechanical process -

It should be clear that because the Markov chain which draws samples from the model is not-necessarily allowed to relax to equilibrium – this process falls under the domain of non-equilibrium thermodynamics.

To formalise this discussion, imagine that we record several instances of the learning procedure described in the previous post.  Thus we have a sequence of model parameters \theta_{ij}(t) and state variables x_{i}(t).  We can calculate the mechanical work done on the parameters and the heat dissipated by accounting for the change in the model’s energy as the learning process is carried out:

W = \sum_t E(x(t+1), \theta(t+1)) - E(x(t+1), \theta(t))

Q = \sum_t E(x(t+1), \theta(t)) - E(x(t), \theta(t))

The equations above just state that the heat dissipated Q is the cumulative change in energy due to the change of the model variables x_i during the MCMC sampling process while holding the model parameters constant. Similarly, the mechanical work W is the cumulative change in energy due to the change of the  model parameters \theta while holding the  x_i constant. Each iteration of the gradient descent algorithm involves both updating the state of the model variables by the Markov chain and then updating the state of the model parameters, thus every iteration does some work and dissipates some heat.

Recall that the second law of thermodynamics states that:

\langle W \rangle \geq \Delta F

Where F = E - TS is the free energy (energy minus entropy)

The average work required for a mechanical process is greater than or equal to the change in free energy of the system.  Equality is achieved only when the process is done quasi-statically so that it is at thermal equilibrium during every step in the process. The precise meaning of this is that the rate at which the controlled variables are changed is much slower than the relaxation time of the system r \ll \tau For such a quasi static process there is no heat dissipated and thus the change in entropy is zero.  For any non-equilibrium process performed faster than this some of the input work will be dissipated and there will be a net increase in entropy.  The more conventional formulation of the second law states merely that the change in entropy is non negative.

Note also that the free energy change is closely related to the intractable partition function Z(\theta)

-\beta \Delta F = \log \frac{Z(\theta_0)}{Z(\theta_f)}

Thus if we could devise a method for estimating the free energy change between \theta_0 and \theta_f then we can estimate log ratio of partition functions (a critical task for model comparison):

If all that could be said about non-equilibrium processes was that the work was lower bounded by the free energy change then there would not be much to discuss here – but there are wonderful results which relate the \Delta F between two equilibrium states to the work done by a non-equilibrium process connecting the two.

The first applies to near-equilibrium processes where the work W can be expected to have a Gaussian distribution

\langle W \rangle = \Delta F + \sigma^2/2

This is called the “Fluctuation-Dissipation estimator”  It comes from the fluctuation-dissipation theorem of linear-response theory.  It states that the amount of dissipation in a non-equilibrium process Q = \Delta F - \langle W \rangle is equal to the magnitude of the equilibrium fluctuations of that quantity \sigma.

This deep result connects the diffusion constant of a particle undergoing Brownian motion to the viscosity of the fluid it is embedded in – and the Johnson noise of a ohmic conductor to its resistance.

For processes where the system is perturbed far from equilibrium there is no reason to expect the fluctuations to be Gaussian.  In this case the Jarzynski equality gives an exact relation between the average work and the free energy change

\langle \exp (-\beta W) \rangle = \exp (-\beta \Delta F)

This remarkable identity was proven in 1997 – there are now several proofs which cover different cases, and it has been proposed as a method for calculating the free energy change in experiments where a single RNA molecule is pulled apart by optical tweezers.

So what’s the point?  The utility of the Jarzynski equality to learning machines was recognized immediately – it provides a way to estimate the partition function of markov random fields.  Neal has called it Annealed Importance Sampling (AIS).

We will discuss this method and it’s relationship to the JE, and other non-equilibrium work theorems in the next installment.

Also at some point I will write up a list of citations which relate the ideas described here to their primary sources – I’m being a bit lazy.

→ Leave a CommentCategories: Uncategorized

Physics Classics

March 2, 2009 · Leave a Comment

edward_mills_purcell

The Back of the Envelope was a monthly column during 1983-84 by E.M. Purcell in the American Journal of Physics.  Every month Purcell (who did much of the seminal work on Nuclear Magnetic Resonance in solutions, along with Pound, Bloch, and others) would propose a series of  3 questions to the reader and provide answers to the previous month’s questions.  The questions were mostly meant to be simple order of magnitude estimation problems which could literally be done on the back of an envelope. These types of problems are also often called Fermi problems after another famous connoisseur, Enrico Fermi. Examples:

  1. At room temperature in air how long could a pencil remain balanced on its point? At absolute zero?
  2. How fast can a 10 mg water droplet spin without falling apart? (ignoring aerodynamic forces)
  3. If the library of congress were printed in tungsten on a postcard would it be readable with a standard electron microscope?

out_swim_diffusionTo get another taste of this style of doing physics -  read the lecture “Life at Low Reynolds Number” – given in 1973 – a time when biophysics was nowhere near as mainstream as it is today. (with some exceptions like Helmholtz).  Many observations made in this famous lecture foreshadowed future developments of the field.  Purcell discusses the special difficulties that E. Coli experience in propelling themselves in an environment with no inertia where viscocity dominates their motion. (It was widely believed that the bacteria vibrated their flagella instead of rotating them because no one could conceive of how a bacteria could contain a rotary joint with a bearing much less a rotary motor)  Now of course, there are many known examples of single molecule motors which exist – Actin/Myosin, Kinesin etc. Interesting fact: for the fluid flow around a swimming human to have the same reynolds number as a bacteria in water – the human would have to swim through molasses and restrict her stroke so that her limbs move no faster than 1cm / s!

Another fine example of his work is found in the MIT radiation lab series – a thick >10 volume set which contained nearly everything that was then known on the subject of microwave engineering.  Along with many other authors, he gives a series of brilliant arguments for how the engineering concepts of impedance and reactance can be extended to high frequency circuits where the free space wavelength is of the same length scale as the circuit elements.

See also Relativistic Electromagnetism

Also highly recommended is the series of columns in AJP by Victor Weisskopf – “The Search for Simplicity

→ Leave a CommentCategories: Uncategorized

Term Frequencies in the Supreme Court Corpus

March 2, 2009 · Leave a Comment

arms450
drugs450

religion450

slave450

speech450

the450

war450

→ Leave a CommentCategories: Uncategorized

Non-Equilibrium Statistical Physics and Learning Machines I

March 2, 2009 · Leave a Comment

Boltzmann Machines and Contrastive Divergence Learning

One formalization of the idea of “learning from data” is to take a rich statistical model such as a Hinton and Sejnowski’s Boltzmann machine:

p(x;\theta) = \frac{1}{Z(\theta)}e^{-\beta E(x,\theta)}

and choose values of the parameters \theta for which the model will assign high probability to values of x that are like the data and low probability to values that are unlike the data.  One method of doing this is to maximize the log-likelihood of the data.  This estimator is called the maximum likelihood estimator and it has a theoretically desirable property called asymptotic efficiency.   This means that in the limit of a large sample it is unbiased and achieves  the minimum variance possible for an unbiased estimator.  If we are given a data set \{x_i\} the average  log-likelihood and its gradient are:

p(x;\theta) = \prod_i^N \frac{1}{Z(\theta)}\exp[-\beta E(x_i,\theta)]

\langle \log p(x;\theta) \rangle_{data} = - \beta \langle E(x,\theta) \rangle_{data} - \log Z(\theta)

\langle \frac{d}{d\theta} \log p(x;\theta) \rangle_{data} = -\beta \langle \frac{d E(x,\theta)}{d\theta} \rangle_{data} - \frac{d}{d\theta} \log Z(\theta)

To evaluate the derivatives of \log Z(\theta) note that:

\frac{d}{d\theta} \log Z(\theta) = \frac{1}{Z(\theta)}\frac{d Z(\theta)}{d\theta}

\frac{d}{d\theta} \log Z(\theta) = \frac{-\beta}{Z(\theta)}\sum_{x} \frac{dE(x,\theta)}{d\theta} \exp[-\beta E(x,\theta)]

\frac{d}{d\theta} \log Z(\theta) = -\beta \langle \frac{d E(x,\theta)}{d\theta}\rangle_{model}

Thus the average gradient of the log-likelihood is:

\langle \frac{d}{d\theta} \log p(x;\theta) \rangle_{data} = -\beta \langle \frac{d E(x,\theta) }{d\theta}\rangle_{data} -\beta \langle \frac{d E(x,\theta)}{d\theta}\rangle_{model}

Now if we assume that the energy E(x,\theta) has a quadratic form E = x^T \cdot \theta \cdot x  we can evaluate the derivatives of the energy function and the gradient takes on an especially simple form:

\langle \frac{d}{d\theta} \log p(x;\theta) \rangle_{data} = -\beta ( \langle x_i x_j \rangle_{data} - \langle x_i x_j \rangle_{model} )

Unfortunately, for many models the log-likelihood or even its gradient are impractical to compute because of the  normalizing constant (a.k.a. the partition function) which appears in the evaluation of the model averages \langle x_i x_j \rangle_{model}.  Z requires summing over all configurations of the model  variables – a set which grows exponentially with the number of variables.

Despite this difficulty practical solutions exist that make use of approximate methods.  One strategy is to choose a family of functions with a set of variational free parameters for which the likelihood can be evaluated  and use an optimization procedure to choose the function in this family which is closest to the desired density.

Another method is to replace the expectation over the model distribution which appears in the gradient of the log likelihood by an average over a finite sample.  This finite sample can be derived by the use of Markov-Chain Monte-Carlo methods (MCMC).  MCMC methods bring gradient descent of the log-likelihood  into the world of the possible, but there are still many practical issues.  MCMC samples are only representative of  the model distribution after the chain has been run for many steps – called the burn-in or relaxation time.  For many models of practical interest this relaxation time can be very long especially if the model density is multi-modal.  One intuitive explanation for the slowness of MCMC is that the markov chain explores the state space by a random walk which takes time of order \sqrt{t} to travel a distance d.

Fortunately, it turns out that effective results can be achieved even if the Markov chain is never allowed to relax to  equilibrium.  Which brings us to the title of this post.  If the Markov chain is initialized at the data distribution and then run for N steps, the parameters will still be pushed towards values which reduce the difference between the expectations of the model and the data.  These methods follow the gradient of a different function called the contrastive divergence.  They can lead learned parameters which perform reasonably well on unsupervised learning tasks.  This line of work has been developed by Hinton, Salakhutdinov and many others.

Although the published results with CD learning are impressive, closer analysis suggests that this type of learning is not a substitute for maximum likelihood learning.  If one takes the learned model parameters and then runs the Markov chain to equilibrium to draw samples from its equilibrium distribution – the samples don’t resemble the data. It has been proven that the minima of the contrastive divergence do not correspond to the minima of the likelihood function, so CD is likely a biased estimate as well. (although the bias is claimed to be small — and this is not to suggest that bias is always an evil to be avoided: especially in the computation of a gradient one might prefer an estimator which achieves lower variance than the MLE, by trading off a small amount of bias)

Improvements on CD Learning – A mechanical analogy

An improvement on CD learning which was recently proposed in a paper by Tijmen Tieleman is to allow the Markov chain to retain its state between gradient descent iterations instead of resetting to the data distribution.  If the learning rate or step size of the gradient descent is sufficiently small then the change in the equilibrium distribution of the markov chain can be expected to be small, on each distribution, and the markov chain can relax to equilibrium during the learning protocol.  This procedure has been called progressive contrastive divergence (PCD), and it is much more physically plausible.  The first experiments with this method have shown that its performance is better than CD learning in the long run, although CD learning can perform better for short learning protocols.

This PCD iterative stochastic gradient descent learning procedure has a simple physical interpretation.  The model parameters follow a Brownian motion, driven by a position dependent stochastic force with mass m,  viscous damping \beta, and weight decay \gamma.   Thus, we can write the following Langevin equation for the model parameters:

m\ddot{\theta}_{ij} = \tilde{F}_{ij}(\theta_{ij}) -\beta\dot{\theta}_{ij} - \gamma|\theta_{ij}|

\tilde{F}_{ij} is the stochastic force on the parameter \theta_{ij} it is equal to the difference between the pair correlation functions with respect to the data and model distributions.

\tilde{F}_{ij} = \langle x_i x_j \rangle_{data} - \langle x_i x_j \rangle_{model}

Just as with contrastive divergence we only run the Markov chain for a few steps per iteration, but we don’t reset it to the data distribution. When the \theta_{ij} has reached the value for which the model pair correlation functions match the data pair correlation functions the stochastic force \tilde{F}  vanishes on average.  There are, of course,  still fluctuations about this average.

This mechanical interpretation of the learning process is very suggestive of other questions – how much mechanical work is done on the parameters by the stochastic force?  How much heat is dissipated?  How much entropy is produced?

→ Leave a CommentCategories: Uncategorized

The Alfa-Romeo Principle

January 7, 2009 · Leave a Comment

Never adjust more than one thing at a time of it will be impossible to tell which adjustment produced what result.

p159 Alfa Romeo Shop Manual (1969)

→ Leave a CommentCategories: Uncategorized

Requiem for a Turbomolecular Pump

December 6, 2008 · Leave a Comment

A true veteran to the cause after 20 years crashed and burned at 50,000 rpm.  Rest in peace, you served us well. You died with your boots on.

turbo12turbo21

→ Leave a CommentCategories: Uncategorized