RESEARCH LOG

Python traits gadget: lowbow smoother for the tribet

November 27, 2008 · Leave a Comment

lowbow_screenshot

Consider a langauge with three words ‘0′, ‘1′, and ‘2′.  We will call this language the ‘tribet’. We can represent each of these three words as a unit vector in a vector space

‘0′ = (1,0,0)

‘1′ = (0,1,0)

‘2′ = (0,0,1)

The bag of words representation of a document takes a sequence of words like ‘112012011′ and represents it as a vector where each component is the relative frequency with this the word corresponding to that component occurs in the sequence.

So since in the 9 word sequence ‘112012011′ 0 occurs 2 times, 1 occurs 5 times, and 2 occurs 2 times. It’s bag of words vector representation would be (2/9, 5/9, 2/9)

If we convert the original sequence ‘112012011′ to a sequence of unit vectors e_1, e_1, e_2, e_0, e_1, e_2, e_0, e_1, e_1.  Then the bag of words representation is just the vector average

b = \frac{1}{N} \sum_i^N w_i

Note that for a bag of words vector each of the components must by positive and they must sum to one.  This means that their coordinates lie in the triangle with vertex coordinates (1,0,0), (0,1,0), (0,0,1).  Thus each document corresponds to a point in this simplex.

This representation ignores the sequential information in the document.  The bag of words representation is invariant to any permutation of the order of the words in the document. (the coordinates in the simplex depend only on the number of times that each word occurs)

If one returns to the original word sequence w_i where each word is represented by its unit vector – then we have a discrete walk on the corners of the 2-simplex. If we convert this word sequence from a discrete to a continuous function of time, and smooth this sequence by convolving it with a gaussian kernel with a variable scale \sigma then the walk on the sequence becomes a smooth curve in the simplex.  In the limit where \sigma becomes as large relative to the length of the document then the curve collapses to the point corresponding to the bag of words representation.

This technique was invented by Lebanon, Dillon, and Mao in the series of papers:

G. Lebanon, Y. Mao, and J. Dillon. The Locally Weighted Bag of Words Framework for Document Representation. Journal of Machine Learning Research 8(Oct):2405-2441, 2007.

The program above computes these lowbow curves by convolving with a gaussian kernel. It has the option of interactively varying scale of the smoothing kernel. This helps visualize how the curves collapse to a point. It is implemented using enthought’s traits and traitsUI.  These two libraries allow one to provide type information to python object attributes.  This type information automates lots of tedious boilerplate like initialization and validation of values.  It also provides a nice automatic implementation of the observer pattern.  TraitsUI uses the Traits type information to automatically generate interfaces for objects.  It is a powerful MVC GUI library.  The plotting is done with chaco, which is nicely integrated with TraitsGUI and provides better features for making interactive plots than the matplotlib. These libraries are wonderful tools for development of scientific programs in python. Eventually I plan to open the source for these programs.

Categories: Uncategorized

0 responses so far ↓

  • There are no comments yet...Kick things off by filling out the form below.

Leave a Comment