Different NLP approaches

(written by lawrence krubner, however indented passages are often quotes). You can contact lawrence at: lawrence@krubner.com, or follow me on Twitter.

A nice summary:

With LSA, each document is transformed into a single vector that has the length of the vocabulary. The length of the vocabulary is the number of unique words across all documents. If a word is present in a document, it is represented as a 1 in the vector and 0 if it is not. So after this transformation, the text is transformed in an D by V matrix where D is the number of documents and V is the vocabulary size.

GloVe (and word2vec by equivalence) however, work on a different matrix. In this algorithm, the matrix is V by V, where V is the vocabulary size and each cell counts the number of the times a word appears next to another word in a document. That is, the matrix represents the counts of how often words are neighbors across all documents. (There are some other technicalities, by this the thrust of GloVe).

The similarity between LSA and GloVe is that once the respective matrix is built, it is followed by matrix factorization. By matrix factorization, I mean SVD. This is a process which takes an N by M matrix and returns an N by K matrix, essentially squeezing the matrix into fewer columns (K is a free parameter that is smaller than M).

So ultimately, what we get with LSA is a D by K matrix (where K is just an arbitrary number, like 100). The interpretation of this matrix is that each row represents a document and each column represents a topic. In GloVe, we get a V by K matrix where each row represents a word and each column represents an embedding.

Post external references

  1. 1