Natural language processing (NLP) is a messy and difficult affair

(written by lawrence krubner, however indented passages are often quotes). You can contact lawrence at: lawrence@krubner.com, or follow me on Twitter.

Interesting:

Similar words are nearby vectors in a vector space. This is a powerful convention since it lets us wipe away a lot of the noise and nuance in vocabulary. For example, let’s use gensim to find a list of words similar to vacation using the freebase skipgram data6:

from gensim.models import Word2Vec
fn = “freebase-vectors-skipgram1000-en.bin.gz”
model = Word2Vec.load_word2vec_format(fn)
model.most_similar(‘vacation’)

# [(‘trip’, 0.7234684228897095),
# (‘honeymoon’, 0.6447688341140747),
# (‘beach’, 0.6249285936355591),
# (‘vacations’, 0.5868890285491943),
# (‘wedding’, 0.5541957020759583),
# (‘resort’, 0.5231006145477295),
# (‘traveling’, 0.5194448232650757),
# (‘vacation.’, 0.5068142414093018),
# (‘vacationing’, 0.5013546943664551)]

We’ve calculated the vectors most similar to the vector for vacation, and then looked up what words those vectors represent. As we read the list, we note that these words aren’t just similar in vector space, but that they make sense intuitively too.

Post external references

  1. 1
    http://multithreaded.stitchfix.com/blog/2015/03/11/word-is-worth-a-thousand-vectors/
Source