March 5th, 2019
In Technology
No Comments
If you enjoy this article, see the other most popular articles
If you enjoy this article, see the other most popular articles
If you enjoy this article, see the other most popular articles
An overview of different approaches to NLP word embeddings
(written by lawrence krubner, however indented passages are often quotes). You can contact lawrence at: lawrence@krubner.com, or follow me on Twitter.
It’s interesting that the simple approaches do fairly well:
Smooth Inverse Frequency
Taking the average of the word embeddings in a sentence tends to give too much weight to words that are quite irrelevant, semantically speaking. Smooth Inverse Frequency tries to solve this problem in two ways:
Weighting: like our tf-idf baseline above, SIF takes the weighted average of the word embeddings in the sentence. Every word embedding is weighted by a/(a + p(w)), where a is a parameter that is typically set to 0.001 and p(w) is the estimated frequency of the word in a reference corpus.
Common component removal: next, SIF computes the principal component of the resulting embeddings for a set of sentences. It then subtracts from these sentence embeddings their projections on their first principal component. This should remove variation related to frequency and syntax that is less relevant semantically.
As a result, SIF downgrades unimportant words such as but, just, etc., and keeps the information that contributes most to the semantics of the sentence.
Post external references
- 1
http://nlp.town/blog/sentence-similarity/
February 8, 2022 9:33 am
From Michael S on How I recovered from Lyme Disease: I fasted for two weeks, no food, just water
"Did you have Bartonella, too? Seems it uses autogenesis..."