NLP as a simple four-step formula: embed, encode, attend, predict

(written by lawrence krubner, however indented passages are often quotes). You can contact lawrence at:, or follow me on Twitter.


Let’s say you have a great technique for predicting a class ID from a dense vector of real values. You’re convinced you can solve any problem that has that particular input/output “shape”. Separately, you also have a great technique for predicting a single vector from a vector and a matrix. Now you have the solutions to three problems, not two. If you start with a matrix and a vector, and you need a class ID, you can obviously compose the two techniques. Most NLP problems can be reduced to machine learning problems that take one or more texts as input. If we can transform these texts to vectors, we can reuse general purpose deep learning solutions. Here’s how to do that.

Embedded word representations, also known as “word vectors”, are now one of the most widely used natural language processing technologies. Word embeddings let you treat individual words as related units of meaning, rather than entirely distinct IDs. However, most NLP problems require understanding of longer spans of text, not just individual words. There’s now a simple and flexible solution that is achieving excellent performance on a wide range of problems. After embedding the text into a sequence of vectors, bidirectional RNNs are used to encode the vectors into a sentence matrix. The rows of this matrix can be understood as token vectors — they are sensitive to the sentential context of the token. The final piece of the puzzle is called an attention mechanism. This lets you reduce the sentence matrix down to a sentence vector, ready for prediction. Here’s how it all works.

…I think bidirectional RNNs will be one of those insights that come to feel obvious with time. However, the most direct application of an RNN is to read the text and predict something from it. What we’re doing here is instead computing an intermediate representation — specifically, per token features. Crucially, the representation we’re getting back represents the tokens in context. We can learn that the phrase “pick up” has a different meaning from the phrase “pick on”, even if we processed the two phrases into separate tokens. This has always been a huge weakness of NLP models. Now we have a solution.

…By reducing the matrix to a vector, you’re necessarily losing information. That’s why the context vector is crucial: it tells you which information to discard, so that the “summary” vector is tailored to the network consuming it. Recent research has shown that the attention mechanism is a flexible technique, and new variations of it can be used to create elegant and powerful solutions.

Post external references

  1. 1