biLSTMs will solve all of your NLP problems

(written by lawrence krubner, however indented passages are often quotes). You can contact lawrence at:, or follow me on Twitter.


Using pre-trained models

We have several different English language pre-trained biLMs available for use. Each model is specified with two separate files, a JSON formatted “options” file with hyperparameters and a hdf5 formatted file with the model weights. Links to the pre-trained models are available here.

There are three ways to integrate ELMo representations into a downstream task, depending on your use case.

Compute representations on the fly from raw text using character input. This is the most general method and will handle any input text. It is also the most computationally expensive.

Precompute and cache the context independent token representations, then compute context dependent representations using the biLSTMs for input data. This method is less computationally expensive then #1, but is only applicable with a fixed, prescribed vocabulary.

Precompute the representations for your entire dataset and save to a file.

We have used all of these methods in the past for various use cases. #1 is necessary for evaluating at test time on unseen data (e.g. public SQuAD leaderboard). #2 is a good compromise for large datasets where the size of the file in #3 is unfeasible (SNLI, SQuAD). #3 is a good choice for smaller datasets or in cases where you’d like to use ELMo in other frameworks.

In all cases, the process roughly follows the same steps. First, create a Batcher (or TokenBatcher for #2) to translate tokenized strings to numpy arrays of character (or token) ids. Then, load the pretrained ELMo model (class BidirectionalLanguageModel). Finally, for steps #1 and #2 use weight_layers to compute the final ELMo representations. For #3, use BidirectionalLanguageModel to write all the intermediate layers to a file.

Post external references

  1. 1