The new AI NLP: Long form question answering

(written by lawrence krubner, however indented passages are often quotes). You can contact lawrence at:, or follow me on Twitter.

My friend Juan Gomez points me to this repository, which offers some insights on some of the libraries on Hugging Face. (Because Hugging Face is hopelessly disorganized.

I found this gem which I think is interesting. This looks specifically at the building an LLM that can go research your question and write a unique report just for you.

Now, wouldn’t it be great if your computer could do all of that for you: gather the right sources (e.g. paragraphs from relevant Wikipedia pages), synthetize the information, and write up an easy-to-read, original summary of the relevant points? Such a system isn’t quite available yet, at least not one that can provide reliable information in its summary. Even though current systems excel at finding an extractive span that answers a factoid question in a given document, they still find open-domain settings where a model needs to find its own sources of information and long answer generation challenging.

Thankfully, a number of recent advances in natural language understanding and generation have made working toward solving this problem much easier! These advances include progress in the pre-training (e.g. BART, T5) and evaluation (e.g. for factuality) of sequence-to-sequence models for conditional text generation, new ways to use language understanding models to find information in Wikipedia (e.g. REALM, DPR), and a new training dataset introduced in the paper ELI5: Long Form Question Answering.

The ELI5 dataset was built by gathering questions that were asked by community members of the r/explainlikeimfive subreddit, along with the answers that were provided by other users. The rules of the subreddit make this data particularly well suited to training a model for abstractive question answering: the questions need to seek an objective explanation about well established facts, and the answers provided need to be understandable to a layperson without any particular knowledge domain.

In this notebook, we show how we can take advantage of these recent advances to train a long form question answering system which takes in a question, fetches 10 relevant passages from a Wikipedia snapshot, and writes a multi-sentence answer based on the question and retrieved passages. In particular, training embedding-based retrieval models to gather supporting evidence for open-domain questions is relatively new research area: the last few months have seen some significant progress in cases where direct supervision is available, or with extensive task-specific pretraining. Here, we show how the ELI5 dataset allows us to train a dense retrieval system without access to either, making dense retrieval models more accessible. See this presentation from the Hugging Face reading group for a non-exhaustive overview of recent work in the field.

Post external references

  1. 1