What is the biggest open problem in NLP?

(written by lawrence krubner, however indented passages are often quotes). You can contact lawrence at: lawrence@krubner.com


Natural language understanding

I think the biggest open problems are all related to natural language understanding. [...] we should develop systems that read and understand text the way a person does, by forming a representation of the world of the text, with the agents, objects, settings, and the relationships, goals, desires, and beliefs of the agents, and everything else that humans create to understand a piece of text. Until we can do that, all of our progress is in improving our systems’ ability to do pattern matching.

– Kevin Gimpel

Many experts in our survey argued that the problem of natural language understanding (NLU) is central as it is a prerequisite for many tasks such as natural language generation (NLG). The consensus was that none of our current models exhibit ‘real’ understanding of natural language.

…Cross-lingual representations Stephan remarked that not enough people are working on low-resource languages. There are 1,250-2,100 languages in Africa alone, most of which have received scarce attention from the NLP community. The question of specialized tools also depends on the NLP task that is being tackled. The main issue with current models is sample efficiency. Cross-lingual word embeddings are sample-efficient as they only require word translation pairs or even only monolingual data. They align word embedding spaces sufficiently well to do coarse-grained tasks like topic classification, but don’t allow for more fine-grained tasks such as machine translation. Recent efforts nevertheless show that these embeddings form an important building lock for unsupervised machine translation.

Is this true? The majority? I would have guessed that the majority of the human race spoke one of just 5 languages: Hindi, Arabic, Spanish, English, and Mandarin. While I’m all in favor of doing more to save small languages, I can’t believe they cover the majority of all humans.

Given the potential impact, building systems for low-resource languages is in fact one of the most important areas to work on. While one low-resource language may not have a lot of data, there is a long tail of low-resource languages; most people on this planet in fact speak a language that is in the low-resource regime. We thus really need to find a way to get our systems to work in this setting.