What is the equivalent of unit tests for large language models?

(written by lawrence krubner, however indented passages are often quotes). You can contact lawrence at: lawrence@krubner.com, or follow me on Twitter.

A big topic of discussion is how to test AI models, especially when making changes. Or, to put that differently, what is the equivalent of unit tests when you’re working with large language models? If you have a startup that is built around one or more LLMs, and you are constantly tinkering with the parameters for those LLMs, how do you automates tests to protect yourself against regressions?

I feel like we are only beginning to see the emergence of “best practices.” One idea is to give your LLM a multiple choice test, and ask the same questions several times, but with the order of the answers changed, to see if the LLM can give the correct answer, repeatedly. That technique is used here:


This is worth looking at the to get ideas on how to do better testing of LLMs.

Post external references

  1. 1