We all test in production

(written by lawrence krubner, however indented passages are often quotes). You can contact lawrence at: lawrence@krubner.com, or follow me on Twitter.

All of this is good. I touched on a small piece of these issues when I wrote “How ignorant am I, and how do I formally specify that in my code?”

Also see “High Availability is not compatible with a MVP“.


When we say “production,” we typically mean the constellation of all these things and more. Despite our best efforts to abstract away such pesky low-level details as the firmware version on your eth0 card on an EC2 instance, I’m here to disappoint: You’ll still have to care about those things on unpredictable occasions.

And if testing is about uncertainty, you “test” any time you deploy to production. Every deploy, after all, is a unique and never-to-be-replicated combination of artifact, environment, infra, and time of day. By the time you’ve tested, it has changed.

Once you deploy, you aren’t testing code anymore, you’re testing systems—complex systems made up of users, code, environment, infrastructure, and a point in time. These systems have unpredictable interactions, lack any sane ordering, and develop emergent properties which perpetually and eternally defy your ability to deterministically test.


Let me tell you a story. In May 2019, we decided to upgrade Ubuntu for the entire Honeycomb production infrastructure. The Ubuntu 14.04 AMI was about to age out of support, and it hadn’t been systematically rolled out since I first set up our infra way back in 2015. We did all the responsible things: We tested it, we wrote a script, we rolled it out to staging and dogfood servers first. Then we decided to roll it out to prod. Things did not go as planned. (Do they ever?)

There was an issue with cron jobs running on the hour while the bootstrap was still running. (We make extensive use of auto-scaling groups, and our data storage nodes bootstrap from one another.) Turns out, we had only tested bootstrapping during 50 out of the 60 minutes of the hour. Naturally, most of the problems we saw were with our storage nodes, because of course they were.

There were issues with data expiring while rsync-ing over. Rsync panicked when it did not see metadata for segment files, and vice versa. There were issues around instrumentation, graceful restarts, and namespacing. The usual. But they are all great examples of responsibly testing in prod.

We did the appropriate amount of testing in a faux environment. We did as much as we could in nonproduction environments. We built in safeguards. We practiced observability-driven development. We added instrumentation so we could track progress and spot failures. And we rolled it out while watching it closely with our human eyes for any unfamiliar behavior or scary problems.

Could we have ironed out all the bugs before running it in prod? No. You can never, ever guarantee that you have ironed out all the bugs. We certainly could have spent a lot more time trying to increase our confidence that we had ironed out all possible bugs, but you quickly reach a point of fast-diminishing returns.

We’re a startup. Startups don’t tend to fail because they moved too fast. They tend to fail because they obsess over trivialities that don’t actually provide business value. It was important that we reach a reasonable level of confidence, handle errors, and have multiple levels of fail-safes (i.e., backups).