Why is 99% reliability a problem for Twitter?

(written by lawrence krubner, however indented passages are often quotes). You can contact lawrence at: lawrence@krubner.com, or follow me on Twitter.

I’m concerned that non-technical people might think that 99% reliability is very good, so I’m writing this to explain the real situation. This is relevant to any website or software, but I’m going to focus on Twitter.

I’ve been at early stage startups where we were stretched very thin, and we had to live with 99% reliability. We were embarrassed about it, but we could explain to our potential customers “we are very early stage” and they would accept that, at least for awhile. At Celelot we never even got to 99% reliability (I cover this in my book).

Still, at a mature software company, everyone agrees the goal should be something like 99.999999% reliability. That’s 99 followed by 6 “9s”. When I say “everyone” I mean the leadership, the application developers, the devops team, the product managers — everyone. And some companies go for even higher levels of reliability.

Does the Law Of Diminishing Returns apply here? Yes, very much. 99% reliability is surprisingly cheap: we often achieve that even at early stage startups, when we are stretched thin. By contrast, 99.999999% is very expensive, especially at large scale.

This is relevant to Twitter. Elon Musk just fired most of the engineers. Some people worry this means the company will collapse. But what does that actually mean? With the tech team now stretched so thin, reliability might fall to the level that you see in an early stage startup. In other words, reliability might fall from 99.999999% to 99% or even 98%.

Why is this bad? This is a complex topic, but I’m going to try to keep this simple. For the sake of simplicity, let’s just assume that every single SOA service or microservice at Twitter declines to 99%, and let’s assume that 99% is randomly distributed. (In reality, the failures would probably be concentrated in a few services and would be lumpy based on other circumstances. But let’s keep this simple.)

How do you interact with Twitter? You can click on a like. You can retweet. You can write a new tweet. You might be reading a long thread where most of the tweets are hidden behind a “see more replies” link, so you click that link.

In short, you might have an interaction with Twitter every 15 seconds. If you spend an hour scrolling Twitter then you have 240 interactions, and at 99% reliability, you would expect at least 2 or 3 bugs. That is, you would click “like” but the like would not be recorded, or you would try to write a tweet, but it won’t post, or you click “see more replies” but no additional replies load. You keep clicking but there are many bugs, and so you are often disappointed that the action you expected to happen does not occur.

And given the variability of such probability distributions, you would occasionally have some one hour sessions where you experience as many as 10 or 15 bugs.

This does not doom Twitter, but it does take us back to the bad old days, 2006 to 2014, when Twitter had a notorious reputation for being unreliable. The “fail whale” appeared often. Of course, Twitter was smaller then, less influential, and its problems did not make the front page of the New York Times, as happens now.

I’m guessing that if reliability falls to 99% then everyone (the users of Twitter) will complain about the bugginess but they will continue to use the app. But God help Twitter if reliability slips to 98% or 97%. At that point people would experience so many glitches during a one hour session, the amount of bugs would probably limit the user experience. In simple terms, people would have less fun on Twitter if they were experiencing as many as one bug for every 3 or 4 minutes of use.

This problem also haunts the game industry: some games have a fun premise, but if they have too many bugs, the bugs will keep people from playing the game. Even when a person sort of enjoys a buggy game, they won’t recommend it to a friend because they assume the friend won’t tolerate the bugs. And there is a game element to Twitter, and so, too many bugs will sink Twitter, in the same way that too many bugs will sink any game.

All of which goes to the question, was Twitter overstaffed? I see a lot of people online are saying “Elon Musk got rid of most of the engineers, and Twitter is still fine, so clearly Twitter was overstaffed.” Maybe or maybe not — only time will tell.

What is certain is that the Law Of Diminishing Returns very much applies to any company seeking reliability numbers beyond 99%. This applies very broadly to most web sites and software services. If it takes you 10 engineers to get 99% reliability, don’t be surprised if it takes you 50 engineers to get to 99.999999%. Does that mean you are overstaffed when you have 50 engineers? Or does it simply mean that a mature company should aim for high levels of reliability? In this example you could fire 80% of your engineers and still have 99% reliability. But is that the level of reliability that you actually want?

Thank you for reading this.

(Note for developers: reliability is a complex topic. There are entire books devoted to it. Reliability is not a single topic and applies to different parts of the tech stack in different ways. Reliability for an application is different than reliability for a database or a queue. Certainly, each type of technology has different “most common” failure modes. For handling load, one can often save the situation by using queues, but queues also have failure modes. My favorite video on queue failure modes is from Zach Tellman. I recommend it.)

Post external references

  1. 1
    https://www.amazon.com/dp/0998997617?psc=1&ref=ppx_yo2ov_dt_b_product_details
  2. 2
    https://www.youtube.com/watch?v=1bNOO3xxMc0
Source