Why did Basecamp crash?

(written by lawrence krubner, however indented passages are often quotes). You can contact lawrence at: lawrence@krubner.com

Is this the best way to store events? In a relational database? Wouldn’t a log be better for this kind of data? Are the events actually relational? When not switch to a historic paradigm? Clearly, the Ruby On Rails model can be made to do almost anything, but is it ideal?

The root cause was that our database hit the ceiling of 2,147,483,647 on our very busy events table. Almost every single activity in Basecamp is tracked in this table. When you post a message, update a todo list, or applaud a comment, we track that activity in the events table. So when we became unable to write new events to that table, every attempt to do practically anything in Basecamp was halted.

This was an avoidable problem. We were actively working on expanding the capacity of the events table in the days prior to this outage, but we failed to properly account for how quickly we were running out of headroom.

To compound the avoidable factor, we should had been aware of the general issue much sooner. The programming framework we use, Ruby on Rails (which was originally extracted from Basecamp!), moved to a new default for database tables in version 5.1 that was released in 2017. That change lifting the headroom for records from 2,147,483,647 to 9,223,372,036,854,775,807 on all tables. Which ended up being the same root-cause fix that we applied to our tables.

It’s bad enough that we had the worst outage at Basecamp in probably 10 years, but to know that it was avoidable is hard to swallow. And I cannot express my apologies clearly or deeply enough.

We pride ourselves at Basecamp on being “boring software” because it just works and it’s always available. Since Basecamp 3 was launched, and up until this outage, we’ve had an uptime record of 99.998%. This near five-hour outage has taken that impressive statistic down to a more humbling 99.978%.

Source