Is there no server monitoring at FourSquare?
(written by Lawrence Krubner, however indented passages are often quotes)
Sourceso, in short, a company relying entirely on cloud computing machines for storing its data, which is presumably being billed according to the memory usage of those machines, ran out of memory on them, and suffered a large amount of downtime as a result. mongodb had little to do with the problem, other than maybe it took longer than expected to migrate data to a third server.
i’m baffled at how there could be no monitoring or reporting in place to catch that days or weeks ahead of time, let alone just 12 hours needed according to the mongodb developer, to fix the problem without downtime. it’s such a fundamental thing to keep track of for a system designed entirely around storing a bunch of data in the memory of those 2(!) machines. i have more servers than foursquare and none of them do even a small fraction of the amount of processing that theirs do, and yet i have real-time bandwidth, memory, cpu, and other stats being collected, logged, and displayed, as well as nightly jobs that email me various pieces of information. nobody at foursquare ever even logged into those systems and periodically checked the memory usage manually?
worse still, during all of this, the initial outage reports were blaming mongodb or saying the problem was unknown. even at that point nobody at foursquare realized that the servers were just out of memory?
how did the developers come up with 66 gigabytes of ram to use for these instances in the first place? was there some kind of capacity planning to come up with that number or is it just a hard limit of EC2 and the foursquare developers maxed out the configuration?
May 17, 2012 2:06 am
From free cell phone ringtones on MySql Workbench is a total waste of time
"I like it so much, http://dailybooth.com/freecellphoneringto free cell phone ringtones, jsneke,..."