How is an operability review different from monitoring?

(written by lawrence krubner, however indented passages are often quotes). You can contact lawrence at: lawrence@krubner.com

I read this article but I didn’t understand what was gained, compared to good monitoring pushing stuff into good graphs, thus giving people a real-time view of the system.

Architecture & Deployment diagram:
The first thing is to have the deployment & architecture diagrams. Where architecture depicts the logical flow of request/ response inside the system, deployment diagram is focussed on the actual infrastructure. But if the system is not too complex, it makes sense to have them in a single diagram. Having these diagram/s helps people unfamiliar with the application to get a quick understanding of what is inside. This also helps in technical discussions around the other areas under operability.

The stuff about monitoring seems like solid advice:

For the internal monitoring platform, we have been using a combination of Sensu & Cloudwatch but recently we are working on migrating to Riemann which is better in terms of real-time monitoring. Also, we use Pingdom for monitoring our services from external to our infrastructure. The new monitoring platform is going to use Telegraf/Riemann/InfluxDB/Grafana where every system is supposed to be sending metrics by telegraf agent to Riemann server, which in turn has the alerts configured & also sends metrics to InfluxDB. Then create Grafana dashboards to have visualization on top of InfluxDB for analysing the trend. This also helps to do capacity planning for your systems. The Alert Notification systems are mainly Email/ Slack & pagerduty — And decided based on the severity of the issues.
For logging, we use hosted service from Loggly. We have Chef cookbook & Ansible playbook which we customize for every app which pushes the data to Loggly. Then creating dashboards with appropriate set of filters needs to be done in Loggly. Not every service is integrated with Loggly yet, but irrespective of that — it’s important to have proper log rotation strategy so that disks are not unnecessarily filled up by old & useless logs.

Source