Netflix monitors its micro-services

(written by lawrence krubner, however indented passages are often quotes). You can contact lawrence at:


The ability to decompose where time is spent both within and across the fleet of microservices can be a challenge given the number of dependencies. Such information can be leveraged to identify the root cause of performance degradation or identify areas ripe for optimization within a given microservice. Our Mogul utility consumes data from Netflix’s recently open-sourced Atlas monitoring framework, applies correlation between metrics, and selects those most likely to be responsible for changes in demand on a given microservice. The different resources evaluated include:

System resource demand (CPU, network, disk)
JVM pressure (thread contention, garbage collection)
Service IPC calls
Persistency-related calls (EVCache, Cassandra)
Errors and timeouts

It is not uncommon for a mogul query to pull thousands of metrics, subsequently reducing to tens of metrics through correlation with system demand. In the following example, we were able to quickly identify which downstream service was causing performance issues for the service under study. This particular microservice has over 40,000 metrics. Mogul reduced this internally to just over 2000 metrics via pattern matching, then correlated the top 4-6 interesting metrics grouped into classifications.