Problem
How to understand the behavior of an application and troubleshoot problems?
Forces
- Any solution should have minimal runtime overhead
Solution
The instrument is a service to gather statistics about individual operations. Aggregate metrics in centralized metrics service, which provides reporting and alerting.
There are two models for aggregating metrics:
- push - the service pushes metrics to the metrics service
- pull - the metrics services pull metrics from the service
Monitoring and alerting are key components of the production environment.
Monitoring systems gather metrics that provide critical information about an application’s health from all parts of its technology stack.
The metrics range from infrastructure-level metrics such as CPU, memory, and disk utilization to application-level metrics such as service request latency and the number of requests processed.