Monitoring Overview

Monitoring is the action of observing and checking the behavior and outputs of a system and its components over time.

Monitor the system in order to,

Alert on conditions that require attention.
Investigate and diagnose those issues.
Display information about the system visually.
Gain insight into trends in resource usage or service health for long-term planning.
Compare the behavior of the system before and after a change.

What to monitor

Intended Changes

Monitor the version of the binary.
Monitor the command-line flags, especially when you use these flags to enable and disable features of the service.
If configuration data is pushed to your service dynamically, monitor the version of this dynamic configuration.

Dependencies

Even if your service didn’t change, any of its dependencies might change or have problems, so you should also monitor responses coming from direct dependencies.
It’s reasonable to export the request and response size in bytes, latency, and response codes for each dependency.
When choosing the metrics to graph, keep the four golden signals in mind.

Saturation

Aim to monitor and track the usage of every resource the service relies upon.
Some resources have hard limits you cannot exceed, like RAM, disk, or CPU quota allocated to your application.
Other resources—like open file descriptors, active threads in any thread pools, waiting times in queues, or the volume of written logs—may not have a clear hard limit but still require management.

Status of Served Traffic

For HTTP traffic, monitor all response codes, even if they don’t provide enough signal for alerting, because some can be triggered by incorrect client behavior.
If you apply rate limits or quota limits to your users, monitor aggregates of how many requests were denied due to lack of quota.

Four Golden Rules for Metrics

Latency - Time taken to serve a request
Traffic - How much demand is placed on your system
Errors - Rate of requests that are failing
Saturation - How “full” your system is

Use cases

Use case 1: Move information from logs to metrics.

Labeling the Http Status code to the metrics. This is something already been taken care by API Gateway.

Use case 2: Improve both logs and metrics.

To calculate error budget, centralized logging service needs to write some scripts for each module.

Instrument the code. While logging an error, export it to metrics data source. So that this can be configured in the Monitoring system and dependency can be avoided with central logging application. Also metrics will have an exact details and maintains the consistency.

Use case 3: Making the life of support engineer easy

Whenever an alert has raised, engineer has to query the logger for the more details. Sometimes, engineer may not aware of the query or framing the query might take little more time.

Providing the related query as part of the alert email.

Grafana Dashboard - Configuration As Code

Instead of manually creating a dashboard panels in the Grafana Web and storing the JSON file in the version control, complete Dashboard with varying panels can be created as configuration as code using open source library.

Few links below:

https://grafanalib.readthedocs.io/en/latest/
https://github.com/weaveworks/grafanalib
Linting support for prometheus config: promtool - https://github.com/prometheus/prometheus/tree/master/cmd/promtool -
An instrumentation framework - https://opencensus.io/
Standardized schema for metrics, developed by Prometheus - https://openmetrics.io/

Alerts

Categorize the use cases and create alerts. We can set the severity levels for alerts.

Not all alerts need not to be emails.
Can log a ticket for low priority alerts which can be reviewed and fixed. E.g. CPU about to reach the limit, API response latency.
Send email for high severity alerts.

Alert suppression

When the problem is at service dependencies, we no need to create an alert. This can be suppressed.
When same alert is being trigged by different system, only one can be alerted.

Another example would be setting different levels of alerts based on the HTTP Status Code 400 or 500.

Karthik Venkatesalu

Search This Blog