Observability

Server observability refers to the practice of monitoring and gaining insights into the performance, health, and behavior of servers and the applications or services running on them. It is a critical aspect of managing and maintaining server infrastructure, especially in modern, complex, and distributed computing environments. Server observability helps organizations detect and diagnose issues, optimize performance, and ensure the reliability of their IT systems.

Server observability is an essential practice in modern IT operations, especially for cloud-based and containerized environments, where servers and services are highly dynamic and interconnected.

Logging

Collecting and analyzing logs generated by:

  • Server Application/ Services Logs: messages logged by your server code, databases, etc.
  • System Logs: Generated by systems-level components like the operating system, disk errors, hardware devices, etc.
  • Network Logs: Generated by routers, load balancers, firewalls, etc. They provide information about network activity like packet drops, connection status, traffic flow and more.

Logs can provide detailed information about events, errors, and transactions, helping in troubleshooting and debugging issues. There are different log system out there, so better approach is to use a tool to standardize the output of the logs, such as Fluentd, an Open source data collector. Logs are collected in a central location, where they can be displayed. A good tool for this is Kibana, with ElasticSearch for data storage.

Metrics

A metric is a numerical measurement easily understandable by humans. It helps humans to understand the system's behavior and performance, from a technical and a business point of view.

Gathering and analyzing various metrics is essential to find quickly if your system is behaving as expected. Metrics can include various performance indicators, such as:

  • CPU utilization
  • memory usage
  • disk I/O bottleneck
  • connectivity and performance (database, 3rd party services, cache...)
  • network traffic and latency
  • http request/response times
  • error rates
  • application-specific metrics (e.g., user sign-ups, transactions processed)

These metrics provide a real-time view of the server's health and performance. Prometheus is an open-source systems monitoring and alerting toolkit. It comes with a query language built-in, where powerful queries can be written to gain insight into the system's behavior. Alerts regarding this metrics can be send to another system like Sentry or Nagios.

For metrics visualization, Grafana is a good option.

How to use/analyze metrics

Don't use averages, use percentiles.

Average doesn't show the real picture of the system, because it can be skewed by a few outliers. For this reason, we use percentiles, like P50, P90, P99, etc. How to get the value, just by throwing out the bottom XX% of the points and looking the first point that remains.

From a list of sorted latency times [20, 37, 45, 62, 850, 920], the average is 312.3ms. But the P50 is 62ms and the P90 is 920ms. This means that 50% of the requests are below 62ms and 90% of the requests are below 920ms.

Set alarms.

Manual look to metrics or schedule checks is not enough, specially if the system is quite big. You need to be notified when something goes wrong.For each metric, you should set a period to measure, a threshold (limit) and a grace period (in case error recover by itself) before triggering the alarm.

You can also set Working Day Alarms, alarms that indicates early warning indicators. During a working day, people can look into the issue, but if same happens during night or weekend, no need to wake up people (until the alarm is critical).

Adapt limits to cycles.

It's not the same traffic during the day than during the night, or in load peaks (if you are a retailer, during Black Friday you should expect a lot of traffic). You should set different threshold for different situation (hour, day of the week, season...).

Periodically review your metrics.

The system is not static, it changes over time. You should review your metrics periodically to ensure they are still relevant and useful. If this process can be automated, better.

Remember that not having metrics is as bad as a bad metric. If you don't have metrics, you can't know what is happening in your system. Create alarms to alert if metrics are not present.

Tracing

Tracing requests as they traverse through various components of a distributed system. Distributed tracing helps in identifying latency bottlenecks and understanding the flow of requests in a microservices architecture.

Jaeger is an open source end-to-end distributed system to help tracing request within the system.

To implement server observability effectively, organizations often use specialized tools and platforms, such as monitoring and observability solutions like Prometheus, Grafana, Elasticsearch, Kibana, and many others. These tools help in collecting, storing, analyzing, and visualizing the data needed to maintain and improve server performance and reliability.

References