Observability

Server observability refers to the practice of monitoring and gaining insights into the performance, health, and behavior of servers and the applications or services running on them. It is a critical aspect of managing and maintaining server infrastructure, especially in modern, complex, and distributed computing environments. Server observability helps organizations detect and diagnose issues, optimize performance, and ensure the reliability of their IT systems.

Server observability is an essential practice in modern IT operations, especially for cloud-based and containerized environments, where servers and services are highly dynamic and interconnected.

Logging

Collecting and analyzing logs generated by:

Server Application/ Services Logs: messages logged by your server code, databases, etc.
System Logs: Generated by systems-level components like the operating system, disk errors, hardware devices, etc.
Network Logs: Generated by routers, load balancers, firewalls, etc. They provide information about network activity like packet drops, connection status, traffic flow and more.

Logs can provide detailed information about events, errors, and transactions, helping in troubleshooting and debugging issues. There are different log system out there, so better approach is to use a tool to standardize the output of the logs, such as Fluentd, an Open source data collector. Logs are collected in a central location, where they can be displayed. A good tool for this is Kibana, with ElasticSearch for data storage.

Metrics

Gathering and analyzing various metrics such as CPU utilization, memory usage, disk I/O, network traffic, and application-specific performance indicators. These metrics provide a real-time view of the server's health and performance. Prometheus is an open-source systems monitoring and alerting toolkit. It comes with a query language built-in, where powerful queries can be written to gain insight into the system's behavior. Alerts regarding this metrics can be send to another system like Sentry or Nagios.

For metrics visualization, Grafana is a good option.

How to use/analyze metrics

Don't use averages, use percentiles. Average doesn't show the real picture of the system, because it can be skewed by a few outliers. For this reason, we use percentiles, like P50, P90, P99, etc. How to get the value, just by throwing out the bottom XX% of the points and looking the first point that remains.

From a list of sorted latency times [20, 37, 45, 62, 850, 920], the average is 312.3ms. But the P50 is 62ms and the P90 is 920ms. This means that 50% of the requests are below 62ms and 90% of the requests are below 920ms.

Set alarms. Manual look to metrics or schedule checks is not enough, specially if the system is quite big. You need to be notified when something goes wrong.For each metric, you should set a period to measure, a threshold (limit) and a grace period (in case error recover by itself) before triggering the alarm.

You can also set Working Day Alarms, alarms that indicates early warning indicators. During a working day, people can look into the issue, but if same happens during night or weekend, no need to wake up people (until the alarm is critical).

Adapt limits to cycles. It's not the same traffic during the day than during the night, or in load peaks (if you are a retailer, during Black Friday you should expect a lot of traffic). You should set different threshold for different situation (hour, day of the week, season...).
Not having metrics is as bad as a bad metric. If you don't have metrics, you can't know what is happening in your system. Create alarms to alert if metrics are not present.
Periodically review your metrics. The system is not static, it changes over time. You should review your metrics periodically to ensure they are still relevant and useful. If this process can be automated, better.

Tracing

Tracing requests as they traverse through various components of a distributed system. Distributed tracing helps in identifying latency bottlenecks and understanding the flow of requests in a microservices architecture.

Jaeger is an open source end-to-end distributed system to help tracing request within the system.

To implement server observability effectively, organizations often use specialized tools and platforms, such as monitoring and observability solutions like Prometheus, Grafana, Elasticsearch, Kibana, and many others. These tools help in collecting, storing, analyzing, and visualizing the data needed to maintain and improve server performance and reliability.

Observability

Logging

Metrics

How to use/analyze metrics

Tracing

References