laitimes

Comprehensive idea of system monitoring

author:The beautiful life of Yiqiu

directory

Metric monitoring

Log monitoring

brief summary

In the previous section, I took you to learn how to use the USE method to monitor the performance of the system, and first briefly reviewed.

The core of system monitoring is the use of resources, which includes hardware resources such as CPU, memory, disk, file system, network, etc., as well as software resources such as the number of file descriptors, the number of connections, and the number of connection traces. The easiest and most effective way to describe these resource bottlenecks is the USE method.

The USE method simplifies the performance metrics of system resources into three categories: utilization, saturation, and number of errors. When the indicators of any of these three categories are too high, it means that there may be performance bottlenecks in the corresponding system resources.

After establishing performance indicators based on the USE method, we also need to use a complete monitoring system to integrate these indicators from collection, storage, query, processing, alarm and visual display. In this way, not only can the bottlenecks of system resources be quickly exposed, but also the source of performance problems can be traced with the help of monitored historical data.

In addition to the system resources mentioned in the previous section that need to be monitored, the performance monitoring of the application is of course also essential. Today, I will take you to see how to monitor the performance of the application.

As with system monitoring, before building an application's monitoring system, you first need to determine which metrics you need to monitor. In particular, be clear about what metrics you can use to quickly identify performance issues with your application.

The USE method is simple and effective for monitoring system resources, but it does not mean that it is suitable for application monitoring. For example, even when CPU usage is low, it doesn't mean that the application doesn't have a performance bottleneck. Because the application may cause slow response due to locks or RPC calls, etc.

Therefore, the core indicators of the application are no longer the use of resources, but the number of requests, error rates, and response times. These metrics are not only directly related to the user experience, but also reflect the overall availability and reliability of the application.

With the three golden metrics of requests, error rate, and response time, we can quickly know if there are performance issues with the application. However, these metrics alone are clearly not enough, because after a performance issue occurs, we also want to be able to quickly locate the "performance bottleneck area". So, in my opinion, the following metrics are also essential when monitoring applications.

The first is the resource usage of the application process, such as the CPU, memory, disk I/O, network, etc. occupied by the process. Using too many system resources, causing applications to respond slowly, or with an increased number of errors, is one of the most common performance issues.

The second is the call situation between applications, such as call frequency, number of errors, delay, etc. Because applications are not isolated, if other applications on which they depend experience performance issues, the performance of the application itself can also be affected.

The third is the operation of the core logic inside the application, such as the time consuming of key links and errors in the execution process. Because this is the state inside the application, detailed performance data is often not directly available from the outside. Therefore, when the application is designed and developed, these indicators should be provided so that the monitoring system can understand its internal operating status.

With the resource usage indicators of the application process, you can associate the bottleneck of system resources with the application, so as to quickly locate the performance problems caused by insufficient system resources;

  • With the call metrics between applications, you can quickly analyze which component is the culprit of performance problems in a call chain for request processing;
  • With the performance of the core logic inside the application, you can go one step further and go directly into the inside of the application to locate which processing function is causing the performance problem.

Based on these ideas, I believe you can build performance metrics that describe the running state of your application. Incorporating these metrics into the monitoring system we mentioned in the previous issue (such as Prometheus + Grafana) can be the same as system monitoring, on the one hand, through the alarm system, the problem is reported to the relevant team in a timely manner; on the other hand, through the intuitive graphical interface, the overall performance of the application is dynamically displayed.

In addition to this, since business systems typically involve a chain of multiple services, forming a complex distributed chain of calls. To quickly locate such cross-application performance bottlenecks, you can also use a variety of open source tools such as Zipkin, Jaeger, Pinpoint, etc. to build a full-link tracking system.

For example, the following figure is an example of a Jaeger call chain trace.

Comprehensive idea of system monitoring
Comprehensive idea of system monitoring

(Image from Jaeger documentation)

Full link tracing can help you quickly identify which link is the root cause of the problem in a request processing process. For example, from the image above, you can easily see that this is a problem caused by a Redis timeout.

In addition to helping you quickly locate performance issues across applications, full link tracing can also help you generate a call topology map of an online system. These intuitive topology diagrams are especially useful when analyzing complex systems, such as microservices.

The monitoring of performance indicators allows you to quickly locate the location of bottlenecks, but only indicators are often not enough. For example, the same interface, when the parameters passed in by the request are different, may cause completely different performance problems. So, in addition to metrics, we also need to monitor the contextual information of these metrics, and logs are the best source for these contexts.

In contrast,

  • Metrics are numerical measurements of specific time periods, usually processed in time series, suitable for real-time monitoring.
  • Logs, on the other hand, are string messages at a certain point in time, usually need to be indexed by search engines before query and summary analysis can be performed.

The most classic approach to log monitoring is to use the ELK technology stack, which is a combination of elasticsearch, Logstash, and Kibana.

As shown in the following figure, it is a classic ELK architecture diagram:

Comprehensive idea of system monitoring
Comprehensive idea of system monitoring

(Image from elastic.co)

Among them,

  • Logstash is responsible for collecting logs from each log source, preprocessing them, and finally sending the preliminaryly processed logs to Elasticsearch for indexing.
  • Elasticsearch indexes the logs and provides a complete full-text search engine so that you can retrieve the data you need from the logs.
  • Kibana is responsible for visually analyzing logs, including log search, processing, and brilliant dashboard display.

The image below is an example of a Kibana dashboard that provides a visual overview of Apache's access.

Comprehensive idea of system monitoring
Comprehensive idea of system monitoring

It is worth noting that the Logstash resource consumption in the ELK technology stack is relatively large. Therefore, in resource-constrained environments, we tend to use fluentd, which consumes less resources, instead of Logstash (the so-called EFK technology stack).

Today, I've sorted out the basic ideas of app monitoring for you. Application monitoring can be divided into two parts: metric monitoring and log monitoring:

  • Metric monitoring is mainly to measure the performance indicators in a certain period of time, and then process, store and alarm through the time series.
  • Log monitoring provides more detailed contextual information, typically collected, indexed, and graphically displayed through the ELK technology stack.

In complex business scenarios that span multiple different applications, you can also build a full-link tracing system. This allows you to dynamically track the performance of individual components in the call chain and generate a call topology map of the entire process, speeding up the positioning of performance issues for complex applications

Read on