laitimes

Deep decryption of the eBPF-based Kubernetes troubleshooting panorama

Author | Li Huangdong

1

When Kubernetes became the cloud-native de facto standard, observability challenges followed

Currently, cloud-native technologies are based on container technology, providing infrastructure through standard and scalable scheduling, networking, storage, and container runtime interfaces. At the same time, through the standard scalable declarative resources and controllers to provide O&M capabilities, two-tier standardization promotes the separation of development and O&M concerns, and further enhances scale and specialization in various fields to achieve comprehensive optimization of cost, efficiency and stability.

In this context of large technologies, more and more companies are introducing cloud-native technologies to develop, operate and maintain business applications. Because cloud-native technologies bring more and more complex possibilities, business applications have the distinctive characteristics of many microservices, multi-language development, and multi-communication protocols. At the same time, cloud-native technologies themselves move the complexity down, creating more challenges for observability:

1. Chaotic microservice architecture, mixed multi-language and multi-network protocols

Because of the division of labor problems in the business architecture, it is easy to have a large number of services, and the call protocol and relationship are very complex, resulting in common problems including:

Unable to accurately and clearly understand and control the overall system operation architecture;

Unable to answer whether the connectivity between applications is correct;

Multi-language, multi-network call protocol brings linear growth in buried costs, and repeated buried ROI is low, and developers generally reduce the priority of such requirements, but observable data has to be collected.

2, the sinking infrastructure capacity shielding to achieve details, the problem demarcation is becoming more and more difficult

Infrastructure capabilities continue to sink, development and O&M concerns continue to be separated, implementation details are shielded from each other after layering, data is not well correlated, and problems cannot be quickly demarcated at which level. The development students only pay attention to whether the application is working properly, and do not care about the details of the underlying infrastructure, and need to work together to troubleshoot the problem after the problem occurs. In the process of troubleshooting, the operation and maintenance students need to provide enough upstream and downstream to promote the investigation, otherwise they only get such a general expression as "so-and-so application latency is high", which is difficult to have further results. Therefore, development students and operations students need a common language to improve communication efficiency, and Kubernetes' concepts such as Label and Namespace are very suitable for building contextual information.

3. A wide range of monitoring systems, resulting in inconsistent monitoring interfaces

A serious side effect of complex systems is the proliferation of monitoring systems. The data link is not correlated and unified, and the monitoring interface experience is inconsistent. Many O&M students may have mostly had such an experience: when locating problems, the browser opens dozens of windows and switches back and forth between various tools such as Grafana, console, and logs, which is not only very time-consuming, but also limited information that the brain can process, and the problem positioning efficiency is inefficient. If there is a unified observability interface, data and information are effectively organized, reducing distraction and page switching, improving the efficiency of problem localization, and investing valuable time in the construction of business logic.

2

Solve the idea and technical solution

In order to solve the above problems, we need to use a technology that supports multi-language, multi-communication protocols, and cover the end-to-end observability requirements of the software stack as much as possible at the product level, through research, we propose an observability solution based on the container interface and the underlying operating system, and upward correlation application performance monitoring.

To collect data from various dimensions of containers, node operating environments, applications, and networks, the challenge is very large, and the cloud-native community has given cAdvisor, node exporter, kube-state-metics and other methods for different needs, but it still cannot meet all the needs. The cost of maintaining a large number of collectors should not be underestimated, and one of the thoughts raised is whether there is a non-invasive data acquisition scheme that supports dynamic expansion of the application. The best answer yet is eBPF.

Data Acquisition (eBPF Superpower)

The power of eBPF

eBPF is equivalent to building an execution engine in the kernel, attaching this program to a kernel event through kernel calls, and listening for kernel events. With the event, we can further derive the protocol, filter out the protocol of interest, and further process the event and put it in the ringbuffer or eBPF's own data structure Map for the user-state process to read. After the user-state process reads this data, it further correlates the Kubernetes metadata and pushes it to the storage side. This is the overall processing process.

eBPF's superpower is reflected in the ability to subscribe to various kernel events, such as file reads and writes, network traffic, etc., all the behavior in containers or pods running in Kubernetes is achieved through kernel system calls, the kernel knows everything that happens in all processes on the machine, so the kernel is almost the best observation point for observability, which is why we chose eBPF. Another advantage of monitoring on the kernel is that the application does not need to be changed, nor does it need to recompile the kernel, so that it is truly non-invasive. When there are dozens or hundreds of applications in the cluster, non-intrusive solutions can help a lot.

But as a new technology, there are also some concerns about eBPF, such as safety and probe performance. To fully guarantee the security of the kernel runtime, the eBPF code imposes a number of limitations, such as a maximum stack space of 512 and a maximum of 1 million instructions. At the same time, for performance concerns, the eBPF probe is controlled at about 1%. Its high performance is mainly reflected in the processing of data in the kernel, reducing the copying of data between the kernel state and the user state. Simply put, the data is calculated in the kernel and then given to the user process, such as a Gauge value, the previous practice is to copy the original data to the user process and then calculate.

Programmable execution engines are naturally suitable for observability

Observability engineering eliminates knowledge blind spots and eliminate systemic risks in a timely manner by helping users better understand the internal state of the system. What is the power of eBPF in terms of observability?

Take the application anomaly as an example, when the application is found to be abnormal, the problem solving process found that the lack of application level observability, at this time through the buried point, test, online supplement the application observability, the specific problem has been solved, but often the symptoms are not cured, the next time there are problems elsewhere, and you need to go through the same process, in addition, multi-language, multi-protocol to make the cost of the burial point higher. It would be better to address it in a non-intrusive manner to avoid the absence of data when observations are needed.

eBPF execution engine can dynamically load and execute eBPF scripts to collect observable data, for example, suppose that the original K8S system did not do process-related monitoring, one day found a malicious process (such as mining program) in the crazy CPU occupation, at this time we will find that this kind of malicious process creation should be monitored, at this time we can be implemented by integrating open source process event detection library, but this often requires packaging, testing, and publishing this whole set of processes. It may take a month to go through it all.

In contrast, the eBPF method is more efficient and fast, because eBPF supports dynamic loading of events created by the kernel listening process, so we can abstract the eBPF script into a submodule, and the acquisition client only needs to load the script in this submodule to complete the data collection each time, and then push the data to the backend through a unified data channel. In this way, we eliminate the cumbersome process of changing the code, packaging, testing, and publishing, and dynamically implement the need for process monitoring in a non-intrusive way. Therefore, the eBPF programmable execution engine is ideal for enhancing observability, collecting rich kernel data, and linking business applications to facilitate troubleshooting.

3

From monitoring systems to observability

With the wave of cloud native, the concept of observability is gaining ground. However, it is still inseparable from the data cornerstone of the three observable fields of logs, indicators, and links. Students who have done operations or SRE often encounter such problems: they are pulled into the emergency group in the middle of the night, and they are asked why the database is not working, and they cannot immediately grasp the core of the problem without context. We think a good observability platform should help users give good context, as datadog's CEO put it: Monitoring tools are not about having more features, but about thinking about how to bridge the gap between the teams and get everything on the same page.

Therefore, in the observability platform product design, it is necessary to integrate Alibaba Cloud's own cloud services based on indicators, links, and logs, and also support open source product data access, correlate key contextual information, facilitate the understanding of engineers from different backgrounds, and accelerate problem troubleshooting. Information is not organized effectively to incur an understanding cost, information granularity in the event - > metrics - > link - > logs are organized from thick to fine into one page, easy to drill down, do not require multiple systems to jump back and forth, thereby providing a consistent experience.

So how is it related? How is information organized? Mainly from two aspects:

1, end-to-end: unfolding is to apply to the application, service to service, Kubernetes standardization and separation of concerns, each of the development and operation of their own focus on their respective fields, then end-to-end monitoring often becomes a "three regardless" area, when there is a problem, it is difficult to troubleshoot which link on the link is wrong. Therefore, from an end-to-end perspective, the invocation relationship between the two is the basis of the association, because the system call produces a connection. Through eBPF technology is very convenient to collect network calls in a non-intrusive way, and then the calls are parsed into application protocols that we are familiar with, such as HTTP, GRPC, MySQL, etc., and finally the topology relationship is built to form a clear service topology to quickly locate the problem, as shown in the following figure Gateway - > Java application - > Python application - > the complete link of cloud services, any link of the delay, in the service topology should be able to see the problem at a glance. This is the first pipeline point end-to-end.

2. Top-down full-stack correlation: With Pod as the medium, Kubernetes level can associate Workload, Service and other objects, infrastructure level can associate nodes, storage devices, networks, etc., application level associated logs, call links, etc.

Let's take a look at the core features of Kubernetes monitoring.

A timeless gold indicator

The Gold Indicator is the smallest set used to monitor the performance and status of a system. Gold indicators have two advantages: First, it directly expresses whether the system is normal external services. Second, can quickly assess the impact on users or the severity of the situation, can save a lot of time in SRE or R & D, imagine if we take CPU usage as a gold indicator, then SRE or R & D will be exhausted, because high CPU usage may not have much impact.

Kubernetes monitoring supports these metrics:

Number of requests /QPS

Response time and quantiles (P50, P90, P95, P99)

The number of errors

The number of slow calls

As shown in the following figure:

Service topology with a global perspective

Zhuge Liang once said, "Those who do not seek the overall situation are not enough to seek a domain." With the increasing complexity of the current technical architecture and deployment architecture, the positioning problem becomes more and more difficult after the problem occurs, which leads to higher and higher MTTR. Another effect is that it poses a very big challenge to the analysis of the impact surface, often resulting in one or the other. Therefore, it is necessary to have a topological map like a map. The global topology has the following characteristics:

System architecture perception: The system architecture diagram is an important reference for programmers to understand a new system, when getting a system, at least need to know where the traffic entrance is, what core modules are there, and which internal and external components are dependent. In the process of anomaly localization, there is a global architecture diagram that has a very large impetus to the anomaly localization process.

Dependency analysis: There are some problems that occur in downstream dependencies, if this dependence is not maintained by their own team, it will be more troublesome, and it is even more troublesome when their own systems and downstream systems are not observable enough, in this case, it is difficult to explain the problem clearly with the dependent maintainers. In our topology, a call graph is formed by connecting the upstream and downstream of the gold indicator with a call relationship. The edge is used as a dependent visualization to view the gold signal of the corresponding call. With a gold signal, you can quickly analyze whether there is a problem with downstream dependencies.

Distributed Tracing facilitates root cause localization

The protocol Trace is also non-invasive and language-independent. If there is a distributed link TraceID in the request content, it can be automatically identified to facilitate further drill-down to the link trace. The request and response information of the application layer protocol helps to analyze the request content and return code to know which interface has the problem. To view the details of the code level or request area, you can click Trace ID to drill down to Link Trace Analysis.

Out-of-the-box alerting function

Out-of-the-box alarm template, full coverage of different levels, no need to manually configure alarms, the large-scale Kubernetes O&M experience into the alarm template, well-designed alarm rules plus intelligent noise reduction and deduplication, we can do once the alarm is effective alarm, and the alarm with associated information, can quickly locate the abnormal entity. The advantage of full stack coverage of alarm rules is that it can report high-risk events to users in a timely and active manner, and users can gradually achieve better system stability through a series of means such as troubleshooting and positioning, solving alarms, post-event review, and failure-oriented design.

Network performance monitoring

Network performance issues are common in Kubernetes environments, and because the tcp underlying mechanism masks the complexity of network transmission, the application layer is indifferent to this, which brings some trouble to the production environment's high location packet loss rate and high retransmission rate. Kubernetes monitoring supports RTT, retransmission & packet loss, TCP connection information to characterize the network status, the following RTT as an example, support from the namespace, node, container, pod, service, workload of these dimensions to see the network performance, support the following network problems positioning:

Load balancer cannot access a pod, the traffic on this pod is 0, you need to determine whether there is a problem with the pod network, or the load balancer configuration is a problem;

The application on one node seems to have poor performance, and it is necessary to determine whether there is a problem with the node network, and to achieve it through other node networks;

Packet loss occurs on the link, but it is not certain which layer occurs, and it can be checked by the order of nodes, pods, and containers.

4

Kubernetes observable panoramic view

With the above product capabilities, based on Alibaba's rich and in-depth practices in containers and Kubernetes, we have summarized and transformed these valuable production practices into product capabilities to help users locate production environment issues more effectively and quickly. Using this troubleshooting panorama can be done in the following ways:

Basically, the structure is to use services and deployments as the entrance, and most developers only need to focus on this layer. Focus on whether the service and application are slow, whether the service is connected, and whether the number of replicas is as expected

The next level is the pod that provides true workload capability. Pod focuses on whether there are false slow requests, whether they are healthy, whether resources are abundant, whether downstream dependencies are healthy, and so on

The bottom layer is the node, which provides the running environment and resources for the pods and services. Focus on whether the node is healthy, whether it is in a schedulable state, whether it is sufficient resources, and so on.

Troubleshoot common problems

Network issues

Networking is the trickiest and most common problem in Kubernetes because it cumbers us by locating production network problems:

Kubernetes' network architecture is highly complex, with nodes, pods, containers, services, and VPCs shining together, which can simply dazzle you;

Network troubleshooting requires a certain amount of expertise, and most people have an innate fear of network problems;

The Distributed 8 Fallacy tells us that the network is not stable, the network topology is not static, and the delay cannot be ignored, resulting in end-to-end network topology uncertainty.

Network problems in the Kubernetes environment are:

conntrack logs full of problems;

IP conflicts;

CoreDNS parsing is slow and parsing fails;

The node does not have an extranet. (Yes, you heard it right);

Service access is not available;

Configuration issues (LoadBalance configuration, routing configuration, device configuration, NIC configuration);

A network outage makes the entire service unavailable.

There are thousands of network problems, but what is constant is that the network has its "golden indicator" to characterize whether it is operating properly or not:

Network traffic and bandwidth;

Packet loss (rate) and retransmission number (rate);

RTT。

The following example shows a slow invocation issue caused by a network issue. From the gateway point of view of the slow call, look at the topology found that the RT of the downstream product is relatively high, but the golden indicator of the product itself to see that the product itself has no problems, further look at the network conditions between the two, found that RTT and retransmission are relatively high, indicating that the network performance deteriorated, resulting in the overall network transmission slower, TCP retransmission mechanism masked this fact, at the application level can not be perceived, the log can not see the problem. At this time, the golden indicator of the network helps to delineate the problem, thus accelerating the investigation of the problem.

Node issues

Kubernetes does a lot of work to ensure that the nodes provided to the workloads and services are normal as much as possible, and the node controller checks the status of the node 24 hours a day, finds problems that affect the normal operation of the node, sets the node as NotReady or is not schedulable, and evicts the business Pod from the problem node through the kubelet. This is the first line of defense for Kubernetes, and the second line of defense is the node self-healing component designed by cloud vendors for high-frequency abnormal node scenarios, such as Alibaba Cloud's node repairer: after discovering the problem node, it performs drainage eviction and replacement of the machine, so as to automatically ensure the normal operation of the business. Even so, nodes inevitably produce various strange problems in the long-term use process, which is time-consuming and labor-intensive to locate. Frequently Asked Questions Categories and Levels:

Take a CPU full as an example:

1. Node status OK, CPU usage exceeds 90%

2, look at the corresponding CPU triples: usage, TopN, timing chart, first of all, the utilization of each core is very high, which leads to high overall CPU usage; then we naturally want to know who is crazy to use the CPU, from the TopN list, there is a Pod that occupies the CPU prominently; finally we have to confirm when the CPU surge began.

Service response is slow

There are many service responses, and the possible causes of the scenario are code design problems, network problems, resource competition problems, and slow dependence on services. In a complex Kubernetes environment, slow calling can start with two solutions: first, whether the application itself is slow; second, whether the downstream or network is slow; and finally, check the use of resources. As shown in the following figure, Kubernetes monitoring analyzes service performance from landscape and vertical orientation, respectively:

Horizontal: Mainly at the end-to-end level, first look at whether there is a problem with the gold indicators of their own services, and then gradually look at the downstream network indicators. Note that if the downstream call from the client point of view is time-consuming, but from the downstream itself gold indicators are normal, this time is very likely to be a network problem or operating system level problems, at this time can be used network performance indicators (traffic, packet loss, retransmission, RTT, etc.) to determine.

Vertical: Determine that the external delay of the application itself is high, the next step is to determine which cause, determine which step / which method is slow can be seen with the flame diagram. If there is no problem with the code, then there may be a problem with the environment in which the code is executed, and you can check whether there is a problem with the system's CPU/Memory and other resources for further troubleshooting.

Here is an example of a SQL slow query (as shown below). In this example, the gateway calls the product service, which relies on the MySQL service, gradually looks at the golden metrics on the link, and finally finds that the product executes a particularly complex SQL and associates multiple tables, resulting in a slow response of the MySQL service. MySQL protocol is based on TCP, our eBPF probe recognizes the MySQL protocol, assembles and restores the MySQL protocol content, and the SQL statements executed in any language can be collected.

The second example is the example of the application itself is slow, at this time it is natural to ask which step, which function caused the slowdown, ARMS application monitoring support flame graph by the CPU time-consuming periodic sampling (as shown below), to help quickly locate the code level problem.

5

App /Pod status issue

The Pod is responsible for managing the container, which is the vehicle that truly executes the business logic. At the same time, the Pod is the smallest unit of Kubernetes scheduling, so the Pod has both business and infrastructure complexity, which needs to be combined with logs, links, system indicators, and downstream service indicators. Pod traffic issues are high-frequency issues in production environments, such as a sharp increase in database traffic, and when there are thousands of Pods in the environment, it is particularly difficult to troubleshoot which pod the traffic mainly comes from.

Let's look at a typical case: the downstream service grays out a pod during the release process, and the pod responds very slowly for code reasons, causing the upstream to time out. The reason why we can achieve Pod-level observability is because we use ebpf technology to collect Pod traffic and golden indicators, so it is convenient to view Pod and Pod, Pod and service, Pod and external traffic through topology and large market.

6

summary

Through eBPF non-intrusive collection of multi-language, multi-network protocol gold indicators / network indicators / traces, by associating Kubernetes objects, applications, cloud services and other contexts, while providing specialized monitoring tools (such as flame maps) when it is necessary to drill further, the one-stop observability platform in the Kubernetes environment is realized.

Read on