In-depth interpretation: Distributed system toughness architecture ballast stone OpenChaos

Author | Si Ying, Jia Hao, Ma Hai

Key Takeaways

1. This paper first introduces the concept of chaos engineering with the complexity and stability requirements of today's distributed systems, and explains the optimization and innovation made by OpenChaos on top of traditional chaos engineering.

2. The second part introduces the architecture of OpenChaos, explains in detail how its reliability model and elasticity model work, and shows the effects that OpenChaos can play in practical application scenarios with two practical cases.

3. The last section looks to the future and proposes the future direction of OpenChaos.

Background

With the advent of Serverless, microservices (with service meshes), and more and more containerized architecture applications, the way we build, deliver, and operate systems has become more complex. This complexity makes it difficult to observe the state of the system. In existing production environments, we have different ways to obtain information and enhance the observability of the system. Initially it may be very simple to give a particular condition, resulting in a particular indicator output. Further, using structured and associated logs, or for distributed tracing, an event bus such as Apache EventMesh, EventGrid, etc. is introduced. With the rapid development of Codeless composable applications, the Serverless design concept has also been gradually accepted by some distributed system designers. No O&M, pay-as-you-go, extreme elasticity, multi-lease pooling, etc. all force us to re-examine the rationality of the old architecture and promote the continuous evolution of the new architecture. Converged architecture is the most mentioned word in recent years, and the complex architecture that supports online/offline systems in the past has been continuously integrated, and various business scenarios can be adapted to various business scenarios through separable and combined design and deployment methods. In this context, we began to seriously examine and consider whether there was a more modern tool that could help identify and address the challenges of a resilient architecture such as reliability, security, and resiliency that the distributed cloud brings to architecture design and upper-layer applications.

The idea of chaos engineering has inspired us to a certain extent. Netflix originally created Chaos Monkey to relocate infrastructure to the cloud, kicking off chaos engineering. Later, the CNCF set up a special interest group to promote the birth of standards in this field. The OpenChaos founding team also had several rounds of conversations with the forerunners of these communities in the early days. Unfortunately, in 2019, the interest group was merged into the App Delivery SIG and did not make much progress. Under the strong guidance of national policies in recent years, the digital upgrading of enterprises has been accelerating, and more and more CIOs, CTOs and even CEOs have begun to attach importance to and invest in the practice of chaos engineering. The Domestic Chaos Engineering Laboratory, led by the Chinese Academy of Information and Communications Technology, is also actively promoting the formulation of standards in this field, from full-link stress testing and chaotic fault introduction to multi-cloud and multi-active reference architectures that will lead to future architectural changes, which all indicate that this industry is developing rapidly. According to domestic and foreign science and technology media survey statistics, by 2025, 80% of organizations will implement chaos engineering practices. Through the overall introduction of strategies such as full-link stress testing, chaotic failure, and multi-cloud multi-activity architecture, unplanned downtime can be reduced by 50%, and a true second-level RTO/RPO can be realized, making application and service innovation more focused.

Good medicine is good, but it also has limitations

The most basic process of chaos engineering is to perform experiments on a small-scale, regular and automated basis in a production environment, randomly injecting faults into the system to observe the "system boundary." It focuses primarily on the fault tolerance and reliability of the system in the face of failure. At present, most of the chaotic engineering tools on the market tend to construct black box random type of failure, and have less understanding and insight into the underlying infrastructure (hardware, operating system, database and middleware). Lack of uniform framework standards, mature specific metrics. At the same time, the analysis feedback is weak, and it is impossible to give comprehensive and thorough diagnostic suggestions, especially through reinforcement learning, generative AI and other capabilities can further solve the current random fault injection, self-healing risk analysis and optimization suggestions.

For distributed systems with more complex features, the performance of the system to deal with failures only through observation is limited, and relying on observation is extremely subjective, it is difficult to form a unified evaluation standard, and it is also difficult to analyze system defects for performance. The observability of the system requires not only comprehensive coverage of the model, but also a complete monitoring system, and provides comprehensive results reports and even intelligent predictions to guide the architecture to improve its resilience. Feng Jia, a senior technical expert in the distributed field and the original founder of Apache RocketMQ and OpenMessaging, a senior technical expert in the field of distribution, said that "the evolution of cloud-native distributed architecture is further developing towards an assembled architecture and a resilient architecture." Against this backdrop, he proposed and led the team to create the emerging openChaos project.

The essential problem that OpenChaos needs to solve

Resilient architecture, covering high reliability, security, resiliency, immutable infrastructure and more. Achieving a truly resilient architecture is undoubtedly the evolution of modern distributed systems. For the resilience of distributed systems, OpenChaos extends its definition with the help of chaotic engineering ideas. For some distributed system-specific properties, such as delivery semantics and push efficiency of Pub/Sub system, accurate scheduling, auto scaling and cold start efficiency of scheduling orchestration system, streaming real-time and counter-pressure efficiency of streaming system, retrieval system completion rate and accuracy rate, and consistency of consensus components of distributed system, a special detection model is set. OpenChaos has built-in scalable model support to validate resilience in the face of large-scale data requests and various failure shocks, and to provide further optimization recommendations for architecture evolution.

Architecture and case analysis

Figure 1

Overall architecture

The working principle of OpenChaos is like this: the control is controlled by the entire process, responsible for making the cluster nodes form a distributed cluster to be tested, and will find the corresponding Driver components and load the corresponding Driver components according to the distributed infrastructure tested according to the needs, and establish the corresponding number of clients according to the number of concurrency set. The control node operates on the cluster based on the execution flow control client defined by the Model component. During the walkthrough, the Detection Model introduces events to the cluster nodes according to different observation characteristics. The Metrics module monitors the performance of the cluster under test during the experiment. After the walkthrough, the Checker component automates the analysis of the business and non-business data in the experiment, derives the test results, and outputs a visualization chart.

As shown in Figure 1, the overall architecture of OpenChaos can be divided into management layers, execution layers, and component layers under test.

At the top is the management layer, which contains the user interface and the controller (Control), which is responsible for scheduling the components of the engine layer to work. The lowest level is the component under test, which can be a self-deploying distributed system cluster, or a distributed system hosted by a container or cloud vendor.

The middle layer is the execution layer and the secret to OpenChaos' powerful capabilities. A model is the basic unit of a process that is executed, which defines the basic form of operation on a distributed system. The controller loads the driver of the distributed system to be tested in the model, creates the corresponding client according to the configured concurrency number, and finally uses the client to perform operations on the distributed system. The Detection Model introduces events based on different observing characteristics that the user is interested in, such as faults or the scaling capacity of the system. The Metrics module monitors the performance of the cluster under test during the experiment. After the walkthrough, the Measurement Model component automates the analysis of the business and non-business data in the experiment, derives the test results, and outputs a visualization chart.

Detection models versus metric models

Detect the model

Traditional chaos engineering is primarily concerned with the stability of systems, and their common implementation is to simulate some common general-purpose faults through fault injection in a black box. The detection model in OpenChaos focuses on a higher-dimensional attribute, toughness, which includes reliability, as well as detection models with features such as resiliency and security. Compared with traditional chaotic engineering, OpenChaos not only supports universal black-box fault injection, but also maliciously performs customized detection of distributed infrastructure software such as primary and standby switching of messages or caches, brain splitting caused by network partitioning, etc., to observe their performance in this situation.

Measurement model

Due to the complexity of distributed systems, observations of the resilience of distributed systems require a simple and intuitive analysis report to make it easier to find possible defects and deficiencies in distributed systems. The measurement model analyzes the performance of the system and outputs the results and charts with a unified standardized calculation, which is convenient for users to compare and analyze.

Taking the stability evaluation of the message system as an example, the measurement model calculates the RPO (Recovery Point Objective) and RTO (Recovery Time Objective) of the system according to the failure injection situation and system performance in the experiment. The processing semantics of the output cluster, such as whether it conforms to at least once or exactly once; the failure recovery situation, whether the system is unavailable during the failure and the recovery time of the unavailability; whether the expected partition order is met under the fault; and the response time of the system throughout the experiment.

Reliability case studies

We used OpenChaos to test the reliability of the ETCD cluster and found that the cluster lacked automatic recovery capabilities from the PERSPECTIVE of the ETCD client in the scenario where the primary node network was disconnected and became a partition alone.

Figure 2

The following is an example of an experimental result performed using OpenChaos, which is the performance of a 3-node ETCD cluster when the master node is disconnected from the slave node network and becomes a partition separately, and the simulated traffic rate is 1000 tps.

Figure 3

As can be seen from the figure, the experiment lasted 10 minutes, injected a total of ten master node network partition failures at 30-second intervals, and the cluster performance was inconsistent during the failure. The following figure shows the results of the experiment in more detail.

During the 1/3/6/8 failure, the cluster cannot recover on its own; during other failures it takes 7 seconds to recover the cluster to an available state, but there is no data loss throughout the experiment.

Figure 4

By viewing the experimental process information, it is found that every time the primary node is partitioned, the cluster can move the primary node by itself during the failure. By analyzing the source code, the ETCD client does not retry connecting to other nodes when faced with AN ETCD internal errors. As a result, when the node to which the client connects first is the primary node and becomes unavailable, even if the primary node has been successfully transferred, the overall cluster is restored to usability, and the service remains in the unrecovered state. We have also reported this issue to the ETCD community pending further fixes.

Resiliency case study

Resiliency is also a capability that distributed systems need to focus on, and in addition to reliability, OpenChaos supports the measurement and evaluation of the scalability of the system. Unlike reliability, the resilience of a distributed system cannot trigger detection by orchestrating fixed-frequency events. OpenChaos can trigger scaling based on operating system metrics or business metric thresholds set by the user. For example, you can specify an expected value of 40% for the average CPU consumption of the cluster, or an expected time for system responses to be 100ms. The elastic detection model will calculate the target scale to bounce according to the specified expected value and the current system performance, and according to OpenChaos' built-in algorithm, to trigger the scaling action. After the experiment, the measurement model calculates the "acceleration ratio efficiency" of the cluster, the "scaling cost" and the performance of the cluster at the corresponding scale.

Note: "Acceleration than Efficiency" and "Scaling Cost" are indicators of the elasticity of distributed systems in OpenChaos, the former representing the performance and effect of distributed system parallelization, and the latter representing the rate of system scaling.

The meaning of elasticity includes not only the scalability of instance nodes, but also the scalability of specific service (application) units. To explore best practices for using Kafka partitions, we designed experiments to explore the scalability of a single topic partition. In the experiment, we will also count the throughput of messages sent and received under different number of partitions to understand the impact of the number of partitions on message throughput and the optimal number of partitions to achieve maximum throughput.

Figure 5 shows the tps and latency of a topic partition on a three-node cluster as it scales out from 1 to 9000.

Figure 5

Figure 6 shows how each indicator changes over time.

Figure 6

Figure 7 is a partial screenshot of the specific elastic evaluation results, showing the performance of the system and the cost and efficiency of elastic changes at different scales. ChangeCost and resilienceEfficienty are the scaling costs and acceleration efficiency results described above.

Figure 7

From the above results, it can be seen that the average time for adding 1 partition in the Kafka cluster under this experimental specification is about 20ms. Optimal performance when the number of partitions reaches 26, with throughput of 1.3 million, at which point the overall CPU utilization is 93%. When the number of partitions reaches 450+, performance degrades significantly. When the number of partitions reaches 1992, throughput drops to 38,000 and overall CPU utilization reaches 97%.

Future planning

At present, OpenChaos supports access to most distributed systems, such as Apache Kafka, Apache RocketMQ, DLedger, Redis, ETCD, etc. With the open source summer 2022 event[1], we have opened up more distributed system access work for students to choose and participate.

At the same time, HUAWEI CLOUD and Chaos Engineering Lab worked closely together to help the Chinese Academy of Information and Communications Technology release the first Distributed Message Queue Stability Assessment Standard in China, which is a major contributor to this standard. In addition, the HUAWEI CLOUD middleware messaging product family is the only application service that has fully passed the acceptance criteria.

In the future, OpenChaos will introduce more common resilience standards and intelligent prediction capabilities to not only measure the existing capabilities of the architecture, but also make predictions based on the observations of the system to avoid anomalies that exceed the resilience of the system itself. We will continue to polish the project, integrate more distributed systems through ecological cooperation, and strive to build OpenChaos into a ballast stone for the resilience architecture of distributed systems, so as to promote the continuous evolution of cloud native architecture, and at the critical time, we can "let the wind and waves rise and sit firmly on the fishing boat".

[1] Open Source Summer 2022 Events:

https://summer-ospp.ac.cn/#/org/prodetail/221bf0008

About the Author:

Si Ying, senior R&D engineer, has a deep understanding and research on distributed system consistency algorithm, tough architecture, pattern recognition.

Jia Hao, a senior middleware R&D engineer, is responsible for the design and development of distributed middleware for HUAWEI CLOUD, is good at middleware performance optimization, and likes the design concept of Avenue to Simplicity.

Ma Hai is a HUAWEI CLOUD middleware reliability technology expert, specializing in chaos engineering, performance testing, and event-driven architecture design.

In-depth interpretation: Distributed system toughness architecture ballast stone OpenChaos

Read on