laitimes

Technology Application | Cloud-native-based chaos engineering practices

author:Digitization of finance

Text / Wang Zijian, head of the platform architecture R&D team of the Software Development Center of Bank of Beijing

In recent years, the widespread popularity of cloud native, the implementation of microservice architecture, and the rapid introduction of new open source technology components have balanced system iteration, technology upgrades, and system stability, and supported the status quo of agile development and iterative launch, which has made the development of banking technology face huge transformation challenges.

In this context, in November 2021, Bank of Beijing tested the robustness of the unified cloud-native technology base through chaos engineering, verified and improved the platform service capabilities, and started the construction of chaos engineering in July 2022.

Practical exploration of platform-level chaos testing

In order to promote the digital transformation strategy, Bank of Beijing completed the construction of a cloud-native application technology architecture system in January 2021, realizing the industry's mainstream unified technology foundation based on microservices and distributed architecture design. The unified technology base is a self-developed technology platform of the cloud native framework system, which needs to add special test links on the basis of existing test methods for its robustness and reliability, and the microservice architecture and components need to be externally evaluated and tested for comprehensive capabilities, and optimization plans are proposed to verify the resistance of the lower limit of the platform and improve the service governance level of the upper limit of the platform. At the same time, in the process of large-scale promotion of the application of the unified technology base, nearly 40 sets of business systems have been developed and deployed based on the base, including key systems such as distributed core and new teller system. Not only that, the unified technology base drives the transformation of the entire software R&D system to microservices and distributed architecture, and the corresponding business system complexity, operational security requirements, and emergency response difficulties are all improved, which requires matching simulation tests and drill experimental methodologies. The stability and fault response ability of the platform itself have a far-reaching impact, and special fault simulation verification must be carried out to meet the requirements of national information security level protection. In view of the compatibility, high availability, and fault recovery capability (MTTR) of trusted security infrastructure, it is necessary to test and verify, and in this context, chaos engineering capacity building and continuous practice are carried out.

Since August 2021, Bank of Beijing has conducted chaos tests on the unified technology base. The test cycle lasted 3 months, focusing on the container cloud and microservice components of a unified technology base, combined with the redundant switching experience of IaaS infrastructure in the production environment, meeting the requirements of information innovation testing, supplemented by the basic capabilities of cloud native observability, and using the distributed core system and the new teller system as pilots to verify the expansion effect of chaos engineering in the field of application scenarios.

A total of 175 fault scenarios and more than 500 test cases were executed, and 37 platform stability capabilities were found to be perfect, and the maturity level of cloud native was improved from level 2 (basic level) to level 4 (excellent level). The first chaos test covers basic resources, platforms, application middleware, monitoring and alarms, disaster recovery, and emergency plans, from basic testing to organizational processes, which greatly improves the elasticity of the technical architecture, enhances the confidence of the new cloud-native core system online, enhances the ability to resist the impact of uncertainties under the highly complex application architecture, verifies the robustness of the unified development platform, verifies and improves the platform service capabilities, and ensures that the unified technology base carries out business promotion as planned.

The first phase of the construction practice of the chaos engineering platform

In order to improve the chaos drill capability and ensure the plan of regular chaos testing of major systems, Bank of Beijing will build a localized chaos engineering test platform in 2022. In mid-2022, the first phase of the chaos engineering construction project will be carried out, the chaos engineering platform will be built, and the fault simulation drill scenario library will be built to form a high-availability capability matrix for microservices and distributed architecture. Complete the construction of a systematic chaos engineering platform, including infrastructure management, fault scenarios, media control, scenario library management, drill plans, experimental processes, experimental protection, experimental observations, experimental reports, authority management and security audits.

In terms of drill materials, a variety of fault factors, scenario libraries, and indicator libraries are available. The chaos engineering platform designs 129 fault factors from the five dimensions of system process, method call, hardware resources, network transmission, and application data, combined with the selected test range, which are based on Kubernetes, physical devices, virtual hosts and other types, and have the ability to quickly combine in a customized way. According to the frequency of occurrence and historical production faults, common fault factors can be selected as high-frequency fault factors, and they can be preferentially selected for testing in fault drills. At the same time, the scene library mode is provided, which can be combined and orchestrated for different types of atomic faults, and fault injection tests can be performed in series and parallel assembly, and the created scene library can also be used multiple times in various environments, systems, and project test cases. The indicator library provides multi-level and multi-angle system monitoring and indicator strategy configuration functions, and the experiment will automatically match the corresponding observation indicators after selecting the faulty atoms, and observe the changes of the indicators in real time, providing reference for the experimental results.

In terms of the drill process, a variety of fault injection methods, startup strategies, and execution methods are provided. A variety of fault injection initiation strategies, including manual execution, scheduled execution, and random execution, provide flexible and controllable drills, manually pause experiments, recover the environment, or terminate fault injection at any time, and provide intelligent termination control based on metrics and alarms. In order to ensure the safety of the drill, it supports the control of the explosion radius and the rapid recovery of the drill environment, the manual protection mechanism of one-click termination and pause, and the intelligent termination based on indicators and alarms to ensure the safety of the drill.

In terms of result observation, based on the classification of chaos engineering experiments, experimental records and experimental baselines, testers can focus more on experimenting on the services, resources and applications under the project, and create a chaos engineering experiment plan exclusive to a project, while effectively avoiding the conflict between experimental resources and objects. Provide the scheduling function to solve the situation that the experiment cannot be carried out normally due to the conflict of experimental resources, or the experimental results are inaccurate, and the experiment is scheduled according to the resources, and the effective resources are reasonably used to carry out chaos experiments. The experimental results show the experimental arrangement, index overview, and experimental events, show whether the experiment is successful, and analyze whether the fault has a great impact according to the indicators, automatically provide the experimental report according to the experimental results, which can be edited and exported, the start time, end time, duration, and implementation of the experimental process, and the performance and experimental trend chart of each stage in the experimental process.

At the end of November 2022, the first phase of the Chaos Engineering Platform project was officially delivered, and the Chaos Engineering Platform of Bank of Beijing currently supports full system coverage testing from multiple levels. First of all, at the level of infrastructure and public components such as hardware servers, container cloud platforms, and microservice components, matching fault types are provided, and corresponding experiments can be orchestrated according to the characteristics of the experimental environment. Secondly, at the system level, for important systems such as core and teller systems, scenario analysis is carried out with the help of chaotic test tools, script programs are written to automatically inject fault factors, and test tasks are completed in batches. At the same time, the chaos platform is connected to the monitoring and alarm system, and the cloud-native monitoring method is used to visualize the whole stage of the experiment.

Since the platform was put into operation, more than 300 experimental plans have been designed, more than 1,500 experiments have been performed, and an average of five parameters have been adjusted for experiments in an experimental plan, with a high reuse rate, of which the results that meet the expected experimental results account for 99.32% of the total number of experiments.

PodCPU load, pod scaling, and pod network latency are the three most commonly used types of faults. Through chaos drills, a variety of defects and vulnerabilities are found and rectified in advance. The functions are as follows: by injecting network delay faults, the system is found to have problems with the setting of timeout limits, and the reasonable timeout configuration suggestions are put forward to improve the robustness of the correlation between systems; through the fault observation capabilities provided by the chaotic engineering platform, it is found that the monitoring alarm threshold before the system is put into production or is not set reasonably, and the reasonable monitoring indicator threshold suggestions are put forward to predict the problems caused by the lack of monitoring in advance; by injecting pod scaling faults, the service interruption occurs during the pod restart process, and the reasonable and elegant shutdown configuration suggestions are put forward to improve the level of business continuity in special conditions; and the service is forcibly killedFor batch processing or batch transaction terminals, the lack of compensation mechanism for batch transactions was verified, and a targeted manual intervention re-run plan was formulated before production.

The use of chaos engineering platforms has great advantages over traditional testing methods. Compared with manual fault injection, the chaotic engineering platform is simple and convenient to automatically inject faults, and the operation steps greatly improve the efficiency and provide a good user experience. The statistical dimensions of the experimental results are clear at a glance, which is convenient for the overall statistical analysis. Manage experiments from the perspective of project management to make experiments more compatible with business characteristics. At the same time, a set of implementation methodologies based on fault injection drills are summarized. Through the development of the chaos test guide, the chaos test concept, implementation method, result analysis, defect fixing, report writing and other links are described in detail for developers. By defining the overall process of chaos testing, developers can clarify the process of requirements investigation, fault drill preparation and implementation, result analysis and defect repair, and report summary in chaos, which is convenient for the promotion and development of chaos testing.

The second phase of the construction practice of the chaos engineering platform

On the basis of the functions of the Chaos Engineering Platform in 2022, the second phase of the Chaos Engineering construction project will be launched in mid-2023, adding multi-core capacity building, managing the open source multi-type atomic fault library, and supporting cross-platform operating systems, middleware, and Windows operating system fault injection in addition to the existing Kubernetes and host type failures. Build the capacity of strong and weak dependencies, connect with a unified platform, obtain the microservice architecture, perceive and display the dependencies between services, and verify the preset policies such as rate limiting, circuit breaker, and degradation of the system. Build the capacity of physical examination packages, use experimental experience as fault risk checkpoints, design fault risk inspection packages for different technology stacks and component combinations, and integrate and arrange the experience base of each scenario to form the same type of physical examination package. By docking with the test tube platform, a unified test platform is built to manage traditional tests and chaotic tests in a unified manner. The ability of chaos engineering drills has been greatly improved.

Chaos engineering is combined with development, testing, operation and maintenance, business and other systems to provide technical assurance for business continuity. At present, Bank of Beijing has been awarded the vice chairman unit of China Information and Communications Chaos Engineering Laboratory, and the Chaos Project has passed the Level 3 Gold Standard of Classified Protection, and the maturity of cloud native has been significantly improved, and a high-availability capability matrix for microservices and distributed architecture has been formed.

In the future, based on the general policy of stability and progress, Bank of Beijing Information Technology will put into operation more investment and stronger support in the stability construction of financial technology innovation, and will continue to carry out the construction of chaos engineering platform in depth in order to meet more and more complex experimental scenarios under the cloud native application system, and comprehensively promote the inspection and inspection of the established platform in terms of stability testing. For the scenario-based delivery of business value in new finance, it provides a more comprehensive and reliable stability verification system to escort the digital transformation of the whole bank.

(This article was published in the second half of March 2024)

Read on