laitimes

Technology Application | Take advantage of the "chaos engineering" to realize a new exploration of the endogenous stability of the cloud system

author:Digitization of finance

By Zhang Jing, Zhang Li, Lu Fei, Shi Peixuan, Information Technology Department, China International Capital Corporation Limited

With the rapid development of fintech, China International Capital Corporation Limited (CICC) has taken advantage of the rapid development of financial technology and gradually migrated its business systems to the cloud. The complexity of IT infrastructure and business systems is increasing day by day, and unpredictable user behaviors and events are intertwined, which puts forward higher requirements for the reliability of system and application architecture. To this end, based on the chaos engineering concept and the characteristics of the cloud environment, CICC has built Wukong, a chaos engineering platform for application resilience management, to conduct chaos experiments in the cloud environment and system normalization fault drills, actively repair the problems found through chaos experiments, steadily improve the immunity of the system in the cloud environment, and realize the stability and security of the cloud system.

Technology Application | Take advantage of the "chaos engineering" to realize a new exploration of the endogenous stability of the cloud system

Zhang Jing, Information Technology Department, China International Capital Corporation Limited

New dynamics: the uncertainty brought about by the "digital finance era".

In the era of Internet finance, financial products and service models continue to innovate, and the transaction volume has increased significantly. In order to achieve the goal of digital transformation, the industry has widely applied new technologies such as cloud computing and distributed to build a distributed architecture and operation and maintenance system to support the rapid development of financial business.

1. The stability requirements of the trading system have been improved. In order to maintain customer satisfaction to an acceptable level, the operation and maintenance team needs to strictly abide by regulatory regulations, go all out to monitor the running status of business applications, and respond to various abnormal or interrupted events.

2. Information technology application ability requirements are improved. The innovative development of digital technologies represented by artificial intelligence, blockchain, cloud computing, big data, etc., has introduced new industrial elements, service formats and business models to the securities industry, broadening the business boundaries of the securities industry, and bringing new pressure to the operation and maintenance team. In this regard, the O&M team needs to continuously learn new knowledge, master new skills, and continuously carry out various verification, upgrade, and migration operations in the face of new scenarios to avoid operational errors and cause external service exceptions of the system.

3. Industry regulatory requirements have been raised. The mainland securities industry is currently in the stage of digital transformation, and the development of the securities industry is facing an important period of opportunity, and securities companies need to continue to provide customers with diversified and differentiated financial products to cope with the intensifying competition in the same industry. The securities industry uses digital technology to reform the whole process of investment, trading, and risk control, and the focus of information technology investment has changed to strengthen the in-depth application of modern technology in the fields of investment, consulting, products, compliance, risk control, and credit. The comprehensive digitalization of business means that more application systems need to be invested, more complex data logic needs to be established, and higher system security and stability requirements need to be maintained. In this regard, the O&M team must adopt a new O&M thinking, invest more in IT, and build new O&M capabilities to respond to the challenges brought about by digital transformation.

Wukong, a chaotic engineering platform for application resilience management built by CICC, has achieved endogenous stability of cloud systems, effectively helping to solve problems such as the difficulty in locating the root causes of production events, the insufficient coverage of non-functional tests, the unpredictable risks of business system silos, the difficulty of emergency capability verification, and the doubts about the availability of global business networks. At the same time, the Wukong Chaos Engineering Platform is of great significance to accelerate the transformation of information and innovation.

New direction: the construction of a chaos engineering platform for the application of "Wukong" resilience management

CICC applies the Resilience Management Chaos Engineering Platform, which adopts containerized packaging and container-based reuse of code and components, improves the overall development level, and simplifies platform maintenance. The platform is divided into multiple system services (microservices), including data side, server side, NoSQL side, data side, agent side, configuration side, application side, collection side, authentication side, registration side, gateway side, client side, etc.

Referring to the standard content of the Guidelines for the Stability Construction of Distributed Systems and the Hierarchical Capability Requirements for Chaos Engineering Platforms issued by the China Academy of Information and Communications Technology, and in accordance with the requirements of the Three-Year Improvement Plan for Network and Information Security of Securities Companies (2023-2025), CICC actively uses chaos engineering to carry out and implement the work, and the application resilience management chaos engineering platform built by CICC is based on a four-layer design, as shown in Figure 1.

Technology Application | Take advantage of the "chaos engineering" to realize a new exploration of the endogenous stability of the cloud system

Fig.1 Architecture of Wukong's Applied Resilience Management Chaos Engineering Platform

Basic service layer: Supported by the infrastructure and hardware resources of IDC and Sky Cloud, a series of public business capabilities are extracted to form the architecture base of the chaotic engineering management platform, providing general capabilities such as unified authentication, multi-tenancy, authorization mechanism, security audit, and a series of service centers.

Core service layer: With fault injection capability as the core function, the core service layer of the platform is formed, including core atomic fault injection capability, experimental scenario management, experimental probe management, and experimental resource management. The platform is based on the coexistence of multiple architectures, compatible with virtual machines, containers, physical machines, and autonomous controllable servers, to achieve unified packaging, focus on fault implementation content, and ignore the underlying differences.

Experimental management layer: It includes experimental management, experimental observation and traffic management, supports the large-scale and standardized promotion of chaos engineering, and enhances the value of the chaos engineering platform. Experiment management includes the management of experimental plans, experimental environments, and experimental results, the experimental observation part provides visualization of the whole process of chaos experiments, and traffic management supports the traffic injection of chaos engineering in various environments.

Tripartite docking layer: The application resilience management chaos engineering platform provides a variety of standard open interfaces, and plans to connect with internal systems such as CMDB, Sky Cloud Management Platform, Monitoring Platform, CI/CD, etc., which are currently mainly used in the operation and maintenance system.

New Exploration: Realizing the Endogenous Stability of Cloud Systems through Chaos Engineering

CICC used the Sky Cloud as the experimental object of chaos engineering, and conducted chaos experiments from two dimensions: application shaft structure and life cycle. As shown in Figure 2, CICC Sky Cloud is a globally deployed cloud platform, which manages domestic and overseas infrastructure resources, mainly involving computing power, storage, network, database, middleware, etc., and provides data centers in Beijing, Shanghai, Shenzhen, Hong Kong and other places with full-stack infrastructure capabilities required for business cloud migration.

Technology Application | Take advantage of the "chaos engineering" to realize a new exploration of the endogenous stability of the cloud system

Fig.2 Architecture diagram of the CICC firmament

Through the construction of Sky Cloud, CICC has fully realized the streamlining of cloud resources, ensured the T+0 delivery of resources through automated construction, and greatly improved the speed of cloud native and traditional project launch. Sky Cloud provides infrastructure resources for the company's departmental business systems, effectively supporting the agile delivery and stable operation of business systems.

1. Practical scenario: comprehensive evaluation of the vertical structure of the business system. To provide stable external services for the business system, it is necessary to rely on stable hosts and network equipment, host systems, sky cloud bases, cloud resource pools, instances and cloud services, etc. The migration of applications to the cloud has higher requirements for stability, and the abnormality of any component will bring hidden dangers to the stable operation of the business system, and even have a significant impact.

Based on the Wukong platform, CICC started from the business system and conducted a comprehensive evaluation of the ARM and x86 device layer, operating system layer, cloud base and cloud resource layer, cloud service instance layer, and cloud service combination layer through layer-by-layer analysis of the vertical structure. As shown in Figure 3-a business system shaft evaluation module, the physical architecture and technical architecture of the business system are analyzed from the whole process of experimental design, steady-state index definition, experimental environment preparation, experimental implementation, and experimental report confirmation, so as to verify the stability of the operation shaft structure of the business system, strictly control the experimental details, and ensure the quality of the experiment.

Technology Application | Take advantage of the "chaos engineering" to realize a new exploration of the endogenous stability of the cloud system

Fig.3 Diagram of the chaotic experiment practice

2. Practical scenario: real event emergency drill to test emergency response ability. According to the preset fault scenarios in the emergency drill plan, chaos engineering automatically injects real faults into the drill to test the detection ability of the monitoring system. At the same time, the emergency team responds quickly, completes the emergency response according to the emergency plan, and tests the emergency team's disposal capability and the availability of the plan, as shown in Figure 3-b Emergency Response Capability Module.

Each business platform can customize the drill plan, rely on the chaos engineering platform to conduct planned and unplanned drills, test the emergency response capabilities and problem discovery capabilities of the development and testing, operation and maintenance teams, verify the overall emergency implementation plan and monitoring coverage, and verify the emergency response process of each line team. In addition, based on the scenario design of the chaotic engineering platform, the emergency drill plan can be arranged on a regular basis to realize the automatic operation of the emergency drill cycle, record the application operation status of the drill process and generate a summary report, and form a positive feedback mechanism.

3. Practical scenario: online quality inspection of the business system. From the perspective of business requirements, the service rollout activities based on the CI/CD pipeline are divided into two parts: development and inspection. Developers follow CICC's development specifications to complete the application design and development process, and inspectors use traditional inspection and chaotic engineering quality inspection methods to design and verify the functional and non-functional aspects of the application, so that the application enters the stage of being launched.

The non-functional inspection activity based on chaos engineering designs the chaos experiment according to the deployment architecture of the system, as shown in Figure 3-c online quality inspection module. Before the CI/CD pipeline business system is launched, the chaos engineering quality inspection node is added, and the chaos engineering platform detects the CI/CD to-do tasks, and the chaos experiment determines the failure factors and detection content for each component of the system to be launched, completes non-functional tests such as high availability, elasticity, and self-healing, and completes the evaluation of the application to be launched, issues a stability report, prompts the risk points, puts forward optimization suggestions, and realizes the online quality inspection.

4. Practical scenario: Fault recurrence assists in locking the root cause and optimizing the emergency plan. When a real fault production event occurs, the O&M team takes the rapid recovery of the business as the first goal, but handles the production event with the goal of quickly recovering the business, which will lead to many situations where the root cause of the event is difficult to lock. Through the chaos engineering platform, it is connected to the monitoring system and the automated operation and maintenance system, and the root cause analysis can be performed after the fault handling is completed, as shown in Figure 3-D root cause analysis module.

Root cause analysis takes fault injection of the chaos experimental platform as the basic capability, and provides the environment and possibility for the root cause location of production faults by completing the reproduction of production faults in the test environment. The O&M team submits a problem sheet, designs a chaos experiment according to the problem phenomenon, controls the minimum explosion radius, repeats the experiment to gradually narrow the scope of the problem, completes the problem confirmation, designs a repair and enhancement plan, and reconfirms whether the problem is solved through the chaos experiment, so as to solve the root cause of the problem, and can quickly verify the problem after similar problems in the production environment, so as to improve the stability of the application.

New Lessons: Lessons learned from the practice of chaos engineering

Through a variety of scenario-based experiments on the Chaos Engineering Platform, CICC's Chaos Engineering Project team has learned rich experience from it.

Enhance the team's confidence in the stability of information innovation: Define chaos experiments through FMEA (Failure Mode and Effects Analysis) to identify the lethality of failures from the dimensions of fault severity and frequency, and verify the stability of business systems in ARM and x86 hybrid architectures according to the recommendations of FMEA.

Auxiliary root cause finding of production fault recurrence: Based on the cosine similarity algorithm, the similarity relationship between the original index image and the replication index image is compared to determine the correlation degree between the fault recurrence experiment and the fault in the production environment, and the explosion radius of the fault recurrence experiment is limited to ensure that the recurrence calculation is not interfered with by other factors, so as to complete the auxiliary root cause location of production fault recurrence.

Actual combat emergency verification of emergency response capability: Based on the fault injection capability of chaos engineering, the abnormal state of the system is triggered, and then the effectiveness and accuracy of the fault collection of the monitoring and alarm platform are verified, the real event alarm is generated, the response ability of the emergency team and the implementation effect of the plan are verified, the emergency work efficiency is effectively improved, and the hidden dangers in the emergency work are discovered.

On-line quality access control to effectively improve system availability: Access the CI/CD pipeline, add quality inspection nodes in the testing and production stages, automate the execution of fault scenarios required for inspection of drill objects in batches, identify quality risks in the production stage, and effectively reduce the events caused by non-functional problems after the system is launched.

Verify the availability of the business system silo structure: The availability of the business system is a comprehensive value of the availability of each level from physical devices to applications, and the failure of components at each level will have an unpredictable impact on the availability of the business system. Through a comprehensive evaluation of the silo structure of the business system, the degree and scope of the impact of the failure on the business system are changed from unknown to known.

Global service availability verification: The availability of global services mainly depends on the network interconnection between multiple IDCs around the world, so network availability is a key factor in IDC availability and service availability. By detecting risks such as network high availability, latency, and transient disconnection, configuration adjustments and repairs are made to improve network availability.

Through the summary of practical exploration, CICC's chaos engineering project successfully won the second "Excellent Case of Stable and Safe Operation of Cloud System" award from the Stability Assurance Laboratory of the China Academy of Information and Communications Technology, providing valuable experience reference for the application of chaos engineering in the industry.

Looking Ahead: Expanding the Scope of Chaos Engineering Practice

At present, the application system architecture is changing from a single application to a cloud-native application, and traditional O&M is facing the challenge of two different system architectures, which requires different ideas to complete the increasingly heavy O&M work. It is a feasible method to continuously test the effectiveness of the stability assurance scheme through the chaos engineering platform, so that most of the threats and risks can be resolved through the system's own capabilities, so as to reduce the maintenance pressure during the operation of the system, improve the continuous service delivery ability of the system, and ensure the continuous operation of business activities.

At present, the system analysis and scenario design before chaos experiments require high professional knowledge of personnel, and the experiments require the participation of professional chaos experiment experts. In the future, CICC Chaos Engineering plans to provide built-in experimental scenario templates for different components and services, and provide physical examination package services for application operators to facilitate the large-scale promotion and use of Chaos Engineering.

(This article was published in the first half of March 2024)

Read on