By Xiao Gang, Chief Information Officer of China Securities Co., Ltd

Information Technology Department of China Securities Co., Ltd., Xu Zhibin, Cao Shuangshuang, Li Baoqiang, Li Xin

With the deepening of digital transformation, the information system architecture of the securities industry has gradually evolved to microservices, the dependence between services has deepened, the number of business links has increased, and the business call relationship and data flow have become more complex, so the unforeseen and uncertain risks faced by the system have increased. In order to improve the stability and flexibility of the system, ensure the availability of information system services, and test the integrity and effectiveness of emergency scenario disposal, CSC Securities embraced new ideas, introduced new technologies, and began to promote the construction of chaos engineering system, and successively promoted the chaos experiment and stability verification of various business systems. This article will briefly introduce the implementation plan, practice path and experience summary of chaos engineering in China Securities for reference.

Chaos Engineering, also known as chaos engineering, is to design and execute a series of experiments to actively inject abnormal states or disturbances of software or hardware into the system, helping us to discover potential vulnerable links in the system that may lead to system service abnormalities, and improving the resilience of the information system by actively controlling the problems existing in these vulnerable links. In recent years, this technology has attracted much attention from securities institutions to escort the stability of the securities industry.

In 2021, China Securities officially launched the construction of the chaos project. On the premise of acknowledging the "dark debt" of the system, we follow the safety guidelines to carry out the chaos experiment of "not playing cards according to common sense", actively embrace the fault, integrate the chaos experiment with the operation and maintenance management system and operation tool system, and implement the whole process of closed-loop management from risk identification to emergency effectiveness management. Finally, from the perspective of "prevention first" quality construction, the reliability of the business system is improved to ensure its safe and stable operation.

Chaos Engineering Construction

The chaotic engineering practice of CSC Securities is not a replacement and innovation of the traditional R&D and operation and maintenance system, but a set of systematic work to upgrade and extend some links and concepts in the large operation and maintenance management system under the premise of complying with IS20000 standards, combined with the digital transformation of operation and maintenance management and the development direction of intelligence. Combined with the actual situation of the company, the construction of chaos engineering is divided into three stages: the first stage is mainly to improve the stability and flexibility of the information system and solve the problem of the robustness of the information system; The third stage combines big data analysis and artificial intelligence technology to realize the intelligent identification of fault scenarios, establish the characteristic values of scenes, and realize the stability of information systems and the intelligent and automatic disposal of emergency disposal. Construction is currently in the second phase, and the third phase is being planned. The details are as follows.

1. Chaos engineering landing plan

The dynamic security model proposed by Jens Rasmussen is well known and highly valued in the field of resilience engineering. The chaos engineering practice of CSC Securities also fully considers safety, economy and effectiveness, and combines the fault drill platform with the existing operation and maintenance support tools to build a set of chaos engineering implementation plans suitable for itself through whole-process management (as shown in Figure 1).

Chaos engineering escorts the steady development of the securities industry

Fig.1 Chaos engineering scheme

Fault drill platform: On the platform, with the business link as the core, starting from the overall processing process of the business, the technical risk management and control strategy is extended from local control to overall control, and the defects of the entire information system life cycle in the business processing process are explored, so as to realize the closed-loop management of the whole process of technical risks. The module also provides task orchestration and scheduling, scene library management, disturbance injection, and user display functions.

Analysis module: In the system, it mainly performs log analysis, monitoring indicator analysis, root cause analysis, and scenario identification. Based on the logs and monitoring alarm data generated during the chaos drill, the explosion radius and service impact degree are analyzed and judged, and the monitoring alarm is accurate and comprehensive. On the basis of this analysis, the fault features are extracted, the problem is located, and the existing emergency scenarios are matched.

Emergency management: It is used to deal with faults to ensure that the system is restored to availability in a timely and rapid manner, and to improve the stability of the system. It enables teams to respond efficiently to a wide range of contingencies by pre-staging responses and response processes. This module includes program management, program implementation, emergency coordination, etc., and its goal is to quickly deal with and recover according to the existing plan when a fault occurs, so as to reduce losses and ensure the normal operation of the system.

Report management: After the experimental exercise, it can record the changes in the indicators, stress testing, and disposal in the process based on the template, generate the drill report, and provide the view and download of the report based on permission control, so as to provide support for the optimization of the system and emergency plan, and better promote the orderly progress of the next chaos experiment.

Stability improvements: Continuous improvement of technical specifications is an extension of chaos engineering practices. When system managers find stability defects through chaos engineering, they identify the deficiencies in architecture design, coding specifications, testing and evaluation, accumulate system resilience improvement specifications at the development end, and promote the front-end management of system resilience improvement.

Process management: control all aspects of the whole life cycle from the beginning to the end of the chaos experiment, and provide users with a unified operation channel.

2. Practice the process

Chaos engineering is a continuous process, and CSC Securities combines chaos experiments with emergency drills, and improves the chaos experiment management process through production network security incident review. Among them, the chaos experiment is active and exploratory, and the all-round system anomaly experiment is completed, and effective fault scenarios, emergency plans and system optimization and improvement suggestions are generated; the drill emergency uses the disturbance injection ability of chaos engineering to select the actual fault scenario for the drill to verify the reliability of monitoring, the effectiveness of the emergency plan, the ability of personnel to deal with and the coordination and disposal, etc.; the production emergency is the real production event disposal, based on the basis of the first two steps, to improve the confidence of emergency disposal, and carry out effective and effective Quickly deal with production failures, and reverse improve the chaos experiment and drill emergency process to form a closed-loop management of continuous improvement (as shown in Figure 2).

Figure 2 Practice flow chart

Define clear experimental objectives: Before conducting chaos engineering practice, the risk points are sorted out through the risk investigation model, and the objectives and expected results of the experiment are clearly defined. This helps the team better measure the effectiveness and value of the practice.

Moderate and gradual practice: Chaos engineering practices should be carried out within modest and controlled limits. Start with a single-system component failure or a small blast radius, and gradually scale up the experiment. This helps to reduce risk and ensures the stability and availability of the system.

Risk assessment and control: When faults and disturbances are introduced, adequate risk assessment is required and appropriate isolation control measures are in place to reduce potential negative impacts and ensure the reliability and resilience of the system. The risk assessment should include an assessment of the likely business impact and the development of a backup and recovery strategy.

Data collection and analysis: Chaos engineering practices require the collection and analysis of the field. This includes metrics during the experiment, monitoring and configuration information, system response, and emergency response efficiency. Through data analysis, we can better understand the effectiveness of practices and identify potential improvement points to improve each weak link in system operation and maintenance.

Summary and review: The practice of chaos engineering is a process of team learning and review optimization. Regular reviews of the chaos experiment process and cybersecurity incidents, as well as incident issue tracking, can help identify deeper problems and suggest improvements. Mutual support and experience sharing among team members can better understand and respond to faults and uncertainties in the system, empower different lines, and improve the robustness of the system faster.

Practical features

1. Autonomous and controllable innovation practice

At the beginning of the design, the chaotic engineering practice of CSC Securities took the route of autonomy and controllability, and insisted on independent development in key links to ensure the safe, stable and reliable operation of the system.

Autonomy of system landing scheme: Combined with the characteristics of the company's existing environment and tool system, the unique scheme and process are designed to achieve the goal of "hierarchical decoupling, compatibility and unification, safety and efficiency" of chaos engineering, and finally realize the modularization of the platform and the plug-in of services.

Independent mastery of core technologies: select open source components + independent development methods to achieve fault injection, operation and maintenance analysis, automatic emergency response and operation and maintenance process management, and ensure that the realization of key links and key modules meets the requirements of the operation management system.

2. Integration and innovation of large operation and maintenance system

The chaotic engineering practice sorts out the functional responsibilities of each link, upgrades the functions of the tool system, and finally connects the fault injection platform with the existing stress testing system, monitoring and analysis, asset allocation, automation tools, operation and maintenance management platform and other systems, and realizes the whole life cycle from chaos experiment to system optimization driven by process control. Finally, the fault drill platform is no longer isolated as a fault injection tool, effectively integrated into the large operation and maintenance system, and expands its service capabilities.

3. Automation of the whole process

Relying on manual entry of scenarios and analysis of impact to carry out large-scale experiments and drills has high labor costs, low efficiency, and poor traceability experience of historical data. Through the combination of risk assessment model, asset allocation, and fault capability, the fault drill platform automatically produces experimental scenarios, automatically schedules experimental tasks, injects faults, and notifies the monitoring and analysis system, completes scene feature collection, root cause analysis, triggers emergency automation and observation and analysis of disposal results, and finally realizes the full-process automatic disposal of chaos engineering practice. Through the automatic design, the manual dependence of the chaotic experimental process is greatly reduced, the experimental ability is improved, the experimental income is guaranteed, and the stability improvement of the information system is accelerated.

4. The openness of the tool system

It is an inevitable requirement for the design and construction of the operation management tool system to quickly adapt to the iteration of the technical solution of the system and continuously meet the needs of operation management. We have transformed the various capabilities of the fault drill platform, and through a unified interface standard, the system functions are interfaced and open to the tool system. In this way, the fault disturbance ability and fault experiment process management will no longer be limited by the fixed tool platform, and the needs of chaos experiments can be quickly met through simple docking and configuration.

Practical results

Since 2021, China Securities has conducted fault drills for the peripheral channel system. Each application system conducts all-round chaos experiments from the infrastructure layer, application component layer and service link layer to effectively verify the robustness of the system, the sustainability of business service capabilities, and the reliability of fault recovery. Up to now, the management information system has been gradually promoted from the peripheral channel system to the core trading system, and the risk scenarios have also been covered from the infrastructure layer to the cross-system business link layer.

Through the practice of chaos engineering, many weak links in the system have been found and repaired, which has made the relevant personnel fully understand the importance of "facing mistakes and embracing failures", and taking chaos experiments as a necessary means to improve the stability of the system, forming a cross-line, top-down team culture consensus, and promoting a virtuous circle of culture through culture and practice results.

Through the construction and practice of chaos engineering, promote the improvement of the operation management system and tool system, provide an effective starting point for the availability and continuity management of the information system, and improve the coverage and accuracy of monitoring and the effectiveness of emergency management through "fighting to nourish war" and "actual combat training", forming a complete closed-loop engineering practice system.

As of 2023, CSC Securities has passed the enhanced rating of the distributed system stability measurement of the Academy of Information and Communications Technology, and has been selected as an excellent case of stable and safe operation of one cloud and multiple cores, "Pioneer Practitioner of Chaos Engineering" and "Excellent Case of Chaos Engineering Practice of Stable and Safe Operation of Cloud System".

Future outlook

After three years of chaotic engineering practice, CSC Securities has formed a fault experiment capability from the front-end to the back-end, from basic resources to business links, and will continue to explore and summarize in the future to continue to provide more support and guarantee for the stability of the company's information system. The first is to strengthen the cultural construction, move the entire chaos engineering culture to the system design, development, and testing end, so that the chaos engineering concept runs through the whole life cycle of system development, testing, and operation, and let the chaos experimental results and specification requirements be used as the key basis for system construction and online evaluation, so as to improve the robustness and stability of the system; The third is to promote the implementation of big data and artificial intelligence technology, analyze the characteristics of fault scenarios through large models, and combine them with root cause location to identify fault scenarios, trigger emergency response in a timely and accurate manner, and improve emergency efficiency.

The construction of the chaos engineering system is a process of continuous polishing and continuous development. At the same time, it will continue to learn from the experience of other practice units, actively participate in the formulation and promotion of relevant industry standards, and contribute to the construction of chaos projects in the securities industry.

(This article was published in the second half of March 2024)

Chaos engineering escorts the steady development of the securities industry

Chaos Engineering Construction

Practical features

Practical results

Future outlook

Read on