With the rapid development of financial technology, the securities industry is also facing new opportunities and challenges. The introduction of cutting-edge technologies such as cloud computing and artificial intelligence has brought more possibilities for the innovation and development of the industry. However, the occurrence of information system security risk events reminds us that with the deepening of technology application, the guarantee of system stability and security has become particularly important. The application of cloud-native technology has improved the efficiency of software delivery in the securities industry, but rapid iteration has led to the system being in an unstable state, and it is difficult to ensure the quality of product delivery. Based on the idea of total quality management, establishing a stable and high-availability system architecture, strengthening risk management and business continuity management through continuous testing, continuous security assurance, chaos engineering system and other methods to ensure the stable operation of business systems is the key path to solve the above problems.

Technology Application | Exploration and practice of total quality assurance system in the cloud-native era

Zhang Yongqi, deputy general manager of the science and technology research and development department of Zhongtai Securities Co., Ltd

Difficulties and challenges in quality management in the cloud-native era

From 1986 to the present, the securities industry has experienced more than 30 years of development, and these 30 years are also 30 years of digital development of the securities industry. In October 2021, at the Beijing Financial Street Forum, the "14th Five-Year Plan for the Development of Science and Technology in the Securities and Futures Industry" compiled by the Science and Technology Supervision Bureau of the China Securities Regulatory Commission and relevant units was officially released, which emphasized the two major themes of "promoting the digital transformation and development of the industry" and "data makes supervision smarter". At present, whether it is the needs of securities enterprises themselves or the requirements of supervision, the digital transformation of the securities industry has been raised to an unprecedented height.

With the rapid development of cloud native technology, our system architecture and R&D model have also undergone major changes, such as the infrastructure has gradually changed from the traditional small computer and traditional X86 architecture to the cloud native infrastructure with container + K8S as the core, the application architecture has gradually changed from a single boulder application to a distributed microservice architecture, the system operation model has also developed from a single computer room and a single center to a multi-location and multi-center deployment architecture, and the delivery model has also changed from the traditional project-based waterfall development to agile iterative development.

The transformation of traditional architectures to cloud-native faces a number of challenges. For example, it is difficult to ensure system stability caused by frequent iterations, the problem that securities business is becoming more and more abundant, the test scenarios are becoming more and more complex, and the test cost is getting higher and higher, the problem of increasing the complexity of system group service governance caused by the widespread application of microservice distributed architecture, the problem of online failure, the problem of problem locating is difficult to locate and the recovery time is long, and the problem of connectivity, release management of development, testing, and production environments caused by hybrid cloud architecture. In the face of many challenges in the process of cloud-native transformation, how to build a total quality management assurance system for the R&D process has become crucial.

Quality assurance in the cloud-native era is no longer the responsibility of a role in the fintech team, but a cross-team collaboration that requires the collaboration of business personnel in R&D, testing, O&M, and even non-R&D systems, as shown in Figure 1. To this end, we have tried to develop a "Total Quality Assurance" solution that runs through the software lifecycle.

Figure 1 Quality assurance management responsibilities in the cloud-native era

The total quality assurance technology system is based on cloud-native technology and runs through the entire software life cycle. On the basis of the basic guarantee of cloud-native stability, through the all-round risk prevention and discovery mechanism, the risks and hidden dangers that may exist in the system are discovered and solved in advance, and the overall quality of the system's business support is comprehensively improved.

Among them, in terms of cloud-native stability basic guarantee, based on CMDB management infrastructure, build an observability system at different levels of IaaS, PaaS and SaaS, and realize fast, efficient and stable automatic release, fault self-healing and other operations through the combination of DevOps platform and automated operation and maintenance. In addition, standardize the internal R&D process of the organization, integrate the process with the assembly line, and ensure the implementation of process specifications. It can be adjusted and optimized according to the specific project and organizational situation to ensure that the pipeline can meet the needs of the actual R&D management of the project and the team. In terms of risk prevention and discovery, the idea of total quality management runs through the entire software life cycle, and through the integration of tools, process lines and platforms at different stages, potential risks in the product delivery process are discovered by automated means, as shown in Figure 2. This paper mainly discusses the prevention and discovery of quality risks in the software life cycle, from the aspects of test management, security management, chaos engineering, and plan management.

Figure 2 Total Quality Assurance System (TQS) solution

In addition, in order to ensure the overall smooth implementation of the solution, we have formulated several principles in the process: minimize the investment in solving the problem, grasp the core contradictions, and not over-design; give priority to empowering technical personnel, and then realize the work of serving R&D management; be lazy if you can be lazy, and never do it if you can do it by machine, and reduce personnel costs through automation means; process specifications must be combined with assembly line design, and do not rely on written process specifications; work nodes should be "shifted left" as much as possible, and those that can be transferred to the R&D side should be "moved left" as much as possible to reduce the cost of problem solving.

A key solution to Total Quality Management

As a comprehensive and complete solution, total quality management is based on the traditional stability guarantee, and its core capabilities include continuous test management, continuous safety management and chaos engineering.

1. Continuous test management

Continuous test management is a core part of the total quality management system, based on the assembly line of specific projects in DevOps, which pulls through the mutual cooperation of various roles, promotes the practice of quality management in all links, takes quality as the starting point, penetrates the quality awareness of the whole people, and earnestly practices "everyone is responsible for quality", so that testing and quality-related work can be valued and paid for by all types of work and roles in the front line, as shown in Figure 3.

Figure 3 Continuous testing throughout the software delivery process

First of all, in terms of continuous test management, it is necessary to pull through the process from the end (demand side) to the end (operation side), so that our process can flow in the DevOps platform, so as to have higher value, and form an experimental delivery closed loop from the proposal of demand characteristics to the verification of operational data and user feedback, so as to adjust the plan and requirements based on actual user feedback.

Second, we have achieved test agility, and promoted the rapid delivery of large links (each iteration) through feedback and measurement in front of small links (testing work at each stage), so as to establish better quality assurance. In the development stage, the ability of unit testing is incorporated into the pipeline as a quality gate before development and testing; in the integration stage, the core use cases of interface testing and UI testing are automatically executed based on the pipeline, and the performance test supports the implementation of multiple stress testing strategies, and problems are found and managed based on the established performance baseline; and in the acceptance and release stage, the main focus is to follow up the problems of business verification and continuously monitor the online operation. Therefore, based on the combination of unit testing, integration testing, interface testing, and manual exploratory testing, a relatively complete testing system has been established, leaving repetitive things to the machine, and using automated testing methods as much as possible to improve testing efficiency, reduce testing time and tester costs. In the process of agile delivery, the method of automated regression testing is adopted, and the interface and UI of important core functions are automatically regressed in bi-weekly agile iterations to ensure the continuity of business services of important core functions.

When it comes to continuous testing agility, the key capability practices are as follows.

(1) Accurate testing. Accurate testing is based on the correlation between business function points (test cases) and business code, obtains the coverage of function points, and carries out accurate test coverage, accurate recommendation of use cases, and accurate location of test defects. Precision testing realizes code change analysis by sensing code changes in real time, accurately locating the scope of code changes and the scope of impact of changes, and providing in-depth and accurate test decision-making analysis basis, so as to accurately determine the test scope. At present, based on Git-Diff, we have implemented Gitcompare and incremental code test case coverage evaluation of code, and completed the analysis of incremental coverage reports at different granularities such as support lines, branches, methods, and classes, so as to achieve the basic capacity building of accurate testing, greatly improve the effectiveness of automated regression testing, and play a more accurate auxiliary role in manual exploratory testing focusing on the impact of code changes on business functions.

(2) Interface automation. Developers can use the interface management function to call front-end and back-end interfaces, manage multi-project interfaces in a unified manner, debug interfaces, and collaborate with multiple teams; testers can perform simple interface tests and scenario-based interface tests based on the interfaces registered in the interface management function; and O&M personnel can realize business monitoring based on the interfaces registered in the interface management function. Interface continuous testing advocates the implementation of automation as much as possible, manual testing is responsible for ensuring the density and effectiveness of the quality network, automated testing is responsible for ensuring that test cases can be executed frequently and efficiently as needed, automated test cases need to be executed flexibly across environments, automated testing activities can be carried out on demand at any time, and can help the value stream flow quickly, so that testing will no longer become a significant blockage in the R&D process. Based on this, interface automation and DevOps platform achieve seamless docking, support one-click release/deployment of different test environments after R&D and testing, can automatically trigger or testers manually trigger the execution of interface automation, and automatically send the test results to relevant personnel in the form of email, and support online viewing of test reports and test execution logs.

(3) UI automation. UI automation test simulates the interaction between the user and the software interface by using automation tools, and it performs functional verification and page performance analysis of the software user interface by simulating the user's clicking, input, swiping and other operations. Through UI automated continuous testing capacity building, product quality problems can be quickly and effectively feedback, and UI automated testing can record logs and screenshots of failures, reduce test omissions and result misjudgments caused by "human" factors, improve the accuracy of testing, and reduce testing costs. At the same time, based on the characteristics of UI automation, it accurately simulates user operations, conducts comprehensive functional verification and page performance analysis of the software, which helps to find potential problems in the software and improve the quality of the software.

(4) Performance automation. Automated performance testing plays an important role in system capacity evaluation and system stability testing. The performance test tool simulates a variety of normal, peak, and abnormal load conditions to test the performance indicators of the system. Verify whether the software system meets the user's performance requirements, and find the performance bottleneck of the software system, and finally achieve the purpose of optimizing the system.

(5) Test left shift vs. right shift. The traditional test team is mainly responsible for quality assurance in the testing stage, and the responsibilities of the test team in the cloud-native era will change, gradually transitioning from a test execution team to a test enablement team. The idea of continuous testing also emphasizes the constant facilitation of testing to the left and right. In order to improve the efficiency and quality of R&D and testing, the test continues to move to the left, to the requirements and development, and to empower its own automated testing capabilities to the R&D side, and integrate automated testing capabilities such as interface automation, UI automation, performance automation, and pipeline CI/CD. R&D personnel can obtain the relevant capabilities of automated testing from the DevOps platform at a very low cost, and the test shift right is to continuously improve the quality of delivery and continuous improvement, and the test is constantly moved to the right, to the operation and maintenance and operation and other links, to obtain online problems and user feedback in a timely manner, and to continuously optimize and improve the product.

2. Continuous safety management

Through the construction of a continuous security management system, safety work runs through the end-to-end process of software delivery, covering organization-level general basic security engineering practices, security development process practices, safety continuous delivery process practices, and security operation process practices. Build a holistic DevSecOps solution based on a maturity model and common security tools.

DevSecOps is based on tool capabilities, covering the whole process of software delivery from requirements to launch, and the main tool capabilities include threat modeling, defect scanning, open source governance, risk discovery, and proactive defense, among which the commonly used core atomic security capabilities include business threat modeling, SAST source code defect detection, SCA open source component detection, image security scanning, DAST dynamic application security testing, and IAST interactive security testing, which are detailed as follows.

(1) Business requirements threat modeling. In order to ensure the security of development, it is necessary to pre-empt the application threat discovery capability to the requirements analysis, development and testing process, and gently embed the security work into the software development process, so as to achieve transparent security for online application projects from the source of the application lifecycle. Threat modeling analyzes security requirements in the project initiation, requirements, and design phases, so that security work is carried out with the beginning of the project, and clear requirements are issued to various roles in the project development process. Threat modeling automatically analyzes the security requirements of current business systems through Q&A analysis of business requirement design, technical architecture, business scenarios, and deployment scenarios, and uses a large number of risk knowledge bases to support application risk correlation analysis, and clarifies security design specifications, development specifications, and testing methods in combination with security requirements management.

(2) SAST source code defect detection. Static Application Security Testing (SAST) technology usually analyzes the syntax, structure, process, and interface of the source code or binaries of the application during the coding stage to find security vulnerabilities in the program code. SAST, also known as white-box testing, is easier for programmers to accept because it can detect a variety of problem types and accurately locate security vulnerability code.

(3) SCA open source component detection. The rise of open source software has dramatically increased the speed of development for developers. However, open source software still has a large number of security risks and licensing risks, so it is particularly important to analyze the software components throughout the development process. Software Composition Analysis (SCA) is a technique for identifying, analyzing, and tracking components of binary software. Enterprises need to first identify what open source component assets are in their software projects, and further confirm whether they are still being maintained, whether there are vulnerabilities in the current version, and whether the use of the component will cause licensing risks.

(4) IAST interactive security test. Interactive Application Security Testing (IAST) is equivalent to an interrelated runtime security detection technology combining DAST and SAST. Accurately identify security flaws and vulnerabilities, and accurately determine the code file, number of lines, functions, and parameters where the vulnerability is located. IAST has high detection efficiency and accuracy, and can accurately locate the location of the vulnerability, and the vulnerability information is highly detailed. However, its deployment cost is slightly higher, and instrumentation in the application under test may have a certain impact on the normal test operation of the application.

(5) DAST dynamic application security testing. DAST (Dynamic Application Security Testing) technology analyzes the dynamic running status of an application during the testing or operation phase, mainly the well-known black-box vulnerability scanning technology in the operation and maintenance phase. It simulates a dynamic attack on an application by a hacker and analyzes the application's reaction to determine whether the web application is vulnerable.

(6) RASP technology. As a new type of web protection, RASP (Runtime Application Self-Protection) injects the protection code into the application like a vaccine, integrates with the application, makes the application have the ability to protect itself, combines the logic and context of the application, detects every piece of code that accesses the application system, and when the application suffers actual attacks and damages, RASP can detect and block security attacks in real time without manual intervention. RASP technology makes up for the inherent deficiencies of traditional border security protection products, can deal with ubiquitous application vulnerabilities and network threats, and provides dynamic security protection for applications throughout the life cycle, but the operation of RASP applications has a certain impact on server performance, resulting in a partial decline in application performance.

(7) Combination of continuous safety capabilities and R&D processes. Traditional R&D and operation security focuses on security threat elimination, vulnerability fixing, and more passive security defense in the testing and operation stages. In the face of the need for frequent adjustment and launch of services in the cloud-native era, it is urgent to shift security to the left, and carry out security intervention in the demand and R&D stages to reduce security risks from the source and achieve proactive security defense, so as to build a security system covering the entire life cycle of software application services.

In the past, security was always added to the software by a separate security team at the end of the development cycle (almost as an afterthought) and tested by a separate quality assurance (QA) team. With the adoption of agile and DevOps practices, software development cycles have been reduced to weeks or even days, and the traditional "additive" approach to security has created unacceptable bottlenecks. DevSecOps makes security issues easier, faster, and less expensive to handle by seamlessly integrating security tool capabilities into agile and DevOps processes and tools, without bringing them into production. Figure 4 shows the security tool capability and pipeline integration solution.

Figure 4 Continuous security management

By integrating security capabilities into the assembly line, the R&D team can focus more on business R&D without spending more energy to cooperate with the rectification of DevSecOps implementation, and at the same time, it can be combined with the pipeline access control to avoid the "sick" flow of products to the next link in the product delivery process.

3. Chaos Engineering

Chaos engineering is an experiment based on a distributed system, which actively simulates faults in the production environment system, creates system anomalies, supports replayable and continuous fault drills to verify system stability, exposes potential problems in advance and repairs them, and avoids service instability or even service unavailability due to sudden failures in the actual operation process. The implementation of chaos engineering is based on the promotion principle of "step-by-step implementation and hierarchical improvement", and in the process of implementation, it can be promoted based on the principles of simulation environment and then transition to production environment, drill key scenarios first, and then gradually cover all business scenarios. At this stage, the key specific plans to promote the implementation of chaos engineering are as follows.

(1) The construction of the observable system of chaos engineering. The chaos engineering drill needs to observe the indicators in the drill process based on the monitoring system and judge the faults existing in the system. The deployment of the monitoring system is based on the hybrid cloud deployment environment, and the business applications, basic services, and infrastructure, that is, SaaS, PaaS, and IaaS layers, are monitored from top to bottom. In addition, in order to better monitor the application and architecture status during the drill, the link monitoring and performance monitoring system are also introduced. Each exporter data collector collects monitoring metrics for basic server resources, database clusters, middleware clusters such as message queues, and application services. These include the following.

Business applications: service availability monitoring (whether services and ports exist and whether they are suspended) and application performance monitoring (application processing capabilities, such as transaction volume, success rate, failure rate, response rate, and time consumption).

Basic service layer: includes the performance indicators of various middleware, Docker containers, and cloud-native platforms.

Infrastructure layer: monitors the performance of basic resources, such as CPU (CPU usage, CPU core usage, CPULoad load), memory (application memory, overall memory, etc.), disk I/O (read/write rate, IOPS, average waiting latency, average service latency, etc.), network I/O (traffic, packet volume, error packets, packet loss), and connections (the number of TCP connections in various states).

(2) Capacity building of chaos drill tools and drill platforms. The chaos engineering platform is an important tool for the implementation of chaos engineering, and the chaos engineering platform is generally divided into six layers, namely SRE capabilities, atomic capabilities, drill capabilities, perception capabilities, planning capabilities, and drill platform capabilities, as shown in Figure 5.

Fig.5 Chaos engineering drill platform capabilities

Through the construction of the drill platform, it can manage and drill multiple environments, such as virtual machines, physical machines or hybrid clouds, and at the same time can drill support applications and middleware on multiple server types and operating systems. In the process of chaotic advancement, it is often used to promote iteratively according to the plan, and the platform can be rehearsed on demand according to the plan through the platform's drill planning ability. At present, through the enrichment of atomic capabilities and the improvement of system functions, the drill platform can better promote the implementation of chaos engineering in many aspects, such as log inspection, fault drill, reliability test, and red and blue attack and defense.

(3) Emergency plan management. Through the implementation of emergency plans, a set of strategies and processes for prevention, response and recovery can be formulated for various emergencies or failures that may occur. At the same time, the emergency response capability is precipitated into the emergency plan management platform, so that the organization can respond quickly and effectively to reduce losses and restore the normal operation of the business.

Through the continuous promotion and implementation of chaos engineering, the plan management of the discovered and possible problems is well managed, and the knowledge base is retained in the plan management platform to form a knowledge base, which is conducive to accumulating standardized drill scenarios and fault handling processes, enabling other businesses to quickly conduct experiments and quickly respond to production failures, and reducing drill costs and fault disposal time.

Based on the chaos engineering system, the stability test of key financial services was carried out, and in the specific implementation route, a request for the service was injected first, followed by the entire service, and finally to the overall application architecture according to the specific implementation route. Then, according to the full-link process, fault injection was implemented such as network, hard disk, process destruction, and business scenarios, and the access success rate and TPS of Internet financial services were continuously observed. Through the normalized chaos engineering test and the integration into the DevOps tool pipeline, the stability of the system has been continuously improved.

In the cloud-native era, the exploration and practice of the total quality assurance system is an indispensable part of the digital transformation process of enterprises. Enterprises need to deepen the concept of total quality management, build a stable and reliable cloud-native infrastructure, strengthen risk prevention and discovery mechanisms, and promote the development of intelligent quality management, so as to achieve high-quality development of fintech-enabled businesses.

(This article was published in the second half of March 2024)

Technology Application | Exploration and practice of total quality assurance system in the cloud-native era

Difficulties and challenges in quality management in the cloud-native era

A key solution to Total Quality Management