laitimes

Actual combat | Practice of building an intelligent O&M system in multi-cloud mode

author:Digitization of finance

Globally, the digital economy is developing rapidly, and the 20th National Congress of the Communist Party of China has clearly put forward the strategic policy of "accelerating the development of the digital economy". As a leading enterprise in China's insurance industry, China Pacific Insurance Company has been committed to promoting digital transformation. As the company's business continues to develop rapidly, our data center operation model is also constantly optimized and upgraded. In order to ensure business continuity, reduce O&M costs, and improve service quality, data center O&M is also facing unprecedented challenges, which can be roughly divided into the following points. First of all, the scale of operation and maintenance is constantly expanding, especially in the context of the implementation of private cloud and system information innovation transformation, the scale of computer rooms, cabinets, equipment and software are growing rapidly. The problem of polycentric collaboration is also becoming more prominent. Second, the form of business systems is diversifying, and the applications of distributed and cloud-native architectures are gradually promoted, while the applications of traditional architectures still need to exist for a long time, and there will be correlation between polymorphic applications, which leads to the geometric growth of O&M complexity. In addition, there is a contradiction between O&M costs and business continuity assurance. With the expansion of O&M scale and the diversification of system forms, the problem of insufficient number and skills of O&M personnel has gradually become prominent. With the number of personnel not increasing, it is increasingly difficult to detect and efficiently handle faults in a timely manner.

In the current complex environment, in order to cope with multiple challenges, we must build a comprehensive and efficient monitoring system to achieve all-round monitoring coverage and ensure that there are no dead ends. At the same time, we should also build an efficient automated O&M system to improve O&M efficiency and reduce the cost and risk of manual intervention. In addition, it is important to establish an intelligent analysis and decision-making center, which can realize hand-eye linkage, assist decision-making, and further enhance the intelligent level of operation and maintenance. Through this intelligent O&M system that can be fully controlled, we will be able to significantly reduce data center O&M costs, improve O&M efficiency, and add more value to O&M. This not only helps to ensure the stable operation of the business, but also provides solid support and guarantee for it.

Actual combat | Practice of building an intelligent O&M system in multi-cloud mode

Du Yingjun, head of the automation tool R&D team of the cloud service business group of CPIC Technology Co., Ltd

construction scheme

1. The construction process of Pacific Insurance's operation and maintenance tool system

Since 2016, CPIC Data Center has systematically built a series of O&M tools, including automated O&M, Configuration Management Database (CMDB), and unified monitoring. This shift marks a significant shift in our data center operating model from back-end scripting to front-end interface operations.

By 2020, we have basically built a relatively complete tool system, including unified monitoring, automated O&M, log platform, container platform, and configuration management database (CMDB). In addition, our first private cloud was built at the same time.

The use of these tools makes routine operations such as operation and maintenance changes, releases, and emergency response basically realize front-end operations. Some standard recovery measures are also automatically triggered by monitoring system alarms, so as to achieve data-driven fault self-healing to a certain extent. Figure 1 illustrates the approximate capability architecture. With the help of the tool system in the diagram, we can effectively deal with the processing time problem of large-scale repetitive tasks at the operational level, and at the same time avoid misoperation caused by human negligence. In terms of problem discovery, basic alarms have been fully covered, and a unified alarm and dispatch mechanism has been established. This series of optimization measures, combined with the 24-hour on-duty management system, ensures that faults can be detected in a timely manner, improving the stability and reliability of the overall service.

Actual combat | Practice of building an intelligent O&M system in multi-cloud mode

Figure 1 O&M tool architecture

2. The driving factor for the development from automation to intelligence

With the rapid development of the company's business, the market competition in the insurance industry has become increasingly fierce, and the demand for business continuity protection continues to increase. Traditional monitoring and automation operations can no longer fully meet the needs of efficient and stable business operations. The main problems are as follows: First, with the continuous growth of the number of data center equipment and the increasingly complex application architecture in the information innovation and cloud native environments, we need to monitor more devices and objects. However, this has led to the increasing problem of monitoring blind spots and invalid alerts. Second, in the early days, various systems were built independently, which made it impossible to quickly analyze and display information such as monitoring, logs, and data. This undoubtedly increases the difficulty of troubleshooting, and the speed of troubleshooting is highly dependent on the skills and experience of personnel. Finally, there are many barriers to information transmission in data centers and professional teams in scenarios such as daily problem handling, troubleshooting, and key assurance, resulting in high communication costs and serious impact on work efficiency.

3. Overall planning for the construction of intelligent operation and maintenance system

In order to solve the problems faced in the current operation and maintenance work, based on the existing platform capabilities, repeated discussions and demonstrations have been carried out, and a set of data-driven and algorithm-driven intelligent operation and maintenance system construction plans have been formed. This plan aims to achieve the goals of advancing the fault detection time, improving the fault location speed, and intelligently recommending the disposal plan through data analysis and algorithm optimization.

First of all, the construction of intelligent monitoring system is the core of the whole plan. By monitoring various indicators in real time, collecting data and analyzing them, potential signs of failures can be detected in time, so as to give early warnings to avoid the occurrence of failures or mitigate their impact. This not only improves the stability and reliability of the system, but also significantly reduces downtime due to failures.

Secondly, the efficient algorithm is used to process massive data, which can quickly locate the specific location and cause of the fault. Compared with traditional troubleshooting methods, the intervention of the algorithm significantly improves the speed and accuracy of fault location. This not only reduces the workload of O&M personnel, but also shortens the time to recover from faults and improves the operational efficiency of the entire system.

In addition, the intelligent monitoring system can also provide intelligent recommended disposal solutions for O&M personnel based on historical data and algorithm analysis. These solutions are based on a lot of practical experience and data support, and can quickly and effectively solve fault problems. This not only improves the professionalism and accuracy of O&M work, but also further improves the stability and reliability of the system.

In short, by building a data-driven and algorithm-driven intelligent O&M system, we can better cope with the challenges faced in the current O&M work. The implementation of this plan will help improve the stability, reliability and operational efficiency of the system, reduce the probability and impact of failures, and provide a strong guarantee for the stable development of the enterprise (as shown in Figure 2).

Actual combat | Practice of building an intelligent O&M system in multi-cloud mode

Figure 2 Data-driven capability model

4. Intelligent monitoring system construction plan

When designing the intelligent monitoring system, we focus on three key stages: fault prevention, fault discovery and fault location. Compared with the traditional monitoring system, its design mainly focuses on the fault discovery stage, although the monitoring indicators have a certain auxiliary role in fault location, but due to factors such as data dispersion, its practical effect is limited. In order to ensure a smooth transition from traditional monitoring to intelligence, we need to integrate monitoring data from multiple channels to comprehensively improve the functions and performance of these three stages. To this end, we need to break the old model and build an integrated intelligent monitoring system from a business perspective (as shown in Figure 3).

Actual combat | Practice of building an intelligent O&M system in multi-cloud mode

Figure 3 Monitoring capabilities

(1) Failure prevention stage. According to the principle of "Hayne's Law", in order to reduce the incidence of business-impacting events, we need to identify and intervene before the fault is at the risk stage. To this end, in the process of monitoring system design, we have specially added an intelligent risk warning module. After each O&M job (such as change or release) is completed, the module compares the running status of the objects involved in the job with the preset comparison model. For example, after an application is released, the module compares and analyzes inefficient SQL statements, execution plans, and CPU usage involving databases. Once the data is found to exceed the preset baseline, the system will immediately issue a risk warning, and at the same time, the problem will be recorded in the system and tracked to ensure that the problem is properly handled and closed-loop management is formed. Information such as the responsible person, the handler and the treatment plan will also be recorded to achieve full traceability.

(2) Fault discovery stage. In order to detect failures as early as possible before users can detect them, we believe it is important to elevate the monitoring perspective from just focusing on "application availability" to a more comprehensive "service availability". The goal of this shift is to identify potential faults earlier, gaining valuable time for subsequent troubleshooting efforts and thus optimizing the user experience. Relying solely on monitoring metrics at the infrastructure level can no longer meet the current monitoring requirements. We need to expand and introduce more key metrics, especially those related to user experience, such as response time, error rate, and error distribution. At the same time, multiple monitoring dimensions such as network traffic and user access areas need to be included to provide a more comprehensive fault description.

Considering the shortcomings of the existing alarm mode, such as over-reliance on a single metric and fixed threshold, it cannot fully reflect complex fault conditions and may generate a large number of invalid alarms. To address these issues, we have taken a number of improvement measures. First of all, we have introduced an intelligent alarm center, which has rich alarm information processing capabilities, such as merging similar alarms, intelligently assigning work orders, and suppressing alarms within the time range. In addition, we have enhanced the diversity of alarm forms, including algorithm-based tabulation baseline calculation and dynamic threshold alarms, running trend (steep increase, steep drop) alarms, and heartbeat detection alarms. Most importantly, we have implemented a multi-dimensional threshold setting function, which can configure alarm conditions more accurately.

At the same time, in order to further improve the effectiveness of alarms, we have established an early warning indicator system and a user impact assessment mechanism. Combined with fault location capabilities, we are able to more accurately identify and filter invalid alarms. These improvements are designed to improve the accuracy and usefulness of alerts and provide users with a better service experience.

(3) Fault location stage. When solving the problem of fault location, we focused on how to quickly and accurately identify the root cause of the failure. To this end, we have carefully designed two modules: system intelligent diagnosis and global observation. The intelligent diagnosis module uses preset algorithms for preliminary screening, and conducts in-depth analysis of the basic components of the application self-generated deployment architecture and the interaction relationship between applications. Combined with the data association of CMDB and the data of application interaction requests, we successfully established the association between the two types of topologies.

At each node, we preset a series of troubleshooting metric factors, which are usually judged based on baseline thresholds and short-term fluctuations, and then form a decision tree model. When an abnormal node is detected, we assign weights to the suspected faulty node based on various factors such as the time of the abnormality, the level of the node in the topology, and the historical fault situation, and finally recommend the node with the highest probability of the root cause of the fault to complete an accurate analysis.

In addition, the diagnostic model has early warning and alarm functions, and supports manual or scheduled triggering. This not only helps to quickly locate the location of the fault, but also effectively reduces invalid alarms. The results of each diagnosis are recorded in detail and become an important reference for fault portraits. Together, these measures ensure that our fault location solutions are both efficient and accurate.

At present, the above-mentioned diagnostic models are still in the trial and tuning stage, and the daily troubleshooting and location work still needs to rely on the support of O&M experts. In order to solve the problem of information transmission barriers between various professional teams, we have specially designed a dedicated data analysis page, covering multiple fields such as systems, networks, databases, and applications. These pages not only show the dependencies and operational metrics of the components at each layer, but also integrate data from different sources into the data observation module through the data processing engine. Our goal is to create efficient fault location tools through a combination of intelligence and human labor, so as to improve the efficiency and accuracy of troubleshooting.

(4) Data processing engine and data governance. In the three stages of prevention, detection, and positioning, we take an efficient and flexible data processing engine as the core, supplemented by long-term data governance work. In order to achieve this goal, we have launched a low-code output processing function based on relatively lightweight databases such as Clickhouse and MongoDB, combined with the self-developed stream-batch integrated processing engine. This function can quickly process metrics and log data from various data sources, lowering the technical threshold for data analysis. Therefore, we have attracted more professional O&M personnel to participate in the development of the intelligent analysis model, forming an ecological mechanism within the department.

In addition, while the data processing engine is important, data quality is another key factor in intelligent analytics. To this end, we have established data access standards, continuously strengthened the automatic collection of CMDB information, designated a responsible team for each type of data, and implemented measures such as automatic data integrity verification. Through the organic combination of technology and management, we continuously improve the quality of operation and maintenance data to ensure the accuracy and reliability of intelligent analysis.

Construction effectiveness

After several years of continuous efforts, our intelligent monitoring system has formed a certain scale, especially in the prevention and early detection of faults has achieved remarkable results. Specifically, these results are mainly reflected in the following aspects. First, after the operation and maintenance operation is executed, we can find potential performance risks in time and take measures to intervene in advance, effectively reducing the incidence of failures. Second, through the monitoring mode that combines early warning and diagnosis, we are able to detect faults earlier and simultaneously analyze the affected user range. This provides a strong basis for disposal decisions and fault recovery. Third, the fault analysis and troubleshooting plan has been effectively precipitated in the system, which improves the efficiency of junior operation and maintenance personnel in dealing with faults. At the same time, it also solves the barriers to cross-team information transfer, further improving the efficiency of troubleshooting. Fourth, the work form of operation and maintenance personnel has undergone positive changes, and more operation and maintenance personnel are willing to devote themselves to the construction of operation and maintenance tools, so that the work efficiency and enthusiasm of personnel have been significantly improved. Fifth, the quality of O&M data is ensured, and the responsibilities of data maintainers and users are clear. All operations rely on unified data for efficient operation. Sixth, we have made detailed records of the operation of each failure. These data can be used for fault review at any time, which has important guiding significance for the formulation of rectification tasks and the optimization of diagnostic models.

Follow-up work and prospects

In the context of digital transformation, intelligent O&M has become the key to stable and efficient operation of enterprises. After a period of construction, the intelligent O&M system has achieved initial success and is basically in line with the expected results. However, just as any complex system engineering will face many challenges, the intelligent O&M system still needs to be continuously optimized and improved in practice.

First of all, the accuracy of dynamic baselines is an important part of the intelligent O&M system. In practice, we found that the accuracy of the baseline still needs to be improved. In order to solve this problem, we need to continuously optimize and improve the intelligent fault diagnosis model to improve its analysis and location capabilities in complex environments.

Secondly, the current fault analysis and location mainly rely on manual experience and historical fault data. In the future, in order to respond to unknown anomalies more efficiently, we will introduce more metric data and log information, and automatically identify and classify them through machine learning technology. This will greatly improve the responsiveness and accuracy of fault handling.

In addition, the handling of complex faults is also a major challenge for the intelligent O&M system. At present, for some complex system failures, we have not directly invoked the automated handling function for security reasons. With the continuous advancement of technology, we will continue to optimize the accuracy of the diagnostic model, so as to integrate more disposal actions with the diagnostic model, and we expect to expand more application scenarios for fault self-healing in the future to further improve the stability and reliability of the system.

Finally, by combining multi-stage capabilities such as discovery, diagnosis, and disposal, and large-scale model technology, we will gradually realize the vision of "O&M brain". Through the integration of large model technology, we will build a more intelligent O&M platform to provide users with more comprehensive and efficient services.

In summary, although the construction of the intelligent O&M system has achieved certain results, it still needs continuous efforts and improvements. In the future, we will continue to deepen the research and application of intelligent O&M, and make greater contributions to achieving more stable and efficient information operations.

(This article was published in the second half of February 2024)