Case Application Solution >
Full-stack monitoring + unified alarm + intelligent on-duty solution
The full-stack monitoring + unified alarm + intelligent attendance solution aims to provide one-stop and all-round operation and maintenance monitoring and management services for enterprises experiencing the rapid increase in the scale and complexity of IT systems. The solution ensures the efficiency of the operation and maintenance process and the stability of the system operation through comprehensive monitoring at all levels of the system, unified integration of alarm management, and intelligent automatic attendance.
Full-stack monitoring monitors various IT resources such as infrastructure, middleware, services, applications, and call chains, monitors the running status and performance indicators of the system in real time, discovers potential risks and anomalies in a timely manner, and centralizes all alarm information through unified alarm management, avoiding the problems of information islands and repeated alarms, and improving the accuracy of alarms and the timeliness of response. At the same time, the intelligent duty system realizes 7*24 hours of automatic duty and intelligent emergency disposal, which can automatically respond after receiving alarms, give relevant solution suggestions, and follow up the disposal status, reducing the dependence on manual intervention and improving the efficiency and accuracy of problem disposal.
Through this solution, enterprises can achieve the monitoring and management goals of "comprehensive three-dimensional monitoring, real-time detection of abnormalities, improvement of alarm quality, and support rapid response" to ensure the efficient and stable operation of IT systems.
Case Background >
After years of precipitation, the case customer has laid out and built some operation and maintenance monitoring tools in terms of operation and maintenance, because there is no overall planning in the early stage of the construction of each operation and maintenance point, the monitoring methods of these operation and maintenance monitoring tools are relatively simple, the technology is relatively backward, and the operation and maintenance data are relatively scattered, lack of interconnection and collaborative working mechanism, and do not have unified integrated management capabilities. In addition, the existing O&M team is limited by its own skills and tools, and cannot ensure timeliness and efficiency in responding to system failures.
In order to cope with the O&M pressure brought about by more than 70 sets of business systems and hundreds of system nodes, the customer urgently needs a complete and professional intelligent O&M system to improve O&M management capabilities, achieve integrated and refined O&M control, and fully ensure the stable operation of IT systems.
01 Needs analysis
01.1 Problems faced
· Insufficient operation and maintenance means - large scope of management blind spots
Due to the outdated technology of monitoring tools, it is not compatible with the monitoring of some device types and software versions; The use of open source technology means that it is necessary to continuously invest manpower in development and maintenance, so only part of the monitoring of servers and logs has been realized, and the monitoring of application performance, middleware and database is missing, and the lack of monitoring coverage, indicator coverage and real-time performance has led to the inability to reflect the operation of the system in real time, and the fault discovery is relatively lagging behind, and even later than the user reports. It can no longer meet the O&M monitoring requirements of the current complex system.
· O&M data is scattered - low efficiency of investigation and disposal
The monitoring data and the alarms generated by it are scattered in various tool platforms, lacking a unified management view and associated and summarized alarm information, and O&M personnel cannot quickly identify important alarms and determine the scope of the problem when faced with a large number of alarms. During troubleshooting, it is difficult for each professional group to conduct overall correlation analysis and fault traceability.
· Lack of intelligent decision-making - management collaboration relies on human labor
The failure analysis and disposal environment is completely manual. When there is an abnormality in the business system, the front-line operation and maintenance personnel often need to seek the assistance of the second- and third-line operation and maintenance personnel due to their lack of experience and skills, which leads to large communication and labor costs, and the time for troubleshooting is too long, which increases the period of business impact.
01.2 Project construction objectives
· 100% coverage of all aspects of monitoring
For full-stack software and hardware performance indicators through multi-channel and multi-mode monitoring and collection, and with the ability to report data with custom scripts, a set of platforms is built to comprehensively cover various monitoring types, including but not limited to user experience monitoring, application performance monitoring, and basic resource monitoring (including servers, middleware, databases, etc.). In addition, real-time collection and monitoring of complete log data. Ensure that the O&M team can perceive system abnormalities at the first time.
· Build a unified view of O&M data
Through a unified platform, it integrates O&M big data, including structured and unstructured data, and connects data such as monitoring, alarms, and assets. From a business perspective, it focuses on the display of core backbone links, core business applications, monitoring alarms and other information, provides visual insights into O&M data, and helps O&M personnel grasp the operation status of IT systems in an all-round way.
· Improve alarm quality and accelerate fault response
A large number of identical or similar alarm events that occur on a daily basis are compressed, so that O&M personnel can focus more on the discovery and traceability of problems and faults. Respond to alarm events by automated and intelligent means such as alarm disposal tracking, fault identification and automatic upgrade, one-click meeting and emergency command room, save time and energy for manual intervention, and quickly respond and deal with faults after they occur, reducing the impact of faults on system stability and business continuity.
02 Solutions and ideas
02.1 Construction ideas
The solution has built-in modules for basic resource monitoring, application performance monitoring, and user experience monitoring, and realizes unified monitoring coverage of the basic environment, server, storage, network, operating system, middleware, and database.
The solution uses the ARCANA platform (multi-modal data intelligent analysis and decision-making platform) developed by Dingmao Technology as a unified data base to aggregate operation and maintenance big data such as performance indicators and logs. Through ARC-IOC (Digital Intelligence Operation Center), you can quickly build a visual view of O&M monitoring and management in a low-code manner. The Di-Logger (Intelligent Log Center) monitors and analyzes logs, and pushes the alarms generated by each monitoring module and log platform to the Di-Alert (Intelligent Alarm Center) to realize alarm compression and disposal flow, and the Di-Robot (Intelligent Duty Center) follows up the handling of alarms, forming a closed loop of fault discovery, analysis, and disposal.
02.2 Program Implementation
Step 1 Deploy the full-stack monitoring module (basic resource monitoring, application performance monitoring, user experience monitoring, and log monitoring)
· Use each monitoring module to build a multi-dimensional operation and maintenance monitoring system around business value, realize all-round real-time monitoring of business systems and basic resources, expand monitoring coverage, and improve the flexibility of monitoring indicators; Set up a timely and accurate monitoring and alarm mechanism to alert at the first time when the problem first appears;
· Use the log analysis capability of Di-Logger to detect logs in real time and alarm for anomalies hidden in logs.
Step2 Deploy the Cloud Native Digital Intelligence Base (ARCANA Platform)
· Provide a unified O&M portal through the ARCANA platform, and integrate all O&M monitoring and management tools to form a unified O&M portal. Aggregate and analyze O&M big data, provide low-code, visually edited O&M monitoring screens, mobile views, etc., to form a personalized O&M interface.
·Based on the rich functional modules carried on the base, it can quickly expand various intelligent O&M capabilities.
Step3 Equipped with Intelligent Alarm Function Module (Di-Alert)
· Di-Alert undertakes the main capabilities of alarm unification, alarm compression, and alarm view. Correlates and compresses a large number of alarms, and notifies and broadcasts related alarms in the form of alarm topology view.
Step4 构建个性化运维可视化视图(ARC-IOC)
·Based on integrated O&M data, including full-stack indicators, log data, alarm information, asset information, and event work orders of transaction-business-service-basic components-infrastructure, the business system is the core to form visual insights into business operation status and system health status.
Step5 Equipped with Intelligent Attendant Function Module (Di-Robot)
· Di-Robot carries the ability of fault duty and emergency management. It realizes automatic alarm judgment and fault escalation, efficiently organizes emergency response, and provides intelligent decision-making in fault scenarios.
03 Project Results
03.1 Achieve full monitoring coverage of 70+ sets of business systems
Through the replacement of basic resource monitoring, the management blind spots in the current management mechanism, such as incomplete monitoring of the operating system, omission of indicators, and lack of monitoring of databases and middleware, are improved. By building application performance monitoring and user experience monitoring covering all business systems, it intuitively reflects the health status of the business and provides fault perception capabilities.
03.2 Provide a global monitoring view of all business systems and a topology view of IT systems
Monitor the health of all your apps with a global view; Through the IT system topology view, you can view the performance of application-related hosts, networks, middleware, databases, etc., and drill down to the topology view to the metric trend details or log details, providing strong support for analyzing the impact scope of faults and troubleshooting the root cause of problems.
03.3 Realize the online alarm compression and disposal process
In view of the large number of alarm events generated by various types of monitoring, alarm convergence, compression, noise reduction and other disposals are carried out, alarm storms are shielded, effective alarms are focused, alarm readability is improved, and the status quo of decentralized management of multi-source alarms is improved, so as to realize the closed-loop alarm disposal process of unified alarm dispatch, notification, claim, billing, disposal, and statement.
03.4 Set up a large screen for O&M management from a business perspective
Realize the precipitation of customers' full O&M data assets, design the O&M management story line from a business perspective by using the current O&M data of multiple data channels, data types, data formats, and data standards in the form of a unified data fusion platform, and form a unified O&M large screen as a data review tool for daily O&M management.
03.5 Realize automatic and intelligent operation and maintenance on duty and emergency handling
It realizes 7*24 automatic attendance, and enables dozens of types of automatic fault upgrade and disposal rules to help front-line operation and maintenance personnel respond in a timely manner after common system failures occur. In the process of troubleshooting, the fault handling best practices and historical fault handling records provided by the fault emergency cockpit can assist emergency response decision-making and improve the efficiency of fault response.
04 Customer benefits
Dingmao Technology helped the customer to achieve a comprehensive upgrade of the monitoring system, manage and compress the alarms generated by the monitoring, efficiently handle fault alarms, and use the large visual screen to display important information such as business health status and core indicator trends. Overall, it improves the timeliness from fault discovery to location, and improves the efficiency of fault disposal.
Immediate Benefits:
· Achieve 100% monitoring coverage of important (business) systems, assets and indicators;
· Unified removal of invalid alarms and intelligent analysis and noise reduction to achieve more than 90% alarm compression rate;
· Achieve automatic fault response and improve the fault handling rate of front-line operation and maintenance personnel to more than 90%.
Scalability Benefits:
· The solution can quickly expand to cover new business systems or software and hardware assets, and easily cope with the growth needs brought about by business growth;
· It also provides comprehensive O&M data collection, governance and analysis capabilities, providing a foundation for more intelligent O&M analysis scenarios in the future.