laitimes

If aiOps is only regarded as an operation and maintenance technology, it is inevitably too superficial...

author:DBAplus Community

Recently, an online article entitled "Chaos in the Intelligent Operation and Maintenance Industry: Inflated Valuation, Blocked Listing, and Frequent Layoffs" is very popular in the operation and maintenance circle, and some views are very attractive. Today, without evaluating the views of the article, I will take a short paragraph of my understanding of the intelligent operation and maintenance of financial enterprises under the series of "Operation and Maintenance of Digital Intelligence".

AIOps is the direction of operation and maintenance, but AIOps can not simply be regarded as a technical means or technology platform, but should be the operation and maintenance mode of human-machine collaboration in the era of digital intelligence.

First, the AIOps operation and maintenance working mode of human-machine collaboration

The emergence of a new working mode will inevitably change the original stable working mode, and the change will usually be challenged by new challenges, so the new working mode needs to be able to solve the problems encountered in the current operation and maintenance work. Taking business continuity management in the financial industry as an example, the goal is to improve the company's risk prevention ability, effectively reduce unplanned business interruptions, and prevent operational risks. In the face of the current complex technical architecture, the continuous introduction of innovative technologies, rapid iteration of software versions, and severe information security threats, it is difficult for traditional O&M teams to achieve the guarantee goal of enterprise business continuity by passive firefighting, problem-driven, operation and maintenance, and experience O&M. The operation of financial enterprises is safe and stable, and it is necessary for O&M data to give data insights, auxiliary decision-making, tracking and execution capabilities, and improve O&M management capabilities in complex environments, namely:

  • Get real-time "What's going on?" ”
  • Why does correlation analysis happen? ”
  • What will happen to "Smart Prediction"? ”
  • Decision-making judgment "What measures to take?" ”
  • Auto Execution How to execute quickly? ”
  • Real-time perception of the effect of work execution? ”

AIOps is to solve the above problems and was born, compared to the traditional operation and maintenance work mode, AIOps focus is not to create a new operation and maintenance work model, but to supplement the existing "expert experience + best practice process + tool platform" operation and maintenance model, to provide enterprise operation and maintenance work with "insight perception, operational decision-making, machine execution" capabilities, support the transformation to the "human-machine collaboration" model.

Why is it a model of "human-machine collaboration" instead of an intelligent model? Because from the current application of AIOps, although artificial intelligence technology is leading, when faced with complex, changeable, and incomplete information environments, especially when applied to complex emergency support scenarios, there are still no alternative experts, and more applications are used in some specific weak artificial intelligence fields. "Human-machine collaboration" focuses on assisting human decision-making and execution through machines, and is to increase the role of robots on the collaborative networks of R&D, testing, and manufacturers outside the original O&M organization, and form an O&M model of human-machine collaboration. The operation and maintenance mode of human-machine collaboration, the most critical role is still people, using human creativity, combined with the data and algorithms provided by the machine, to assist people in operation and maintenance work. In general, human-machine collaboration needs to give full play to the strengths of humans and robots to form a fusion solution, and the key direction of follow-up AIOps can be centered around three points:

  • "Data + Algorithm" empowers O&M experts with the ability to "perceive in real time and assist in decision-making".
  • Increase the number of O&M robot positions, and reshape O&M work such as "big computing", "massive data analysis", "operability", "process", "regularity", "7*24", "man-machine experience" and so on.
  • Establish a digital platform management model and implement decision-making in a closed loop.

Second, data, algorithms, scenarios, and knowledge constitute the key 4 elements of AIOps

As proposed in Gartner's definition, AIOps applications need to use big data, modern machine learning technologies and other advanced analytics techniques, which is a relatively high threshold of work mode. In order to better implement AIOps, O&M organizations need to deeply understand the connotation of AIOps and focus on implementing ideas: data-based, algorithm-based, scenario-oriented, and knowledge-expanded AIOps4 elements.

1. Based on data

Data first, AIOps requires the ability to produce high-quality data quickly. The "fast" idea can be built with the idea of "middle office", and establish unified data acquisition and control, real-time and batch data processing capabilities, operation and maintenance algorithms that match operation and maintenance, storage schemes, master data, indicator models, etc.; "High quality" starts from unifying scattered data, forming "live data" after going online, and governance of data quality. From the perspective of technical implementation, it has the ability to manage the whole life cycle of data flow in real time of "mining, storage, calculation, management and use". Among them, data collection is the ability to collect data online on demand; Data storage is to archive, organize, transmit and share data according to the data type and data application characteristics; Data calculation includes data labeling, cleaning, modeling, processing, standardization, quality monitoring, and data analysis and statistics for data insight, decision-making, and execution; Data management focuses on data governance, including the management of O&M data standards, master data, metadata, data quality, and data security; Data usage focuses on the data catalog, service portal, and supporting data servitization capabilities involved in the data service perspective.

2. Supported by algorithms

Algorithm brain, adapt, introduce operation and maintenance algorithms in specific scenarios, and build algorithm model system. Machine learning, especially the large-scale application of deep learning, has driven the rapid development of artificial intelligence. With the popularity of the domestic TOB market, aiopia research and application of artificial intelligence is in an explosive period, the introduction of AI technology algorithms have three advantages: First, the work stability is high, artificial intelligence can work tirelessly, in the analysis of regularity problems without environmental impact. The second is to reduce operational risks and use artificial intelligence to replace traditional manual experience operations, which can better avoid operational risks and moral hazards. The third is to effectively improve the efficiency of decision-making, artificial intelligence can quickly screen and analyze big data, helping people make decisions more efficiently. As a financial enterprise, on the one hand, due to the lack of talent, salary structure, etc., it should cooperate more with external suppliers in the construction of algorithms; On the other hand, the pursuit of algorithms is not necessarily the advanced nature of technology, in fact, the regularity of expert experience landing is also an algorithm implementation, and many times more reliable. For the current mainstream algorithms, see the common algorithms mentioned in the previous section, which will not be repeated here.

3. Scenario-oriented

Scenario-driven, with pain points and value expectations, intelligently empower O&M scenarios and land intelligent O&M capabilities. FROM the word point of view, AIOps should include "AI + Ops", which is a model that uses AI to empower O&M scenarios. With the data base and algorithm brain mentioned above, the next step is the landing of the AIOps operation and maintenance model, which will mainly revolve around the landing of the scene, one is to use the algorithm to empower the existing operation and maintenance scenarios, and the other is to use the algorithm to achieve the originally unattainable operation and maintenance scenarios. The former is a model that works quickly, while the latter is a change made in response to change.

4. Expand with knowledge

Operational knowledge describes a large number of relevant object definitions, techniques, and troubleshooting/resolution experiences in a large number of operational areas. The O&M knowledge graph is a relationship network obtained by connecting different kinds of information of O&M objects together, and is a key technology for expressing O&M data. By building an O&M knowledge graph, various types of O&M subjects are automatically mined from massive data, their characteristics are profiled and structured descriptions, and the relationship between O&M subjects is dynamically recorded. Based on the O&M knowledge graph and using algorithmic technologies such as natural semantics, it can help IT personnel achieve a variety of AIOps scenarios such as fault chain propagation analysis, root cause location, intelligent change impact analysis, and fault prediction.

Some other perspectives on AIOps

1. Establish a scene map to systematically and rhythmically land AIOps

Similar to the current application of AI technology in most areas, AI is a platform capability, not a business. Taking the online banking system as an example, the online banking on the PC side solves the convenience problem from the counter to the counter, the mobile banking solves the control problem from the mouse and keyboard operation to the finger scratching screen and the operation at any time, and the AI video, language recognition and other technologies solve the experience problem from the touch screen to the immersive intelligence for the mobile banking. In this process, the essence of many services has not changed, so in the face of AIOps, O&M organizations need to establish scene maps, prioritize based on scene maps, and see how to empower the advantages of AI to the specific links of O&M scenarios.

2. "Living data" is the basis of intelligent operation and maintenance

Live data has two meanings, one is that the data is alive, that is, the data is all online; The second is that data is used, that is, in the continuous application of data, it is perfected and new data is generated, forming data backflow. In the past, O&M data analysis was mainly based on batch offline data to establish reports to assist decision-making, but many O&M scenarios require real-time data analysis support. Therefore, it is necessary to use the O&M data platform to realize the real-time acquisition and control of machine data and land data assets, and then realize data consumption in scenarios, and establish feedback from data application execution to continuously optimize the data-driven workflow to form more accurate data. Therefore, there are three key elements in the realization of live data, one is to build an operation and maintenance work scenario of a collaborative network, and the collaborative network needs to break the online workflow island and open up the network connection between "people, organizations, software, and hardware"; Second, the O&M organization should establish an O&M data platform to bring together the relevant data of production and operation and abstract it into data services to provide value for O&M scenarios in a convenient way; Third, we must continue to consume data, find problems in data consumption, correct data, mine data value-added services, and generate new data.

3, the first impression is very important

As a new working mode, AIOps gives users the first impression of reliability and availability, and if the first impression is skepticism, the operation and promotion of the subsequent work model will bring greater challenges. The most important problem that AIOps algorithms need to solve is to change people's impression of "algorithm accuracy", that is, the introduction of "algorithms" is not to innovate, but to actually solve real problems. Taking auxiliary fault location as an example, many daily failures of a normally operating O&M organization can usually be handled through expert experience, monitoring tools, and effective collaboration mechanisms. The introduction of AIOps for the empowerment of fault management, one is to be faster, the other is to be more accurate. Thanks to the automation and computing power of the machine, through the designed online emergency scenario, the "fast" problem can be predicted and solved; But for "quasi", it has the meaning of a black box, so it is necessary to be cautious when applying algorithms, and solving problems is far more important than advanced algorithms.

Finally, whether it is tool development in the operation and maintenance team or vendors, when promoting the AIOps model, we should pay attention to the experience of front-line O&M experts and the supporting working mechanism of the operation and maintenance model, so as to connect people, processes, and tools with specific "things" into real and usable scenarios.

Author 丨 Peng Huasheng

Source 丨 Public Account: Road to Operation and Maintenance (ID: HuashengPeng001)

The dbaplus community welcomes technical staff to contribute to the posting email: [email protected]

And much more

dbaplus community live broadcast [topic relay 丨 intelligent operation and maintenance AIOps is difficult to land the call is extremely high, how to break the game? Will be broadcast at 8:00 p.m. on September 16, the dbaplus community invited Jingdong Technology Intelligent Operation and Maintenance Algorithm Leader - Zhang Jing, Ant Group AIOps technology expert - Xu Xinlong gathered on the cloud, hoping to bring together the research results and practice accumulation of the two operation and maintenance experts, to further clarify the direction of intelligent operation and maintenance development, and provide reference and landing intelligent operation and maintenance practical experience.

Live broadcast address: http://z-mz.cn/5lIbo

Add live assistant WeChat (dbazhiran), you can also get more benefits such as joining the SRE theme exchange group

If aiOps is only regarded as an operation and maintenance technology, it is inevitably too superficial...

About us

The dbaplus community is an enterprise-level professional community around Database, BigData, and AIOps. Senior coffee, technical dry goods, daily boutique original article push, weekly online technology sharing, monthly offline technology salon, quarterly Gdevops & DAMS industry conference.

Follow the official account [dbaplus community] for more original technical articles and select tools to download