laitimes

Intelligent operation and maintenance of AIOps, accelerating the paradigm shift in cloud computing

author:Microsoft Research Asia

As we all know, the Microsoft Azure cloud computing platform has maintained a strong growth momentum in recent years. Satya Nadella, chairman and CEO of Microsoft, has said that digital technology is a de-inflation force in a rising economy, with both large and small businesses able to increase productivity by building their own technology intensity and making their products and services more widely adopted. The Microsoft Cloud, which delivers digital platforms and tools end-to-end, is helping businesses and organizations move forward in the midst of today's transformation and seismic change.

Intelligent operation and maintenance of AIOps, accelerating the paradigm shift in cloud computing

Today, the Microsoft cloud has tens of millions of physical servers, deployed in thousands of data centers on five continents, running millions of customer applications and services, including more than 95% of the services of Fortune 500 companies, new hardware online every month, and new software updates deployed to the cloud almost every day or even every minute. So behind the Microsoft cloud, what kind of black technology is there?

Recently, Zhang Dongmei, executive vice president of Microsoft Research Asia and outstanding chief scientist of Microsoft, and Lin Qingwei, chief researcher of Microsoft Research Asia, and Dang Yingnong, chief data scientist of Microsoft Cloud Computing and Artificial Intelligence Business Unit, jointly disclosed to the outside world the black technology behind Microsoft Cloud - intelligent operation and maintenance AIOps.

Digital transformation of the software industry

The rapid development of cloud computing in the past 15 years has provided a broad space for the development of intelligent operation and maintenance AIOps. The so-called intelligent operation and maintenance AIOps is to replace the original manual operation and maintenance of data centers in the form of big data and artificial intelligence. Before the emergence of public cloud data centers, all of them were self-built data centers, which were relatively small and medium-sized data centers and scattered locations, and the number of failures was not high, and the types of failures were difficult to comprehensively, so it was difficult to form big data for artificial intelligence analysis. And the emergence of the public cloud changed everything.

Intelligent operation and maintenance of AIOps, accelerating the paradigm shift in cloud computing

Dongmei Zhang, executive vice president of Microsoft Research Asia and outstanding chief scientist of Microsoft

Under the premise of the great development of the digital economy, the public cloud has also achieved great development, and the public cloud represented by the Microsoft cloud has appeared as a global super cloud, such as the Microsoft cloud has tens of millions of servers, so it is possible to obtain global, full, and comprehensive data center operation and maintenance big data on a cloud platform of Microsoft cloud. Secondly, because the public cloud has become the infrastructure of the whole society, the requirements of the whole society for public cloud operators have also risen, which has forced public cloud operators to improve the automation and intelligence level of data center operation and maintenance, and switch from manual mode to more stable, efficient and secure intelligent operation and maintenance AIOps.

Zhang Dongmei introduced that Microsoft Research Asia started very early in the research direction of intelligent operation and maintenance AIOps. Dating back to 2009 to 2010, when Microsoft Research Asia established the Software Analytics Group to study the software field from a data-driven perspective, the most important problems in the software scenario include operating system problems, user experience problems, development efficiency problems, etc., which is very similar to today's AIOps.

Cloud computing has become a paradigm shift driver and platform for the software and software industry over the past 10 to 15 years. When cloud computing systems become the main form of software operation, Software Analytics will naturally focus on cloud computing systems. With the extension of software from the previous stand-alone and server to the hyperscale cloud data center, software engineering is facing a profound change, from the previous focus on programmers, to the focus on the system, users and devopers of the expansion - this is because the software in the public cloud, will eventually run into a system, so as to provide cloud services to the outside world, at the same time, as a system also care about user experience and development and deployment efficiency, this is a trinity, indispensable model.

To sum up, cloud computing is not only the transformation of enterprise data centers, but also promotes the digital transformation of software and the software industry - software design, development, operation and maintenance have shifted from manual methods to data and artificial intelligence, which has brought the feasibility and possibility of intelligent operation and maintenance AIOps. The Software Analytics research that Microsoft Research Asia began more than a decade ago has ensured the stability, reliability, security and other characteristics of today's Microsoft cloud, as well as the strong situation of rising performance.

The three directions of AIOps

At present, the industry's research and practice of AIOps is still in its infancy. Compared with the five levels of autonomous driving, AIOps can be said to be the "automatic driving" of the data center, and can also be divided into L0-L5 levels, of which L5 corresponds to the highest level of complete "unmanned driving", which can be said to be the ultimate goal of AIOps.

Intelligent operation and maintenance of AIOps, accelerating the paradigm shift in cloud computing

Lin Qingwei, principal investigator of Microsoft Research Asia

Zhang Dongmei introduced that AIOps effectively and efficiently designs, builds and operates large-scale complex cloud services through innovative artificial intelligence and machine learning technologies. From the perspective of Microsoft Research Asia, AIOps has three main research directions: systems, users and developers.

Intelligent operation and maintenance of AIOps, accelerating the paradigm shift in cloud computing

The first is system services, which are AI for System. Cloud services are provided by the data center system where the software is deployed, and the software must run on the cloud data center and form a runnable system together with the cloud data center in order to provide cloud services to the outside world. From the perspective of a working system, the performance, stability, security and other aspects of the system are the objects that need to be studied. The second is the customer, i.e. AI for Customer. Different from Internet services, cloud services not only serve individuals, but also serve enterprises, since they are serving users, we must pay attention to user experience. Even if the system is stable, but the user experience is not good, everyone will not choose cloud services. The third is developers, that is, AI for DevOps, which is mainly aimed at developers and operations personnel, using intelligent technology to improve their productivity and make daily work smoother.

In terms of system services, there are two scenarios that are often encountered: abnormal behavior detection, that is, when there is a problem in the system, it must be able to detect it; early warning, that is, predicting the problem that may occur before the problem occurs. For these two scenarios, mainly based on data and machine learning methods, combined with professional domain knowledge, it is possible to make better judgments and predictions.

In terms of users, one scenario is to strengthen the interaction with users, so that users can get a good experience. For example, when users encounter problems, they often contact the customer service staff of cloud services, if they can provide users with some tools in advance, so that users can automatically get the construction of cloud services and what problems occur, when communicating with customer service personnel, customer service personnel can also grasp the specific situation at the first time, and both sides can communicate at the same level of knowledge, so as to better help users.

In terms of development and operations, it is mainly to help development or operations personnel better complete tasks such as CI/CD continuous integration and continuous deployment. When a problem occurs, on the one hand, it is necessary to quickly find a solution and let the cloud service return to the normal state as soon as possible, but the system returning to normal is not the same as fundamentally solving the problem. Many times, because the system is very complex, it takes a lot of time to study, discover and debug to find the root cause. Due to the sheer volume of logs, an intelligent approach is needed to help developers complete diagnostics efficiently and accurately as quickly as possible.

In addition to being system-oriented, user-oriented, and developer-oriented, aiOps can be divided into four major aspects: detection, diagnosis, prediction, and optimization. In terms of each problem, there are many challenges and multiple subdivisions of research that solve problems and challenges. At the end of the day, cloud platforms are complex, not only large-scale but also distributed architectures, so AIOps is a research area that requires long-term investment.

Intelligent operation and maintenance of AIOps, accelerating the paradigm shift in cloud computing

The black technology behind the Microsoft cloud

For large-scale, highly complex cloud computing systems like Microsoft Cloud, which carry a large number of customer applications, it is difficult to use traditional non-intelligent software development and operation and maintenance technologies to efficiently develop, deploy, operate, and manage. Dang Yingnong, chief data scientist of Microsoft Cloud Computing and Artificial Intelligence Division, introduced that as early as five or six years ago, the engineering team of Microsoft's cloud computing department was deeply aware of the great necessity and urgency of implementing intelligent operation and maintenance, and began to establish a special team of data scientists, and cooperated with Microsoft Research Institute in depth to carry out research and development and deployment of intelligent operation and maintenance.

Intelligent operation and maintenance of AIOps, accelerating the paradigm shift in cloud computing

Dang Yingnong, chief data scientist of Microsoft's Cloud Computing and Artificial Intelligence Business Unit

Through the in-depth cooperation with Microsoft Research in the past few years and the unremitting efforts of Microsoft cloud engineers, Today Microsoft Cloud has accumulated many important technological innovations in intelligent operation and maintenance, including intelligent and management automation of cloud service systems, intelligent cloud development and deployment, and intelligent customer response. Specifically, artificial intelligence and machine learning technologies have been deeply integrated into the management software of Microsoft's cloud infrastructure, including intelligent monitoring, intelligent prediction, intelligent repair and so on.

Dang Yingnong stressed that automation and intelligence are promoted together, on the one hand, the availability, reliability and efficiency of cloud services are improved, on the other hand, the autonomy of cloud service operation is improved, the scenes that require manual maintenance are constantly reduced, and the maintenance cost is constantly reduced. Machine learning technology has also greatly improved and enhanced the development and operation maintenance solutions of Microsoft Cloud, such as intelligent testing, intelligent diagnosis, intelligent deployment, etc., which greatly improve the efficiency of development and operation engineers.

Take common hard drive failures, for example. Lin Qingwei introduced that hardware problems are one of the reasons for virtual machine downtime, and hard disk failure is one of the main causes of hardware problems. To this end, engineers hope to be able to predict failures earlier before they occur, and then take measures to migrate users' virtual machines to other machines, or solve problems such as soft boot, so that users are not affected. However, in hard disk failure prediction, the ratio of failed disks to healthy disks on large-scale and complex cloud computing platforms may be 3:10,000, and such an extremely unbalanced positive and negative sample poses great challenges for traditional machine learning predictions. In addition, the upper-layer applications of the drive are already affected before it is completely unusable, so the data of the drive itself cannot be predicted instantaneously.

Solving hard disk failures is a big problem in AIOps research, which is small data samples. So, how do researchers at Microsoft Research Asia solve the problem? First of all, we can't just look at the data of the hardware itself, but we must connect the upstream and downstream data related to the hardware to see the problem, which greatly expands the available data. For example, far before the failure of the hard disk, the performance of the virtual machine running on the hardware may have been affected, then it is possible to determine in advance whether the hard disk is going to fail by monitoring the performance of the virtual machine; in addition, the hard disk is basically placed in the middle of the same disk array, if the motherboard voltage is not stable, when a hard disk is broken, it may affect other hard disks at the same time, or the workload will also be affected, so the adjacent hard disk as a whole can also be well predicted.

Based on the above ideas, Microsoft Research Asia proposed the Neighborhood-Time Attention Model (NTAM), including neighborhood perception components, time components, decision-making components, etc., which can capture more information in time and space, making the model prediction more powerful. Through a large number of data experiments, compared with the methods of papers in the most cutting-edge journals or conferences in the past 10 to 20 years, the predictive power of the model proposed by Microsoft Research Asia has better results in terms of accuracy and recall.

Intelligent operation and maintenance of AIOps, accelerating the paradigm shift in cloud computing

NTAM model overview

NTAM Thesis Links:

https://dl.acm.org/doi/10.1145/3442381.3449867

In terms of large-scale service failure prediction, in order to minimize large-scale service outages (outages), reduce service downtime, and ensure high availability of cloud services, Microsoft Research Asia has developed an intelligent large-scale interruption early warning mechanism AirAlert, which can predict the occurrence of cloud service large-scale outages before they occur. AirAlert collects all system monitoring signals throughout the cloud, detects dependencies between monitoring signals, and dynamically predicts large-scale outages that occur anywhere in the entire cloud system, then uses a technique called robust gradient boosting trees to predict potential large-scale outages. The research team collected service interruption data on Microsoft cloud systems for more than 1 year and verified the effectiveness of the method on the dataset.

In the actual operation of the system, some system failures occur from time to time, resulting in a decline in the quality of system service or even service interruption, often referred to as service incidents. Cloud service incidents often bring huge financial losses and affect the services that users deploy on the cloud. Over the past few years, Microsoft Research Asia has adopted the method of software parsing to solve the problem of accident management in online systems, and developed a system cloud service analysis studio (SAS) to help software maintainers and developers quickly process and analyze massive amounts of system monitoring data, improving the efficiency and response speed of accident management. SAS was adopted by Microsoft's online products division in June 2011 and installed in data centers around the world for incident management of large-scale online service offerings. An analysis of six months of SAS usage records found that engineers used SAS in about 86 percent of service incidents, and SAS was able to help with about 76 percent of them.

Microsoft Research Asia has long been deeply involved in the field of data intelligence, using large-scale data mining, machine learning and artificial intelligence technology to analyze complex operation and maintenance big data in real time, providing effective decision-making solutions for system maintenance. Today, the research results of Microsoft Research Asia have been applied to Many online services such as Microsoft Azure, Office 365, OneDrive, SharePoint, etc., and have become the black technology behind ensuring the high-quality operation of Microsoft cloud and online services. In the future, Microsoft Research Asia also hopes to create a more generalized AIOps technology to help more users and the industry improve the overall operation and maintenance level of cloud services, and consolidate cloud computing as the "foundation" of new social infrastructure.