How do I go from "monitoring" to "observability"?

What is observability?

Observability is the ability to measure the current operating state of a system through the output data (such as logs, metrics, and link traces) generated by the system, which stems from the complexity and distributed architecture of modern application systems, which often consist of a large number of servers, containers, microservices, etc., deployed in the cloud or hybrid cloud environment. In this case, traditional manual log analysis and troubleshooting methods can no longer meet the needs of quickly locating and resolving issues.

Therefore, observability has increasingly become an indispensable technical means to help O&M personnel monitor the running status, performance indicators and security of application systems in real time from the perspective of business applications, and quickly find and solve problems, so as to ensure the high availability and stability of application systems. At the same time, observability can also improve the work efficiency of operation and maintenance personnel, reduce maintenance costs, and make application systems more agile, flexible and competitive.

What is the difference between monitoring and observability?

With the development of technologies such as cloud computing, containerization, and microservices, observability has become increasingly important in modern IT systems. Here, we have questions: why have the "traditional monitoring methods" we built using indicators and dashboards over the past two or three decades not meet the needs of "modern systems", and what is the difference between "monitoring" and "observability"?

Ultimately, if we continue to use traditional surveillance methods, we won't be able to fully "see" modern systems. The complexity of modern distributed system architectures is known to lead to failures in unpredictable and previously unencountered ways, whereas traditional monitoring methods rely more on "predictive" metrics, thresholds, and empirical intuition.

However, the "observability" approach offers a different idea of the "traditional monitoring" approach:

1. From the aspect of target objects, it is not limited to a certain technical field, but also pays more attention to understanding the overall operation and user experience from the overall business application;

2. From the aspect of problem solving, there is no need to rely on empirical intuition to have the ability to find, diagnose, locate and recover problems in complex systems;

3. In terms of technical means, it not only has monitoring data such as "indicators, logs, and links", but also needs to establish data integration association and explorability capabilities that are "cross-business, cross-system, and cross-resource".

Observability's goals and challenges in the landing process

In the era of monolithic application architecture, due to the relatively simple system interaction and limited data collection, it often relies on the experience of monitoring and operation and maintenance personnel to monitor and judge system problems. However, modern applications face challenges with huge unknown failures due to the large number of interacting components of distributed systems and the high iteration of agile development.

To correct the cause, the existing monitoring methods such as logs, links, and indicators have certain limitations. For example, the occurrence of problems often involves multiple tools, and in the process of troubleshooting, the isolation and fragmentation of these tools and data bring great cognitive barriers to O&M personnel, resulting in heavy burdens and huge challenges in the process of implementing observability in the era of distributed application architecture.

Therefore, the core idea and goal from "monitoring" to "observability" is to solve the quality problems and heterogeneous integration problems of multi-data, and to continuously expand observable scenarios in a service-oriented manner, specifically realize the quality management and aggregation association of data domains such as indicators, logs, links, dialing and configuration, and build observable data resource association capabilities and value scenario service capabilities from the horizontal and vertical global perspectives of applications and applications, applications and cloud services, and third-party components, application and container layers, and application and resource layers.

How do I go from "monitoring" to "observability"?

At the same time, combined with the horizontal full-link observation of applications and the correlation analysis of application vertical resource indicators, the operation and maintenance perspectives such as monitoring, alarming, process, and automation are integrated from multiple angles and structures, presenting the logical access relationship, alarm situation, work order information, indicator monitoring, log monitoring, link monitoring, automated operation, etc. between applications, integrating basic monitoring, application monitoring, alarming, process, automation and other capabilities, and providing a unified business view for application operation and maintenance personnel from a panoramic perspective of the application system. See how your business is running at a glance.

What is the observable landing methodology?

Due to the functional characteristics, data quality and service capabilities of existing tools, the effectiveness of observability implementation is directly determined. Therefore, observable implementation needs to comprehensively consider the construction of existing O&M tools, and build capabilities in stages based on actual conditions:

1) Build observable capabilities in phases and steps

1. Stage 1: Establish observable capabilities in the alarm dimension from the perspective of business, application and infrastructure, provide alarm consultation mechanism, pay attention to operation observation and problem discovery capabilities under complex application architectures, and provide online coordination of experts in various fields for efficient consultation services;

2. Stage 2: Establish observable capabilities for active discovery from the perspectives of business, application and infrastructure, expand data such as superimposed logs and links, transform from alarm perception to observable capabilities for active discovery, and link automated operations to achieve emergency handling, pay attention to the derivation of fault location and troubleshooting disposal in complex application architectures, and achieve leftward shift.

3. Stage 3: Through the accumulated data, based on algorithm capabilities, dynamic thresholds, capacity predictions, intelligent insights, program suggestions and other active prevention observable capabilities are formed, and changes are made to prevent and prevent the disposal of the past in the complex application architecture to ensure the service experience.

The construction of observability is not achieved overnight, and through the gradual and in-depth implementation in stages, the landing effect and observable service experience can be guaranteed to the greatest extent.

2) Build an observable tool base based on the platform operation and maintenance mode

At the same time, as more and more underlying IT operation and maintenance tools and systems of enterprises show the trend of "segmentation" and "juxtaposition", the weak connection between each other greatly limits the linkage, flexibility and scalability of observability implementation, and having an integrated platform and products plays a crucial role in supporting observability data resource integration and value scenario service supply.

Since 2016, Guangtong Youyun has begun to explore, hoping to realize the global opening of data, resources and scenarios through a form, and finally, the "platform operation and maintenance model" proposed by us in the industry provides a solid guarantee for observable landing from the side of building capabilities + service scenarios by providing a tool base for observable capabilities, which is the optimal solution to achieve observability.

The overall value is provided from the observability building capability layer and the service scenario layer:

1. Observability capability layer: Through the concept of platformization, Youyun builds a unified acquisition and control, data management and indicator system model, and business service (supervision, management, control, allocation, analysis) base to realize centralized management and capability completion of multi-system, multi-tool, and heterogeneous resources, realize the integration and governance of multiple logs, links, indicators and other data, and realize the seamless linkage capability of observability in the end-to-end process of operation observation, problem discovery, fault location, and troubleshooting.

2. Observability scenario layer: Based on the Youyun base platform, the observable ecological operation and maintenance scenarios are continuously extended in the service sharing mode to realize the alarm observable scenarios, active discovery of observable scenarios, and active prevention of observable scenarios from the perspectives of business, application, and infrastructure.

Guangtong Youyun Observability Practice Results

1) Build an observable system with a multi-level perspective

Based on the Youyun operation and maintenance platform, a state-owned bank automatically collects/accesses application call link information, transaction link information, log events, application instance operation indicators and other observation data, builds an observable system from a multi-level perspective, dynamically realizes monitoring link call monitoring and tracking, static vertical application map navigation application full picture, ensures the safe and stable operation of services, and uses application monitoring to achieve business indicator monitoring, application indicator monitoring, full-link tracing, application topology analysis, and indicator threshold alarm. It achieves the service support goals of 1-minute discovery, 3-minute positioning, and 5-minute resolution, helping to discover application performance bottlenecks, improve service efficiency, enhance application experience, and greatly improve O&M efficiency.

2) Enterprise application wall: one picture in hand, all in hand

Through the accurate "portrait" of the application, various key attributes and operation indicators are extracted from the application, each indicator is aggregated and analyzed, and configured according to the needs of different personnel, supporting multi-dimensional viewing, application basic information configuration and presentation, indicator presentation (can customize the indicators to be displayed by extension), evaluation information configuration and presentation, application trajectory viewing (can be quickly linked to each process work order system), application operation and maintenance operations, etc.

As a result, O&M personnel can obtain the "root cause" of applications more clearly, accurately, and quickly during management, and quickly and effectively solve problems through guided O&M. From the perspective of business/application and infrastructure graphs, the operation status is displayed in a panoramic view.

3) Full-element and whole-process insight capabilities from the perspective of business

For an abnormal node of the business application, drill down, you can view the architecture topology from the perspective of the application, you can also view the architecture topology from the perspective of the system, according to the business dimension, think what you think, draw what you draw, so that the service architecture topology is clearly visible, at a glance, solve the pain points of complicated microservices and architecture combing without way to start, among which, based on platform capabilities, seamlessly link asset allocation and knowledge base, automated operations, workflow engines and other platform capabilities, to achieve application resource data as the axis, vertically through the application, The relationship between resources, establish an application resource architecture relationship map, diagnose the root cause nodes of faults in layers, and realize emergency response and closed-loop management and control capabilities.

4) Application full link tracking, intimate service

Through the access relationship and performance indicators of the application topology, check whether there are performance bottlenecks and application errors in the recent application node, and drill down to locate the specific time-consuming or application error links. Through link analysis, the exception generated by which application instance and which host node executes which code is detected, and then combined with the specific link information, expand the resource information of the process to which the link belongs, the application logs generated by link access, the error stack information, the database access details, and the trend of the current process instance running indicators, and analyze and locate the root cause.

Since entering the cloud-native era, the speed of technology update iteration has increased significantly. Guangtong Youyun's products and solutions truly realize the core requirements of application business-centered and effectively solve the observable capabilities from traditional passive monitoring methods to "active discovery". From "monitoring" to "observability", richer technologies, organizations, and contents are integrated into it, building a grander understanding of the entire application management. If this understanding can be based on unified and feasible concepts, methodologies and tool products, and unified data information as the basis, the ability of "active discovery" will be greatly improved, the business will be fully observable, and the ideal will eventually shine into reality.