laitimes

O&M tools are so fragmented, nine tricks to help you manage them in a unified manner

author:Observe clouds

background

In the operation and maintenance tools, there are many lone rangers, each with their own unique skills, each fighting for themselves. These open-source monitoring software are like the various schools in the martial arts: SkyWalking is unique in the martial arts with its superb tracking skills, Prometheus is roaming the rivers and lakes with its flexible alerting mechanism, and ELK is like a mesmer, with excellent log analysis and data visualization skills.

Each sect defends its territory and competes for resources, but ignores the general trend of martial arts. Some people in the rivers and lakes pointed out that if these knights can work together to face the changes in martial arts, they will definitely be able to form an invincible alliance.

Until there was a chivalrous man in the rivers and lakes, who claimed to be able to "fight ten with one", and at the same time repelled SkyWalking, Sentry, Prometheus, OpenSearch and other masters, so that the operation and maintenance team was liberated from the work of "operation and maintenance of operation and maintenance tools", and the overall efficiency of the production, research and operation teams was improved.

It is a real-time data monitoring platform for development, operation and maintenance, testing, and business teams, which can meet the monitoring needs of cloud, cloud native, application, and business in a unified manner, and quickly realize observability at the infrastructure, middleware, application layer, and business layer. Infrastructure monitoring, log and metric management, application performance monitoring, user access monitoring, availability monitoring, system-level security inspection, scenarios and dashboards are all observable solutions of "Observability Cloud", which provides users with the fastest, easiest, most comprehensive and most free system observability platform through unified data collection, comprehensive data monitoring, seamless correlation analysis, custom scenario construction, high programmability and agile member collaboration.

Today, I would like to share with you such a case: Observation Cloud used nine tricks to help a customer replace 5 sets of monitoring platforms.

Customer case

The customer is a small and medium-sized financial solution provider, and the customer's DevOps team is divided into several groups, such as front-end, back-end, and operation and maintenance, and everyone performs their own duties and works closely together to continuously introduce new products and new functions to meet business needs.

According to the two technical exchanges between the two parties, the customer gave feedback to the observation cloud team on the monitoring status: the front-end, back-end, and operation and maintenance teams used more than 5 operation and maintenance tools/services, including: Sentry, SkyWalking, Prometheus+Grafana, OpenSearch subscription service (AWS), AWS CloudWatch, etc.

Obviously, decentralized monitoring brings data fragmentation, and each team uses its own data to evaluate system problems, which is easy to lead to "self-centeredness", and it is difficult for the team to quickly agree on the cause of the failure/exception. In addition, several self-built platforms also bring heavy O&M burdens to the O&M team.

  • The many problems that come with decentralized surveillance
O&M tools are so fragmented, nine tricks to help you manage them in a unified manner

After full communication with the customer's team, Observation Cloud was confident to help the customer integrate monitoring and replace multiple sets of open-source monitoring tools. The customer recognized the value proposition of "a unified platform to improve the overall collaboration efficiency of the team" of Observation Cloud, and quickly organized the team to enter the technical verification stage. After two weeks of communication and interaction, the two sides realized the verification of all scenarios.

  • The value proposition of Observable Cloud – a unified platform to improve the overall collaboration efficiency of the team
O&M tools are so fragmented, nine tricks to help you manage them in a unified manner

Key tricks

During the two-week validation process, a total of 8 colleagues from the customer's front-end, back-end, and operations teams participated in the work. In the end, the two parties summarized a number of key scenarios and successfully reported to the customer's decision-makers, which were highly recognized by the decision-makers. Here are nine tips to impress customers by observing clouds.

Trick 1: Host/container monitoring

For host and container monitoring, the customer was using a combination of Prometheus + Grafana. Observation clouds, on the other hand, rely on DataKit collectors to collect objects and metrics from hosts/containers.

l If it is a host mode, you can use a sh statement to perform one-click installation;

  • A command to install the DataKit collector
O&M tools are so fragmented, nine tricks to help you manage them in a unified manner

l If the k8s mode is used, the configuration is performed in a YAML file.

  • Manage the installation of the crawler and corresponding plug-ins through the YAML file
O&M tools are so fragmented, nine tricks to help you manage them in a unified manner

After DataKit is installed, the object attributes and metrics of the host/container can be displayed on the observation cloud. Observation Cloud provides dashboards such as cellular charts to display the health of hosts/containers by color, which is convenient for customers to quickly analyze sub-healthy hosts/containers in the case of a large amount of infrastructure.

When installing DataKit, you can choose to enable the ebpf collector to analyze the network communication between the foundations, and customers can observe TCP retransmission, latency, etc., and fully understand the network status in the cluster.

  • Monitoring of hosts/containers
O&M tools are so fragmented, nine tricks to help you manage them in a unified manner

Trick 2: AWS service monitoring

The customer's infrastructure was deployed on AWS and used a large number of PaaS services, but the monitoring coverage through CloudWatch was incomplete. During this use of Observation Cloud, we recommend customers to use the Observation Cloud Func service module, which is a Python-based script development, management, and execution platform, and has included dozens of monitoring scripts for AWS services in the official script marketplace.

You only need to select the corresponding script to perform simple modifications (fill in AK/SK, Region, modify the default collection metrics) and other operations, and enable scheduled tasks, and you can easily display the objects and monitoring metrics of AWS services on the interface of the observation cloud platform.

  • AWS Service Monitoring
O&M tools are so fragmented, nine tricks to help you manage them in a unified manner
  • Dozens of AWS services currently supported by the Observation Cloud
O&M tools are so fragmented, nine tricks to help you manage them in a unified manner

Trick 3: Collect and analyze logs

The customer was using the AWS OpenSearch subscription service to process important logs in their business systems. During the use of Observation Cloud, the customer used the cold and hot hierarchical storage of Observation Cloud, stored the last 30 days' logs in GuanceDB (Observation Cloud high-performance OLAP column storage database), and dumped the cold logs to AWS S3 for backup through Observation Cloud.

In addition, customers also need to manage the log blacklist on a daily basis, and realize timely fault notification based on the combination of some business keywords. These requirements are well supported on the observation cloud.

What exceeded the customer's expectation was that the logs of the observation cloud were automatically associated with the data of hosts and containers, in addition to the link tracing data. When customers analyze logs, they can easily click the tag to view the running metrics of the corresponding host/container, which greatly improves the speed of troubleshooting.

  • Observational cloud log analysis
O&M tools are so fragmented, nine tricks to help you manage them in a unified manner

Trick 4: Collection and analysis of user experience data (RUM).

The customer's front-end team originally used Sentry for user experience analysis, focusing on interface performance, session replay, and other functions, but did not implement correlation tracking analysis with back-end APM.

Observation Cloud provides RUM metadata collection and analysis from multiple perspectives such as session, view, action, error, and LongTask to help customers understand the actual user experience.

Session Replay is well supported on the observation cloud and provides multiple levels of modes to mask sensitive data to ensure that no one is lost (i.e., the user's sensitive data is leaked while reproducing the user's fault scene).

The correlation tracking between front-end RUM and back-end APM relies on the tracking parameters automatically added by the SDK to the HTTP request header, so that customers can realize front-end and back-end data correlation analysis without burying points in the code.

  • Collect and analyze user experience data
O&M tools are so fragmented, nine tricks to help you manage them in a unified manner

Trick 5: Tracing Links

The customer's back-end development team originally used SkyWalking to trace the link, but due to the pressure of product iteration, they were unable to invest in the correlation analysis of Tracing and Log.

With good support for mainstream APM solutions such as DDTrace/OpenTelemetry/SkyWalking, Observation Cloud collects customers' link data in real time and guides customers to adjust the log output format, and soon realizes the long-awaited Tracing+Log correlation analysis.

In addition, since the data collected by the observation cloud provides a large number of extended fields by default, customers can use the key:value query method to search and analyze any extended fields to explore suspected anomalous phenomena, which is full of flexibility.

O&M tools are so fragmented, nine tricks to help you manage them in a unified manner

Trick 6: Unified alarm

In terms of unified alarms, customers mainly focus on the monitoring of host running indicators and log keywords, and all alarms will be notified by sending group notifications through DingTalk and phone notifications through PagerDuty.

Observation Cloud provides more than 10 kinds of monitors, including threshold monitoring, log monitoring, process monitoring, and application performance monitoring, which fully meet customers' needs for fault warning. In addition, the customizable event templates and easy integration with PagerDuty allow customers to preserve their original usage habits.

  • Unified alerts for observing clouds
O&M tools are so fragmented, nine tricks to help you manage them in a unified manner

Trick 7: Flexible, easy-to-use dashboards

The customer originally used Grafana to draw dashboards, the operation team mainly focused on dashboards such as container operation and AWS service monitoring, and the front-end team needed to respond to the needs of the product team and help the product team draw business dashboards.

Observation Cloud has more than 20 kinds of charts in terms of dashboards, and provides a flexible and easy-to-use interactive experience of charting. For product and operation students, you can use the drag-and-drop method to build chart styles, and use drop-down menus to filter the indicators and query conditions you are concerned about to achieve chart drawing, and for operation and maintenance and development students, you can use DQL statements to query various observation data. Whether it is metrics, links, or logs, the same query syntax is used, and it is compatible with promQL, making it easy for customers to transition from the original data analysis platform to the observation cloud. In just two weeks, the customer team configured 30+ dashboards on their own, which was recognized by the customer's decision-makers.

O&M tools are so fragmented, nine tricks to help you manage them in a unified manner

Trick 8: Efficient and secure data sharing

Since the customer's business is a financial service, data security is very important. In the past, teams collaborated through screenshots, remote assistance, and sending logs, which was not only inefficient but also prone to data breaches.

Observation Cloud provides a snapshot sharing function, which allows customers to save the filtered data as a snapshot and share it with other colleagues, who can open the snapshot for a certain degree of interactive analysis. During this process, Observation Cloud can desensitize the specified data, and set the validity period of the snapshot, access IP whitelist, encrypted access, etc., so that customers will not have to worry about data leakage.

O&M tools are so fragmented, nine tricks to help you manage them in a unified manner

Trick 9: Three-dimensional analysis based on user ID

With the help of the observation cloud, the customer realized the full-link tracing starting from the user ID. It is easy to find the access session of the user who reported the fault based on the userID and time period, and truly realize three-dimensional and full-access analysis from front-end association analysis to back-end links, logs, hosts, pods, container basic resources, and database middleware running. Product, R&D, and O&M students can finally use the same set of tools to analyze problems and quickly reach an agreement on problem conclusions.

O&M tools are so fragmented, nine tricks to help you manage them in a unified manner

review

Looking back at history, Qin's "unified weights and measures" promoted economic and social prosperity. In the cloud-native era, a unified monitoring platform is also needed to achieve the unification of monitoring data, the unification of team analysis perspectives, and the unification of data standards.

In the process of reporting to the customer's decision-making, we also reported the advantages of Observation Cloud over foreign commercial products: millions of words of Chinese documents can be easily used by the customer team, and comprehensive international support can be adapted to the needs of overseas employees. For commercial customers, Observation Cloud also provides a variety of technical services, such as regular meetings, best practice sharing, etc., to ensure that customers can feel the temperature of Observation Cloud technical services while using powerful product functions.

O&M tools are so fragmented, nine tricks to help you manage them in a unified manner

Read on