laitimes

Byte distributed link tracing practice, textbook construction guide

author:DBAplus Community

Review

In the process of development, ByteDance has gradually formed a very complex ultra-large-scale microservice system, which puts forward extremely high requirements for the overall observability solution of the backend. In order to solve this problem, the infrastructure intelligent operation and maintenance team developed a link tracking system to integrate and unify massive Metrics/Trace/Log data, and on this basis, it realized a new generation of one-stop full-link observation and diagnosis platform to help business solve problems such as monitoring and troubleshooting, link combing, and performance analysis. This article will introduce the overall functions and technical architecture of ByteDance link tracing system, as well as our thoughts and summaries in the process of practice.

First, what is distributed trace?

1, the relationship between M T and L

The three basic pieces of data for observability are Metrics/Log/Trace. Speaking of these three major pieces, you may think of using Metrics when you need to monitor the trend of change and configure alarms; when you need to check the problem to check the log; for systems with a large number of microservices, there must be Trace, Trace can also be seen as a standard structured log, recording a lot of fixed fields, such as upstream and downstream call relationships and time consumption, view the upstream and downstream call relationship or request time consumption on the distribution of the link nodes can be viewed Trace.

Byte distributed link tracing practice, textbook construction guide

But if you use this data with an isolated point of view, the value of the data will be greatly reduced. For example, in a common scenario, if you learn through Metrics that the SLA of a service decreases and the error rate increases, how do you check the root cause? Let's go find the error log first, but is the error log I see really related to the increase in the error rate? You have to go through the code to see where these error logs are typed and what they mean. Look again to find out if there is an error Trace? The trace found is not quite sure whether it is related to the increase in the error rate, or it depends on the code to confirm. Finally through a line of code and data comparison, confirmed that this error is the next layer of service back to me, the person in charge of that service pulled in to check it together, and then the group is getting bigger and bigger, more people are dragged in layer by layer to check down, and finally locate that a change in the underlying service caused Panic, and the error layer by layer spread upwards to cause the service SLA to be reduced.

This process is not good, requiring engineers to understand the underlying logic of each piece of data in order to make the most of it to solve specific problems. In complex large-scale microservice systems, no single engineer can be familiar with the underlying logic of each microservice, so troubleshooting and observing complex microservice systems is often a challenging and difficult task.

2, Trace is the link link of data

What if monitoring data for all microservices follows a unified model and semantic specifications and is inherently highly correlated?

In a software system, there are countless Contexts flowing every second. These contexts may be a live online request or an asynchronous processing task. Each Context is continuously propagated across multiple microservice nodes before it is finally completed. All monitoring data (including Metric, Log, etc.) originates from a context. Trace is the data carrier of this Context, through a standardized data model, records the entire execution process of context in multiple microservices, and correlates all events (including Metric, Log, etc.) that occur on this Context along the way.

Back to the case just now, when we are interested in a metric fluctuation, we can directly retrieve the Trace association that caused this fluctuation, and then look at all the execution details of these Traces in each microservice, and find that a microservice at the bottom has panic in the process of executing the request, and this error continues to spread upwards and causes the service's external SLA to drop. If the observable platform is more perfect and the change event data of the microservice is also presented, then an engineer can quickly complete the entire process of troubleshooting and root cause positioning, and even without the need for people, the entire troubleshooting and root cause location process can be automatically completed by the machine.

Byte distributed link tracing practice, textbook construction guide

Trace is not only a tool for viewing gantt maps of time-consuming distributions, but also a context link link for massive amounts of monitoring data. Based on the reliable correlation of Metric/Trace/Log data, it also builds a powerful observability capability to answer many complex problems such as monitoring and troubleshooting, SLO tuning, architecture combing, traffic estimation, and intelligent fault attribution.

Trace capture and context delivery across service processes are generally done automatically by infrastructure such as microservices frameworks, but achieving the best results also requires the understanding and cooperation of all R&D engineers. R&D engineers should consciously pass context throughout all code executions. For example, in Golang, context. Context needs to be passed continuously as an argument in all function calls; in Java, Threadlocal is generally used as the storage carrier of context by default, but if there are multiple threads or asynchronous scenarios, developers need to explicitly pass the Context themselves, otherwise the context will be broken, and it is difficult to achieve effective tracking and monitoring.

Second, the challenges and opportunities of byte link tracing system

In the process of development, ByteDance has gradually formed a very complex ultra-large-scale microservice system, which puts forward extremely high requirements for the overall observability solution of the backend.

Byte distributed link tracing practice, textbook construction guide
Our challenges include:

  • Online traffic is huge
  • The number of microservices is huge, the invocation relationship is complex, and the iteration changes quickly
  • The R&D team is huge and the division of labor is complex

At present, ByteDance has huge traffic, many active microservices, container instances, and a huge R&D team. A complex service link often involves hundreds of microservices, there are first-line services, there are middle offices, there are also infrastructure, different microservices are developed by different R&D teams, and there are various horizontal teams responsible for the quality, stability, security and efficiency of the overall architecture. Different teams will have different demands on the link tracing system.

We also have unique opportunities:

  • The microservices framework is highly unified
  • Microservices are highly containerized and have a unified environment
  • Infrastructure such as storage/middleware is highly unified

Thanks to the long-term unified infrastructure work, the underlying technology solution used by Byte's company-wide microservices is highly uniform. The vast majority of microservices are deployed on the company's unified container platform, using a unified corporate microservices framework and grid solution, using the storage components and corresponding SDKs provided by the company. A high degree of consistency provides a favorable foundation for a unified link tracing system at the company level for infrastructure team building.

Third, the target of the byte link tracing system

Faced with this situation, byte link tracing systems are built around a number of targets. Our functional objectives include these main areas:

  • Unified data model and semantics: Unified data model and semantic specifications, replacement and upgrading of default buried middleware for all mainstream frameworks/components, and establish a reliable correlation relationship between Metrics/Trace/Log.
  • Open customization: On the basis of the unified model, the customization capabilities are fully opened to meet the monitoring and tracking needs of different business scenarios.
  • Centralized configuration control: Centralized dynamic management of sampling, dyeing, fuse limiting, indexing, desensitization and confidentiality and other strategies.
  • One-stop observation platform: Provide a complete solution from SDK to collection, computing, storage, query and product interaction, build a one-stop observation platform based on high-quality basic data, and improve the energy efficiency of monitoring and troubleshooting, SLO tuning, architecture combing, capacity management and other scenarios.

Behind the functional goals, the technical goals we pursue mainly revolve around these aspects:

  • Minimize service integration overhead: Integration overhead includes the retrofit cost of service access and the Overhead overhead of post-access. A wide range of link traces can successfully cover the rollout and must be guaranteed to minimize integration overhead.
  • Balance storage efficiency and retrieval requirements: It is necessary to complete the processing and storage of a large amount of data with a limited machine budget, so as to ensure that the delay from data generation to retrievability is within the minute level, and the retrieval response speed is within the second level.
  • Disaster recovery completeness of multiple computer rooms: Priority needs to be given to maintaining availability when disaster scenarios such as network disconnection or congestion, and computer room downtime occur, and the service urgently needs to observe the online situation.
  • Minimize architecture and dependency complexity: Byte has many computer rooms at home and abroad, and it is necessary to minimize the complexity of the overall architecture and the complexity of third-party dependencies as much as possible, otherwise it will be very difficult to deploy and operate multiple computer rooms, including disaster recovery and completeness.

Fourth, the implementation of the byte link tracing system

1. Data collection

1) Data model

The unified data model is the basis of Trace, and the data model design of the byte link tracing system draws on excellent open source solutions such as opentracing and CAT, combined with the actual ecology and usage habits of Byte, using the following data models:

  • Span: An event with a time span, such as an RPC call, a function execution.
  • Event: An event without a time span, such as a log, a panic.
  • Metric: A numeric value with a multidimensional tag, such as the size of a message body, the price of an order.
  • Trace: A request context that executes the full link across multiple distributed microservice nodes.
  • Transaction: A trace of a tree-structured message body consisting of all Span/Event/Metric objects on a single service node. Transaction is the smallest unit of processing and storage of Trace data, and the compact data structure is conducive to cost savings and improved retrieval performance.
Byte distributed link tracing practice, textbook construction guide

The following diagram shows a code example of burying Trace using the Byte Link Trace System SDK. Note that contexts are threaded throughout the request lifecycle, continuously passing within and across processes.

Byte distributed link tracing practice, textbook construction guide

Continuing with this example, we illustrate how metric/Trace/Log can be reliably associated based on this model in conjunction with the following diagram.

(1) Metric associated trace:

  • Each Span will have a built-in frequency/time/failure rate metric statistical timing, and the Span of the Rpc/Mq scene will also have a built-in statistical timing such as SendSize/RecvSize/MqLag. Span Tags and Metric Tags correspond one-to-one, allowing Span timing metrics to be reliably correlated with Spans in Trace.
  • Not only will each Event be mounted on the Span, but it will also have a built-in frequency Metric statistical timing. Event Tags correspond to Metric Tags one-to-one, allowing event timing metrics to be reliably correlated with Events in Trace.
  • Each Metric is not only mounted on the Span, but also outputs various statistical time series such as rate/timer/store according to the Metric type, and the tags on both sides correspond one-to-one, so that the Metric timing indicator can be reliably associated with the Metric object in Trace.

(2) Trace associated log:

  • The Log SDK writes TraceID and SpanID from context to the log header, and establishes a reliable association with Trace through TraceID and SpanID.
Byte distributed link tracing practice, textbook construction guide

2) Semantic norms

A unified abstract data model is not enough. If there is no uniform standard for each service, it is difficult to build a high-quality observation platform even if there is a unified abstract model. It is necessary to unify the semantic specifications for HTTP Server, RPC Server, RPC Client, MySQL Client, Redis Client, MQ Consumer, MQ Producer and other mainstream scenarios to ensure that the data reported by different languages and frameworks in the same scenarios follow the unified semantic specifications in order to truly obtain high-quality observable data.

There is no unique standard for semantic specifications, and some of the semantic specifications currently used inside byte are given as reference examples below.

(1) Common base field

Byte distributed link tracing practice, textbook construction guide

(2) Example of scenario-based semantic specification: RPC Client scenario

Byte distributed link tracing practice, textbook construction guide

3) Sampling strategy

Due to the large overall online traffic of bytes and the large number of microservices, the performance sensitivity, cost sensitivity and data requirements of different microservices are different, for example, some services involve sensitive data, and there must be very complete trace data; some services are highly sensitive in performance and need to be prioritized to control the number of samples to minimize; Scenarios such as test swimlane, small traffic grayscale, or online problem tracing require different sampling strategies; conventional traffic and abnormal traffic also need different sampling strategies. Therefore, flexible sampling strategies and control methods are necessary. The byte link trace system mainly provides the following sampling modes:

  • Fixed probability sampling + low flow interface bottom sampling: Logid is used as the sampling seed by default, and sampling is based on fixed probability. For interfaces with low traffic, if it is difficult to hit by sampling with a fixed probability, the SDK will automatically sample at a certain interval to ensure that a certain number of requests are collected for low-traffic interfaces.
  • Adaptive probabilistic sampling: A certain number of transactions are collected per interface per unit time, for example, 100 bars/min, and the SDK automatically calculates the sample rate according to the current QPS. Services with low or unstable traffic are recommended for this pattern.
  • Stain sampling: Adds a stain marker to a specific request, and the SDK detects that the stain mark forces the request to be sampled.
  • PostTrace Post-sampling: When a Trace initially misses sampling, but something interesting occurs during execution, such as an error or delay glitch, sampling can be initiated in the Trace intermediate state. This scheme has very low overhead compared to full sampling and then post-sampling. PostTrace is an important complement to pre-probability sampling, which can be targeted to acquire anomalous links, which is extremely small compared to the first full harvest and then tail-based sampling scheme. However, the disadvantage of PostTrace can only be acquired to the Span that has not yet ended at the PostTrace moment, so the data integrity is relatively lost compared to the pre-sampling.

Let's combine an example to better understand what PostTrace is. The figure on the left is a request, in the order of Arabic numeral identification, a call occurred between microservices, originally this trace was not sampled, but an exception occurred at stage 5, triggering a posttrace, the posttrace information can be passed back from 5 to 4, and propagated to the subsequent occurrence of 6 and 7, and finally back to 1, and finally can collect the data of 1, 4, 5, 6, 7, but the previously closed 2, 3 links can not be collected. The image on the right is an actual posttrace display on our line, with errors propagating up layer by layer to the end of the captured link. PostTrace has good applications for scenarios such as error chain propagation analysis and strong dependency analysis.

Byte distributed link tracing practice, textbook construction guide

These sampling strategies can be combined at the same time. Note that sampling does not affect Metrics and Log. Metrics are aggregated calculations of the full amount of data and are not affected by sampling. The service logs are also fully collected and are not affected by sampling.

4) Centralized configuration control

In order to improve efficiency and facilitate the efficient work of different teams, byte link tracing system provides rich centralized configuration control capabilities, and the core capabilities include the following aspects:

  • Sampling strategy: Supports services to set different probabilistic sampling strategies according to different clusters, interfaces, computer rooms, and deployment stages, and can also dynamically set dyeing and PostTrace trigger conditions.
  • Custom indexing: Different frameworks and scenarios will have different default index fields, and also support business needs to create indexes for custom fields on the basis of default indexes.
  • Circuit breaker protection: By default, the SDK is equipped with a variety of circuit protection mechanisms to ensure that Trace acquisition does not occupy too many resources to affect the main line function, and allows the service to dynamically adjust the relevant circuit breaker parameters according to the actual situation.
  • Desensitization and confidentiality: Services can desensitize and keep Trace data as needed.

2. Overall architecture

1) Overall architecture

Byte distributed link tracing practice, textbook construction guide

The overall module architecture of the byte link tracing system from the data access side, consumer storage to query is shown in the figure above. Here are some technical details:

  • Private protocol data flow, performance is more extreme: from the SDK to the final write storage, the overall data flow adopts a private protocol, and each link in the data stream only needs to decode part of the header to complete the processing, without decoding all the content, which can save a lot of intermediate link resources and reduce latency.
  • Low-level byte self-developed log storage: Achieve high write speed and query performance at a lower storage cost, and support various Trace online retrieval scenarios.
  • The unitary architecture ensures the completeness of disaster recovery in multiple computer rooms: the overall unit architecture is adopted, saving the network bandwidth of the machine room, and maintaining high availability in the event of network failure in some machine rooms or single room downtime.
  • Granular and flexible centralized control capabilities: The unified configuration center issues various dynamic configurations to all stages of the entire data flow and takes effect in real time.
  • Take into account the online real-time query and computational analysis: as shown in the architecture diagram, the data flow is mainly divided into two, one is responsible for the online storage of data and real-time query, requiring the link to be as short as possible, the pursuit of extreme performance, low latency, to ensure that the data from generation to retrievability to be as fast as possible + high availability; the other is the computational analysis stream, the latency requirements are relatively low, but need to meet the needs of various scenario-based computing and analysis needs, and the company's digital warehouse platform has a better integration.
  • Metadata collection and security expiration: Accurate and time-sensitive metadata can be collected from Trace data streams, such as what interfaces are active for each microservice, which framework components are used, and other information, supporting multiple third-party systems such as monitoring alerts and service governance platforms.

2) Disaster recovery completeness of multiple computer rooms

As mentioned earlier, as an observable infrastructure, the link tracking system needs to be prioritized to remain available when there is a disaster scenario such as network disconnection or congestion, and the machine room downtime, and the service urgently needs to observe the online situation.

Byte distributed link tracing practice, textbook construction guide

The byte link tracing system adopts a unit deployment mechanism, there is no communication in each machine room on the write data stream, and the top-level query module is deployed in the aggregation machine room (and an alternate query node is deployed in the host room), and the query is initiated to retrieve and merge the results to each computer room.

  • There is no data communication in the host room of the write stream, and the function is not damaged when the host room is disconnected.
  • When a single computer room downtime or island occurs, the executable plan can be implemented to shield the faulty machine room on the query side to ensure that other computer room data is available.
  • When the machine room where the top-level query module is located is down or disconnected, the executable pre-plan will cut the query to the query node of the standby machine room to continue to provide services.

3) Analytical calculations

In addition to the basic real-time retrieval capability, scenario-based data aggregation analysis and calculation is also an important requirement of the link tracing system, and the analysis and calculation scenarios supported by the byte link tracing system mainly include:

  • Topology calculation: Calculate the upstream and downstream dependent link topology skeleton for each microservice (accurate to the interface/cluster/computer room granularity) to meet the requirements of scenarios such as service architecture combing and full-link monitoring.
  • Traffic estimation: Aggregates and calculates all Traces and estimates the original traffic and call ratio of the link in combination with the sampling rate of each Trace, which meets the requirements of scenarios such as active capacity expansion evaluation, traffic source analysis, and cost allocation calculation.
  • Error chain analysis: Aggregates and calculates the error propagation path of traces containing errors to meet the requirements of scenarios such as fault root cause location, error impact surface analysis, and easy fault point optimization.
  • Link performance analysis: Aggregates and calculates traces that meet specific conditions and analyzes the data such as the time consumption and call ratio of each link to meet the requirements of scenarios such as full link performance optimization and time-consuming increase root cause location.

Different requirements can choose different calculation modes, and there are three main calculation modes used in the byte link tracing system:

  • Near real-time streaming computing: Consume data from the message queue for streaming aggregation calculations according to the time window, and continuously update the calculation results in near real time. For example, topology computing mainly adopts this pattern to obtain a near real-time topological skeleton.
  • Improvisational sampling calculations: Improvised sampling retrieves a limited number (e.g., hundreds) of Traces from online storage according to specific criteria for aggregate calculations and quickly obtains calculation results.
  • Offline batch calculation: MapReduce batch calculation is performed on the Trace in the offline warehouse at regular intervals, and the analysis results are output.

The essential calculation of Trace analysis in most scenarios is the MapReduce calculation of batch Trace, and the basic logic operators can be reused in different calculation modes. For example, the error propagation chain analysis calculation can be carried out at the moment of failure to quickly obtain the analysis results by improvising the sampling calculation, or it can be subscribed to for SLO continuous optimization by means of offline batch calculation.

Byte distributed link tracing practice, textbook construction guide

4) The implementation effect at this stage

  • Throughput: 10 Million/s transactions (default sample rate 0.1%), indexes 300 Million/s
  • Query performance: TraceID retrieval performance P50 < 100 ms, P99 < 500 ms
  • The overall latency for data generation to be retrievable: AVG ≈ 1 minute, P99 < 2 minutes
  • Storage resources: 5 PB (2 replica TTL 15 days)

5. Practical application cases

1. P99 Slow request root cause trace

Based on Metrics, Trace, the underlying call analysis and container resource monitoring performs fast location of root causes of glitch slow requests.

Byte distributed link tracing practice, textbook construction guide

2. Full link real-time monitoring

Supports initiating topology queries from any microservice node, observing the traffic/latency/error rate/resource usage/alarm/change of each node in real time, and quickly obtaining the overall status information from the full-link perspective for daily inspection, troubleshooting, or stress test observation.

Byte distributed link tracing practice, textbook construction guide

3. Activity promotion full link capacity estimation

Capacity estimation is a necessary stage in the preparation of activities that each line of business regularly engages in to promote user growth or retention, typically as follows:

Byte distributed link tracing practice, textbook construction guide

The byte link tracing system can automatically complete a one-click estimation of QPS increment and resource incremental demand of each link link according to the QPS of the entrance in the historical period, the proportion of calls to each node, and the resource usage rate.

Byte distributed link tracing practice, textbook construction guide

4. Fault source and impact surface analysis

When an exception occurs, the anomaly Trace can be quickly retrieved from the online storage for aggregate calculation, analyze where the root cause of the error comes from, where it is propagated, which services are affected, and compare it with the error rate of the same period yesterday, helping engineers quickly locate the root cause when they encounter a failure. Error propagation chain calculation also supports long-term calculation of all abnormal traces in the whole period through offline subscription to help SLO long-term optimization.

Byte distributed link tracing practice, textbook construction guide

brief summary

Trace is the link link of software system monitoring data, the establishment of Metrics/Trace/Log reliable correlation relationship, is the basis for building a strong observability capability, based on high-quality monitoring data, can answer monitoring troubleshooting, SLO tuning, architecture combing, traffic estimation, intelligent fault attribution and many other complex problems.

In the process of development, ByteDance has gradually formed a very complex ultra-large-scale microservice system, and we are facing many challenges, including huge online traffic, huge number of microservices, rapid iterative changes, huge R&D teams, complex division of labor, etc., but at the same time there are rare opportunities, that is, the microservice infrastructure at the company level is very unified.

In the face of such a situation, the byte link tracking system is built around a number of goals, some of which are clear at the beginning of the project construction, and some of which are reflected and summarized in the process of practice, mainly including unified data model and semantics, open customization, centralized configuration control, one-stop observation platform; service integration overhead minimization, balanced storage efficiency and retrieval requirements, multi-computer room disaster recovery completeness and minimization architecture and dependency complexity.

Next, the overall implementation of the byte link trace system is shared. The construction of the data collection side mainly focuses on data model, semantic unification, sampling strategy and circuit breaker protection, and realizes centralized configuration control. The construction of the server mainly focuses on balancing cost and latency, taking into account online retrieval and analysis calculation, and ensuring the completeness of disaster recovery in multiple computer rooms.

Finally, some practical applications of byte link tracing system are shared, such as P99 slow request root cause tracing, full-link real-time monitoring, full-link capacity estimation and error propagation analysis.

Author 丨Intelligent operation and maintenance team

Source丨 Public Account: ByteDance Technical Team (ID: BytedanceTechBlog)

The dbaplus community welcomes technical staff to contribute to the posting email: [email protected]

Follow the official account [dbaplus community] for more original technical articles and select tools to download

Read on