background

With the rapid development of Hujiang's business, the call relationship between the company's services is becoming more and more complex, and how to sort out and track the call relationship between them is more critical. Each online request passes through multiple business systems and generates access to various caches or DBs, but this scattered data is of limited help for troubleshooting or process optimization. In such a complex business scenario, the business flow will be processed and delivered by many microservices, and we will inevitably encounter these problems:

Which service does the traffic for a request come from? Which service did it end up in?
Why is this request so slow? What went wrong?
What does this depend on? Is it a database or a message queue? Redis is suspended, which services are affected?

Design goals

Low consumption: The impact of the tracking system on the business system should be small enough. In some highly optimized services, even the slightest attrition can be easily noticed, and it may force the online responsible deployment team to shut down the tracking system
Low invasiveness: As a non-business component, it should be as little or no intrusion into the business system as possible, transparent to users, and reduce the burden on developers
Timeliness: From the collection and generation of data, to the calculation and processing of data, to the final presentation, it is required to be as fast as possible
Decision support: Whether this data can be useful at the decision support level, especially from a DevOps perspective
Data visualization: You can filter through visualization without looking at logs

Implement functionality

Fault location: The logical trace of a request can be displayed completely and clearly.
Performance analysis: Each link of the call chain adds the call time to analyze the performance bottleneck of the system and optimize it in a targeted manner.
Data analysis: A trace is a complete service log, which can obtain the behavior path of the request, and is used for summary analysis in many business scenarios

Design ideas

For the above problems, the industry has some specific practices and solutions. Through the call chain, the process of a request is completely connected, so that the request link can be monitored.

在业界，目前已知的分布式跟踪系统，比如「Twitter Zipkin 与淘宝鹰眼」，设计思想都是来自 Google Dapper 的论文「Dapper, a Large-Scale Distributed Systems Tracing Infrastructure」

Typical distributed invocation process

Design and practice of Hujiang full-link tracking system

1: This path is initiated by the user's X request and passes through a simple service system. Nodes identified by letters represent different processes in a distributed system.

The tracking system of a distributed service needs to record information about all the work done in the system after a particular request. For example, Figure 1 shows a service related to five servers: the front end (A), the two middle tiers (B and C), and the two back ends (D and E). When a user (the initiator of this use case) initiates a request, it first reaches the frontend and then sends two RPCs to servers B and C. B reacts immediately, but C needs to interact with D and E on the backend before returning it to A, who responds to the initial request. For such a request, the simple and practical implementation of end-to-trace is to collect and trace every time you send and receive actions on the server

Invocation link diagram

cs - CLIENT_SEND，客户端发起请求

SR - SERVER_RECIEVE, the server receives the request

SS - SERVER_SEND, the server completes the processing and sends the result to the client

cr - CLIENT_RECIEVE, the client receives a response

Technology selection

To sum up, considering the company's current scenario, which is dominated by HTTP requests, it was finally decided to adopt the implementation idea of referring to Zipkin, and at the same time use the OpenTracing standard to be compatible with multi-language clients

System design

Overall architecture diagram and description

Generally, the end-to-end tracking system has four main parts: data burying, data transmission, data storage, and query interface

Data crypting

Integrate SDKs into the Hujiang Unified Development Framework for low-intrusive data collection
AOP slice method is used to store the collected data in the local thread variable ThreadLocal, which is transparent to the application
Data Record TraceId, Event Application, Interface, Start Time, and Time Consumption
Data is sent to the Kafka queue in an asynchronous thread queue to reduce the impact on the business

Currently, the following middleware is supported:

HTTP middleware
Mysql middleware
RabbitMQ middleware

Data transmission

We add a layer of Kafka between the SDK and the backend service, which can not only realize the decoupling of the project, but also realize the delayed consumption of data, which plays the role of peak shaving and valley filling. We don't want to lose data because of a high instantaneous QPS, and of course there comes at a cost to it.

Kafka Manager 展示

Data storage

The data store uses ElasticSearch, which mainly stores data related to Span and Annotation, and only stores the last 1 month's data in the early stage, considering the scale of the data volume.

ELasticSearch Head 数据

Query interface

A visual web interface is used to query distributed call links, and dependency aggregation is also provided for analysis based on project dimensions

Query page

Trace tree

Dependency analysis

Encountered some pits

The web page load timed out

issue

When using the Zipkin official website UI, there will be occasional service loading timeouts.

solution

The reason is that when Zipkin loads the page, it will load all the spans in the project at once, and when the project uses the RESTful API, it will generate millions of spans. Finally, we rewrite the UI page to use lazy loading, display the last 10 spans by default, and support the ability to enter characters to dynamically query spans.

Span piles up too much

issue

When you query a trace on a page, you find that a link is stacked up to thousands of spans

solution

After troubleshooting, when the business side uses HttpClient middleware to send HTTP requests that time out, the SDK does not intercept the timeout exception, resulting in the event being stored in the corresponding ThreadLocal of the thread. After discovery, the SDK intercepts the request exception and clears the corresponding event in ThreadLocal.

summary

End-to-end tracing system key points: trace traces.

Each request generates a globally unique TraceID that connects different systems together. Implement trace tracking and path analysis to help business personnel quickly locate performance bottlenecks and troubleshoot fault causes.

Resources

Google Dapper http://bigbully.github.io/Dapper-translation/
Twitter http://zipkin.io/
窝窝网 Tracing 文章 http://www.cnblogs.com/zhengyun_ustc/p/55solution2.html

Author: Lao Cao

Source-WeChat public account: Hujiang Technology

Source: https://mp.weixin.qq.com/s/uYTVNMpX8Ci0oFvqEF_jlA

Design and practice of Hujiang full-link tracking system

background

Design goals

Implement functionality

Design ideas

Typical distributed invocation process

Invocation link diagram

Technology selection

System design

Overall architecture diagram and description

Data crypting

Data transmission

Kafka Manager 展示

Data storage

ELasticSearch Head 数据

Query interface

Query page

Trace tree

Dependency analysis

Encountered some pits

The web page load timed out

issue

solution

Span piles up too much

issue

solution

summary

Resources

Read on