background
With the rapid development of Hujiang's business, the call relationship between the company's services is becoming more and more complex, and how to sort out and track the call relationship between them is more critical. Each online request passes through multiple business systems and generates access to various caches or DBs, but this scattered data is of limited help for troubleshooting or process optimization. In such a complex business scenario, the business flow will be processed and delivered by many microservices, and we will inevitably encounter these problems:
- Which service does the traffic for a request come from? Which service did it end up in?
- Why is this request so slow? What went wrong?
- What does this depend on? Is it a database or a message queue? Redis is suspended, which services are affected?
Design goals
- Low consumption: The impact of the tracking system on the business system should be small enough. In some highly optimized services, even the slightest attrition can be easily noticed, and it may force the online responsible deployment team to shut down the tracking system
- Low invasiveness: As a non-business component, it should be as little or no intrusion into the business system as possible, transparent to users, and reduce the burden on developers
- Timeliness: From the collection and generation of data, to the calculation and processing of data, to the final presentation, it is required to be as fast as possible
- Decision support: Whether this data can be useful at the decision support level, especially from a DevOps perspective
- Data visualization: You can filter through visualization without looking at logs
Implement functionality
- Fault location: The logical trace of a request can be displayed completely and clearly.
- Performance analysis: Each link of the call chain adds the call time to analyze the performance bottleneck of the system and optimize it in a targeted manner.
- Data analysis: A trace is a complete service log, which can obtain the behavior path of the request, and is used for summary analysis in many business scenarios
Design ideas
For the above problems, the industry has some specific practices and solutions. Through the call chain, the process of a request is completely connected, so that the request link can be monitored.
在业界,目前已知的分布式跟踪系统,比如「Twitter Zipkin 与淘宝鹰眼」,设计思想都是来自 Google Dapper 的论文 「Dapper, a Large-Scale Distributed Systems Tracing Infrastructure」
Typical distributed invocation process
1: This path is initiated by the user's X request and passes through a simple service system. Nodes identified by letters represent different processes in a distributed system.
The tracking system of a distributed service needs to record information about all the work done in the system after a particular request. For example, Figure 1 shows a service related to five servers: the front end (A), the two middle tiers (B and C), and the two back ends (D and E). When a user (the initiator of this use case) initiates a request, it first reaches the frontend and then sends two RPCs to servers B and C. B reacts immediately, but C needs to interact with D and E on the backend before returning it to A, who responds to the initial request. For such a request, the simple and practical implementation of end-to-trace is to collect and trace every time you send and receive actions on the server
Invocation link diagram
cs - CLIENT_SEND,客户端发起请求
Technology selection
To sum up, considering the company's current scenario, which is dominated by HTTP requests, it was finally decided to adopt the implementation idea of referring to Zipkin, and at the same time use the OpenTracing standard to be compatible with multi-language clients
System design
Overall architecture diagram and description
Generally, the end-to-end tracking system has four main parts: data burying, data transmission, data storage, and query interface
Data crypting
- Integrate SDKs into the Hujiang Unified Development Framework for low-intrusive data collection
- AOP slice method is used to store the collected data in the local thread variable ThreadLocal, which is transparent to the application
- Data Record TraceId, Event Application, Interface, Start Time, and Time Consumption
- Data is sent to the Kafka queue in an asynchronous thread queue to reduce the impact on the business
Currently, the following middleware is supported:
- HTTP middleware
- Mysql middleware
- RabbitMQ middleware
Data transmission
We add a layer of Kafka between the SDK and the backend service, which can not only realize the decoupling of the project, but also realize the delayed consumption of data, which plays the role of peak shaving and valley filling. We don't want to lose data because of a high instantaneous QPS, and of course there comes at a cost to it.
Kafka Manager 展示
Data storage
The data store uses ElasticSearch, which mainly stores data related to Span and Annotation, and only stores the last 1 month's data in the early stage, considering the scale of the data volume.
ELasticSearch Head 数据
Query interface
A visual web interface is used to query distributed call links, and dependency aggregation is also provided for analysis based on project dimensions
Query page
Trace tree
Dependency analysis
Encountered some pits
The web page load timed out
issue
When using the Zipkin official website UI, there will be occasional service loading timeouts.
solution
The reason is that when Zipkin loads the page, it will load all the spans in the project at once, and when the project uses the RESTful API, it will generate millions of spans. Finally, we rewrite the UI page to use lazy loading, display the last 10 spans by default, and support the ability to enter characters to dynamically query spans.
Span piles up too much
issue
When you query a trace on a page, you find that a link is stacked up to thousands of spans
solution
After troubleshooting, when the business side uses HttpClient middleware to send HTTP requests that time out, the SDK does not intercept the timeout exception, resulting in the event being stored in the corresponding ThreadLocal of the thread. After discovery, the SDK intercepts the request exception and clears the corresponding event in ThreadLocal.
summary
End-to-end tracing system key points: trace traces.
Each request generates a globally unique TraceID that connects different systems together. Implement trace tracking and path analysis to help business personnel quickly locate performance bottlenecks and troubleshoot fault causes.
Resources
- Google Dapper http://bigbully.github.io/Dapper-translation/
- Twitter http://zipkin.io/
- 窝窝网 Tracing 文章 http://www.cnblogs.com/zhengyun_ustc/p/55solution2.html
Author: Lao Cao
Source-WeChat public account: Hujiang Technology
Source: https://mp.weixin.qq.com/s/uYTVNMpX8Ci0oFvqEF_jlA