laitimes

Design and practice of Hujiang full-link tracking system

author:Flash Gene

background

With the rapid development of Hujiang's business, the call relationship between the company's services is becoming more and more complex, and how to sort out and track the call relationship between them is more critical. Each online request passes through multiple business systems and generates access to various caches or DBs, but this scattered data is of limited help for troubleshooting or process optimization. In such a complex business scenario, the business flow will be processed and delivered by many microservices, and we will inevitably encounter these problems:

  • Which service does the traffic for a request come from? Which service did it end up in?
  • Why is this request so slow? What went wrong?
  • What does this depend on? Is it a database or a message queue? Redis is suspended, which services are affected?

Design goals

  • Low consumption: The impact of the tracking system on the business system should be small enough. In some highly optimized services, even the slightest attrition can be easily noticed, and it may force the online responsible deployment team to shut down the tracking system
  • Low invasiveness: As a non-business component, it should be as little or no intrusion into the business system as possible, transparent to users, and reduce the burden on developers
  • Timeliness: From the collection and generation of data, to the calculation and processing of data, to the final presentation, it is required to be as fast as possible
  • Decision support: Whether this data can be useful at the decision support level, especially from a DevOps perspective
  • Data visualization: You can filter through visualization without looking at logs

Implement functionality

  • Fault location: The logical trace of a request can be displayed completely and clearly.
  • Performance analysis: Each link of the call chain adds the call time to analyze the performance bottleneck of the system and optimize it in a targeted manner.
  • Data analysis: A trace is a complete service log, which can obtain the behavior path of the request, and is used for summary analysis in many business scenarios

Design ideas

For the above problems, the industry has some specific practices and solutions. Through the call chain, the process of a request is completely connected, so that the request link can be monitored.

在业界,目前已知的分布式跟踪系统,比如「Twitter Zipkin 与淘宝鹰眼」,设计思想都是来自 Google Dapper 的论文 「Dapper, a Large-Scale Distributed Systems Tracing Infrastructure」

Typical distributed invocation process

Design and practice of Hujiang full-link tracking system

1: This path is initiated by the user's X request and passes through a simple service system. Nodes identified by letters represent different processes in a distributed system.

The tracking system of a distributed service needs to record information about all the work done in the system after a particular request. For example, Figure 1 shows a service related to five servers: the front end (A), the two middle tiers (B and C), and the two back ends (D and E). When a user (the initiator of this use case) initiates a request, it first reaches the frontend and then sends two RPCs to servers B and C. B reacts immediately, but C needs to interact with D and E on the backend before returning it to A, who responds to the initial request. For such a request, the simple and practical implementation of end-to-trace is to collect and trace every time you send and receive actions on the server

Invocation link diagram

Design and practice of Hujiang full-link tracking system
cs - CLIENT_SEND,客户端发起请求
           
  • SR - SERVER_RECIEVE, the server receives the request
  • SS - SERVER_SEND, the server completes the processing and sends the result to the client
  • cr - CLIENT_RECIEVE, the client receives a response
  • Technology selection

    Design and practice of Hujiang full-link tracking system

    To sum up, considering the company's current scenario, which is dominated by HTTP requests, it was finally decided to adopt the implementation idea of referring to Zipkin, and at the same time use the OpenTracing standard to be compatible with multi-language clients

    System design

    Overall architecture diagram and description

    Design and practice of Hujiang full-link tracking system

    Generally, the end-to-end tracking system has four main parts: data burying, data transmission, data storage, and query interface

    Data crypting

    • Integrate SDKs into the Hujiang Unified Development Framework for low-intrusive data collection
    • AOP slice method is used to store the collected data in the local thread variable ThreadLocal, which is transparent to the application
    • Data Record TraceId, Event Application, Interface, Start Time, and Time Consumption
    • Data is sent to the Kafka queue in an asynchronous thread queue to reduce the impact on the business

    Currently, the following middleware is supported:

    • HTTP middleware
    • Mysql middleware
    • RabbitMQ middleware

    Data transmission

    We add a layer of Kafka between the SDK and the backend service, which can not only realize the decoupling of the project, but also realize the delayed consumption of data, which plays the role of peak shaving and valley filling. We don't want to lose data because of a high instantaneous QPS, and of course there comes at a cost to it.

    Kafka Manager 展示

    Design and practice of Hujiang full-link tracking system

    Data storage

    The data store uses ElasticSearch, which mainly stores data related to Span and Annotation, and only stores the last 1 month's data in the early stage, considering the scale of the data volume.

    ELasticSearch Head 数据

    Design and practice of Hujiang full-link tracking system

    Query interface

    A visual web interface is used to query distributed call links, and dependency aggregation is also provided for analysis based on project dimensions

    Query page

    Design and practice of Hujiang full-link tracking system

    Trace tree

    Design and practice of Hujiang full-link tracking system

    Dependency analysis

    Design and practice of Hujiang full-link tracking system

    Encountered some pits

    The web page load timed out

    issue

    When using the Zipkin official website UI, there will be occasional service loading timeouts.

    solution

    The reason is that when Zipkin loads the page, it will load all the spans in the project at once, and when the project uses the RESTful API, it will generate millions of spans. Finally, we rewrite the UI page to use lazy loading, display the last 10 spans by default, and support the ability to enter characters to dynamically query spans.

    Span piles up too much

    Design and practice of Hujiang full-link tracking system

    issue

    When you query a trace on a page, you find that a link is stacked up to thousands of spans

    solution

    After troubleshooting, when the business side uses HttpClient middleware to send HTTP requests that time out, the SDK does not intercept the timeout exception, resulting in the event being stored in the corresponding ThreadLocal of the thread. After discovery, the SDK intercepts the request exception and clears the corresponding event in ThreadLocal.

    summary

    End-to-end tracing system key points: trace traces.

    Each request generates a globally unique TraceID that connects different systems together. Implement trace tracking and path analysis to help business personnel quickly locate performance bottlenecks and troubleshoot fault causes.

    Resources

    • Google Dapper http://bigbully.github.io/Dapper-translation/
    • Twitter http://zipkin.io/
    • 窝窝网 Tracing 文章 http://www.cnblogs.com/zhengyun_ustc/p/55solution2.html

    Author: Lao Cao

    Source-WeChat public account: Hujiang Technology

    Source: https://mp.weixin.qq.com/s/uYTVNMpX8Ci0oFvqEF_jlA

    Read on