Construction of Qudian link tracking system

At present, more and more complex systems are beginning to adopt the architecture of microservices, and for the development team, microservices have many advantages, such as easy maintenance, high development efficiency, multi-language, loose coupling, etc., but for the operation and maintenance team, many challenges are added, such as:

•Complexity of O&M of distributed systems•Long call links that are difficult to monitor•Error reporting and difficult link tracing•Problems with underlying modules can easily cause avalanches.

The Tracing Analysis System is designed to solve the problem of performance analysis and fault tracing in the microservice architecture.

principle

Today's distributed tracing systems are basically based on Google's Dapper paper, which can be referred to the link in Appendix 1, which is simple and clear. Here's just a brief introduction:

The core of the tracing system is to build the call links between the systems, similar to building a bidirectional linked list. For example, when module A calls module B, module A can pass its span id to module B at the same time as the call, and the span id of module A is its parent id. This process can be depicted in the following diagram:

Figure 1 (from Google Dapper paper)

Research on open source solutions

When we got on the system relatively early around 2017, we mainly investigated the two open source solutions of Zipkin and CAT during the implementation process, which are relatively friendly to Java support, but the php-fpm support for our main system is relatively poor, and the transformation of the system is relatively large if you want to implement it. If it is a Java system, you can consider using these two open source solutions (the recently popular SkyWalking is also recommended to be considered, and some of our online systems are also using Skywalking).

After comprehensively considering the implementation cost, security, and the difficulty of self-development, we decided to try to make a set of our own, the principle is relatively simple, so the cost of self-development is relatively low.

Architectural design

The whole program is roughly divided into three parts:

Business layer: responsible for recording call logs

This layer is the foundation of the entire scenario, and missing fields will affect the use of analysis later. After discussion, we have recorded the following 5 tracking basic fields and key fields of performance analysis in this layer:

5 tracking base fields:

•SPAN ID: ID of the request • SPAN HostName: The name of the machine (or the IP address of the machine) of the request, which is used to identify which machine the request occurred on, which is convenient for querying logs id: trace ID, used to identify the ID of a request, usually assigned at the entry point trace id, passed between each request call, for each span initiated by the same external request, the trace ID is the same, through which the span id can be concatenated.

Key fields for performance analysis:

•URI: Request URI • Process Time: the processing time of the URI, which represents the processing efficiency of the URI

Log aggregation layer: is responsible for aggregating logs to ES clusters

该层我们采用了filebeat -> kafka -> logstash -> elasticsearch 的方案

Display layer: is responsible for link log analysis, monitoring, and alerting based on data in the ES cluster

The whole scheme is generally shown as follows:

This solution has a number of advantages, such as:

1. The solution has no performance bottleneck, and each module can be expanded; 2. The business performance is basically not affected, and no additional network requests will be added; 3. The business security is guaranteed, and the system will not affect the business system if it hangs down;

Implementation

The design of the scheme is only the starting point, and the implementation after that is the focus. If the implementation is too complex or the implementation period is too long, it is likely to die, so we have done more or less to simplify the implementation during the implementation process.

Logging

This part of the work is mainly carried out by the R&D team. Because the microservice module framework used is relatively unified, we have transformed the SDK for calling between modules and the SDK for log modules, and R&D students only need to access these two SDKs, which is very simple;

These two SDKs only adjust the logging format, which does not affect the performance at all;

There are two logs that we focus on: one is access logs and the other is business logs.

•Access logs: The request log needs to record at least the 7 fields described above, which can be recorded at the level of tomcat, nginx, or in the business framework; •Business logs: The service logs need to identify at least the span id field, usually the business logs will be relatively large, here we only collect error logs;

Log collection

There is a relatively mature architecture for log collection, which can be referred to

Filebeat → Kafka → Logstash → ElasticSearch

During the implementation process, Elasticsearch is a module that needs to be optimized and monitored, and it is necessary to ensure the real-time log collection. In particular, when the amount of logs is large, bottlenecks can easily occur, resulting in log delays, which in turn affect the query service or alert service.

Presentation layer

This layer retrieves and summarizes data in Elasticsearch to form a practical function. Only when this layer is done well can the value of the entire system be shown. Through this layer, we can do link call tracing, fault tracing, performance analysis, etc., for a brief introduction, please refer to the "How to use" section below.

How to use:

In this section, we will introduce some of the functions of the presentation layer that were made under the previous architecture. (UI is not optimized, just look at the function)

Tracing of link calls

This is the basic function of comparison, which is used to trace the call link. Given a trace id, the complete call can be drawn. This function is usually used for slow request analysis and call link analysis.

Fault tracing

The main problem solved in this part is that there are many microservice modules, the call is complex, and the root module cannot be found when there is a problem. We draw a module as a circle, and the call between modules is drawn as a single arrow straight line (the normal request color is green, and the color is red after the error exceeds a certain percentage), and by drawing it in this way, we can draw a call network, and it is easy to see the module where the root of the error is located when the problem occurs. As shown in the figure below:

Performance analysis

By statistically summarizing the URL response time, number of requests, status code, etc., you can draw a chart of the URL performance of the module.

Before and after the module goes online, you can monitor whether there is any abnormality in the online by observing the chart;

In the event of a failure, it is also more convenient to count the statistics before and after the failure.

Usage Overview

At present, the system has been fully used in the business system, and has played an important role in the invocation link tracing, fault tracing, performance monitoring, etc., which greatly improves the fault recovery time. But at the same time, in the process of use, we found many places that can be optimized and expanded.

Functional expansion

There are a number of features that can be extended on top of these basic functions, such as:

Resource call monitoring traces

The current system only monitors and analyzes the requests between modules, but we can further monitor the resource (e.g., redis, db) calls in the request. In general, there will be a function of printing and recording resource request logs in the framework, which can be easily integrated into the system with a little modification;

Intelligent monitoring of abnormal fluctuations

For the fluctuation of the call duration, in fact, it can be combined with the module online, error logs, etc. to do some intelligent monitoring, automatic positioning and other functions.

附录1 Google DAPPER 论文

English version

Dapper, a Large-Scale Distributed Systems Tracing Infrastructure

https://research.google/pubs/pub36356/

Chinese version

Dapper, a tracking system for large-scale distributed systems

http://bigbully.github.io/Dapper-translation/

Author: Yu Bo

Source-WeChat public account: Qudian technical team

Source: https://mp.weixin.qq.com/s/1ck-bhOTvFfIo7lbwrzErw

Construction of Qudian link tracking system