laitimes

APM Query Optimization & Request Slow Attribution Analysis

preface

The blogger itself is not responsible for the APM project, basically it is spontaneous behavior, which is a good aspect, workers must cultivate their own unique experience in the industry, I will talk about two sentences below.

Throughout the "Zizhi Tongjian", it talks about the elites of ancient China, and the way the emperors did things, so it is called the Book of Emperors, which is used for managers and affairs, but if it is used for personal conduct to learn, it may be biased. For example, in these people, which is not a magnate, so their rights are higher than ordinary people, and the things they deal with are not our daily sesame little things, and when the roles are different and the status is different, the way of doing things cannot be directly copied ~

In addition, there is an interesting conversation in the Tang Dynasty Xianzong, he and the prime minister are talking about how to be a good monarch, the prime minister refers to the past history, believes that people's energy is limited, should formulate a good selection mechanism, KPI assessment, these things are left to them. This mentioned the selection mechanism, as well as the formulation of the assessment mechanism, for a person who governs the country, it is very important to have a reasonable mechanism to deal with various daily affairs, and secondly, these leaders in office should be competent, and they need to rely on this set of assessment mechanisms.

True competence in the workplace

I think the most important thing for a leader is to have their own opinions in a certain field, that is, you know the future planning of a certain field, and then you may encounter solutions to problems, such as the supply chain, in essence I think it is to improve the operational efficiency of this process, and there is a data support for the entire process, such as where the entire link is time-consuming, which are still manual operations. Then you refer to other companies, you will find that it is through C-end user consumption data to promote rapid iteration of goods in suppliers, and the way of playing is different. This is not a grade with your B-end supply chain playing by itself, for example, the buyer buys more materials is a hit, not necessarily, maybe people use it wholesale.

As for management ability, I arrange it according to the strength of ability and execution, so I think that leaders in the workplace are not simple managers, but have their own achievements in specific fields, and insight is king, even if they are not able to take the initiative to understand and learn, they are worthy of being wise.

APM query optimization

status quo

Before some students reported that APM queries are relatively slow, that is, the kind that spans a long time span, recently I took a look at it when I was free, and found that the code was relatively old, and directly requested through the ES statement through RESTCLIENT.

Our APM logs are sharded by days, if you need to get /xxx_20230907, xxx_20230906/_search to perform query functions across days, our daily log volume is about 5kw, so many indexes will definitely be slow to check out, followed by sorting by time.

Ideas

We according to the response results, is a total + paging query data, total, we can query through the count method, followed by pagination query data about 10 ~ 500 lines a page, which means that my current query will fall into one to two indexes, no need to stuff the indexes into the query, so the key is to find which index shard is where my current query page is located, and query directly.

implement

1. If it is an aggregation statement, or a single index shard (check one day) directly query

Because aggregating it needs to do statistical functions, it cannot be directly carried out through a single shard, and the query will only hit one index in a day, so we can directly execute the query.

2. Check the total number

Through the count+ query condition of multiple indexes, we can find out how many of all the quantities under the current conditions are, and query them out at one time

3. Locate index shards

Because we are in reverse chronological order, we will check the time index in reverse order, for example, on September 7, check the total number first, 10 match, if I am on the first page, there are 8 on one page, then I directly hit the first page to query, the second page, will cross shards, we count the number of matches on September 6, if it is greater than or equal to 2, that is, just meet the content of the second page, we query by getting two indexes.

This avoids saying I'm introducing all indexes into brute-force paginated queries.

4. Data assembly

We need to assemble the above pagination data and return it to the frontend

The APM interface requests slow attribution analysis

APM Query Optimization & Request Slow Attribution Analysis

This roadmap appeared earlier when the second APM requested an "anomaly" analysis, and when I actually implemented it, I found that there were some detailed problems.

detail

1. Abnormal point algorithm problem

I looked at Qunar.com before also talked about similar, using the LOF model to statistics, you will find several extreme cases: the first bunch of requests are very slow interfaces, several requests are fast interfaces, these fast interfaces become heterogeneous, I drop mother; The second is a bunch of 20ms interfaces, one or two 400ms interfaces, which are also abnormal points;

So under my weight, the variance method is used, that is, you are 3 times higher than the average, you are an anomaly, of course, this also exists in the second case above, and noise reduction is required.

1.1. Noise reduction

a. Timer interface processing

If the timer is made into a synchronous interface, it will be very slow, because if the machine performance may collapse, you can only run slowly. We can handle it through asynchronous threads, so that it does not affect our statistics of slow requests.

b. Normal "abnormal points"

For example, in the second case, the normal interface request only needs to be between 200-500ms, but if our internal network situation will be very fast, causing these interfaces to become abnormal points, so we determine that the average time of the interface is greater than 500ms, and then your current request time is greater than 3 times its variance, then this is the abnormal point of the entire application.

By the way, we have a time window for statistics, such as 10s for statistics, so we only talked about the average time to compare with 500ms last time.

2. Attribution analysis

2.1. Statistics are calculated by the dimension of slow response time of interfaces

I don't think this works, one may have a small data set, such as interface A, only requested once in half an hour, how do you know that the interface has slowed down, right, and the other unless there is a new version update to find out the interface speed changes.

2.2. According to the top3 statistics of the longest time, interface + mysql request

For example, my application is very slow, and these long requests are of course the culprit, right, no problem. There is also a detail here, that is, the gateway interface statistics should be removed, because the gateway is the outermost layer, and it is not meaningful for you to count its interfaces, but which interfaces should be the slowest service in the link, and database request sql statistics.