laitimes

【Micro Electric Platform】-High concurrency practical experience-Wonderful problem solving and process optimization journey

author:JD Cloud developer

Micro-electric platform

The micro-electric platform is a comprehensive intelligent SCRM SAAS system integrating telemarketing and enterprise WeChat, covering major functions such as multi-channel management, full customer life cycle management, and private domain marketing operations, and undertakes JD.com's various business line services, focusing on providing one-stop customer management and integrated private domain operation services for business in the form of workplace outsourcing.

【Micro Electric Platform】-High concurrency practical experience-Wonderful problem solving and process optimization journey

Guide

This article introduces the process of the telemarketing system [customer list offline marking process] when encountering the problem of JMQ consumer side throughput dropping by 40% and JSF interface service machine freezing death, from investigation, repeated verification to final solution to the problem and an additional 50% increase in throughput, suitable for server-side R&D students, providing troubleshooting ideas and solutions when the production ring encounters some complex problems, correctly using JD's internal tools such as SGM, JMQ, JSF and other tools to catch the root cause of the problem and solve it thoroughly, and use technology to escort business development~

The following describes the throughput problems encountered in the offline blacklist marking process encountered in the actual production environment of the telemarketing system.

Background to the event

1. Overview

Every morning and night, the blacklist will be marked for the list of 100 million customers of the telemarketing system, the average speed is 950,000 lists/minute, the total TPS of the blacklist JSF service is about 20,000, and the TP99 is 100~110ms.

2. Complexity

It provides configuration functions for thousands of people and thousands of faces, and the underlying layer is designed and implemented based on the rule engine, and the call link contains many external interfaces, such as financial swiping markers, risk control markers, crowd portrait markers, mall risk control markers, mall real-name markers, etc., which contain many dimensions and high complexity.

【Micro Electric Platform】-High concurrency practical experience-Wonderful problem solving and process optimization journey

3. Questions

In the morning, it was found that the blacklist marking was not completed, which was manifested as the TP99 on the JMQ consumer side was too high, which reduced the throughput of the marking program, and through temporary expansion, 4 "problem machines" (God's perspective: in fact, the problem machine caused by the program) were reduced to increase the throughput and accelerate the completion of the blacklist marking.

But why do there be frequent machine problems? Why do a small number of machines have problems reduce throughput by 40%, and how can we avoid such situations in the future?

With the above problems, let's start the journey of locating and solving the root cause of the problem~

Catch the culprit behind it

1. Why does the failure of several machines lead to a sharp drop in throughput?

【Micro Electric Platform】-High concurrency practical experience-Wonderful problem solving and process optimization journey

As shown in the figure above, every time a message is consumed and an error is reported (in this case, it is hit on the "problem machine"), jmq will try to sleep and then re-consume all the messages pulled down (in this case, jmq's batchSize=10), that is, the total time taken by each error will increase by at least 1,000 milliseconds, a total of 80 machines, JSF uses the default load balancing algorithm, and the probability of the service request hitting 4 problem machines is 5% jmq pulls down 10 messages at a time, and as long as one message hits the "problem machine", it will greatly reduce the throughput.

【Micro Electric Platform】-High concurrency practical experience-Wonderful problem solving and process optimization journey

In summary, the reason why the throughput of a small number of machines will decrease sharply if there is a problem is that the hit rate of each instance is the same under the JSF random load balancing algorithm and the default sleep mechanism of 1 second when the JMQ consumer retries after an error is reported.

Solution: Of course, the problem can be completely avoided if the consumer does not report the error, so if it is not guaranteed that the error will not be reported, you can do the following:

1) Modify the number of retries and the retry delay time of JMQ to reduce the impact as much as possible

<jmq:consumer id="cusAttributionMarkConsumer" transport="jmqTransport">
        <jmq:listener topic="${jmq.topic}" listener="jmqListener" retryDelay="10" maxRetryDelay="20" maxRetrys="1"/>
</jmq:consumer>
           

2) Modify the JSF load balancing algorithm

Configuration example:

<jsf:consumer loadbalance="shortestresponse"/>
           

Schematic:

【Micro Electric Platform】-High concurrency practical experience-Wonderful problem solving and process optimization journey

The consumer diagram in the above figure is extracted from the jsf wiki, and the red letters above are the key information extracted from the jsf code, in a word: the default random is a completely random algorithm, and the shortestresponse is based on the response time, The number of requests is weighted for load balancing, so the use of the shortestresponse algorithm can largely avoid such problems, similar to the role of circuit breaker (in this solution, JSF instance circuit breaker and warm-up capabilities are also used, see JSF wiki for details, so I won't introduce them here).

2. How to determine if it is an instance problem and find out the IP address of the problematic instance?

Through monitoring and observation, the phenomenon of high time-consuming only exists on 4 machines, and the first reaction is indeed to think that there is a problem with the machine, but combined with the previous situation (there was a similar phenomenon in January), I feel that there must be something strange about this matter.

The following figure is the log of the first reaction to think that it is a machine problem (the corresponding SGM monitoring time of this machine is also continuously high), and the removal of such a machine can indeed solve the problem temporarily:

【Micro Electric Platform】-High concurrency practical experience-Wonderful problem solving and process optimization journey

To sum up, when the time consumption is high or the failure is a certain IP in the time period, it can be determined that the instance corresponding to the IP has a problem (such as network, hardware, etc.), if there is a similar phenomenon in a large number of IPs, it is determined that it is not the problem of the machine itself, and the problem involved in this case is not the problem of the machine itself but the phenomenon caused by the program, continue to see the answer below.

3. What causes the machine to frequently fake death and become a faulty machine?

Through the above analysis, it can be known that the problem machine error is that the JSF thread pool is full, and the TPS is almost 0 during the machine problem, and the JSF thread pool (non-business thread pool) will be reported as full when there is a request.

Use Xingyun to perform the following operations:

1) dump memory objects

There is no obvious problem, and the memory usage is not large, which is consistent with a small amount of GC on the monitoring, and continue to check the stack

2)jstack堆栈

From this point of view, it is consistent with the appearance of the problem machine, and it is basically concluded that all JSF threads are waiting, so if there is traffic coming in, it will report that the JSF thread pool is full error, and it is consistent with the phenomenon that the machine's CPU and memory are very low.

线程编号JSF-BZ-22000-92-T-200:

stackTrace:
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00000007280021f8> (a java.util.concurrent.FutureTask)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at java.util.concurrent.FutureTask.awaitDone(FutureTask.java:429)
at java.util.concurrent.FutureTask.get(FutureTask.java:191)
at com.jd.jr.scg.service.common.crowd.UserCrowdHitResult.isHit(UserCrowdHitResult.java:36)
at com.jd.jr.scg.service.impl.BlacklistTempNewServiceImpl.callTimes(BlacklistTempNewServiceImpl.java:409)
at com.jd.jr.scg.service.impl.BlacklistTempNewServiceImpl.hit(BlacklistTempNewServiceImpl.java:168)
           

线程编号JSF-BZ-22000-92-T-199:

stackTrace:
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00000007286c9a68> (a java.util.concurrent.FutureTask)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at java.util.concurrent.FutureTask.awaitDone(FutureTask.java:429)
at java.util.concurrent.FutureTask.get(FutureTask.java:191)
at com.jd.jr.scg.service.biz.BlacklistBiz.isBlacklist(BlacklistBiz.java:337)
           

Inference: Thread number JSF-BZ-22000-92-T-200 waits at line 36 of UserCrowdHitResult, thread number JSF-BZ-22000-92-T-199 waits at line 337 of BlacklistBiz, and when you find this, you can basically infer that the cause of the problem is thread waiting, and similar code scenarios that cause the problem are 1) the main thread asks thread pool A to execute a task X, and 2) the same thread pool is asked to execute another task in task X, and when the thread cannot be obtained in the second step, it will wait, and then the first step will wait until the second step is executed, which is the phenomenon of fake death caused by threads waiting for each other.

Summary: This can basically confirm the problem, but due to the resignation of the code maintainer and the complexity of the program, in order to verify the conclusion, first modify the number of threads in the business thread pool A: 50->200 (here is to verify the throughput performance when there is no thread waiting phenomenon), and then verify, the conclusion is that the tps will have a small range of jitter, but there will be no tps to 0 or a large decrease.

Stand-alone TPS300~500, the traffic is normal, that is, the service can be provided normally when there is no thread waiting problem, as shown in the figure:

Confirmation inference: The code logic is as follows:

BlacklistBiz->【线程池A】blacklistTempNewService.hit
blacklistTempNewService.hit->callTimes
callTimes->userCrowdServiceClient.isHit->【线程池A】crowdIdServiceRpc.groupFacadeHit
           

Summary: BlacklistBiz executes the blacklistTempNewService.hit task through thread pool A as the main thread, and then uses thread pool A to execute crowdIdServiceRpc.groupFacadeHit in blacklistTempNewService.hit, which causes thread waiting and suspended animation, which is consistent with the above inference, and the problem has been located.

Solution: The solution is very simple, add an additional thread pool to avoid nested thread pools.

4. Unexpected gain, finding a problem that affects the performance of the blacklist service

查看堆栈信息时发现存在大量waiting to lock的信息:

【Micro Electric Platform】-High concurrency practical experience-Wonderful problem solving and process optimization journey

Problem: If you use the above stack to troubleshoot the code, you find that three methods in a service link use the same lock, and the performance is not degraded

【Micro Electric Platform】-High concurrency practical experience-Wonderful problem solving and process optimization journey

Solution: Replace the handwritten local cache maintained with a synclock by introducing a caffeine local cache.

5. As an additional bonus, did you know that when the jsf thread pool is full, the client will not retry when it reports RpcException?

This surprised me, I read the jsf code and communicated with the jsf architect before, and the information I got was: all RpcExceptions will be retried, and the load balancing algorithm will be used to find the provider again to call, but in the actual verification process, it is found that if the server reports: handlerRequest error msg:[JSF-23003]Biz thread pool of provider has bean exhausted, the server port is 22001, the client will not initiate a retry, this conclusion is finally agreed with the JSF architect, so if you want to retry in this scenario, you need to find a way in the client program, the implementation is relatively simple, and there is no example here.

Add some details, thanks to JD logistics boss @Shi Jiangang for a detailed demonstration: the current jsf will only be retried if the RpcException thrown by the consumer, and the others will not be retried (for example, the exception thrown by the above-mentioned provider will be treated as a normal return packet, and the consumer will not retry).

summary

The problem is that the marking process is not completed, the appearance is that jmqtp99 is increased, the throughput is reduced, at this time, if the tp99 of the jsf service is soaring, it can be said that it is a service problem, but on the contrary (it looks like a small number of requests rather than a problem with the jsf service itself), at this time, the tps of the marking jsf service is low and tp99 is normal, memory and cpu and other indicators are also in an idle state, but there are frequent errors that the jsf thread pool is exhausted, so the stack is further investigated. In other words, the TP99 high of JMQ is actually not caused by the high time consumption of the marking JSF service, but the appearance of the JMQ throughput reduction caused by the error of the marking JSF service triggering the JMQ retry mechanism (delay and hibernation), and finally locating and solving the thread waiting problem, and additionally solving a lock that affects the performance of the marking JSF service. Some configuration principles and tuning of JSF have achieved the effect that even if the service error is reported, it will not greatly affect the MQ throughput.

Event recap

Through this problem solving, not only the problem is completely solved, but also the factors affecting the performance are optimized, and the final results are:

1. Solve the security risks of blacklist JSF service thread waiting, remove the synchronous lock to improve the throughput, and reduce TPS99 from 100ms to 65ms when TPS is increased from 20,000 to 30,000;

2. JMQ retry waiting and delay time tuning, and the throughput of avoiding retries is greatly reduced: TP99 is reduced from 1100ms to 300ms;

3. The JSF load balancing algorithm is optimized, and a large number of requests are still sent to the machine when the machine fails to avoid, and the effect is that the service is relatively stable;

In the end, the execution was completed from more than 8 o'clock to 5 o'clock, and the overall time was reduced by 57%, and even if there was a "problem" with the machine, the overall throughput would not be greatly reduced, and the benefits were more obvious.

The optimized operation diagram is as follows:

【Micro Electric Platform】-High concurrency practical experience-Wonderful problem solving and process optimization journey

Write at the end

Although the micro-electric platform is not in the golden link, the complexity of the scene (business complexity, RPA and other robot user complexity) and the amount of traffic make us often face various challenges, but fortunately, we have solved them.