introduction

As a back-end developer, you're no stranger to Redis, an in-memory database with very fast read and write speeds. In iQIYI's overseas back-end projects, we also use Redis extensively, mainly for scenarios such as caching, message queues, and distributed locks. Recently, I encountered a strange problem during the upgrade of the docker image version used by an old project, and found that after the project was upgraded to a higher version of the image, many timeout errors occurred when accessing Redis, and the problem disappeared after the previous image version was downgraded. After troubleshooting, the culprit of the problem was finally located as a code bug involving Lettuce, Redis, Netty and other content. During the problem-solving process, there is also a deeper understanding of how the components in question work. The following is a detailed introduction to the problem analysis and troubleshooting process in the process of "catching insects".

background

Our technology stack is a common Spring Cloud family bucket in the industry. The project in question is a sub-service of the overall microservices architecture, which is mainly responsible for providing web services related to the client, including program details, episode lists, and playback authentication. Since the frequency of entity changes in most fields, such as programs, dramas, and actors, is not high, we use a Redis cluster with three masters and three slaves for caching and sharding data management to store more content.

There are two main ways to access Redis in a project: one is to directly use the RedisTemplete object encapsulated by the Spring framework to access, and the use scenario is to manually manipulate the data in Redis. Another way is to access the cache content indirectly through the self-developed caching framework, which manages the cached content internally, mainly including advanced functions such as second-level caching, hot key statistics, and cache preheating.

Access via RedisTemplete:

A Redis visit timed out "bug catching" trip

Access through a self-developed caching framework. As shown in the following figure, and the object with @CreateCache annotation is declared as a cache container, the framework will use the publish and subscribe mechanism of Redis to automatically synchronize the hot data in the remote Redis L2 cache to the local computer when the project starts. You can also configure attributes such as the validity period of the data cache and the number of local caches. In addition, the framework itself provides a read/write interface for users to access the cached data.

Problem phenomenon

After the image version is upgraded, a large number of Redis timeout errors occur after the application is started normally. After observing routine monitoring of CPU, memory, and garbage collection, no significant anomalies were found. However, in the early stage of the project, there will be more network data writes. This is actually the cache warm-up logic mentioned earlier, so it's expected.

Since there are two access methods in the project itself, and the Redis server architecture is different in different environments, in order to fix the problem scenario, we conducted a conditional test and found some clues:

On the earlier version of the image, both the Redis Templete and the caching framework access the Redis cluster are normal.
On the image of the higher version, the access to the Redis cluster by RedisTemplete is normal, but the access to the Redis cluster by the caching framework has timed out. Frame access returns to normal after a period of time after the project has been initiated.
In images of earlier and higher versions, both RedisTemplete and the cache framework access to Redis instances are normal

Based on the above phenomenon, it is not difficult to deduce that the problem should appear to be with the mechanism by which the caching framework accesses the Redis cluster. Combined with the fact that the project will return to normal after a period of time after starting, it is guessed that it should be related to the cache warm-up process.

Troubleshooting process

Reproduce the case

After checking the code, I found that the self-developed caching framework did not access Redis through Spring, but directly used Lettuce, the Redis client at the bottom of Sping. After removing the irrelevant business code, we get a minimal case that can reproduce the problem, and the code is as follows:

The whole case simulates the cache warm-up scenario, and the main operation process is as follows:

After the service (node 3) starts, it sends an ONLINE message that the new node goes online
After receiving the ONLINE message, other nodes (nodes 1 and 2) package the hot keys cached locally
Other nodes (nodes 1 and 2) send HOTKEY messages containing native hotkeys
The new node (node 3) receives a HOTKEY message containing the hot key
The new node (node 3) backchecks the Redis based on the received hot key to obtain the value value and caches it to the local computer

A log is added to the Redis method based on the hot key to show the execution time of the reverse check operation (omitting some irrelevant code).

After running the above code, let's take a look at the actual output of the console:

You can see that the application receives the HOTKEY message of Redis after startup and performs a reverse lookup. However, a large number of counter-check requests do not get results after 1 second. The timeout period for requesting future in the source code is also set to 1 second, that is, a large number of Redis get requests have timed out.

In general, there are two reasons why a request times out, either the request does not reach the server, or the response does not return to the client. To locate the cause, we check the communication status with the Redis cluster on the application host, as follows:

The results show that a total of 6 connections have been established between the local computer and the 3 shards of the Redis cluster, and the receiving queue content of one TCP connection has been non-empty, which means that the response data of the connection has reached the local socket buffer, but the client program has not consumed it for some reason. For comparison, we also observed the connection after starting on the earlier version of the image, and found that there was no data backlog.

At this point, we found that the backlog of data in the buffer is probably the cause of the backcheck request timeout, and after understanding this, we began to think:

Who should consume the data in the connection buffer?
What is the role of each connection?
Why is there a backlog on only one connection?
Why does the backlog only appear in older versions of the image?
Why doesn't there be a timeout problem when accessing Redis through Spring?

In-depth analysis

To answer the above questions, you must first understand how Lettuce works, focusing on how the underlying layer accesses the Redis cluster.

According to the official website, Lettuce is based on Netty's NIO model, which only supports more Redis connections with a limited number of threads, and can make more efficient use of system resources under high load.

Let's briefly review how Netty works. All I/O operations and events in Netty are handled by its core component, EventLoop. When Netty starts, multiple EventLoop objects are created according to the configuration, and each Netty connection is registered on an EventLoop, and the same EventLoop can manage multiple connections. Each EventLoop contains a worker, a selector, and a task queue.

When a client performs operations such as connection establishment or registration, these actions are submitted to the task queue associated with the EventLoop in the form of tasks. Whenever an I/O event occurs on a connection or the task queue is not empty, the worker threads inside it (single-threaded) take the event execution from the queue in a poll or distribute the event to the appropriate event listener for execution.

In the Lettuce framework, interaction with Redis clusters is handled by the internal RedisClusterClient object. When the project starts, the RedisClusterClient obtains all the master/slave node information based on the configuration, tries to connect to each node to obtain the node metadata data, and then releases the connection to complete the initialization. The RedisClusterClient then connects to each node as needed. There are two types of connections to the RedisClusterClient: primary connection and secondary connection. The connection created by the client display is the primary connection and is used to execute commands that do not require routing, such as auth/select, etc. The connection implicitly created by the client based on the routing rule is a secondary connection, which is used to execute commands that need to be routed according to the slot, such as common get/set operations. For the pub/sub publish and subscribe mechanism, in order to ensure that subscribers can receive the messages published by the publisher in real time, Lettuce will maintain a separate connection dedicated to event listening.

So the 6 TCP connections we observed earlier actually consist of 1 cluster master, 3 secondary connections, 1 pub connection for event publishing (statefulRedisPubSubConnection declared by TestService), and 1 sub connection for subscriptions. All of these connections are registered and managed on Netty's EventLoop.

The core function of the EventLoop mechanism is multiplexing, which means that a thread can handle read and write events for multiple connections. However, this can only be achieved if the EventLoop thread cannot be blocked, otherwise the events of each connection registered on that thread will not be responded. From this, we can speculate that if there is a backlog of socket buffers, there may be some reason why the EventLoop thread corresponding to the socket connection is blocked, making it unable to respond to readable events and read buffer data normally.

To verify the guess, we print the thread information in the log for further observation.

It turns out that most of the timeouts occur on the same EventLoop thread (Lettuce's epollEventLoop-9-3 thread), so what is the state of this thread at the moment?

Arthas排障

Here, we use the thread command of the Ali Arthas troubleshooting tool to view the thread status and stack information.

From the stack information, it can be seen that Lettuce has created a total of 3 Netty EventLoop threads, 9-3 of which are in the TIMED_WAITTING state, which is also a listening thread for Pub/Sub messages, and is blocked on the get method of the RedisLettucePubSubListener object to receive the hot key of the message update.

Locate the cause

Through Arthas troubleshooting, we learned that Lettuce was responding to Pub/Sub events in Netty's EventLoop thread. Therefore, we also basically located the backlog of buffers, that is, the blocking future get method was executed in the RedisLettucePubSubListener, which caused its carrier EventLoop thread to be blocked and unable to respond to io events associated with its Selector.

Why are Pub/Sub events handled by the same EventLoop as other connected io events? By consulting the data, it is found that when Netty multiplexes connections, it will only start a limited number of EventLoop threads (the default is the number of CPUs * 2) for connection management, and each connection is polled to register on the EventLoop, so when the number of EventLoops is small, multiple connections may be registered on the same io thread.

Netty中EventLoop线程数量计算逻辑

Netty注册EventLoop时的轮训策略

Combined with the further analysis of the problem scenario, there are a total of 3 EventLoop threads, and 6 connections are created, among which the creation priority of the Pub/Sub connection is higher than the secondary connection responsible for data routing, so there will inevitably be a situation where a secondary connection and a Pub/Sub connection are registered on the same EventLoop thread. When the key is routed to the shared thread of the Pub/Sub, because the thread is blocked by the callback method of the Pub/Sub, even if there is data in the buffer to arrive, the read and write events on the secondary connection bound to the EventLoop will not be triggered normally.

The EventLoop thread is blocked due to the blocking of the publish and subscribe callback method

For this application scenario, Lettuce also has a special reminder on the official website: https://lettuce.io/core/release/reference/index.html

That is, do not perform blocking operations in the callback function of Pub/Sub.

Another thing that needs to be noted is the number of EventLoops. Since we don't actively configure it, Netty will typically create an EventLoop that is twice as CPU-worthy. In our test program, the host environment is dual-core, and 4 EventLoops should theoretically be created. However, the actual number of EventLoops observed was only 3. This is because the Lettuce framework has tweaked Netty's logic to require the number of EventLoops to be created equal to the number of CPU cores, and not less than 3.

Logic for calculating the number of io threads in Lettuce.

This is also stated in the official documentation.

solution

Once the cause is located, a solution is also on the horizon. There are two ways to do this:

Add I/O threads

Increase the number of Lettuce io threads so that Pub/Sub connections and other connections can be registered into different EventLoops. There are two ways to set it up:

Specify the number of io threads in the ClientResources operation provided by lettuce

Since Lettuce uses Netty at the bottom layer, you can also specify the number of EventLoops in Netty by configuring the io.netty.eventLoopThreads parameter. In order to quickly verify the effect, we configured this parameter on the timeout instance and restarted it, and found that the problem disappeared, which further proved that this is indeed the cause of the access timeout.

Asynchronous

The more elegant way is not to perform blocking operations in the nio thread, that is, the process of processing Pub/Sub messages is asynchronous, and it is better to put it in a separate thread to free up Netty's EventLoop resources as soon as possible. That's what the Spring-data-redis framework we're familiar with.

Spring-data-redis does this by starting a new thread each time a message is received.

thought

Although the issue has been resolved, there are still a few remaining questions that remain unanswered. After some research, we also found the answer.

Why is it okay to have an image of an earlier version?

In the previous analysis, we mentioned that the number of EventLoop threads is too low, causing the thread to block. The number of EventLoop threads in the instance of the higher version is 3, but what about the lower version? Looking through Arthas, it is found that the number of EventLoops in the lower version of Lettuce is 13, which far exceeds the number of the higher version. This means that in earlier environments, Pub/Sub connections and other connections will be registered on different EventLoops, and even if the Pub/Sub processing thread is blocked, it will not affect the processing of read and write events for other connections.

The maximum thread number of an image of the later version is 9-3

The maximum thread number of an image in earlier versions is 9-13

Why do older versions of the image create more EventLoops? This is actually a pitfall of the JDK. In the early JDK 8 version (before 8u131), Java obtained an inaccurate number of CPU cores in the docker environment, which caused the program to obtain the number of cores of the host.

(https://blogs.oracle.com/java/post/java-se-support-for-docker-cpu-and-memory-limits)

If you view that the jdk version of the earlier version image is 8u101, the number of cores of the application host is 16, that is, the application of the earlier version mistakenly obtains the number of cores of the host of 16, so it will register each connection to an independent EventLoop, thus avoiding blocking. In other words, the reason why the lower version of the image is okay is that the program gets the wrong value in the wrong environment, but it gets the correct result, and the negative becomes positive. As for why the maximum thread number is 13, it is due to the fact that our Redis cluster is configured with two domain names, as shown in the following figure.

When the RedisClusterClient is initialized, the domain name (2), all cluster nodes (6), pub/sub channel (1), cluster primary connection (1), and secondary connection (3) are created, adding up to exactly 13 connections.

Why does a higher version use Spring to access Redis and why does the timeout issue occur?

There are two ways for the original project to access Redis: Spring and Caching Framework. All of the EventLoops mentioned above are created from the RedisClusterClient object maintained by our own caching framework. Spring containers, on the other hand, use a separate RedisClusterClient object to create Redis connections. In Lettuce, each RedisClusterClient object corresponds to a different EventLoopGroup at the bottom layer. In other words, Redis connections created by Spring must not share the same EventLoop as connections to the caching framework. So even if the EventLoop thread on which the caching framework resides is blocked, it won't affect the event response of the Spring connection.

Why Is It Impossible for a High-Version Image to Access a Single-Node Redis Instance?

Unlike the RedisClusterClient, which creates multiple primary and secondary connections when accessing a Redis cluster, the RedisClient used by Lettuce creates only one connection when accessing a single-node Redis cluster. Coupled with independent Pub/Sub connections, it is equivalent to 2 connections registered on 3 EventLoops, avoiding conflicts.

summary

This paper starts from a Redis access timeout problem encountered in practical work, explores the working principles of Spring, Lettuce and Netty, and analyzes the importance of EventLoop threads to connection processing and the necessity of avoiding blocking operations when processing Pub/Sub events by using debugging tools such as Arthas. By observing the differences in behavior in different version environments, we have deepened our understanding of the adaptation of JDK versions and program environments, and accumulated valuable experience for troubleshooting similar problems in the future.

Resources

[1]https://lettuce.io/core/5.3.7.RELEASE/reference/index.html

[2]https://docs.spring.io/spring-data/redis/reference/redis/pubsub.html

[3]https://github.com/TFdream/netty-learning/issues/22

[4]https://github.com/alibaba/jetcache/blob/master/docs/CN/RedisWithLettuce.md

[5]https://arthas.aliyun.com/doc/thread.html

[6]https://blogs.oracle.com/java/post/java-se-support-for-docker-cpu-and-memory-limits

Author: Overseas technical product team

Source-WeChat public account: iQiyi technical product team

Source: https://mp.weixin.qq.com/s/cIVZ6VoNQqF0HFee9psfVQ

A Redis visit timed out "bug catching" trip