I. Preface

Not long ago, our real-time quote service (hereinafter referred to as sirius) had occasional interface timeouts, and after layer by layer investigation, all software factors were eliminated, and the cause of the problem was located on the RAID card at the hardware level.

This topic describes the occurrence, troubleshooting, and conclusion of the problem, and finally describes some general troubleshooting methods for performance problems.

PS: Due to the nature of the content of this article, it contains a lot of monitoring information and analysis of monitoring, focusing on this kind of information when reading can help you better understand the content.

二、sirius 服务简介

Sirius is one of the core services in Qunar's hotel business, providing four core interfaces in the main process:

Page L, List, for a list of eligible hotels and the lowest price for each hotel
Page D, Detail, select a hotel on page L to see all offer details for that hotel
Page B, Booking, select an offer on page D to make a reservation, and fill in the occupant information (check whether the inventory, price, discounts, etc. have changed)
O page, Order, order payment (secondary verification of inventory, price, preferential information, etc.)

The simplified service link diagram is as follows:

3. The process of occurrence

2024-02-10(Lunar New Year's Day)
After receiving feedback from the main site, the timeout rate on page D increased
初步排查发现 org.apache.dubbo.remoting.TimeoutException异常增多
Try to restart the service, and all the restarts are completed around 16:00, and the monitoring is restored
Preliminary conclusions: occasional stand-alone GC issues due to sirius running for too long (7 days).
The monitoring of the day is shown in the figure
2024-02-17(Lunar New Year)
At 11:09 a.m., I received feedback from the main site that the timeout rate on page D began to rise again
According to the experience of the last time, the batch restart was started, and all the restarts were completed at 13:00, and the timeout rate decreased, but it did not fully recover
After that, the pods with a large number of abnormal dubbo timeouts were deleted and rebuilt, and the timeout rate was basically restored at around 15:35. At 17:10, the timeout rate was fully restored
At this time, it was actually possible to overturn the previous conclusion, but I didn't think so much about it at the time
The monitoring of the day is shown in the figure
2024-03-01
Received feedback from the main station, on the day of 02-24, the timeout rate of D page also increased in the same way
At this time, Sirius has only been released for a day, and in combination with the phenomenon of 02-17, it completely overturns the previous conclusion and begins to broaden its thinking for investigation
The monitoring of the day is shown in the figure

Fourth, the investigation process

(1) The problem is found to be related to the host

A preliminary investigation revealed that the timeout phenomenon is not common to all Sirius services, but only exists on individual pods.

After further investigation of the previous pods with problems, it was found that the hosts where they were located at that time coincided (with pod_name/pod_ip + time, you can find out the host where the pod was located at that time in the k8s log).

At this point, we can conclude that the timeout problem is not common to all Sirius services, it is host-related, and several suspicious hosts are located.

(2) Discover the regularity of the cycle

The interval between the three problems is 7 days, which is unlikely to be a coincidence, and it is suspected that the problem has a periodic period.

In the monitoring of the timeout rate of the D-page interface, it is found that there is the same phenomenon in the same period every Saturday, which can be traced back to 2023-10-28 at the earliest.

The following is the monitoring chart on the last Saturday of each month starting from October 2023:

2023-10-28(Saturday)

2023-11-25(Saturday)

2023-12-30(Sat)

2024-01-27(Sat)

2024-02-24(Sat)

Let's take a look at a few non-Saturday monitoring charts:

2024-02-22 (Thu)

2024-02-23(Fri)

From the above diagram, two phenomena can be summarized:

Periodic regularity
Every Saturday at around 11:00, the overtime rate starts to rise
In other non-Saturday sessions, there is no cycle regularity
The increase in the overtime rate has a tendency to deteriorate, which is reflected in two aspects
The timeout rate gradually increased, from 0.05% to 0.5%, an increase of 10 times
The duration is getting longer and longer, from 2 hours to 8 hours

Intuitively, the regularity of the cycle is more critical and important, so start with it; It is easy to associate with timed tasks from the regularity of cycles, so there are several guesses:

Application side
Sirius does not have timed tasks
Taking 10,000 steps back, even if there is old code that we don't even know about, it doesn't always appear on a specific host
Infrastructure side (can be understood as enhancements made by the infrastructure group to all Java-like applications across the company)
This speculation is so outrageous that it is almost impossible, and it is listed only for the sake of logical rigor
Nor does it explain the phenomenon of problems only on a particular host
Host (operating system scheduled tasks)
Contact the O&M team to troubleshoot, and the answer is that the crontab on the host of all business applications is the same, and no host has a special crontab

At this point, we can conclude that the periodic regularity is so obvious that it is almost impossible to conclude that the root cause comes from outside Sirius, and that Sirius is the affected party.

(3) Abnormal I/O usage waveforms are found

Surveillance of suspected hosts identified in the first step was analyzed.

During the problem period (11:00 to 19:00 every Saturday), three metrics also fluctuated in the same way: CPU wait, disk I/O usage, and process blocked, as shown in the following figure:

Compared with the monitoring of the timeout rate of the D-page interface, it is found that:

The waveforms are similar
Consistent duration
There is the same periodic regularity

It is not difficult to conclude that there is a strong correlation between these indicators and the timeout rate of the D-page interface, and it is necessary to further transform the correlation into a causal relationship.

Let's assume that IO fluctuations are the result, and the causal chain is as follows: network timeout → timeout abnormal increase → exception log increase → IO increase → CPU wait, process blocked increase, which seems to make logical sense.

From the network timeout + cycle regularity, I think of network-related hardware (such as switches), contact the network group for troubleshooting, and feedback that there is nothing abnormal in the switches where these hosts are located during that time period.

Moreover, the impact of network fluctuations is often relatively large, and it is unlikely that such a slight increase in the timeout rate will occur.

Let's continue to assume that among the three metrics, CPU wait, disk IO utilization, and process blocked, IO usage is most likely to be the cause, while the other two metrics are more like the result.

In host monitoring, there are three metrics related to disk I/O: utilization, quantity, and throughput, where quantity and throughput distinguish between Read and Write, while usage does not.

Further investigation of these indicators revealed a strange point: during the problem period, the waveform of usage rate is not related to the number and throughput waveform.

Normally, IO usage = read + write, so its waveform should be correlated with the number and throughput waveforms.

For example, during normal hours, the I/O waveform looks like this:

It is clear from the diagram that:

The number and throughput waveforms are very similar
Considering extreme cases, such as read/write 1GB of data at a time, the number and throughput waveforms are certainly completely different
But from the sample statistics of large data volumes, it is still similar
Usage is an overlay of read and write
The usage spikes are mainly derived from the read, which is the blue waveform in the quantity, throughput monitoring
The waveform at the bottom of the usage comes from the write, which is the orange waveform in the volume, throughput monitoring (there is a significant increase around 15:00)
Usage is low, with a maximum of 1.15%

During the problem period, the IO waveform is as follows:

As you can see from the diagram:

The waveforms for quantity and throughput are still very similar
The usage rate is significantly higher (max from 1.15% to 34%, a 30-fold increase), making its waveform and volume and throughput seem completely irrelevant
更准确的来说，是出现了除 read/write 之外的第三种波形，即使用率 = read + write + ?，而且这第三种波形的值相对于 read/write 来说非常大
The usage of SDA and SDB has fluctuated abnormally, as shown in the following figure

Consulting with O&M students learned that our application logs are stored in sdb, which theoretically does not produce any IO to sda.

At the same time, we found another host, node303, which was not running the sirius service at that time (2024-02-24) (as can be judged from the CPU monitoring), and also had abnormal IO usage waveforms, as shown in the following figure:

At this point, we can draw a phased conclusion: to confirm that the root cause is not related to sirius, and it is most likely related to the abnormal IO usage waveform, and the source of the abnormal IO needs to be found.

(4) Look for the source of abnormal IO

Locate the list of anomalous hosts

Since it is related to the abnormal IO usage waveform, the problematic host can be quickly identified through host monitoring. sirius has a total of 100 exclusive hosts, and a total of 21 are screened out for abnormal IO usage waveforms on Saturdays.

Because the problem recurs periodically and steadily, the next day is Saturday (2024-03-02), and it was decided to investigate the problem on the spot when it occurs.

Preparation

Among the problem hosts, 2 were selected as the control group

node307,正常运行2个 sirius pod(单个宿主的硬件资源只够运行2个 pod)。
node308, take the 2 pods on this host off in advance to ensure that there is no wired traffic during the problem period, and the previous logs will be cleaned up regularly, so theoretically the IO on this host should be very small.

Wait until about 11:00 a.m. on 03-02 and observe the following situations when the problem occurs:

Compare the processes on 307 and 308 to see if there are any differences
Find out exactly what the 308 is doing IO operations
Verify that the list of exceptions that you just filtered out is correct

The main thing is to verify that there is a pod that is problematic, but it is not in the list of abnormal hosts

Collect traces for further analysis

I am responsible for the business side and the O&M team is responsible for the host side.

Field-application side

All the pods with the problem were found, and without exception, they were located on exactly the same list of anomalous hosts that were filtered out earlier.

At the same time, some traces were collected, and when we observed the column of the timeline in the figure below, we found that the call timed out when dubbo initiated the call, and after the consumer made the request, it took a long time for the producer to receive it.

Further analysis of the DUBBO timeout anomaly shows that there are two main causes:

第一种， Server 超时，关键字为"Waiting server-side response timeout by scan timer"，在 kibana 中搜索，如图：

There is no pattern overall, and there is no spike on Saturday, excluded.

第二种， Client 超时，关键字为"Sending request timeout in client-side by scan timer"，在kibana中搜索，如图：

Figure 1: Logs over a 7-day period

Figure 2: Logs from 10:00 to 19:00 on the same day

From Figure 1, we can see that the client timeout increases suddenly on Saturday, and the waveform in Figure 2 is similar to the waveform of the D-page timeout rate, and the causal relationship can be obtained: the client timeout when dubbo is called→ the D-page interface timeout.

Site-host side

Not surprisingly, waveforms with abnormal IO usage on both 307 and 308 are visible, as shown in the figure:

As you can see, the usage rate of 307 is much higher than that of 308, and the duration of fluctuations is much longer.

Students in the O&M group reported that there was no difference between the processes on 307 and 308, and no high I/O processes were found on 308, so the source of the abnormal I/O could not be confirmed.

At this point, we can conclude that the exception IO does not come from the process on the host, WHAT??? HOW？？？

(5) Look for host differences

Make a bold assumption

Since all software factors are excluded, what else does our service have besides software? Hardware!

Hardware related to disk IO comes to mind two possibilities:

Disk defragmentation
宿主是否有 RAID,RAID 是否有周期性 IO

Disk fragmentation direction

After thinking about it, disk fragmentation must be combined with files to cause problems, and files are the concept of the operating system layer, and have nothing to do with disks.

That is, the disk itself does not care about fragmentation at all; So even if it is defragmented, it should be initiated by the operating system, not the disk.

If it is initiated by the operating system, there will definitely be a corresponding process, and it will also leave traces in the monitoring of the number of IOs and throughput, which is inconsistent with the phenomenon we see.

So this direction is excluded.

RAID direction

Since I didn't know much about RAID, I did some exploration and learned that RAID can be implemented in different ways:

Hardware, i.e., RAID cards
Software approach, usually provided by the operating system

And in wikipedia - RAID I found information about RAID periodic IO:

Roughly translated: A RAID controller (i.e., a RAID card) periodically reads and checks all blocks in the array.

After checking this, I felt close to the truth in my heart, and the next thing is to seek verification.

Confirmation

Since it is only a specific host that is problematic, it is certain that there must be differences between the abnormal host and the normal host, and there are probably several directions:

Verify that the host has a RAID card, if not, this guess is not valid

If you have RAID cards, do they differ in model

Whether there is a difference in the host disk (RAID is only a controller, does not have storage functions, and the real IO still has to go to the disk), the main concerns are as follows:
类型：hdd or ssd
Model, service life, etc

According to these directions, the information related to abnormal hosts and normal hosts was queried, and it was found that:

使用 lspci | grep RAID 或 lshw -class storage 查看 RAID 卡信息，发现异常宿主都有 RAID 卡，且型号一致，但也有正常宿主是这个 RAID 卡型号
Use lsblk -S to check the disk information and find that the disk model of the abnormal host is the same, but there are also normal hosts with this disk model

In this step, we made sure that as long as it is not the host of this RAID card and disk model, there is no problem; That said, the combination of the two models is necessary for something to go wrong, but not sufficiently.

At this point, we can draw a phased conclusion: the abnormal IO is most likely from the RAID card, and it is related to the RAID card model and disk model of the host.

(6) The conclusion given by the operation and maintenance team

Periodic RAID patrol schedules

The RAID card of the LSI MegaRAID xxxxxx model has a regular inspection function, which is enabled by default; The default trigger time is 3 a.m. GMT every Saturday (11 a.m. Beijing time) every Saturday, as follows:

-----------------------------------------------

Ctrl_Prop Value

-----------------------------------------------

CC Operation Mode Concurrent # CC = Consistency Check

CC Execution Delay 168 # 循环周期：168小时 = 1周

CC Next Starttime 03/09/2024, 03:00:00 # 下次执行时间

CC Current State Stopped

CC Number of iterations 97

CC Number of VD completed 2

CC Excluded VDs None

-----------------------------------------------

Questions about disk, RAID card models

All of our hosts have RAID cards, but different hosts correspond to different situations, and the conclusion given by the students in the operation and maintenance team is that problems will occur in specific scenarios:

RAID 卡型号为 LSI MegaRAID xxxxxx
The disk is SSD and RAID1
More than 1 year of operation

What's wrong with this model of RAID card?

Here's what RAID providers have to say:

By default, the Patrol Read (PR) and Consistency Check (CC) functions are enabled on the LSI adapter card, and if there is frequent I2C printing on the serial port during PR or CC execution, the busy level of the adapter card will increase, resulting in an increase in disk latency. For scenarios with high I/O latency requirements, such as distributed storage, services are affected.

The 94/95 series adapter firmware (MR7.20 and earlier) processes I2C information printing as an interrupt to the serial port. Due to the low efficiency of serial port log printing, when the log printing task increases and the PR or CC task is executed at the same time, the busyness of the array card will increase, resulting in an increase in the I/O delay of the hard disk. The 94/95 tin adapter card is upgraded to MR7.21 or later, and the new firmware version removes I2C message printing from the serial port to RAM, which improves the efficiency of the adapter to write logs, thus solving this problem.

However, there is no description of how to solve this problem in the latest firmware version description of the 9361 adapter card, so in order to avoid affecting services, the only way to do this is to disable PR and CC to ensure the normal operation of services.

To translate simply: there are performance problems in the implementation of the firmware, which cannot be solved, so you can only upgrade the firmware or turn off the inspection.

At this point, the root cause has been clarified, but we still have a question that is not clearly explained: how does high IO cause timeouts?

(7) How does the abnormal IO cause a timeout?

Sirius logs are asynchronous and normally don't block requests.

I studied the configuration related to logback with questions, and found that the asynchronous log of logback will block requests by default when the queue is full, as follows:

Thinking further, since the root cause is high IO, then all IO operations will be affected, such as the logback log of the application, the access log of tomcat, and even the IO class commands (tail, grep...... ）。

With this in mind, evidence was found in the Tomcat Access log at the site on 2024-03-02.

The following are the logs from 11:36:43 and 11:39:04 on a pod with an exception:

# The first column is the response time, and the second column is the time point

117 [02/Mar/2024:11:36:36

91 [02/Mar/2024:11:36:36

4055 [02/Mar/2024:11:36:43

5716 [02/Mar/2024:11:36:43

5654 [02/Mar/2024:11:36:43

3419 [02/Mar/2024:11:36:43

5727 [02/Mar/2024:11:36:43

5361 [02/Mar/2024:11:36:43

6769 [02/Mar/2024:11:36:43

6222 [02/Mar/2024:11:36:43

1932 [02/Mar/2024:11:36:43

6808 [02/Mar/2024:11:36:43

3705 [02/Mar/2024:11:36:43

6818 [02/Mar/2024:11:36:43

6388 [02/Mar/2024:11:36:43

5713 [02/Mar/2024:11:36:43

2609 [02/Mar/2024:11:36:43

2614 [02/Mar/2024:11:36:43

1989 [02/Mar/2024:11:36:43

6154 [02/Mar/2024:11:36:43

4531 [02/Mar/2024:11:36:43

5211 [02/Mar/2024:11:36:43

4753 [02/Mar/2024:11:36:43

6432 [02/Mar/2024:11:36:43

6420 [02/Mar/2024:11:36:43

223 [02/Mar/2024:11:36:43

2271 [02/Mar/2024:11:36:43

6602 [02/Mar/2024:11:36:43

...

339 [02/Mar/2024:11:38:57

155 [02/Mar/2024:11:38:57

154 [02/Mar/2024:11:38:57

3831 [02/Mar/2024:11:39:00

4856 [02/Mar/2024:11:39:04

4002 [02/Mar/2024:11:39:04

5740 [02/Mar/2024:11:39:04

4285 [02/Mar/2024:11:39:04

5803 [02/Mar/2024:11:39:04

3684 [02/Mar/2024:11:39:04

5898 [02/Mar/2024:11:39:04

4751 [02/Mar/2024:11:39:04

7672 [02/Mar/2024:11:39:04

5940 [02/Mar/2024:11:39:04

5698 [02/Mar/2024:11:39:04

4992 [02/Mar/2024:11:39:04

3852 [02/Mar/2024:11:39:01

5318 [02/Mar/2024:11:39:04

5926 [02/Mar/2024:11:39:04

5969 [02/Mar/2024:11:39:04

4388 [02/Mar/2024:11:39:04

5717 [02/Mar/2024:11:39:04

4819 [02/Mar/2024:11:39:04

7307 [02/Mar/2024:11:39:04

5896 [02/Mar/2024:11:39:04

5442 [02/Mar/2024:11:39:04

5815 [02/Mar/2024:11:39:04

It can be found that between 11:36:36~11:36:43 and 11:38:57~11:39:04, there are hardly any logs.

This is because a large number of requests are blocked all the time and do not return until these two points in time, with a maximum response time of more than 7 seconds.

In both cases, the IO usage of the host where the pod is located is very high, exceeding 80%.

At this point, the complete causal chain is clear: the host does RAID consistency checks, → host IOPS drops, → log queues pile up, → logs generate waits→ various asynchronous requests (dubbo/redis) timeouts, → sirius interface timeouts

The reason why it has the greatest impact on dubbo requests is that Sirius has turned on dubbo-access logs, and the logs record the details of the request and response, which are very large and not allowed to be dropped.

On 2024-03-09, we performed a verification that after downgrading the logback configuration, there was no more timeout on the abnormal host.

V. Conclusions

(1) The triggering conditions of the problem

Hardware side

RAID card + RAID patrol program enabled for certain models
ssd 磁盘 + raid1 + 长使用年限

Software side

Large amount of I/O on the disk of the application: In this case, the I/O usage will increase significantly, and the duration of the inspection will be significantly extended, from 3 hours to 8 hours
The application is sensitive to response time: Only if the application is sensitive to response time can problems be discovered

Some of the hosts served by Sirius meet all of the above criteria.

(2) The scope of the problem

So far, only one affected party has been found in Sirius, and the impact is very small.

But theoretically, as long as the conditions are met, they will be affected; The corresponding phenomenon is that every Saturday at 11 a.m., some indicators begin to fluctuate slightly, which lasts from 3 to 8 hours, and then heals themselves.

(3) Damage assessment

Although the duration is very long, from 2023-10 to 2024-03, almost half a year, the impact is too small to meet the failure criteria.

Considering that the phenomenon of interface timeout has a tendency to deteriorate gradually, it belongs to timely stop loss and avoids greater negative impact.

(4) Solution

Close all inspection plans (PR, CC) for this type of RAID card company-wide

Business side

Reduce non-essential logging

6. Other issues

(1) On 2024-02-10, why did the timeout rate recover after Sirius was fully restarted?

coincidence, and so much time has passed that it is no longer possible to trace the precise timing of the operation at that time.

(2) Why did the problem start to be exposed from 2023-10?

In 2023-10, Sirius switched to the deployment mode of dedicated large-spec pods.

Prior to this, sirius had more than 600 pods and shared a large host pool with all the businesses of the company, so the IO pressure on sirius was dispersed, and the IO pressure on most other businesses was far less than that of sirius, so the problem was not exposed.

After the switchover, sirius was reduced to 200 pods and used a fixed batch of host pools (100 units), so all the IO pressure of sirius had to be borne by these 100 hosts, and the problem was exposed.

(3) During RAID inspection, why is there only abnormal waveform in IO usage, but not in terms of number of times and throughput?

Only IOs initiated by system calls will be recorded in disk IO-related monitoring; RAID inspection commands are delivered directly to the disk by the RAID card without going through the operating system.

In fact, the IO usage does not record the IO caused by RAID inspection, but this metric is measured from the time-consuming dimension, during which the disk is busy, resulting in a decrease in the processing power of IO system calls, resulting in an increase in time-consuming.

Therefore, it is the RAID patrol that affects the time taken by the IO system call.

(4) How did the trend of the gradual deterioration of the overtime phenomenon come about?

Not known, but there are a few guesses:

It is related to the amount of disk data
It has to do with the degree of disk fragmentation (SSDs also have fragmentation problems, but the impact on performance is far less severe than HDDs; And the lifespan of the SSD is based on the number of erasures, so turning on defragmentation on the SSD will only accelerate the loss of the SSD's life)

7. General performance troubleshooting methods

This part of the content is mostly based on personal understanding and experience, it is inevitable that there are subjective and limited places, if you have different views, welcome to exchange and discuss!

(1) Methodological level

Finding patterns: Finding patterns or commonalities in anomalies

The law of change
Law of correlation
Periodicity
......

as a comparison

When troubleshooting, there should always be a string in the brain
Starting with the points of difference can greatly improve the efficiency of troubleshooting

Reproduce the problem by all means
Make bold assumptions and verify carefully

(2) Technical level

In the computer field, there are no more than four categories related to performance: CPU, memory, disk, and network

The root cause of most performance problems can be attributed to one of them
The impact of these four types is not a point problem, but a surface problem, and if you find the impact area, you will find the problem

Establish a chain of suspicion based on the actual situation

Application → Middleware → ...... → Operating system → hardware → physical world
Keep going down, don't skip a step, don't miss

Broaden your knowledge

Empower yourself to ask the right questions when it matters most, and then use search engines + AI to find the answers
In today's technological environment, it's much more important to ask the right questions than to know the answers

We recommend reading Top of Performance, which contains a very comprehensive methodology and toolset, and should not be missed if you want to become a performance expert

(3) The level of the Tao

Sense of belief

99.99% of performance issues are for a reason
Don't simply boil it down to excuses like "network fluctuations" or "unstable performance"; Every time we attribute these excuses, we miss an opportunity to break through ourselves

Dare to question "authority"

Questioning should have a factual basis and reasonableness
The thing in question must have the ability to prove or falsify

Be wary of coincidences: There are often deep-seated reasons behind coincidences
Good mindset

Performance specialists are not quick and need to start with relatively simple questions, build positive feedback, and develop self-confidence
A lot of problems, if you want to find out, do need some luck component; So don't be discouraged if you really can't find out

There are no boundaries in doing things, and it comes back to solving the problem itself

8. References

logback: https://logback.qos.ch/manual/appenders.html#AsyncAppender asynchronous logs
RAID periodic inspection: https://en.wikipedia.org/wiki/RAID#Integrity
MegaRAID chip supports the inspection function: https://techdocs.broadcom.com/us/en/storage-and-ethernet-connectivity/enterprise-storage-solutions/storcli-12gbs-megaraid-tri-mode/1-0/v11869215/v11673749/v11673787/v11675132.html
SSD磁盘碎片：https://en.wikipedia.org/wiki/Solid-state_drive#Hard_disk_drives，搜索fragmentation

Author: Li Chengya

Source-WeChat public account: Qunar Technology Salon

Source: https://mp.weixin.qq.com/s/riHDAecplHoSxV8_Kz-u_Q

Timeout from interface to RAID

I. Preface

二、sirius 服务简介

Fourth, the investigation process

(1) The problem is found to be related to the host

(2) Discover the regularity of the cycle

(3) Abnormal I/O usage waveforms are found

(4) Look for the source of abnormal IO

Locate the list of anomalous hosts

Preparation

Field-application side

Site-host side

(5) Look for host differences

Make a bold assumption

Disk fragmentation direction

RAID direction

Confirmation

(6) The conclusion given by the operation and maintenance team

Periodic RAID patrol schedules

Questions about disk, RAID card models

What's wrong with this model of RAID card?

(7) How does the abnormal IO cause a timeout?

V. Conclusions

(1) The triggering conditions of the problem

(2) The scope of the problem

(3) Damage assessment

(4) Solution

6. Other issues

(1) On 2024-02-10, why did the timeout rate recover after Sirius was fully restarted?

(2) Why did the problem start to be exposed from 2023-10?

(3) During RAID inspection, why is there only abnormal waveform in IO usage, but not in terms of number of times and throughput?

(4) How did the trend of the gradual deterioration of the overtime phenomenon come about?

7. General performance troubleshooting methods

(1) Methodological level

(2) Technical level

(3) The level of the Tao

8. References

Read on