Tech Talk Cloud Technology has a conversation| how to ensure high reliability of key basic components?

2022-04-25 22:34:52

First, the definition and objectives of reliability

Reliability means that the system will not crash unexpectedly, restart or even lose data, which means that a reliable system must be able to do fault self-repair, and isolate the failure that cannot be self-repaired as much as possible to ensure the normal operation of the rest of the system. In short, the goal of reliability is to reduce business disruption due to failures (product quality, external components, environment, human factors, etc.).

High reliability can be understood from three levels: First, without failure, the system can always operate normally, which requires improving the quality of hardware research and development. Second, the fault does not affect the business. Third, it affects the business but can recover quickly. The latter two layers can be "software-defined" to avoid business disruption caused by hardware failures.

When it comes to reliability, it's important to understand the key basic components of a server. Judging from the industry's server statistics, the problems of hardware components are concentrated on memory, hard disk, CPU, motherboard, power supply, and network card. In a cloud environment, the same server may run several virtual machines with different services and different scenarios, and once the physical device collapses, it will affect many users and cause huge losses to the operators themselves. In the existing failure modes, memory and hard disk failures are the highest and most severe failures.

The failure of memory and hard disk can be further understood through these two cases.

Case 1, a memory UCE error causes the server system to repeatedly go down and restart. The server is down and restarted, log on to the BMC management interface of the server, query the alarm information of the server, and the following alarm appears: "2019-07-25 08:03:06 memory has a uncorrectable error." Later, further querying the hardware error log file, it was found that DIMM020 had a large number of memory CE errors and some memory UCE errors, which could be known to be due to the UCE error of the DIMM020 memory module, resulting in a server downtime restart.

Case 2, a slow disk card causes a big data cluster to fail. A slow disk failure occurs on a cluster node of a big data platform (the system executes an iostat command every second to monitor the system indicators of disk I/O, and if the number of cycles of svctm greater than 100ms is greater than 30 times within 60s, it is considered that there is a problem with the disk and generates this alarm). First, ZOOKEEPER fails, and then the cluster balance status is abnormal. Then other services on the same node also fail, and finally all services of the entire node fail, and then restart automatic recovery. However, after 3-10 minutes, the node will repeat this situation. If no other problems are found, you can choose to restart the system for a business interruption of ten minutes.

Second, the reliability technology of memory

Memory from the external structure of the PCB board, gold finger, memory chip, memory stick fixed card gap and so on. From the internal structure, including the memory body, storage unit Cell, storage array Bank, Chip (device), Rank, DIMM, Channel and so on.

Based on the memory structure, the improvement of memory technology (process reduction and frequency) is prone to higher failure rates.

(1) The challenges brought about by the shrinking process

(1) Lithography is more susceptible to diffraction, focusing, etc. affecting the quality.

(2) Epitaxial growth (EPI) is prone to short circuits between leakage growth and epitaxial growth.

(3) The impact of particles such as etching cleaning is aggravated.

(4) The size of the single die becomes smaller, and the number of single wafer dies increases.

(5) In the future, the difficulty of TSV package multi-die post-die packaging will increase, and the failure rate will increase.

(2) Challenges brought about by frequency increase

(1) The timing of high-speed signals is smaller and the compatibility problems are more prominent.

(2) The signal attenuation is more serious, and DDR5 increases the DFE circuit, and the design is more complex.

(3) Higher frequencies lead to higher power consumption and higher requirements for PI.

Memory failures can be divided into two categories according to "whether the fault can be corrected": CE (Correctable Error): can correct arbitrary single-bit errors, some single-particle multi-bit errors are collectively referred to; UE (Uncorrectable Error): a general term for errors that cannot be corrected. There are some UE errors that cause system downtime due to the inability of the operating system to handle them.

The causes of memory failure are: memory unit energy leakage leakage, memory data transmission path with high impedance, memory voltage operation abnormality, internal timing abnormality, internal operation abnormality (such as self-refresh), bit line/word line abnormal, address decoding line abnormal, memory weak unit (can be used normally), cosmic rays or radioactivity (without causing permanent damage) caused by soft failure (multiple detection failures do not reproduce).

When dealing with failures, there are layered processing, and there are two kinds of ideas in the industry: software-led and hardware-led. Based on the hardware-led point of view, some high-quality hardware will be selected when selecting devices, and in addition, the hardware itself has some "reliability", such as automatically correcting some relatively simple errors.

But the hardware is not able to do very reliable, you need software to do some work. A software-defined approach isolates a faulty memory area so that it is no longer used so that it does not have an impact on the business.

After a CE (correctable error) occurs, if it is not handled, it may become an uncorrectable UE error. Therefore, it is necessary to prevent micro-gradual, and when CE (errors that can be corrected), further processing should be carried out to isolate suspicious failures.

Convinced Cloud designs ideas for memory CE fault isolation schemes

When the memory hardware triggers an interrupt in cere, see if the memory can be isolated (not used by the operating system kernel or peripherals), and if it can be isolated, add the whitelist to isolate the memory. When the memory isolation function is used to switch the failed memory page to a normal memory page, the failed memory page is isolated and no longer used.

At the same time, details such as the location and number of occurrences of these failures will be alerted to help O&M personnel replace the faulty memory modules. For memory that cannot be isolated, the memory part in question is isolated when the system does not use these memory according to the information of the memory error area recorded before the restart, so as to ensure that the memory used by the system is not a problem part.

Tech Talk Cloud Technology has a conversation| how to ensure high reliability of key basic components?

Overall architecture of memory CE fault isolation scheme

After the implementation of this plan, through the collection of live network operation statistics, the average success rate of isolation success is 96.93%. Compared with the CE shielding of the industry's general solution, the problem of isolating CE in time and locating the memory module after error, Convinced Cloud has a leading advantage in the solution and has applied for 5 patents in this field. The isolation scheme has little overhead for CPU and memory resources during use, and the effect is obvious.

In view of memory UE failure, the solution design idea of Convinced Cloud is to solve the problem of recoverability and early warning of memory UE, downgrade some UE downtime to kill the corresponding application, or even only isolate bad pages to avoid downtime to improve system stability and reliability. At least more than 30% of the memory fault recovery capacity, the solution of Convinced Cloud can achieve 60% memory UE fault recovery rate, the effect is better than the industry's public data (the industry is generally UE fault recovery can cover 50%), in the actual POC test scenario, better than the industry's general solution (such as the general solution will be down, no memory fault alarm log, can not locate the socket location where the fault memory is located).

Overall architecture of memory UE fault isolation schemes

Third, the reliability technology of the hard disk

Hard disks mainly include system disks, cache disks, and data disks. The system disk generally uses an SSD to store the cloud platform system software and host OS, as well as related logs and configurations. The cache disk generally uses the SSD, which uses the fast speed characteristics of the SSD as the cache disk as a cache layer for IO read and write speed, which is used to store the data that is often accessed by the user's business, called hot data. Data disks generally use mechanical hard disk HDDs, and high capacity is suitable for data disks as the final storage location of data (such as virtual disks of virtual machines).

(1) Hard disk TOP failure mode/classification:

Stuck: The hard disk IO temporarily or has not responded;

Slow card: the hard disk IO is significantly slower or stuck;

Bad sector: hard disk logical unit (sector) damage;

Bad block: the physical unit (block) of the hard disk is damaged;

Insufficient lifespan: The mechanical hard disk is physically worn, or the flash particles of the SSD actively reach the number of erases.

When the input output (Input Output, I / O) response time of the hard disk becomes longer, or the situation of stuck and not returning, it will cause the user's business to continue to be slow, or even hang, and a hard disk is stuck and even cause the entire service of the system to be interrupted.

With the increase of service life, the probability of bad sectors, head degradation or other problems of the hard disk is also increasing; from the historical problem distribution, as well as the industry's hard disk reliability failure curve, it can be seen that the hard disk chuck problem is becoming one of the most serious problems affecting the stable operation of the system.

The overall architecture of the convinced cloud card slow disk solution

(2) The idea of Convincing Cloud for the solution of card slow disk:

1. For the complex problem of slow failure mode of disk card, multi-dimensional detection confirms the diagnosis. Using Linux common tools and information, it does not rely on specific hardware tools, including kernel log analysis, smart information analysis, hard disk io monitoring data analysis, etc. to accurately locate the faulty hard disk from multiple dimensions.

2. In view of the choice between business or data when the card slow disk is disposed of, a multi-level isolation algorithm is formulated. Mild slow disk: not isolated, notify users on the page alarm; critical slow disk: select service: do not isolate when the peer is abnormal, and notify the user with page alarm; chuck: select service: do not isolate when the first peer abnormality occurs, and notify the user with page alarm; chuck (frequent): select data: 3 abnormalities in one hour for permanent isolation.

3. Threshold grinding on the basis of multi-level isolation algorithm. A large number of real card slow disks are used for testing and data collected on the user side to develop more accurate card slow detection thresholds, and threshold verification is performed using the fault injection tool.

After enabling the card slow disk function, the effect can be guaranteed to trigger isolation within 1min, the virtual machine does not appear HA, and the service IO is stable after isolation.

That's the main content of this live broadcast. IT friends who are interested in cloud computing can pay attention to the "Deeply Believe in Service Technology" public account to review this issue of live broadcast to learn more about cloud computing knowledge.

Leifeng Network

Tech Talk Cloud Technology has a conversation| how to ensure high reliability of key basic components?

Read on