laitimes

ECC mechanism of Convinced Cloud: Effectively reduce server downtime failures by 30%.

ECC mechanism of Convinced Cloud: Effectively reduce server downtime failures by 30%.

Server downtime can be the worst nightmare for many operations engineers. A Google study showed that most crashes are caused by memory problems, and that 1/3 of Google's servers every year have correctable memory failures, and 1/100 of Google's servers have uncorrectable memory failures, which is one of the typical cases that cause system downtime.

If someone says that using software can solve the memory problem of the hardware and reduce the server downtime by 30%, do you think it is reliable?

Today's data centers have moved towards the era of software-defined, from the original software-defined networking SDN to the software-defined data center SDDC. In order to prevent the unexpected occurrence of server downtime, more and more enterprises are considering software-defined solutions and shielding the impact of hardware failures such as servers and memory through software-defined reliability.

So how does the software improve memory and server availability?

MCA-based memory ECC technology

Memory failures are very many, depending on whether the system can identify it, some failures are memory single or multiple bit byte failures, some memory particle failures, some memory particles on a single row or a single column of storage unit failures, as well as firmware failures, memory controller failures, and some memory gold finger solder joint aging, memory slots on the motherboard loose or dusty and so on.

Device quality failures can only be solved through process improvements, while Convinced Cloud solves bit-level faults that can be controlled at the software level. Often big failures come from the continuous accumulation of so-called bit-level small faults, then what to do is to "prevent micro-gradually", seize it when the small fault occurs, isolate it, and avoid the expansion of the impact.

Intel has a mechanism called MCA (Machine Check Architecture) that can monitor for this type of error. This mechanism works by first defining these error models, and calling the errors that can be corrected automatically CE (Correctable Error), which are often arbitrary single-bit errors and partial single-particle multi-bit errors. However, some errors cannot be automatically corrected and recovered, resulting in system downtime, and these errors are defined as UCE (Uncorrectable Error). According to statistics, the problem types of the CE/UCE class account for 59% of all types of problems in memory. Therefore, if you can design a fault inspection and correction mechanism, its value will be very large!

This full set of error checking and correction mechanisms is ECC (Error Checking and Correcting). ECC will first identify the problem when encountering a failure, through the design of the memory active scanning mechanism, you can set 24 hours a day (can also be adjusted) to scan and find faults; after identification to determine the fault location (here it is used to some special bit calculation and verification algorithms), after identifying the fault location, try to isolate the problematic memory space, to avoid subsequent services to use the memory space again.

Memory ECC enhancement technology for Convinced Cloud

The mainstream IT service providers in the industry will use Intel's MCA mechanism for memory error handling, but the degree of refinement of their software implementations is different, such as some service providers just block CE errors, or simply alarms, without further processing; and some service providers can not accurately locate the slot where the problem occurs even if there are alarms. The confidence cloud proposes a risk zone mechanism, once a memory error occurs, the problem unit is placed in a "buffer" for observation, and when the CE error reaches a certain threshold, it immediately automatically isolates the risky memory area to avoid the error from continuing to expand and cause serious downtime.

In recent years, Convinced Cloud has continuously optimized the memory isolation recovery mechanism, and the ECC mechanism has been enhanced in hyper-converged HCI6.7.0, which was launched in January 2022.

The enhancement mechanism works by first setting the CE Record option through the CPU's BIOS, so that the hardware recognizes the memory error, and once the CE/UCE error is found, the hardware will report the error to the software of the convinced cloud. Then it is the turn of the software mechanism to play, the OS system first determines whether this memory is used by software (including application software and operating system), if it is not used, it is directly isolated and not allowed to be used again.

If it is used by software, the context of the software is obtained, and the distinction is that it is used by the operating system kernel (in_kernel) or by the user application software (in_user).

■ If it is used by application software (in_user), for CE to correct errors, the memory ECC enhancement mechanism of Convinced Cloud replaces the memory area with an error memory area with a good memory area, and the service is completely unaffected in this process. If it is an uncorrectable error of the UCE, the mechanism restarts the process, frees the wrong memory area and isolates it for no longer use. After the process restarts, it is ready to use fully normal memory.

■ If it is used by the operating system kernel (in_kernel), the memory ECC enhancement mechanism of the convinced cloud records the information of the memory area with errors, and when the system starts up again, the mechanism will isolate these faulty memory to ensure that it will not be used again.

ECC mechanism of Convinced Cloud: Effectively reduce server downtime failures by 30%.

(Principle of ECC automatic error correction mechanism of Convinced Cloud)

After the introduction of the above mechanism, Convinced Cloud was validated in a 1,000-unit host environment. The results show that through the software-controlled ECC mechanism, Convinced Cloud can detect memory abnormalities in advance, and 100% automatic isolation is successful, which can be disposed of in advance to avoid greater failure impact, and overall, it can reduce server downtime by 30% compared with the original method.

Back to the question at the beginning, can software solve the problems caused by the hardware level? Without a doubt, of course! The ECC mechanism of Convinced Cloud controls the memory failure problem of the server more accurately and intelligently through innovative software technology, effectively improving the reliability of the IT system.

The above is the sharing of software-defined reliability and ECC mechanisms in this issue of Convinced Cloud Blackboard Newspaper. Pay attention to the "DeepLy Believe in Service Technology" WeChat public account, you can continue to get more technical dry goods content!

Leifeng Network

Read on