laitimes

Huawei Cloud was attacked continuously on the eve of the Spring Festival! Plotted for 3 months, specifically picked out the Internet in the early morning

Yang Jing Xiao Zhen was sent from The Temple of Oufei

Qubits | Official account QbitAI

"Cut off the network of shanghai and Guangzhou two stations!"

"Inject the attack when they're not paying attention at dinner!"

"There will be another wave of attacks when everyone is sleeping at 3 a.m."!"

……

This is a "conspiracy" recently cracked by HUAWEI CLOUD, just before the Spring Festival.

The reason why this wave of attacks should not be underestimated is that if the conspiracy succeeds, the operation of HUAWEI CLOUD's internal systems will face serious consequences.

Huawei Cloud was attacked continuously on the eve of the Spring Festival! Plotted for 3 months, specifically picked out the Internet in the early morning

Launched more than 20 combined attacks in 1 month

The attacker's plan began 3 months ago.

After conspiratorial consultation, they decided to "dry the votes" around the Spring Festival, and launched more than 20 combined attacks in the past month.

The motivation is simple: During the Spring Festival, the traffic of various short videos, social media, including selfie software and other applications surges, theoretically the highest traffic and the most prone to failure by most cloud service providers.

Once the attack is successful, there will be a large number of Internet services that are unstable or even more impactful.

Especially for activities like Chinese New Year's Eve grab red envelopes, 8:00 to 1:00 a.m. is the peak period of traffic, and users can't stand the failure for one second more.

Fortunately, the results of this attack did not achieve the goal.

HUAWEI CLOUD responded very quickly, limiting the time to troubleshoot the problem to 3 minutes and using 5 minutes to repair it, and finally handled the system failure within 8 minutes, without affecting the operation of the cloud service.

It is hard to wonder why HUAWEI CLOUD allows these attacks to be launched repeatedly.

After all, for HUAWEI CLOUD, this group of attackers is no longer a "first-time offender".

They have tried everything from human-made attacks to "automated" attacks using systems, from disconnections to fault injections to a variety of the latest attack "weapons."

However, even in the face of unknown attacks, HUAWEI CLOUD can still deal with them quickly.

Not only such attacks on the eve of the Spring Festival, in the face of various types of attacks, they can detect system abnormalities in time, quickly locate and solve problems, and compress the whole process to 10 minutes.

And why?

"Special Forces" on standby

It turned out that this team of attackers who had been secretly plotting for three months and attacking HUAWEI CLOUD thousands of times was actually a "secret team" within HUAWEI CLOUD, known as the "Blue Army".

They are constantly designing the latest attack ammunition and raiding HUAWEI CLOUD systems at any time.

As for the Red Army team as a defender, it is on standby at any time, and once it detects the Blue Army attack, it will repair it as soon as it is detected.

Huawei Cloud was attacked continuously on the eve of the Spring Festival! Plotted for 3 months, specifically picked out the Internet in the early morning

There was no communication between the two teams, and it was unknown when the attack would be triggered.

In addition to human attacks, the Blue Army even used chaos engineering, the system will randomly and automatically attack the system maintained by the Red Army, the total number of attacks in the past year is as high as 2000+ times.

And behind this type of attack, all actions have only one purpose -

Improve the stability and emergency response capabilities of HUAWEI CLOUD systems.

Even during the Spring Festival, the maintenance and defense of the system will not stop: HUAWEI CLOUD has set up a special "special team".

The size of the "special operations team" is hundreds of people, all of whom are full-stack engineers who have dealt with countless attacks and have "experienced hundreds of battles".

From now until the Lantern Festival, the members of the "special combat team" will work in three shifts, 7× 24 hours a day, full-time to ensure the operation and maintenance of the Spring Festival.

In this way, even if the attacker wants to "take advantage of the void", the process will not be too easy.

But that's just the answer to the first question.

Why is HUAWEI CLOUD able to handle the entire process quickly and stably in the face of an attack?

The failure rate remains within 0.01%.

It can be said that this attack of the Blue Army just crashed into the "muzzle" of the Red Army, which had been prepared earlier.

As early as three months ago, on November 5, the Red Army had begun to troubleshoot system risks and further reduce the incidence of failures through traffic forecasting.

In fact, this is no longer an operations team in the traditional sense.

Whether it is the Red Army that eliminates risks and faults on a daily basis and maintains the stability of the system, or the "special combat team" on duty during the Spring Festival, they all come from a "well-trained" team within HUAWEI CLOUD , SRE.

The concept of SRE was first defined as "operating and maintenance activities using software engineering methods". In HUAWEI CLOUD, it is a little more refined, and a "deterministic" methodology is born to achieve the goal of "high availability".

In a word, it is to consider the highly available architecture when designing the product, and to dynamically clear the risk control, plus the intelligent operation and maintenance platform to control the risk of uncertainty and achieve a deterministic risk control quality.

The SRE team has independently developed an intelligent O&M platform that standardizes and automates the O&M process with a data-driven approach. Specifically, the platform not only records O&M data in real time, but also measures the quality of all aspects of the entire process, truly shortening the time for problem discovery, fault location, and repair.

Today, the number of monitoring indicators on the platform has reached 16 billion per hour, the number of O&M system users has reached 10,000+, and the change frequency is twice per minute, taking into account functions such as intelligent O&M and logging.

Huawei Cloud was attacked continuously on the eve of the Spring Festival! Plotted for 3 months, specifically picked out the Internet in the early morning

In addition to the intelligent O&M platform, the SRE team will also use traffic forecasting and other work to further improve the availability of the system and reduce the probability of risk.

Specifically, resource usage is estimated through specific algorithmic models combined with metrics.

Behind HUAWEI CLOUD, there is a doctoral corps, which has a special algorithm innovation laboratory, researchers will assist traffic forecasters to optimize algorithms, like a recent lab paper on solving virtual machine scheduling problems with reinforcement learning, which has been accepted by the top journal Patience Recognition.

Huawei Cloud was attacked continuously on the eve of the Spring Festival! Plotted for 3 months, specifically picked out the Internet in the early morning

At the same time, with the help of cloud operating system and global scheduling and other technologies, it will efficiently "squeeze" and allocate limited traffic resources, including the use of "Yaoguang" smart cloud brain, responsible for the resource allocation, deployment, mobilization and supply of the entire cloud, and further refine the utilization efficiency of resources in combination with global scheduling and other technologies.

At present, the failure rate of HUAWEI CLOUD systems has also been suppressed below 0.01%, that is, the time of failure in one year remains within 53 minutes.

The Spring Festival Defense War in the Digital World

In fact, the manpower invested by HUAWEI CLOUD in the Spring Festival defense war this year has reached nearly 1,000 people.

Among them, the entire SRE team of several hundred people has been in a state of readiness for "full online".

To some extent, like employees in traditional industries, they are the Spring Festival guards who ensure the convenience of our lives.

It's just that the dimension has changed from the physical world offline to the digital world of online.

Zhang Zhi, who has worked in the operation and maintenance industry for more than 20 years, believes that the taste of the Spring Festival has not actually changed, but has changed places for the New Year.

Huawei Cloud was attacked continuously on the eve of the Spring Festival! Plotted for 3 months, specifically picked out the Internet in the early morning

△ Zhang Zhi, SRE expert of HUAWEI CLOUD

In the past, the Spring Festival was mainly in the physical world, but now the Spring Festival in the digital world may be more lively than the physical world. Now that I am in the digital world, I can also spend the Spring Festival with my friends, grab red envelopes, and brush videos.

He has witnessed many disasters in his peers and believes that this duty is indispensable:

You don't know when the risk will occur. But SRE can really reduce the likelihood of encountering risks.

Shi Shengbing, who transferred from other posts to SRE, although he ridiculed the particularity of this identity in the Spring Festival:

Huawei Cloud was attacked continuously on the eve of the Spring Festival! Plotted for 3 months, specifically picked out the Internet in the early morning

△ Shi Shengbing, an expert of HUAWEI CLOUD SRE

SRE is the role behind HUAWEI CLOUD. In fact, we rarely "appear" in festivals like the Spring Festival, because when they do appear, they are often "not some good things".

But this job made him feel a "new spring":

I've been with Huawei for twenty years, and I've been on this team for a year and a half. I thought that my last position was the last one of my career, but now I feel that a new spring has arrived.

On the one hand, it is reflected in the SRE itself, which is the youngest team in HUAWEI CLOUD.

On the other hand, with the rapid growth of the industry, the young SRE is becoming the backbone of cloud service quality assurance.

In fact, this kind of protection for digital life is not an isolated case.

The electronic bus card for weekday travel, one-click taxi, digital payment when eating, online reservation when sick, and then online shopping and online game gatherings, in retrospect, we have been inseparable from digital life.

If you look further ahead, from the earliest proposed "smart earth", to the "true Internet" brought about by the development of AI, and then to the current "meta-universe", the industry buzzword has always been closely related to the digital world.

Huawei Cloud was attacked continuously on the eve of the Spring Festival! Plotted for 3 months, specifically picked out the Internet in the early morning

Specific to technology, including the outbreak of "digital people" in recent years, XR devices that have been brought up again with the development of AI also show that our lives are indeed unconsciously merging with the digital world.

In the digital world, cloud services have gone from an emerging technology to an indispensable infrastructure.

In other words, all of our Internet services and digital products are ultimately brought by the cloud and run on the cloud, and even when we become part of the digital virtual world, we are loaded into the cloud.

The traditional physical world of water, electricity, bridges and houses, after being restored to the digital world, is nothing more than some data stored in the cloud.

Under this trend, the stability of cloud services becomes as important as the stability of infrastructure in the digital world, and even more so in the Spring Festival.

In today's special period, we are more dependent on the protection of the digital Spring Festival than ever before.

This time, HUAWEI CLOUD's offensive and defensive drills and red-blue confrontation disclosures are not only a sharing of advanced experiences and mechanisms, but also a reminder of our concern for the increasingly dependent "infrastructure of the digital world".

Read on