laitimes

In times of crisis, how can health codes stay away from "collapse"?

In times of crisis, how can health codes stay away from "collapse"?

Pictured: Leaf reflects orange

Source: 21tech

Author: Poplar

Editor: Li Qingyu

On the morning of January 10, the sudden visit of Guangdong Province's Yue Kang Code was abnormal, which brought many inconveniences to many people who were working.

Afterwards, Yuekang Code issued an announcement saying, "At 8:31 a.m. on the 10th, the platform monitored that the traffic of Yuekang Code increased abnormally, reaching up to 1.4 million times per minute, exceeding the bearing limit, triggering the system protection mechanism, resulting in some users accessing Yuekang Code slowly or abnormally, and the operation guarantee team dealt with it urgently, partially alleviated at 9:04, and fully resumed smooth operation at 9:56."

A week ago, Xi'an One Code Pass collapsed twice in a row in just half a month. On the morning of December 20, 2021, Xi'an Yidiantong had an access anomaly, and the repair work lasted all day; on the morning of January 4, 2022, Xi'an Yidiantong collapsed again and returned to normal at noon.

The first failure, Xi'an official replied that "the number of visits per second reached more than 10 times the previous peak, resulting in network congestion", and the second failure was also due to "excessive visits".

In the past two years, with the normalization of epidemic prevention and control, health codes have become a "necessity" for people to travel. This leads to a problem with the health code, which will also have a greater impact on the lives of the people, especially in Xi'an, which is at a critical moment in the fight against the epidemic, and the failure of the health code also directly hinders the development of epidemic prevention work.

With the lessons learned from the past, an important issue that local governments need to think about urgently is how to prevent the collapse of health codes.

The game of safety and efficiency

However, industry expert Li Ming told 21st Century Business Herald that if from a technical point of view, the health code operation support has been very mature, but in the face of sudden access pressure, it is not realistic to ensure that 100% of the problems do not occur.

Not to mention that in theory, there is no absolutely stable system, and from the perspective of the most common cause of the collapse of the current health code - traffic overload, there are still efficiency, cost and security game problems behind this.

Li Ming said that the general system architecture has a bearer threshold, and when user traffic exceeds this threshold, the system will crash. In fact, this problem has appeared in many products, such as the previous Weibo, which has repeatedly encountered the collapse of the product due to the sudden exposure of gossip news by a certain star, and the instantaneous increase in the number of users.

Many people will wonder, so why not raise this threshold to avoid overload situations? This involves the balance between access efficiency and cost. Because raising the threshold, more servers are needed, which also means more cost.

Li Ming revealed that taking the health code as an example, the cost of accessing tens of millions of people is about ten million. If the daily concurrency peak of a city health code is only 100,000, but a server with 500,000 concurrency is prepared for this purpose, it will also cause waste of resources.

Therefore, in practice, the health code of each city will set a reasonable threshold according to the local situation, at least to ensure daily use. However, this will also leave a hidden danger, that is, when local residents use health codes on a large scale, it will cause user traffic to exceed the threshold.

Although it is a small probability event, it also shows that the health code collapses due to a sudden increase in traffic, which is also an expected thing. Therefore, the real problem to be solved should be whether the system can recover quickly after it is overloaded.

Li Ming said that from a technical point of view, it also has the ability to respond quickly. It mentions several design principles:

First of all, when designing the system architecture, we must consider extreme situations, such as a city with a population of 10 million, then the architecture design must take into account at least the extreme situations that may occur for 10 million people.

Under the architecture, it is necessary to have the ability to scale elastically. For example, only 100 servers need to be used in normal times, but in special circumstances, 100 servers are not enough, and it is necessary to support rapid elastic expansion. This scenario is actually a core capability of cloud computing, and it is also very mature.

In addition, the system also needs to do disaster preparedness. This is to ensure that at least one set of filings can be enabled in a timely manner if user access is overloaded and the Auto Scaling capability is not functioning.

In addition, such as decoupling the system process, dividing the entire business process into different levels to prevent the centralized influx of traffic, or splitting the traffic into different processing areas through a distributed architecture. These can effectively avoid the collapse of the entire system.

Xi'an one-code pass warning

Therefore, the health code encounters problems is not terrible, as long as it can be quickly recovered, it can basically meet the needs of the masses. However, Xi'an One Code Pass has attracted much attention this time, on the one hand, the first crash time is as long as one day, on the other hand, there have been two consecutive problems in a short period of time.

According to some of the solutions mentioned above, when Xi'an Yidiantong encounters traffic overload for the first time, the normal operation should be to deal with it through auto scaling, but the yidiantong system eventually crashes, which shows that in the system architecture, the load balancing and auto scaling situation were not fully considered at the beginning of the design.

According to the Titanium Media App from a person close to the Xi'an "one code pass" project, the general cause of the entire failure has been basically clear, that is, a systemic failure due to overloaded traffic and insufficient system architecture response to high concurrency, which eventually leads to the firewall intercepting data can not be returned.

Li Ming told reporters that the firewall also has throughput limits, if the traffic is too large beyond the throughput limit, the firewall will not be able to respond. Normally, firewalls should also have load balancing mechanisms, and when one firewall cannot support it, it enables other firewalls to share its traffic.

Of course, behind this involves the problem of public and private clouds. Li Ming said that at present, the entire government market tends to use private clouds, if the firewall is a private cloud architecture, then the firewall of disaster preparedness usually even if it is not used, it also needs a cost to buy, but this cost is often not saved.

The system architecture cannot meet the rapid expansion, which also causes the processing power of Xi'an One Code Pass to become rigid when it encounters high concurrent traffic. As for why such a situation occurs, Li Ming believes that this shows that the entire system is not fully designed to take into account various possible situations. Moreover, when there is already a city-wide epidemic prevention and control, systematic anti-stress tests and drill preparations should also be made in advance.

In addition, the subcontracting of health code construction in some areas has also appeared some unreasonable problems. In general, as an EPC, the core modules should not be subcontracted.

The core module here includes key engines such as access, generation, and verification of the entire health code. "Of course, as for some other simple business modules, subcontracting is no problem, and almost all health code projects also have subcontracting," Li Ming said.

For the health code project, Li Ming believes that first of all, we must clarify the boundaries of the project, such as how large the amount of concurrency carried is, and what is the processing time that can be tolerated in the face of unexpected situations;

Second, there must be a strict review mechanism. The entire system architecture should be carefully reviewed, and the examiner should implement a lifelong responsibility system, rather than taking a form, resulting in problems and not knowing who to find;

Then, when the bidding is executed, it is necessary to ensure that the bidding company has sufficient capacity, at this time, it is not only based on their commitment ability, but also on what cases they have done;

Finally, there is an early warning mechanism, such as when the traffic reaches a certain proportion of the peak, the plan should be started in advance. For example, if there is a local epidemic in a certain place, measures should also be taken in advance.

The huge public opinion triggered by the two collapses of Xi'an Yimatong has sounded the alarm bell for other local governments. For example, at the recent epidemic prevention and control work conference, Beijing proposed to make good use of the "Beijing Health Treasure" to strengthen stress testing and system operation and maintenance to ensure normal operation.

Under the capricious normality of the epidemic, we hope that the health code will not have any abnormalities, and if there is an abnormality, we also hope to be able to recover as quickly as possible, rather than making people wait for a day or even half a day.

(At the request of the interviewer, Li Ming is a pseudonym)

Editor: Lu Taoran

Read on