laitimes

Tencent Cloud's sudden failure affected 1,957 customers in 87 minutes!

author:Drive the house

On the afternoon of April 8, Tencent Cloud suddenly experienced a service failure, which manifested itself as an interface response error, an internal service error, and a 504 error on the webpage.

A 504 error indicates a gateway timeout, which means that the server, acting as a gateway or proxy, does not receive requests from the upstream server in a timely manner.

In the evening of the same day, Tencent Cloud replied that the official website console-related services were abnormal and were being urgently repaired, and some areas had been restored, and other areas were also continuing to be repaired.

Tencent Cloud's sudden failure affected 1,957 customers in 87 minutes!

Now, Tencent Cloud has officially announced a review and explanation of the failure.

According to Tencent Cloud, at 15:23 on April 8, the Tencent Cloud team received an alarm that the cloud API service was in an abnormal state, and a large number of customer feedback began to appear on Tencent Cloud work orders, after-sales service groups, Weibo and other channels that Tencent Cloud console could not be logged in.

After fault location, it was found that the customer could not log in to the console because of the abnormal cloud API.

Cloud API is a unified collection of open interfaces on the cloud, through which customers can programmatically manage and control cloud resources, and the cloud console provides interactive web page functions by combining cloud APIs.

After a fault occurs, some public cloud services that rely on cloud APIs to provide product capabilities are also unavailable due to exceptions in cloud APIs, such as cloud functions, text recognition, microservice platforms, audio content security, and verification codes.

The outage lasted nearly 87 minutes, during which a total of 1,957 customers reported the failure.

Tencent Cloud's sudden failure affected 1,957 customers in 87 minutes!

From the customer's point of view, cloud services can be roughly divided into the data plane and the control plane, where the data plane carries the customer's own business, and the control plane is responsible for operating different products on the cloud.

The console and cloud APIs that failed this time have an impact on the control plane.

Tencent Cloud's sudden failure affected 1,957 customers in 87 minutes!

Generally speaking, if the cloud service is compared to a hotel, the console is equivalent to the front desk of the hotel, and once a failure occurs, the management capabilities such as check-in and extension will be unavailable, but the rooms that have been checked in will not be affected.

In this failure, IaaS resources such as servers that have been configured by the customer, including services that have been deployed and running, are not affected by the cloud API exception. Other PaaS and SaaS services that provide services in the form of non-cloud APIs are also available as normal.

Tencent Cloud's sudden failure affected 1,957 customers in 87 minutes!

On April 8, you can see that the inbound and outbound traffic trend chart of all Tencent Cloud products will not be affected

However, the service products provided by using APIs have been affected to varying degrees, for example, Tencent Cloud Storage Service calls have dropped significantly on the same day.

Tencent Cloud's sudden failure affected 1,957 customers in 87 minutes!

On April 8, Tencent Cloud's storage service call data trend chart shows that there was a significant fluctuation in storage service calls

The troubleshooting process is as follows:

At 15:23, the fault is detected, the service is restored immediately, and the cause is troubled at the same time.

At 15:47, it was found that the service could not be fully restored by rolling back the version, and the problem was further located;

At 15:57, the root cause of the fault was found to be an error in the configuration data, and the data repair plan was urgently designed.

At 16:02, data restoration work was carried out in all regions, and API services were being restored region-by-region.

At 16:05, it was observed that API services in all regions except Shanghai have been restored, and the recovery problem in Shanghai region was further located.

At 16:25, the technical components located in Shanghai had API cyclic dependency problems, and decided to restore them by scheduling traffic to other regions.

At 16:45, it was observed that the Shanghai region was restored, and at this time, the API and API-dependent PaaS services were completely restored, but the console traffic increased sharply, and the capacity was expanded by nine times.

At 16:50, the request volume gradually returned to the normal level, the business ran stably, and all console services were restored.

At 17:45, we continued to observe for one hour, no problems were found, and the process was completed according to the plan.

According to Tencent Cloud, the cause of the failure was insufficient consideration for the forward compatibility of the new version of the cloud API service and insufficient configuration data grayscale mechanism.

During this API upgrade, due to the change of the interface protocol of the new version, after the new version is released in the background, the data processing logic from the front-end of the old version is abnormal, resulting in the generation of an incorrect configuration data, and the insufficient grayscale mechanism causes the abnormal data to quickly spread to the entire network, resulting in abnormal API usage as a whole.

After a fault occurs, the service backend and configuration data are rolled back to the old version at the same time according to the standard rollback scheme, and the API backend service is restarted, but at this time, because the container platform that hosts the API service also relies on the API service to provide scheduling capabilities, a circular dependency occurs, resulting in the service being unable to be automatically pulled up.

The API service is restarted manually through O&M to complete the entire fault recovery.

In recent years, various cloud services in China have failed many times:

Alipay crashed on April 9, 2024, Tencent Video crashed on December 3, 2023, Didi crashed on November 27, 2023, Alibaba Cloud and Alibaba-based services collectively crashed on November 12, 2023, and Bilibili crashed on March 5, 2023......

Tencent Cloud's sudden failure affected 1,957 customers in 87 minutes!