These ten events have turned "never downtime" into a joke

Organize | Tina

This year, those Internet companies that have "collapsed".

Internet technology has developed to 2022, and in theory, it is possible to achieve "never downtime". But in the past 2021, downtime doesn't look like it's decreasing at all.

With the increase of "national applications", everyone is more and more dependent on technology, and they face more risks than ever before. Downtime affects not only internal users, but also all aspects of revenue, reputation, and productivity of customers and partners.

Downtime accidents are unpredictable, so it is also known as the "black swan" in the system. At present, the architecture of large-scale Internet systems is becoming more and more complex, and the stability risk is also increasing, and there must be some black swans lurking in the system, but they have not yet been discovered. However, Murphy's Law tells us that "those who should be wrong will eventually make mistakes." We've put together ten major downtime events from 2021 and summarized the causes of failures. Most of these failures are man-made and are still where we need special attention in system construction.

Domestic downtime: It is also a capability to explain the cause of the failure clearly

Station B collapsed, leaving young people without the heart to sleep

On the evening of July 13, the video website Bilibili (Station B) had a server downtime accident, and users who could not log in flocked to other sites, resulting in a series of downtime accidents. "B station collapsed", "Douban collapsed", "A station also collapsed", "Jinjiang collapsed" and so on rushed to the hot search.

According to the data, at that time, the monthly active users of Station B were 223 million, of which the proportion of users aged 35 and under exceeded 86%. Obviously, these young people are very able to stay up late, although the downtime occurred in the middle of the night, but everyone noisily analyzed the cause and even alarmed the fire station. Some netizens believe that "the B station collapsed because of the fire", and the Shanghai fire department replied: "It is understood that there is no fire in the Bilibili Barrage Network B station (headquarters) located in the Guozheng Center, No. 485 Zhengli Road, Shanghai, and no relevant alarms have been received." The specific situation is subject to the announcement of the station. ”

After 2 o'clock in the middle of the night, Station B finally sent a very brief explanation: "Some server rooms have failed, resulting in inaccessibility." ”

It's just that the explanation of Station B is like saying everything, and it's like not saying anything.

Futu Securities service was interrupted, and the founder issued a 2,000-word hard-core long article explaining the technical failure

In the early morning of October 9, the Internet brokerage Futu Securities App crashed, and users could not log in to trade. In the afternoon, Futu Securities issued relevant instructions and apologized. Futu Securities said that the cause of the accident was "the multi-room network failure caused by the power flash of the operator's computer room", and the company has contacted the operator for repair at the first time, and gradually restored the core service within 2 hours.

This downtime did not originally attract attention outside the securities industry, but then the article of Futu founder Li Hua (Ye Zige) made this downtime out of the circle. At noon on the 11th, Li Hua, who came from a technical background, released a 2,000-word long article to apologize to users, but more space in the article explained why it was "down" from a technical point of view.

Although it is the same as The B station is due to the failure of the server room, Li Hua gave everyone a detailed explanation of the various aspects of the disaster design.

These ten events have turned "never downtime" into a joke

Li Hua said that Futu's securities system has a redundant design of two-way or multi-channel from the market to the transaction, from the server to the trading gateway to the network transmission. Different subsystem designs will vary. Taking the market as an example, one-way transmission is the mainstay, the sensitivity to delay is not so high, Futu has long made a multi-regional multi-IDC disaster recovery design; especially like the US stock market, involving transoceanic transmission, in order to avoid interruption, Futu chose the world's top two market suppliers to provide market sources, respectively, from the United States, Hong Kong multi-point access, when these are not available, Futu also retains the ability of Futu U.S. IDC direct transmission. Regardless of other redundant designs, just because of the redundancy of market sources, Futu's annual increase in costs exceeded HK$10 million.

Li Hua pointed out that there are two choices in the design of a multi-channel redundant trading system with real-time hot standby. One is the cross-IDC multiple redundancy scheme with poor transaction performance, greater order delay but better disaster recovery, and the other is the multi-way redundancy scheme of a single IDC with better transaction performance, which is less ordered to submit a delay, but the IDC itself will become a single point of failure. This also indirectly leads to having to make a choice. In Li Hua's view, considering IDC's construction standards, IDC's large-scale accidents are rare, especially in terms of power failures. After comprehensive deduction, Futu chose the second solution with better performance, which also left the IDC single point of failure hidden danger. This accident is precisely the problem with IDC, and it is the power system that should not be the problem, and the uninterruptible power supply and diesel generator have not played their due role.

Li Hua's hardcore articles have also been supported and encouraged by many Futu Securities users.

Xi'an "one code pass" collapsed twice in half a month

On December 20, 2021, Xi'an "One Code Pass" crashed due to excessive traffic. At that time, the Xi'an Big Data Resources Administration Bureau said that the registered users of "One Code Pass" had reached 46.952 million, and the average daily scanning volume exceeded 8 million. Due to the increased scanning code inspection in various public places and the simultaneous implementation of multiple rounds of nucleic acid testing for all employees, the number of visits per second of "One Code Pass" has reached more than 10 times the previous peak, and it is recommended that the public do not display codes and bright codes unnecessarily.

At 9:00 a.m. on January 4, 2022, Xi'an "One Code Pass" crashed for the second time. Xi'an city opened a new round of nucleic acid screening, many Xi'an netizens responded that the "Xi'an one code pass" system collapsed again, unable to display the epidemic prevention and control code. Topic # Xi'an One Code Pass # once rushed to the first place in Weibo's hot search. The relevant departments of Xi'an City publicly responded that due to the large number of visits, the city's "one code pass" has the problem of not being displayed normally. In the afternoon of the same day, Xi'an's "one code pass" has gradually resumed normal use.

It is understood that Xi'an "One Code Pass" is a big data platform led by Xi'an City in February 2020 for epidemic prevention and control, and the owner unit is Xi'an Big Data Resources Management Bureau. According to the official website of the Ministry of Industry and Information Technology on January 4, from December 30 to 31, the Ministry of Industry and Information Technology conducted a survey on the epidemic prevention and control work of the Shaanxi Provincial Communications Administration, and asked Xi'an "One Code Pass" to strengthen technical improvement and network expansion to ensure that there is no congestion and downtime.

As it happens, at about 8:30 a.m. on January 10, 2022, many users reported that the "Yue Kang Code" could not be opened. After 10:00 a.m., the situation gradually eased. Subsequently, the "Yue Kang Code" App released a very professional official instruction.

Today (10th) at 8:31 a.m., the platform monitored that the traffic of Yuekang code increased abnormally, up to 1.4 million times per minute, exceeding the bearing limit, triggering the system protection mechanism, causing some users to access Yuekang code slowly or abnormally, and the operation guarantee team dealt with it urgently, partially alleviated at 9:04, and fully resumed smooth operation at 9:56. Thank you for any inconvenience this may cause, so thank you for your understanding!

International downtime event: Small bugs cause big trouble

Facebook's worst downtime in history, with its market value evaporating by 300 billion overnight

On October 4, the US social media Facebook, Instagram and instant messaging app WhatsApp experienced a massive outage of nearly seven hours, refreshing Facebook's longest downtime since 2008.

WhatsApp and Facebook Messenger, two "WeChat" instant messaging products, have 2 billion and 1.3 billion users worldwide, respectively, and the number of Instagram users on social platforms has reached 1 billion users, which means that the shutdown has affected more than 3 billion users. During the downtime, desperate users flocked to Twitter, Discord, Signal, and Telegram, causing the applications' servers to crash.

Facebook later issued a failure report, saying that during a routine maintenance exercise, engineers issued a directive to assess the availability of the capacity of the global backbone, but accidentally cut off all connections in the backbone network, which essentially disconnected the connections between Facebook's global data centers. After the service outage, Facebook's engineers were unable to access Facebook's data centers in a normal way to fix it, resulting in a failure lasting 7 hours.

It is reported that the accident wiped out facebook's market value overnight by about $47.3 billion (about 304.9 billion yuan).

Roblox has experienced ultra-long downtime, indicating that the critical business is determined not to go to the cloud

On October 28, Roblox experienced a 73-hour downtime. Roblox is currently the most popular online gaming platform around the world, with more than 50 million daily active users, many of whom are 13 years of age or younger. It is worth mentioning that Roblox is also considered a key player in the "metaverse".

Roblox subsequently published a very detailed failure report. In the report, Roblox technicians explained that Roblox programs run in their own data centers. To manage its many servers, Roblox uses open source Consul for service discovery and health checks. Roblox said that the downtime was mainly due to the enabling of the streaming function in Consul instead of the long polling mechanism, but there was a bug in the streaming function, which eventually led to performance degradation and system crash. After 54 hours of downtime, the cause of the failure was identified, and the service capabilities of the system were gradually restored by disabling the streaming function.

In the wake of such service disruptions, it's natural for many to ask if Roblox would consider moving to the public cloud to let a third party manage Roblox's underlying compute, storage, and networking services.

Roblox technicians say that self-built data centers can significantly control costs compared to using the public cloud. In addition, having its own hardware and building its own edge infrastructure allows Roblox to minimize performance variations and manage latency for players around the world. But it's not tied to any particular approach: "We use the public cloud for the use cases that make the most sense for our players and developers, such as burst capacity, most DevOps workflows, and most of the internal analytics." But for workloads that are critical to performance and latency, we chose to build and manage our own infrastructure on-premises. This will allow us to build a better platform. ”

Salesforce engineers take shortcuts to fix bugs and cause global downtime

Salesforce is one of the most popular cloud software applications available today. The software application is reportedly used by millions of employees in approximately 150,000 organizations worldwide. Salesforce provides services that span all aspects of customer relationship management, from general contact management, product catalogs to order management, opportunity management, sales management, and more. Users do not need to spend a lot of money and manpower on the maintenance, storage and management of records, all records and data are stored on Salesforce.com.

On May 11, Salesforce's service became unavailable and the downtime lasted for 5 hours. Afterwards, Salesforce organized a customer briefing that fully disclosed the incident and the relevant engineers' operating procedures. While Salesforce has always prided itself on being highly automated with internal business processes, many of these links can only be done manually—DNS is one of them. The configuration script used by the engineer performs a configuration change that requires a restart of the server to take effect, unfortunately, the script update timed out. Updates have since been deployed in Salesforce's data centers, and timeouts have been detonated... Of the engineer himself, who was determined to circumvent existing management policies and cause accidents, Salesforce said that "we have done the appropriate treatment of this employee." ”

Cloud-related service providers: Once something goes wrong, the "explosion radius" is very large!

Cloud computing giant OVH data centers caught fire, forcing 3.6 million websites offline

In March, a severe fire broke out in the computer room of European cloud computing giant OVH in Strasbourg, France, with a total of four data centers (Strasbourg Data Center) in the area, the SBG2 data center that caught fire was completely burned down, and the building of SBG1 was partially damaged. Local newspapers said it took 115 firefighters 6 hours to extinguish it. After up to 6 hours of continuous combustion, the data within SBG2 should be severely lost.

The fire has had a serious impact on numerous websites across Europe. In total, as many as 3.6 million websites across 464,000 domains have rolled off the line.

Customers affected by the fire include the European Space Agency's DATA AND INFORMATION ACCESS SERVICE ONDA project, which hosts geospatial data for users and builds applications in the cloud. Facepunch Studios, a game studio owned by Rust, confirmed that 25 servers were burned and that their data had all been lost in the fire. Even after the data center comes back online, no data can be recovered. Other customers include the French government, whose data.gouv.Fr website was also forced to go offline. There's also cryptocurrency exchange Deribit, as well as Bad Packets, an information security threat intelligence vendor responsible for tracking DDoS botnets and other network abuse issues...

Some of them were unlucky: "No!!! No way!!! My server is on rack 70C09, I'm just a regular customer, I don't have any disaster recovery plan..."

Paralyzing most of the world's Internet, where is Fastly sacred?

On June 8, when hundreds of millions of Internet users around the world logged on to their usual websites, they found that the page could not be opened, and a "503 Errors" error message appeared, including Amazon, Twitter, Reddit, Twitch, HBO Max, Hulu, PayPal, Pinterest, and various types of websites including the New York Times, CNN, etc. were all recruited.

After about an hour, it was discovered that the massive failure was caused by CDN services company Fastly. Fastly said through its official Twitter and blog, "We found that a change in the configuration of a service triggered a brief outage of global services, which has now been shut down and our global service network has returned to normal." ”

Founded in 2011, Fastly is one of the few large CDN vendors in the world to speed up user browsing and experience. Interestingly, Fastly's stock price rose sharply on the day after the problem, because through this incident, investors realized that the small San Francisco-based company with fewer than 1,000 employees had a significant impact on the Internet world.

Google Cloud global downtime for 2 hours

On November 16, according to foreign media reports, Google Cloud, one of the world's largest cloud service providers, experienced an outage, resulting in the websites of many large companies relying on Google Cloud being interrupted.

The outage lasted about 2 hours, including home Depot, Spotify and other companies receiving feedback from users about the service outage, and etsy and Snap services also had network failures. In addition, the downtime had a deep impact on Google's own services, and YouTube, Gmail, and Google Search all stopped working.

This event is reported to be caused by a Google Cloud user misconfiguration of External Agent Load Balancing (GCLB), which is a vulnerability that was introduced 6 months ago and in rare cases allowed corrupted configuration files to be pushed to GCLB. On November 12, a Google engineer discovered the vulnerability. Google had planned to roll out a patch on November 15, but unfortunately, before it was fixed, the service outage occurred.

AWS experiences 3 outages in a month

In the last month of 2021, AWS experienced 3 outages. The first downtime occurred on the 7th EST, lasting from 10:45 a.m. to 2:22 p.m., and a large number of popular websites and applications, including Disney, Netflix, Robinhood, Roku, and so on, were disrupted. At the same time, Amazon's Alexa AI assistant, Kindle e-book, Amazon music, Ring security camera and other businesses have also been affected.

On December 10, AWS announced the cause of the outage: Unexpected behavior by an internal client caused a spike in connectivity activity, overwhelming the networked devices between the internal network and the primary AWS network, resulting in communication delays between those networks. These delays increase service latency and errors in communication between networks, resulting in more connection attempts and retries, ultimately triggering ongoing congestion and performance issues.

The second december outage occurred around 7:43 a.m. on the 16th, affecting online services including Twitch, Zoom, PSN, Xbox Live, Doordash, Quickbooks Online, and Hulu. AWS then announced the cause of the failure: due to an automated software in the main network, some traffic was diverted to the backbone due to an error, which affected the connection of some Internet applications.

The third december outage occurred around 7:30 EST on the 23rd, including Slack, Epic Games, cryptocurrency exchange Coinbase Global, game company Fortnite, dating app Grindr and delivery company Instacart. For the outage, AWS preliminary investigation said that it was a problem with the power supply in the data center.

Finally, I hope that in 2022, everyone will not experience downtime~

These ten events have turned "never downtime" into a joke

Read on