laitimes

Spring Festival no downtime defense war: 1.4 billion people, tens of billions of red envelopes, and overtime engineers 丨 New Year has this Krypton

Wen | Wang and Tong

Edited by | Shi Yaqiong

Cover source | IC photo

At the Bashang data center, known as Alibaba's "data heart," more than 200 kilometers from Beijing, temperatures as low as minus 38 degrees Celsius in winter are so strong that people can't open their mouths. Chinese New Year's Eve here, every year there are Alibaba Cloud "patrol soldiers" stationed in the Ecc monitoring room, in addition to ensuring the normal operation of the basic facilities of water, electricity and oil, but also to prevent the sudden situation brought by the cold to the server.

In 2019, Chinese New Year's Eve to the second year of junior high school, Taiyou of Alibaba Cloud is on duty here. Because he knows that he is the last line of defense, which is very important for the operation of the whole system, his mentality is not bad: "The advantage here is that firecrackers can be put at will!" ”

As the "home" of the server, the data center is the basis for the normal progress of the user's network life, and the peak of traffic during the Spring Festival cannot tolerate a trace of difference.

And Taiyou is just one of the engineers who let us grab red envelopes for the Spring Festival, send circles of friends, watch the Online Spring Festival Gala, and play games more smoothly.

In the Chinese New Year's Eve festival, when thousands of families gather and share reunions, engineers from major factories, or in the company building, or in the computer room, or at home, they are working overtime to combat the impact of more than one billion traffic on the server:

In 2022, Jingdong set up 8 project teams, led by Jingdong Retail Group, assisted by subsidiaries such as logistics and technology, responsible for the distribution of red envelopes for the Spring Festival Gala;

Chinese New Year's Eve 2021, hundreds of engineers from Tencent Cloud and Tencent's underlying technical team chose to stay on duty at the company, customer sites, and data centers around the world.

Kingsoft Cloud has set up an operation and maintenance team of more than 100 people this year to ensure the smooth progress of the live broadcast of the Spring Festival Gala;

In 2020, Kuaishou invested hundreds of people to participate in the R&D team of the Spring Festival Gala Red Packet;

In 2019, Baidu managed to get 100,000 servers, and thousands of Baidu colleagues Chinese New Year's Eve overtime to let Baidu App smoothly survive the traffic impact caused by red envelope distribution;

……

Spring Festival no downtime defense war: 1.4 billion people, tens of billions of red envelopes, and overtime engineers 丨 New Year has this Krypton

Alibaba Cloud's Taiyou inspects the air-cooled water unit outdoors

When the Spring Festival Gala is played on time, when the Chinese New Year's Eve meal is on the table, when the fireworks bloom, you may not know what kind of data tsunami will be gathered when more than a billion people across the country pick up their mobile phones at the same time.

Ensuring the normal and orderly operation of the digital world has become a special mission of major factories and their engineers during the Spring Festival.

This is a defensive battle that does not stop during the Spring Festival.

01 Downtime: Arrives at peak traffic

In the PC Internet era, downtime rarely occurs, one is that people can access the Internet limited equipment, but also limited by geographical location, for the network hotspot event participation is low, the server receives less traffic impact; second, the PC era we often "besiege" a hot picture or video, the server only needs to cache this content.

But in the era of mobile Internet, UGC (User Generated Content, that is, user-generated content) represented by the circle of friends and Weibo is different, and the server needs to cache each one. On this basis, the larger the traffic, the greater the impact on the server, and the downtime will occur.

Therefore, during the Spring Festival, more than a billion people have nothing to do, and the frequent network surfing behavior can easily bring impact and pressure to the server.

Tencent is one of the first manufacturers to experience a sudden spike in traffic in the mobile Internet era.

Tencent engineers in the PC era mostly have normal holidays, 9 days of vacation without a computer - go to the Internet café to connect to the VPN, online, relocation, offline is equivalent to duty.

The change took place on Chinese New Year's Eve night in 2014. It was the year when 4G began to spread and the mobile Internet began to take off, and in the first ten days of the Spring Festival, in order to enliven the New Year atmosphere, Tencent added a red envelope grabbing function in WeChat. Before the Spring Festival red envelope was officially launched, the team found that the number of users of this "small function" far exceeded expectations: starting from first-tier cities such as Guangzhou, the habit of sending red envelopes gradually expanded to second-, third- and fourth-tier cities, up to the whole country.

But this feature was designed according to a small system from the beginning, and it was too late to make changes for user growth temporarily.

At that time, the WeChat DAU just exceeded 100 million, the number of users was about 400 million, and almost all the people who owned WeChat in Chinese New Year's Eve night began to send red envelopes and grab red envelopes. The Spring Festival red envelope team quickly activated overload protection. When an overloaded user wants to send a red envelope, the system will prompt "The current system is busy". At that time, the technical team that developed the red envelope temporarily transferred 10 times the number of servers originally designed to withstand the test.

At the same time, tencent storage is also in trouble. Screenshots of red envelopes and New Year's wishes that everyone grabbed were sent intensively in the circle of friends, triggering the overload warning line that had been set. The user's intuitive feeling is that your message cannot be seen by the other party in time, and you may not be able to receive the WeChat/circle of friends sent to you by your friends in time. Both the storage team and the WeChat team urgently dispatched O&M engineers to deal with it, expand capacity, and improve distribution strategies.

After 2014, Tencent learned its lesson and began the "tradition" of overtime duty every spring festival.

All the other big factories that want to participate in the Spring Festival activities have also learned to prepare in advance since then.

02 Red packets: the carnival of the whole people, the overtime of the big factories

The content prepared in advance is not as simple as everyone thinks.

An obvious "peak" is the annual red envelope grabbing activity. Since WeChat opened the Spring Festival Gala in 2015 to grab red envelopes, every year a top Internet company has stepped onto the stage to send red envelopes to more than a billion people, this year is Douyin, last year is Kuaishou, 2019 is Baidu, 2018 is Taobao, 2016 is Alipay, the amount of money is gradually increased every year, and new ways of playing are emerging in an endless stream.

Spring Festival no downtime defense war: 1.4 billion people, tens of billions of red envelopes, and overtime engineers 丨 New Year has this Krypton

Photo update: In 2022, JD.com prepares 1.5 billion interactive red envelopes and physical objects

The red envelope will be revealed and distributed centrally at a certain time or points. It seems that only a few hundred million red envelopes have been issued, but the investment in technology behind it is far more than that.

Grabbing red envelopes is very easy to bring downtime, the reasons are basically these: 1. unforeseen peak traffic influx in an instant, 2. The complex architecture of the red envelope system brings coordination costs, 3. The Spring Festival returns to the hometown leads to temporary adjustments to the allocation of traffic resources between regions, 4. There are problems with external resource collaboration, and 5. New forms need new technologies to match.

In order to solve these problems, the red envelope organizers and cloud vendors have not bothered:

The unforeseen peak flow poured in instantly, which was basically able to "know in mind" after crossing the river by feeling the stones in previous years.

At the Spring Festival Gala of 2018, although the technical team of the Taobao Red Envelope Project had long anticipated the pressure of the login system, it was based on some historical data to deduce the extreme situation, and finally decided to expand the number of logins by 3 times based on the capacity of Double Eleven in 2017. As a result, the actual peak of login on the night of the Spring Festival Gala exceeded 15 times that of Singles' Day in 2017, especially the instantaneous login of new users was completely beyond expectations.

Fortunately, with the data base of previous years, the estimates of the data by latecomers will be relatively accurate. Baidu's technical department has calculated before the Spring Festival Gala that the login value during the Spring Festival Gala can reach 2500 times the peak of daily user login, the traffic is estimated to reach 50 million times per second, and the peak per minute will reach 1 billion times, and the cloud computing system that can support these flows is composed of 100,000 servers.

The complexity of the red envelope system architecture brings coordination costs. Different from the simple login, release, and comment, the red packet grabbing project is often converted between the three levels of the red packet business system, the transaction payment system, and the change account system, because if a red packet is issued through a bank card, it must first apply to the bank, the bank will deduct the money, after the deduction is successful, the background will notify the payment system, and the red packet system will release the red packet at this time. After other users grab the red packet, they will enter the user's account in the form of change.

Red envelopes for a few seconds cash in and out, all need to consume server resources, due to the frequent entry and exit of funds in and out of the bank, the technical capabilities of some banks are very limited, so the big factory also needs to be premised and bank coordination test.

The situation that the spring festival return to the hometown has led to temporary adjustments in the distribution of inter-regional traffic resources may improve slightly this year when "encouraging local New Year".

The geographical location of users will change, which will lead to changes in the traffic structure, and the DC data center and CDN bandwidth will have to be adjusted. Every year, Alibaba Cloud, Tencent Cloud, Kingsoft Cloud and other manufacturers must work with the three major operators to plan the network resources needed in advance for different realities, and then perceive the degree of resource tension of different resources through the intelligent scheduling system, and carry out corresponding resource scheduling and replenishment.

The Spring Festival Gala is a big collaboration that relies on a lot of external resources: app stores, servers, bandwidth, CDNs. For example, if the app store does not expand, it will also be paralyzed by the users imported by the Spring Festival Gala, although in 2019 Baidu rebuked 100,000 servers to ensure the smooth progress of the red envelope grabbing activity, but that night Apple, Xiaomi, Huawei system of the app mall part of the system crashed, there are 2 million -3 million people across the country can not download Baidu App.

Spring Festival no downtime defense war: 1.4 billion people, tens of billions of red envelopes, and overtime engineers 丨 New Year has this Krypton

Baidu employees "worshiped" Yang Chaoyue before the Spring Festival Gala

With the popularity of short videos, the activity of grabbing red envelopes is also combined with short videos. In 2020 and 2021, the partners of the Spring Festival Gala Red Envelope are Kuaishou and Douyin, two short video platforms, respectively. Different from the previous graphic form of grabbing red packets, 2020 Kuaishou adopts the "watch video + like" to grab red packets. Kuaishou official data shows that the cumulative number of viewers during the Spring Festival Gala live broadcast is 780 million, the highest number of online users is 25.24 million, the bandwidth required for short video content is 50-100 times that of the text, but based on high-performance scalable AI architecture capabilities for rich media processing, content moderation, content production, content distribution, content consumption and other aspects are fully applied AI technology, while ensuring that users smoothly grab red envelopes, but also to ensure that short video, live broadcast, community and other functions are available.

03 More defensive battles, outside of the red packets

Over the years, the technology behind the red envelope has become more and more stable, and Tencent as a bystander can feel it.

This is because, no matter what the final amount is revealed, 1.88, 6.88, or 288, everyone will take screenshots to share in the social environment, that is, send circles of friends or send to WeChat groups.

The massive amount of information we send, such as pictures, GIFs, and videos, will put great pressure on the storage capacity of the memory, the bandwidth of the communication trunk channel, and the page loading speed. Therefore, in a certain period of time nodes, there is a dense sending phenomenon of the circle of friends ten times higher than usual, which will bring great pressure to Tencent's data center; if other Apps are not revealed and red envelopes are issued at the same time, then the speed of everyone sharing to the circle of friends is not synchronized, and the pressure monitored by Tencent's data center will not increase sharply, but gradually.

Therefore, Gao Xiangran, director of the Technology Operation and Quality Center of Tencent's cloud architecture platform, joked that Tencent supports and tests many "friendly businessmen" red envelope activities: "Many times, whether the friendly business activity planning is perfect can be 'tested' by us. ”

The pressure on the server beyond the red envelope, and the Spring Festival Gala itself.

The popularity of the mobile Internet has brought fire to the Spring Festival Gala live broadcast. As of 24:00 on January 24, 2020, the cumulative number of live viewers of the 2020 Spring Festival Gala on the new media platform exceeded 1.116 billion, and only 589 million viewers of the live tv broadcast. The technology of the Spring Festival Gala live broadcast and the red envelope and storage mentioned earlier have similarities, but they are not very similar.

The same thing is that it is also supported by cloud computing. The difference is that grabbing red envelopes and sending circles of friends is an instantaneous high concurrent pressure, while the pressure of live video comes from network bandwidth. Since last year, the CCTV Spring Festival Gala live broadcast has adopted new forms such as 5G ultra-clear and VR live broadcast, and the bandwidth of the video is a thousand times that of the image, which makes the video transmission more difficult.

Kingsoft Cloud, which specializes in video codec technology, has participated in the 4-year service Spring Festival Gala webcast since 2017. The new AI+ video cloud technology developed for new forms such as 5G, 4K/8K HD live broadcasting and VR can achieve bandwidth saving, match fragmented terminals, and design large-capacity distribution nodes for high-definition content to adapt to "large content" and other purposes.

In fact, the various operations and maintenance of the CCTV Spring Festival Gala are no longer the responsibility of a single big factory, and multi-cloud is not only to avoid risks such as large traffic, high concurrency, DDos attacks, etc., but also to "specialize in the art industry" and let different large factories do what they are better at.

In the past two years, the Spring Festival is no longer just a physical space activity for everyone - watching the Spring Festival Gala, visiting relatives and friends is so simple, surfing in the online world has even become a more important Spring Festival activity. Even many of our unimaginable traffic shocks, such as the number of game logins will soar during the Spring Festival Gala song and dance program, such as the first day of each year from 12:00 a.m. to 12:10 a.m. In the ten minutes, the circle of friends will send too much impact on the server, such as the movie ticketing system is easy to crowd during the holiday...

Thousands of buildings do not come out of thin air, technology is the foundation, and thousands of engineers who do not go home during the Spring Festival are the founders.

04 Stealth technology, and engineers who don't go home

From 2014 to 2022, the technology of major factories to cope with the impact of spring festival traffic continues to evolve. From the earliest ignorance, no cognition, insufficient estimation of technical difficulties, to the later every year will be prepared in advance, but still rely on the circuit breaker mechanism to limit traffic, and then today's restrictions as a supplement, distributed, automated, intelligent, it seems that Tencent, Baidu, Ali, Kingsoft in the guarantee of our Spring Festival activities, in fact, cloud computing, 4G (that is, 5G), AI and other technologies are escorting.

Our daily behavior on the Internet - WeChat chat, send circle of friends, Taobao, Baidu search, and grab red envelopes on different apps every year will produce data, and the transmission, storage, and computing of these data are extremely dependent on 4G (5G), cloud computing, and AI as the technical foundation.

The "2020 Global Computing Power Index Evaluation Report" released by Inspur Information and IDC shows that the digital economy and GDP will increase by 3.3 ‰ and 1.8 ‰ respectively for each point of increase in the computing power index; the United States ranks first in the national computing power index with 75 points, sitting on the world's largest super-large-scale data center, and China has 66 points, ranking second.

Spring Festival no downtime defense war: 1.4 billion people, tens of billions of red envelopes, and overtime engineers 丨 New Year has this Krypton

Source: IDC

The amount of ultra-large-scale data puts forward higher requirements for processing efficiency, and the powerful cloud computing capabilities provide the source power for multiplying innovation for the digital economy; the continuous evolution of transmission technology 3G-4G-5G provides a faster and wider channel for data transmission; through AI to achieve automation, intelligence, save operating expenses and time, and give time and location for more important data processing.

Specific to China, the uninterrupted defense war of the Spring Festival every year will become the test and sublimation of the computing power of major factories.

This is a forward loop. The pressure of "downtime" has forced manufacturers to upgrade their technology, which in turn has promoted the development of cloud computing and transmission technology; the development of cloud computing and transmission technology has provided the technical foundation for the Spring Festival defense war of major factories.

Behind the positive cycle are countless engineers who have gone bald for the development of technology, and "hit workers" who do not go home in order to make technology work smoothly in the Spring Festival.

Every year, Alipay will have a five-blessing activity, and the prize money will be announced on the night of Chinese New Year's Eve. After the Wufu lottery does not mean that the engineer's work is over, after all the completion is about one o'clock in the morning, an Alipay colleague with the name of Shape Xiu will set off from Hangzhou across China and return to his home in Aral, Xinjiang.

A Kingsoft cloud technology expert who has experienced two Spring Festival Gala live broadcast projects told 36Kr that the main person in charge of the department will also participate in the front-line leading the team of 100 people to fight until the last second, and the logistics team will also ensure the supply of food and "New Year's taste".

On the night of Chinese New Year's Eve, 2019, more than 1,000 colleagues in the Baidu Building, 100 colleagues in Baidu's computer rooms scattered around the world, engineers from more than 100 server manufacturers waiting in the computer room with spare parts, and more than 1,000 colleagues who stayed in the computer rooms of the three major operators in order to protect the smoothness of the network, were working overtime in order to successfully complete the red envelope activities.

Their overtime costs are not low, generally more than twice the salary, and there will be bonuses for the project. However, paying is not only for money, in the process of interviews and data review, you can feel more faith in support. Spring Festival operation and maintenance is not a trivial matter.

Nowadays, the Spring Festival has become a peak node that can be predicted. It is said that when the "Super Bowl" live broadcast in the United States interstitial advertising, hundreds of millions of viewers in front of the TV set will collectively go to the toilet to flush the toilet, resulting in the collapse of municipal water supply in major cities in the United States, while the super bowl has only 130 million global viewers.

In the New Year festival where more than a billion people spend the festive season and brush their mobile phones together, thousands of engineers, through the silent new technology of moisturizing things, let every ordinary person live unimpeded in the digital world.

This article is from the WeChat public account "36Kr" (ID: wow36kr), author: Wang and Tong Shi Yaqiong, 36Kr is published with permission.

Read on