laitimes

"Web" Taobao HTTP3/QUIC technology evolution and practice

author:Architectural thinking
"Web" Taobao HTTP3/QUIC technology evolution and practice

I. Introduction

The following figure shows the key nodes in the evolution of the Handamoy network protocol. In 2015, in order to optimize the standard TLS/1.2 handshake slow problem, we developed and launched the lightweight private encryption protocol Slight SSL to optimize the handshake and encryption problems, allowing session negotiation and data encryption to be put in a TCP packet to achieve 0-RTT when there is no risk of replay attacks. AT THE SAME TIME, PAST TROUBLESHOOTING/SERVICE ACCESS/SERVICE REQUIREMENTS ARE FACING SOME DIFFICULTIES OR CANNOT MEET THE REQUIREMENTS, SUCH AS "ALL FAILURES OF LONG CHAINS UNDER WI-FI, DOWNGRADING SHORT-CHAIN HTTPS CAN BE SUCCESSFUL, SWITCHING TO 4G NETWORK LONG CHAIN NORMAL USE (SLIGHTSSL PRIVATE PROTOCOL IS DISCONNECTED BY WIFI FIREWALL)", "DO YOU PLAN TO SUPPORT TLS 1.3"? ", "Our domain name access server does not support the deployment of SlightSSL", etc. On the other hand, with the official release of QUIC RFC9000 and HTTP3 RFC9114, the evolution of the Slight network protocol to HTTP3/QUIC is not only to solve the business pain points of Slight SSL private protocol, or to improve network transmission performance and improve user experience, but also the general trend of the current network protocol evolution. While the privatization agreement brings an efficient network experience, the problems can be summarized in the following three points:

  • Privatized protocols mean more customization and require end-to-end deployment support (intrusive)
  • TLS1.3 is not supported
  • Occasionally, network intermediate devices disconnect both ends at the same time due to private protocols
"Web" Taobao HTTP3/QUIC technology evolution and practice

Figure 1.1 Key nodes in the evolution of the Hand-Tao network protocol

Second, the evolution of TNET capabilities

TNET, whose full name is TAOBAO NET, is a set of basic capability libraries of the underlying network gradually formed in the development and evolution of hand-to-hand wireless network. At present, it carries 90%+ of the business HTTPs data traffic of HandTao (a small number of domain names AMDC are not configured with long-chain protocol), as the cornerstone of the group's network services on the end side, it is the end-side entrance of the long-chain channel from the end to the server, and it is also the underlying foundation of the network-related middleware on the end. After evolution and improvement, it currently provides rich and combinable protocol matching for the upper layer, implements and abstracts different protocols internally, provides a unified interface on the external interface, and only needs to pass in different combined protocol types when the outer layer is connected, so as to achieve simplicity and ease of use in the true sense. At present, there are two main internal functions:

  • A piece by SPDY/HTTP2/HTTP3/Custom/HTTP3/Tunnel to the upper layer to meet HTTP network request/upload private protocol channel/ACCS message network channel capability (of which standard TLS is mainly used by overseas and other parts do not support SlightSSL deployment business, standard HTTP2 currently adopts branch maintenance without starting the backbone, the reason is mainly based on the package size of the manual Taotao integration, SlightSSL meets business requirements and higher performance)
  • The other block provides self-implemented DNS resolution/traceroute/MTU detection/ICMP ping detection/IPv4&IPv6 protocol stack detection capabilities, mainly partial network tool attributes to meet the upper layer support for network diagnosis/detection capabilities, and some supplement to system capabilities in the event of native DNS interface failure.
"Web" Taobao HTTP3/QUIC technology evolution and practice

Figure 2.1 TNET capability architecture

Third, HTTP3/QUIC protocol upgrade improves performance end cloud upgrade technology transformation plan

XQUIC as a protocol library of the IETF QUIC standard developed by hand, has the advantage of completely independent, controllable and rapid evolution, and colleagues in the design part of the XQUIC protocol library have a number of documents in detail to introduce and will not repeat, interested can be viewed through the relevant article links at the end of this article. Returning to the TNET network library on the end, by adapting to the XQUIC library, the support for the seven-layer HTTP3 protocol and the four-layer QUIC protocol is added by the comprehensive upgrade of the XQUIC library, and the differences in the implementation of each protocol are blocked externally, and the upper layer only needs to select different protocol types when establishing the alliance to meet the different requirements in various business scenarios.

"Web" Taobao HTTP3/QUIC technology evolution and practice

Figure 3.1 XQUIC hand-taotao integrated transformation scheme

End-side degradation & fast recovery

The end will first pull a set of policies from AMDC (AMDC can be understood as an extended httpdns domain name resolution service, not only will return the IP corresponding to the domain name, but also support protocols and other extended attributes), then AMDC will simultaneously issue the http3 & http2 protocol (http3 is preferred on the end, and the http2 protocol is issued at the same time to ensure that there is a long-chain protocol), and after getting the http3 protocol, it will first perform UDP connectivity detection to avoid UDP restriction problems. Only after the current network environment probe is passed, the HTTP3 long chain will be created, and the article on the detection will be introduced later.

"Web" Taobao HTTP3/QUIC technology evolution and practice

Figure 3.2 Client downgrade policy

Upgrade the effect

Dashboard upgrade progress & effect

Last year, with the completion of the transformation of the QUIC IPv6 link in the Aserver main intranet, we switched all the traffic of the shopping guide/transaction/short video/upload link to TCP+IPv6 to QUIC, and these key scenarios in Handaoli have been upgraded at present. In terms of performance, the data of large market/service AB shows that HTTP3 and QUIC have achieved significant improvements in these different service scenarios, helping services achieve better good networks (transmission rate/average time consumption) and weak networks (long tail time consumption & success rate), bringing users a smoother network experience. In addition, other apps within Alibaba Group, such as Cainiao, Handmao, AliExpress and other apps, also use our solution for HTTP3 upgrade coverage to obtain business revenue data for a better network experience. The following is the increase in revenue on Hand Tao:

  • Shopping guide scenario: the average total network time/P99 is reduced by 22%/33%, and the completion rate is increased by 1.2pt in one second;
  • Transaction scenario: The average total network time/P99 is reduced by 23%/32%, and the completion rate per second is increased by 0.55pt.
  • Upload scenario: video/image upload rate increased by 7.7%/21%, success rate increased by 0.18pt.
  • Short video download: the average total network time consumption / P99 is reduced by 15%/16%, and the download rate is increased by 18%.
"Web" Taobao HTTP3/QUIC technology evolution and practice

Figure 3.3 MTOP RPC core link upgrade comparison data

"Web" Taobao HTTP3/QUIC technology evolution and practice

Figure 3.4 Upload & short video content link upgrade comparison data

Typical business scenario effect interaction scenario interruption rate

In the interactive service, AB experimental data shows that upgrading the HTTP3 experimental bucket can effectively reduce the number of interrupted and lost UVs of interaction. The number of interrupt UVs and interrupt loss UVs in Android HTTP3 AB experimental buckets decreased by 24.02%/22.89%, and the IOS experimental bucket decreased by 20.91%/18.57%, respectively.

"Web" Taobao HTTP3/QUIC technology evolution and practice

Figure 3.5 One of the interactive scenes of Hand Tao, Fenba Farm

Cart & Details

This year, the revised hand-shopping cart has brought convenience to users to place orders, but at the same time, it also faces the problem of time-consuming business pain points in the network transmission experience, and the average time consumption from the business market has decreased significantly after switching to HTTP3, bringing more possibilities to the business. The following figure shows the change trend of interface market time after HTTP3 upgrade. Other details/home and other interfaces also have similar performance, which is brought about by the improvement of transmission performance after the upgrade.

"Web" Taobao HTTP3/QUIC technology evolution and practice

Figure 3.6 Service interface time consumption trend with volume

Landing problems & optimization

UDP penetration issues

Because some operators and network intermediate devices may have a policy of discarding UDP packets, this will reduce the success rate of the establishment of large markets and lead to a significantly higher degradation rate, and it is often necessary to wait for the establishment timeout before the downgrade is successfully retried, which will obviously increase the retry time and lead to a poor user experience.

In this regard, we design UDP connectivity probes, which trigger asynchronous probes during the startup phase or when the network environment is switched, and the probe results are persisted to the local according to the network environment, and the probe update will be retriggered after the probe results expire. This ensures that even if UDP is not passable, there will be no deterioration on the upper-layer service experience, and the use of HTTP3/QUIC in the environment of probing will bring users a better user experience, the average success rate of UDP penetration detection data of the online national market is about 95% at the beginning, after VIP governance of poor UDP quality / offline history does not support UDP port special scheduling configuration / operators delist certain UDP IP network segments. At present, the national average success rate of UDP detection has increased to 98%.

UDP port NET-rebind problem

In the past, the load balancing distribution basic algorithms of SLB and CDN LVS were based on 5-tuples, which can meet the requirements well under TCP. After upgrading to the QUIC protocol, the ability to transfer connection migration and multipath QUIC-based five-tuple forwarding pairs cannot be supported, because the 5-tuple will change in both scenarios. It is ideal to carry out consistent hash forwarding based on CID, which is also considered to be decoupled from 5-tuples at the beginning of the design of the QUIC protocol, and interested in CID-based distribution can be viewed in the draft QUIC-LB. Returning to our implementation, due to the long period of SLB/LVS infrastructure renovation, affected by this, at the beginning of the single implementation, we based on 5-tuple forwarding (abandoning the connection migration capability) for business applications, which in most cases has met the requirements, but also faced some problems. The session survival time of NAT gateway for UDP is generally short, and UDP port NET-rebind problems are prone to occur on the mobile end due to the idle background of the user, and then it will not be distributed to the target server under 5-tuple forwarding, and the connection will be interrupted because the connection context cannot be found, even if the current network is normal.

As shown in the following figure, the client QUIC connection Q is first forwarded by SLB from the source egress port 1 of the NET device to Server A, and the two-way forwarding transmission of connection Q from the client to the server A link is normal. At some point, if the UDP Session corresponding to connection Q is idle (such as the user switching background) exceeds the keep-alive time of the NET device, the mapping between the APP and egress port 1 will be invalidated. After the user returns to the foreground to trigger the packet issuance, the .NET device re-establishes a new mapping from the APP to the exit port 2, at this time, the packet from the client will be forwarded by SLB to another server machine C, and the context corresponding to the QUIC connection Q cannot be found on the C machine, and the corresponding context of the QUIC connection Q will be replied to the RESET, and the proportion of Huawei models is higher than that of other manufacturers from our data. The problem is clear that CID forwarding ensures the consistency of port NET-rebind routing before and after, and when the servre side detects a new 5-tuple, it can trigger the connection migration to be solved.

"Web" Taobao HTTP3/QUIC technology evolution and practice

Figure 3.7: Five-tuple forwarding faces problems

0RTT ratio increased

When the first handshake is established, the server will return the session ticket and transmission parameters to the client, and the client can directly send encrypted data after the client-hello during the session ticket cache validity period for the next handshake. At the same time, the session ticket can return the 1-RTT update after the automatic expiration and invalidation, which is better than the scheme of public key presetting under the premise of reducing the handshake delay, taking into account forward security. At present, after completing the first 1RTT connection, we will store the Session ticket and transmission parameters in the security bodyguard to ensure the security of the cache. In the early stage of the project launch, the improvement effect is not so ideal, the total network time consumption is increased by about 15% compared with H2, the analysis data is almost the same as H2 in terms of the first packet time, which is obviously not expected, through the data, the 0RTT connection ratio is only about 40% at the beginning, and after optimizing the cache efficiency, the 0RTT ratio has increased from 40% to 65% (the ratio has room for further improvement, and the 0RTT ratio of short video scenarios is currently 80%+), The total network time has increased from 15% to about 20% compared to H2.

Business non-encryption appeals

For some short video services, the response size is larger than that of RPC scenarios, and the encryption requirements for plaintext transmission are basically weak, and the video streaming rate is more concerned. To this end, we implement the encryption/plaintext negotiation capability in XQUIC, and if the negotiation result is plaintext transmission after the handshake is completed, subsequent packets will no longer be encrypted, which can effectively reduce the processing overhead of server/client encryption and decryption, thereby improving performance.

XQUIC protocol stack performance optimization

In addition to the previous optimization, we also deeply optimized the protocol stack, and the protocol processing performance of the XQUIC library itself was improved by 85.93%, and the processing performance of nginx-quic was also improved by 15.62%.

"Web" Taobao HTTP3/QUIC technology evolution and practice

Figure 3.8 Comparison data of XQUIC library stack optimization

The XQUIC library processes models

The following figure is the most simplified model of the XQUIC protocol stack: for the sender, XQUIC will encapsulate an ordered byte stream into a QUIC packet and send it, and for the receiver, it will assemble an unordered QUIC packet into an ordered byte stream.

"Web" Taobao HTTP3/QUIC technology evolution and practice

Figure 3.9 XQUIC library processing simplified model

Holistic optimization

Let's think about what is the core of optimizing CPU overhead? To answer this question, let's first think about what is running on the CPU? That's right, the instruction set. So how did the instruction set come about? It is generated by assembly language. How did assembly language come about? It is generated by a high-level programming language. Therefore, we can think of at least the following three aspects that can be optimized:

  • Programming language: that is, your code, choose a suitable programming language, and then find a way to write it with higher performance
  • compile
    • Compilation optimization, which can open compilation optimization options
    • Compiler, choose a high-performance compiler
  • Instruction set: We can do relatively little about this, and the server side is generally X86
  • Packaging optimization: Going back to the question above, what is the core of optimizing CPU overhead? Essentially, it is to reduce the number of instructions required to complete a function. Note that looking at the simplified model of XQUIC, each QUIC packet received requires a series of function operations to finally output a stream. Conversely, each time a stream is sent, a series of functions need to be called, and finally a QUIC packet is output. What we want to accomplish is to deliver a stream to the peer, we can optimize the performance of a series of functions that process each package, but it is more efficient to reduce the number of function calls. Reducing the number of QUIC packets can greatly improve performance, and fill every packet as much as possible within the scope allowed by the protocol.
"Web" Taobao HTTP3/QUIC technology evolution and practice

Figure 3.10 XQUIC Initial Packing

"Web" Taobao HTTP3/QUIC technology evolution and practice

Figure 3.11 Pack optimization

"Web" Taobao HTTP3/QUIC technology evolution and practice

Figure 3.12 Optimization without redundant frames

Local optimization

  • Can it be untuned
    • Avoid invalid calculations
    • Avoid double counting: Each encryption and decryption packet creates an encryption and decryption context, and initializes the encryption and decryption context when the -> handshake is completed or when the key changes, and initializes the key
  • Can you tune less
    • Reduce memory copy: Service copy to H3 layer and then to transport layer -> Service copy to transport layer
    • Get out of the loop early: especially if the traversed list is long
  • Optimize function performance
    • Space for time: huffman decoding table stored in 4K arrays, 4bits each time decoding -> Stored in 64K arrays, 16 bits per decoding
    • Function inline
    • Branch prediction: likely()/unlikely()

4. Upgrade the Group's full-link stress test protocol

Amazon upgraded HTTP3 to the full-link platform

After the large-scale increase of HTTP3 in the hand-Taotao client shopping guide & transaction scenario, the proportion of protocols in the ensuing full-link stress test traffic model has also changed, and the full-link stress test needs to support HTTP2 + HTTP3 protocol at the same time, for which we have carried out a major transformation and upgrade of the group's full-link stress test engine Amazon platform to support HTTP3 protocol stress test.

Different from HTTP2's TCP-based stable link that has been verified for many years in many rounds of promotion & pressure test, the performance of HTTP3's new UDP-based link under the big promotion pulse lacks the promotion experience, and indeed helps us discover some problems under the new UDP link in advance through the pre-stress test verification, and finally ensure the smooth and smooth promotion of HTTP3 on Double 11 after the solution. The main problems encountered are:

  • 1. Poor performance udp_hash finding: In the case of a large number of QUIC connections, the performance of the system udp_hash search will drop sharply, and it is easy to fill the system soft interrupt and cannot handle the timeout in time. The kernel has been optimized for this problem, kernel versions before 4.19 need to patch, 4.19 and later versions have come with it, the lookup optimization needs to be enabled by setting socket option, so we upgrade the kernel version to 4.19.
setsockopt(s, SOL_UDP, 200, (const void *) &value, sizeof(int)
           
  • 2. Kernel packet loss problem to UDP: After upgrading the 4.19 kernel, the UDP packet loss problem is encountered in the case of high PPS, because the 4.19 kernel has restrictions on UDP memory.
  • Specific principle:
  • 1. For each UDP session, the kernel counts the memory used and accumulates to a certain value (positively correlated with rcvbuf) before releasing 2. The kernel records the memory count sum of all UDPs, and when this sum is greater than the limit value (positively related to umem), all UDP packets are discarded.
  • It is not difficult to see that this problem will lead to the limitation of our single machine long chain number and cope with sudden traffic, so we adjust the direction with two optimization parameters:
  • 1. Increase the UMEM value 2. Reduce the RECVBUF value.

V. Ongoing

HTTP3 covers image domains

At present, we have completed the full upgrade coverage of shopping guides, transactions, short videos, and upload links, and the upgrade coverage of image domain names is still gradually grayscale.

HTTP3 over MPQUIC at scale

MPQUIC transformation involves the upgrading of client, SLB, Aserver and other infrastructure, and the entire link of RPC link end-to-end has been transformed. Hand Tao Android has been officially launched, is currently in the stage of large-scale amplification, the ability to provide two optional modes (long tail compensation mode and multi-channel parallel acceleration mode), from the grayscale data point of view MPQUIC in the acceleration mode compared with single QUIC has a further 8% rate improvement, the current XQUIC implemented MPQUIC has been open source. CDN link MPQUIC support for server side and LVS is underway.

"Web" Taobao HTTP3/QUIC technology evolution and practice

Figure 5.1 Schematic diagram of WIFI+LTE dual-channel aggregation transmission

"Web" Taobao HTTP3/QUIC technology evolution and practice

Figure 5.2 Hand Taotao universal setting network acceleration user switch

appendix

QUIC-LB: https://datatracker.ietf.org/doc/html/draft-ietf-quic-load-balancers-15

RFC 9000:https://quicwg.org/base-drafts/rfc9000.html

RFC 9114:https://quicwg.org/base-drafts/rfc9114.html

XQUIC: https://github.com/alibaba/xquic

Team introduction

Gateway and network technology belongs to the team of Dataobao Technology-Dataobao Platform Technology-Terminal Experience Platform, hoping to bring users a smoother experience through the evolution of network technology. If you are interested in XQUIC, network technology, high-performance network transmission, network cost optimization and other fields, please click "Read Original" to follow our GitHub repository: https://github.com/alibaba/xquic.

Source: Alibaba Terminal Technology_https://mp.weixin.qq.com/s/uwMxy5v_uwQDCMAUWgWgPw