In weak network scenarios, how does QoS technology "escort" audio and video experience?

Multiple algorithms and strategies are used to control network transmission to maximize the audio and video user experience in weak network scenarios.

Liangyi|Technical author

01 What is QoS? What is the type of QoS for audio and video communication?

QoS (Quality of Service) is the abbreviation of quality of service, which refers to a network that can use various basic technologies to provide better service capabilities for designated network communication, is a security mechanism of the network, is a technology used to solve network delay and congestion and other problems, including flow classification, traffic supervision, traffic shaping, interface speed limiting, congestion management, congestion avoidance, etc. Generally, QoS provides the following three service models: Best-effort service (best effort service model), Integrated service (Int-Serv), Differentiated service (Diff-Serv).

These descriptions refer to the traditional, primitive definition of QoS and technology stack, which comes from the quality assurance system between network transmission devices in the early Internet. The QoS to be discussed in this article is in a completely different level of quality assurance system, let's first look at the relationship between these two levels of QoS.

Video conferencing company Polycom's H323 white paper QoE and QoS-IP Video Conferencing mentioned that QoS is divided into two categories, one is Network-Based QoS NQoS and the other is Application-Based QoS AQoS. The following figure shows the different use scenarios and quality assurance levels of the two QoS.

NQoS is the basic communication quality assurance capability between network transmission equipment, and is a unique set of quality assurance protocols between such communication equipment, including routers, switches, gateways, etc. AQoS is the ability to do its best to ensure user experience under different network conditions based on the terminal device type, service scenario, and data flow type corresponding to the application.

Although NQoS and AQoS will have a decisive impact on the end-user experience, if the application scenario is limited to the field of audio and video communication, the application-based QoS of AQoS is extremely important, because NQoS as part of the Internet infrastructure, in order to take into account various use scenarios, is more of a "universal" transmission quality assurance technology, it is difficult to do too much targeted optimization in specific fields. Therefore, the audio and video communication QoS discussed in this article is actually a kind of application-based AQoS, which is a transmission quality assurance technology for related applications in the field of audio and video communication.

02 Audio and video communication QOS background

Personally, QoS for audio and video communication is "the ability to use multiple algorithms and strategies for network transmission control to maximize the ability to meet the audio and video user experience in weak network scenarios", as shown in the following figure:

Data is formed from the production of audio and video media, through the intermediate transmission links of various weak network conditions, to the consumption of audio and video media, forming the final user experience. QoS controls the end-to-end link through various policies and algorithms, ultimately enabling users to get the best experience.

03 Challenges for QoS in audio and video communications

Network scenario

Various network conditions are very complex: the types and combinations of networks are diverse, especially in the last mile, there are twisted pairs, coaxial cables, optical fibers, WIFI, 4G, 5G, etc.; Even if the same network link, it will change with different scenarios, such as 4G, 5G, WIFI wireless signals, will change with the location of the signal strength is also erratic, there will be 4G, 5G, WIFI signal switching. Even wired networks may have problems such as competitive congestion due to the sharing of multiple apps and multiple users on the link.

Business scenarios

There are various audio and video communication scenarios, such as on-demand, live streaming (RTMP/RTS), meetings, interactive entertainment, online education, IOT, cloud gaming, cloud rendering, cloud desktop, and telemedicine. Different service scenarios have different experience requirements, for example, live broadcast scenarios focus on smooth experience, but the timeliness of audio and video interaction is not too high. The meeting scenario will have higher requirements for real-time communication, but not so high requirements for audio and video quality. However, for scenarios such as cloud gaming, extremely low latency is required while ensuring extremely high definition.

User experience

As shown in the figure below, in audio and video communication scenarios, user experience is mainly measured from three aspects: clarity, fluency, and real-time.

Clarity, whether the video picture perceived by the user is clear or blurry, in general, the higher the resolution, the clearer the clearer the picture contains more information, and the more traffic is occupied during transmission;

Smoothness, is the user feels that the video picture movement is smooth or stuttering, generally speaking, the more pictures seen per second, the smoother, the more the number of pictures per second, the more traffic occupied during transmission;

Real-time is the time required for audio and video information to be felt by remote users, the shorter the time, the better the real-time, the better the real-time performance, the higher the requirements for transmission speed.

As mentioned earlier, different business scenarios have different emphasis on the requirements of clarity, fluency, and real-time, but with the continuous development of audio and video communication service scenarios, more and more extremely low latency and immersive scenarios continue to emerge, and users can say that they want and want both audio and video experience, and the requirements are getting higher and higher, leaving less and less room for technicians to toss and turn. Under this trend, the technical requirements for audio and video transmission QOS are becoming higher and higher.

From the perspective of the underlying protocol, audio and video communication based on TCP transmission, such as live broadcast, on-demand, etc., generally has a relatively large delay, and because the data is encapsulated inside the TCP protocol, relying on the anti-weak network mechanism of TCP itself to ensure reliability, it is difficult for the application layer to have the opportunity to participate in the control and optimization, which is only suitable for scenarios with large delay tolerance. For scenarios with small delay tolerance, they are basically based on UDP, and everyone knows that UDP transmission is characterized by poor reliability, and the application layer needs to ensure the reliability of data transmission through various anti-weak network technical means, and QoS technology has the opportunity to show its talents.

This article mainly discusses the most challenging and technically complex audio and video short-delay communication QoS technology based on UDP transmission, including real-time audio and video communication RTC scenarios and low-latency live RTS scenarios.

04 Classification of weak nets

If our transmission network is very perfect, with enough bandwidth, low enough latency, and high enough guarantee, then we can easily achieve face-to-face communication like real people, we do not need QoS technology, no codec, as long as the audio and video are collected, and then instantly transmitted to the peer for playback, the remote interaction between people will become very beautiful.

However, the reality is far from this beauty, modern audio and video communication is a type of application built on the infrastructure of the Internet, which makes the transmission quality of the Internet become the ceiling of the transmission quality of audio and video communication impossible to break through. As we all know, the transmission of the Internet is complex, expensive, unreliable, unstable, there is no way to understand the status of all transmission links, we need to abstract these problems, in order to better respond to different scenarios, do our best to ensure that the user's audio and video experience is not too affected.

We generally call the scene where the network transmission quality does not meet expectations a weak network scenario, and the weak network is divided into congestion, packet loss, delay, jitter, out-of-order, bit error, etc. Congestion is a manifestation of insufficient available bandwidth, like a high-speed traffic jam; Packet loss is when the data is lost during transmission and does not know where it goes, just like the loss of a package by express delivery; Time delay, generally too many transfers or congestion and queuing, resulting in poor timeliness, such as transferring flights or traffic jams; Jitter is a large and small data transmission interval, which may cause audio and video to be fast and slow if not processed; Out of order, it is a first-come, first-come, first-served data that arrives later than the data sent later, and if it is not processed, it may lead to audio and video playback; Bit error is a data error in the transmission process, because the transport layer will have verification, correction, and retransmission, so the application layer is generally unaware and does not require special processing.

The above figure uses pipeline irrigation to illustrate several weak network scenarios more vividly: the left side is the traffic production side, the right is the traffic consumption side, and the length of the pipeline is the basic delay of the link; Some errors and packet loss occur during pipeline transfer; When the pipe narrows from width to narrowing and the flow exceeds the width of the pipe, bandwidth congestion occurs; When congestion occurs, traffic will be queued, and some traffic will be put into the cache queue, resulting in queue delay. When the cache queue is full, a queue overflow will occur, and the overflow traffic corresponds to the overflow and packet loss. There will be some fluctuations and signal interference in the process of link data transmission, resulting in the data transmission rate is not constant, and the final received data becomes fast and slow, which we classify as link jitter. In reality, these different types of weak nets often appear mixed together at the same time, and different classifications can facilitate us to technically break each of them.

05 Audio and video communication QoS technology system

QoS technology classification

Audio and video communication QoS technology and strategy is born to combat the above weak network scenarios, its purpose is to eliminate the experience degradation caused by network deterioration to the greatest extent possible, so corresponding to the classification of different weak network scenarios mentioned above, the QoS technology used is also divided into several categories: congestion control, source control, anti-packet loss, anti-jitter, each type of technology contains a lot of different technical points and technical details, which will be expanded later.

Congestion control is the decision-making center of network condition detection and data transmission mode, the core of the entire transmission-side QoS technology, and the brain of transmission control.

Source control is to control the way audio and video sources are generated and the bitrate of sources is controlled under the decision of congestion control, so as to adapt to the detected network conditions and achieve the purpose of anti-congestion;

Anti-packet loss is to add redundant information to the source data in the scenario of packet loss in the network, so that part of the information is lost, and the original data can be completely restored;

Anti-jitter is to increase a part of the delay and buffer the data when the network delay fluctuates, the data is fast and slow, and the data is intermittent, so that the user experience is smoother and does not freeze;

The above also explains that different types of QoS technologies can solve different user experience problems, and it can be seen that they are all improved around the three points of fluency, clarity, and real-time. Congestion control is the overall command, and in many cases plays a decisive role in the experience of the entire link, source control can improve the experience in terms of smoothness and clarity, and anti-packet loss and anti-jitter can improve the experience in terms of smoothness and real-time.

The position of QoS in the audio and video communication process

We know that audio and video communication is end-to-end full-link communication, which involves a wide range of technical fields, spanning a very large and complex range. For example, the client side includes the adaptation, compatibility, interoperability of all terminals of the four platforms of Windows, MacOS, iOS, and Android that can be seen on the market, and even interconnection with the browser. There are also audio 3A (AEC, AGC, ANS), AUDIO CODEC (Opus, AAC), video codec (H264, H265, AV1) and any other field is a very complex technology stack. Various servers in the cloud are also indispensable links to achieve interconnection, including signaling servers, media servers, mixed streaming, transcoding, recording, node deployment, scheduling and routing, load balancing, etc., each node and each service is a complex existence.

In such a complex audio and video communication technology link, QoS technology is only one of the narrow areas, but it is also indispensable, and it is of decisive significance to the availability of online audio and video communication. QoS technology seems to be one of the few full-link technologies in the field of audio and video communication, which controls all aspects of media transmission and media encoding and decoding end-to-end, full-link, so that engineers engaged in QoS technology need to have a certain understanding of the full-link technology of the client and server, and need to look at the entire audio and video communication from a global perspective.

The following figure is an abstraction of the function of audio and video real-time communication link, using the P2P mode of media transmission and media reception, omitting the complex server transmission part, which is convenient for everyone to understand.

The basic process of audio and video communication: first of all, the push client, the audio and video data collected from the audio and video acquisition module of the terminal equipment is uncompressed raw data, the media data stored by frame (frame), the larger size, there is no way to directly transmit on the network, need to be compressed first, it is to the encoding part, the audio and video encoder is used, after the encoding is completed, the data is still very large, it needs to be sliced, and then the data after the slice is encapsulated by RTP, after packaging, From the sending queue to the network, after a series of transparent transmission or processing by the server, it reaches the streaming client, and after the pull end receives the RTP packet sent by the network, the integrity of the RTP packet must be judged first, and whether the encoded data frame slice data has been received, and then the RTP encapsulation is deconstructed, the encoded end data frame is restored, and sent to the decoder for decoding, and the decoded data is sent to the rendering module, and the user sees and hears the picture and sound of the push end.

On the left side of the figure above is the processing flow of the media transmission side: audio and video acquisition module, pre-processing module, encoding module, RTP encapsulation module, transmission queue, and network data sending. On the right is the processing flow of the media receiving side: network data reception, RTP packet parsing module, receive queue management, decoding module, post-processing module, and rendering module. The functions in the blue box on the left in the middle are QoS-related functions on the sending side, and the functions in the blue box on the right are QoS-related functions on the receiving side. In addition, from the positioning of the RTCP protocol itself, it is a protocol for controlling UDP-based media RTP data, so it can also be regarded as part of QoS control.

First, the QoS function is related to many other modules, because QoS technology is a full-link control technology, and there are more modules that need to be reached; Second, the QoS function on the sending side is obviously more than the QoS function on the receiving side, because many of them are currently bandwidth estimation and congestion control on the sending side, because this will be closer to the source of information generation, and it is more efficient to solve problems from the source and prevent problems before they occur. The technology on the receiving side is often more passive, which is the last remedy after the problem is wrong.

QoS technology system

After talking about the classification of QoS technology and the position of QoS technology in audio and video communication technology, we will focus on the QoS technology field, from the client and server media link to see the QoS technology system and the relatively large technical points, as shown in the figure below. The push client side in the lower left corner uses source control, congestion control, and anti-packet loss technology; The media forwarding server SFU in the middle and upper part uses source control, congestion control, and anti-packet loss technology; On the client side of pulling streams in the lower right corner, anti-packet loss and anti-jitter technologies are used. The following is a brief summary of the general process and significance of the relevant technical points.

l Audio and video push client

When all media RTP packets are sent, a unified sequence number will be added to the RTP extension header, and it can be considered that each packet has a unique number, so that all sent data has the corresponding sequence number, sending time, and packet size three information. After receiving these RTP packets, each received sequence number and the receiving time information of this sequence number are encapsulated into RTCP packets according to the format defined by the TransportFeedback (twcc) packet and fed back to the push end. Reference: "WebRTC Research: RTP and RTCP of Transport-cc - Sword Obsession", based on these feedback information, the push terminal can estimate the current network transmission status, including packet loss, latency, bandwidth three aspects of information. These estimation algorithms are bandwidth estimation algorithms (also called congestion control algorithms), and the above figure mentions two more commonly used, one is GCC (google congestion control), one is BBR (Bottleneck Bandwidth and Round-trip Time), both of which are congestion control algorithms launched by Google.

Why not just call it a bandwidth estimation algorithm? These estimation algorithms are generally paired with smooth transmission PacedSender, rarely only estimate but not control, smooth transmission control strategy is also part of the estimation algorithm architecture design, in order to make the sent stream as smooth as possible, to prevent high and low fluctuations, so as not to impact the link, bring unnecessary congestion.

Based on the design of these congestion control algorithms, many times in order to detect relatively accurately enough available bandwidth, when the original audio and video data is not enough to fill the expected bandwidth, such as the video picture is static, audio mute scene, it is necessary to send some Padding data to temporarily fill the bandwidth, these Padding data is sometimes used to repeatedly send the original package, sometimes simply send a bunch of garbage data, the purpose is only to fill the bandwidth, after receiving it will also be thrown away.

After the estimation algorithm estimates, after obtaining the available bandwidth, transmission delay, and packet loss rate information of the network, these information can be broadcast to each required module, such as the bandwidth allocation module. In the bandwidth allocation module, the available bandwidth is allocated to each audio and video stream according to a certain priority and allocation strategy, and in scenarios where packet loss occurs, the corresponding redundant information bandwidth needs to be allocated to each audio and video stream at the same time.

After allocating bandwidth resources, it comes to the source control part, and the code control module of the audio and video stream will carry out its own control according to their respective code stream characteristics, encoding and decoding capabilities, and technical means, such as basic audio bit rate, frame length, video bit rate, frame rate, resolution, QP and other basic controls, as well as some specific technical points related to encoding, such as audio DTX (Opus encoder discontinuous transmission > reduce bandwidth occupation), video simulcast (simultaneous push of multiple streams - >). Meet different subscription scenarios to reduce bandwidth), video SVC (layerable video encoding > to achieve dynamic frame pumping to reduce congestion), video LTR (long-term reference frame > reduce retransmission bandwidth), video screen sharing SCC (screen content encoding > reduce screen sharing scenario bandwidth), and so on.

In the scenario of packet loss on the network, we must reserve technical means to resist packet loss. There are generally two types of anti-packet loss:

One is packet loss and retransmission, when the receiving end finds that the data has lost packets and is no longer continuous, it actively sends a retransmission request to the push end through NACK packets (a type of RTCP packet), and the push end needs to cache the previously sent data at any time to meet the retransmission requirements after packet loss and reissue the lost data. This is a lagging remedy, so it saves bandwidth relatively much, but it increases latency;

The other is to send a part of redundant information at the same time when sending data, and once packet loss occurs during transmission, you can rely on this part of the redundant information to restore part or all of the original data. This is a preventive technical means, because the redundancy and the original data are sent at the same time, so packet loss recovery can be performed immediately, there is no delay problem, but because there is redundant information, it will occupy more bandwidth. There are two ways to add redundant information, one is RED encapsulation, which is used in audio transmission scenarios with relatively small packets; The other is FEC coding encapsulation, which is used in video or audio scenarios with relatively large data packets, and there are many FEC encoding methods available, and there are many research on algorithms in this regard.

l Media forwarding processing server SFU

When the media forwarding server is reached, on the one hand, it acts as the receiving side corresponding to the push side client, and on the other hand, the media server acts as the sending side corresponding to the pull client. The current receiving side basically only provides anti-packet loss capability, and the anti-packet loss ability of the pulling side is used to form a segmentation anti-packet loss capability of the whole link, segmentation means that the uplink and downlink are separated to do anti-packet loss, do not affect each other, the advantage is that the design can be simplified, and at the same time, for different downstream weak network and non-weak network users, on-demand services can be provided, with relatively strong adaptive capabilities. The anti-packet loss of the current receiving side and the pull side is the same as the anti-packet loss of the client mentioned above, and also includes packet loss request and retransmission, corresponding to RED encoding and depackaging, corresponding to FEC encoding and decoding, encoding and decapsulating functions, this part of the function is relatively fixed, after negotiating with the SDP media capability on the client push side, it is determined which functions can be turned on or off.

The server's sending side function, and the client push side is also more complex, including congestion control GCC, BBR, smooth sending, Padding and other congestion control algorithms and bandwidth allocation, these algorithms on the server and the client push side algorithm basic framework structure and basic functions are the same, but the algorithm parameter configuration, the strategy used, are different from the client, because the server side of the source control and the client push side of the source control, is completely incomparable, Don't whisper the same day. At the same time, the server needs to take into account a lot of push ends and many pull ends, and it needs to balance various relationships.

This position of the server has also derived some specific technical means, such as video frame extraction (extracting some frame data that does not affect decoding to reduce bandwidth), video stream cutting (actively switching to streams with lower bandwidth and definition to reduce bandwidth), video on-demand streaming (allowing the push end to directly push the required stream to reduce bandwidth according to the actual subscription relationship), audio and video bandwidth feedback (specific scenarios can feed the flow measurement information back to the push side to provide more accurate code control services), Audio AudioRanking (multi-person meeting scenarios filter out voices that are not speaking to reduce bandwidth) and so on. For a more detailed description of the technical points related to the server, see Alibaba Cloud GRTN QoS System - Building the Best Experience of Real-time Audio and Video Products.

l Audio and video streaming client

Finally, to the audio and video streaming client, the anti-packet loss function here in addition to the packet loss retransmission, RED, FEC mentioned above, there are two more, one is a keyframe request, the other is a long-term reference frame LTR request, both of which are requested to restore the reference chain of the video frame so that the video decoding can be restarted.

A keyframe request is a request that requires the video to restart encoding, so that any client that receives the keyframe can decode it, so that the reference chain of the video frame restarts. Long-term reference frames are confirmed that a long-term reference frame has been received, there is no need to start encoding from the keyframe, as long as a long-term reference frame is sent, the reference chain can be restored, that is, the way of sending a long-term reference frame instead of sending a keyframe. The advantage of this is mainly to reduce the bandwidth of retransmission, but at the same time it increases the complexity, because the server needs to confirm that each pull client has received a specific long-term reference frame, which is difficult to meet in scenarios with a large number of pull clients. For more information, see "LTR and Hardware Decoding Support for Alibaba Cloud RTC QoS Weak Network Countermeasures".

In addition, the pull client has more anti-jitter functions than other parts, and the main idea is to add a buffer buffer for data buffering, which increases some latency. Just like a reservoir, water is stored during the rainy season, storing the water that flows in quickly, and releasing water during the dry season, slowly releasing the previously stored water to ensure that there is water flowing out from beginning to end.

Audio and video data flow has their own different characteristics, corresponding to the audio jitter cache mechanism and video jitter buffer mechanism is also different, currently used more are WebRTC in the audio and video jitter control mechanism, video is based on the Kalman filter JitterBuffer, audio is NetEQ, the specific algorithm are very complex, here will not be expanded, interested students can refer to "WebRTC Video JitterBuffer Detailed - Zhihu" and one of my previous vernacular articles, "Vernacular Interpretation of WebRTC Audio NetEQ and Optimization Practices".

The pull side or playback side of audio and video generally has the need for audio and video synchronization (also called lip sync), otherwise there will be a situation where only the sound is heard and no one is seen, or only lightning is seen and thunder cannot be heard. The original audio and video synchronization mechanism of WebRTC is very complex, I have also introduced "WebRTC audio and video synchronization principle and implementation" before, and in NetEQ and optimization practice also mentioned a simple alternative, not expanded here.

06 Evolution of QoS technology for audio and video communications

The above roughly describes the technical system used in audio and video communication QoS, any technology requires a certain software architecture to carry and implement, QoS technology in the field of audio and video communication is also constantly upgraded with the evolution of the software architecture of audio and video communication. For the evolution history of RTC for real-time audio and video communication, refer to "The Evolution History of RTC Architecture Spanning 25 Years in 5 Generations" (https://www.livevideostack.cn/news/the-evolutionary-history-of-rtc-architecture-after-5-generations-spanning-25-years/). It is mentioned that "Google open-sourced WebRTC in 2011, as a milestone event in the field of RTC technology, greatly reducing the threshold of RTC development and giving birth to the era of mobile Internet RTC applications".

WebRTC previous

Before the advent of WebRTC, because of the high threshold, audio and video communication was basically a game between several head players, such as Polycom, Huawei, Cisco, Microsoft, BT, Vidyo, etc., each with its own private architecture, all in retreat. The QoS technology they use is also their own martial arts cheats, which can only be spied from some public articles or protocol standard submissions. When I saw the news of WebRTC open source in Polycom in 2012, I didn't think it was a great thing at all, Polycom had a group of audio and video technology scientists, supported various codec technologies, was the absolute head in the industry, did not expect to be out of the public after a few years.

WebRTC later

After the advent of WebRTC, the field of audio and video communication for the first time exposed its technology stack to the sun, everyone can do their own experiments, optimization, evolution based on it, attracting a large number of developers, start-ups, Internet giants to participate, whether technical whites, or industry experts, unconsciously, actively or passively involved in the audio and video communication industry redefined by WebRTC. Because WebRTC itself is also a relatively excellent architecture, its QoS technology and the communication effect are good, so many enterprises have also abandoned the original private architecture, and instead adapt their own business logic on the basis of WebRTC, increasing the QoS algorithm optimization unique to their own business scenarios.

However, WebRTC itself is positioned as a communication between P2P Internet browsers, focusing on the architecture and implementation of the client side, and with the development of cloud audio and video communication service scenarios, the media forwarding server has become an indispensable link between the two clients. There are also a variety of media servers that support the WebRTC protocol, such as janus, mediasoup, srs, licode, kurento, jitsi, etc., you can refer to the "Top Ten Must-Know Open Source WebRTC Servers" (https://zhuanlan.zhihu.com/p/554537113). However, many media forwarding server SFUs only implement the forwarding function, and the QoS technical support for link control is very weak, and some are even better than nothing, and because the server code architecture is very different from the end-side code architecture of WebRTC, it becomes very difficult to migrate the original WebRTC QoS algorithm.

QoS technology algorithm optimization stage

About the first half of the epidemic in 2021, the Internet gradually reached its peak, everyone is a business boom, rapid iteration, each is a take-ism, directly after the WebRTC compilation passed, it will be integrated into their own SDK, first do the business, and then slowly tune the QoS algorithm, as long as it can meet the rising business needs, will not consider whether the architecture is complex and the implementation is elegant. This stage is based on WebRTC's QoS algorithm optimization, all kinds of technical articles emerge endlessly, basically covering the above QoS technology system mentioned in the technical points, more than 90% of the online articles on QoS optimization are this type of single-point algorithm optimization and algorithm in-depth analysis. Everyone's technical level was quickly pulled to the same starting line, and they are very friendly to new audio and video technology students, as long as they are willing to learn, they can get started quickly.

This optimization and upgrade of QoS single point technology is the core means to improve QoS performance and is the foothold for ultimately improving user experience, and will continue to do so. But these single-point algorithm optimization also has a bottleneck, once it reaches the ceiling of existing basic scientific research, it is difficult to improve, because the breakthrough of basic theoretical research is required, this input and output is not the general commercial company willing to bear, nor the general algorithm technicians can break through, so most of the domestic companies and technical personnel have chosen to retreat in the face of difficulties, but also due to the environment.

Of course, we don't have to worry about the uselessness of algorithm technicians, after all, many technologies have not yet reached the ceiling of basic science, and we still have some time; After all, what we are best at is to take ism, can not engage in brain power, engage in physical strength, short-term can not improve the height of technology, then we can start from the breadth of technology, as long as we can tap enough user scenarios, we can target specific scenarios, tailor-made, through sewing and patching, you can make all kinds of scenes have a better experience, which is also a value embodiment. Not only QoS technology, but many of our science and technology fields, every time we talk about this level, it is always sad, and there is no way to do it, I hope that one day this situation can be improved.

QoS technology architecture upgrade phase

As the epidemic entered the second half, the Internet boom was no longer there, and new scenarios such as IOT, cloud rendering, and cloud gaming appeared, everyone gradually slowed down and began to think again about whether the WebRTC framework is suitable for their business, and whether there is a better solution. Students who have some understanding of WebRTC source code or have participated in related compilation should know that WebRTC is a very large implementation, including referenced third-party libraries, the number of source files is close to 200,000, this order of magnitude of code to environmental deployment, compilation configuration, engineering reference have brought great trouble, so that some people on the Internet have compiled WebRTC into a business, charging per time. Few companies can use WebRTC directly, they need to find special classmates, do environment configuration, code tailoring and a series of things that are of little value to the business, thankless effort.

As one of the most valuable technologies in WebRTC, QoS technology is deeply bundled in the entire code framework, and it is difficult to be directly used in non-WebRTC code frameworks without making a painstaking transformation. The following figure is a simple combing of the QoS-related media processing part of the WebRTC process, students familiar with WebRTC code should be able to clearly know the meaning and function of each module in the figure, here will not expand the introduction. The red part is the QoS-related module, and we can see that the entire process is coupled to each other, and there is no way to extract the QoS function separately.

At the same time, for scenarios such as IOT, cloud gaming, and cloud rendering, due to its own unique collection and rendering, encoding and decoding functions, the entire framework of WebRTC cannot be used, but only media transmission and QoS control capabilities are required, so WebRTC has to be cropped and the QoS algorithm stripped. This business requirement promotes the thinking and upgrading of the original WebRTC architecture and promotes the architectural evolution of QoS technology.

How exactly does this architecture upgrade and evolution work? I think that first of all, we must start from the abstraction of audio and video communication technology links and functional modules, abstract to a certain height, you can see the essence of things, see the essence, it is easier to see the relationship between each module, and then things can be decoupled by clustering. The following figure is an abstraction of the QoS push-pull stream function and processing flow.

After the above abstraction, we can clearly define the boundary of QoS functions, and can further redesign and implement the various functions inside QoS, which may eventually become the following figure of layered decoupling and functional modularization. With the QoS module of this architecture, it can be easily migrated to various scenarios, and even migrated to the media forwarding server SFU to achieve rapid reuse of QoS capabilities, optimize multiple points at one time, and accelerate the commercialization of new scenarios. For example, the QoS part of CCTV's Sanxingdui Fantasy Journey project uses the evolved QoS module function, "Best Practices of Sanxingdui Large-scale Immersive Digital Interactive Space": Alibaba Cloud Cloud Rendering Platform Supports CCTV Immersive Online Archaeological Game Complete Solution _Cloud Rendering-Alibaba Cloud Help Center.

From the perspective of the evolution of audio and video communication software, the final result may be back to the state before WebRTC open source, each has its own proprietary software architecture, each company has returned to its own QoS technology optimization circle, it seems to go around and return to the starting point, but each has absorbed the essence of WebRTC.

This paper introduces the concept and classification of QoS from a broader perspective, and briefly summarizes the common technologies in the field of QoS in the field of audio and video communication to the evolution process of architecture. With the continuous emergence of new scenarios of audio and video communication, more real-time and higher definition are becoming more and more important, and related technologies will also tilt in this direction, and the application of QoS related technologies based on big data analysis will gradually penetrate.

In weak network scenarios, how does QoS technology "escort" audio and video experience?