QQ 9 "Silly Fast, Silly Fast" will take you to see the technical secrets behind it

Since the launch of the newly released QQ 9, it has received praise from many users in terms of fluency, and many users jokingly called QQ 9 "stupid and stupid", so fast that "I am a little unaccustomed to it".

This article will introduce in detail the technical implementation behind QQ 9 Fluency, as well as the performance optimization and exploration of the whole process, so as to provide reusable experience for the application to improve fluency.

There are still 500 million people who stick to QQ

This year marks the 30th anniversary of China's Internet era, the 25th year of QQ as the "first generation of Internet products", and the 14th year of mobile phone QQ.

#仍有5亿人坚持用 QQ #, it is the persistence of this group of users that urges the QQ technical team to continuously innovate themselves, and tirelessly pursue performance in order to provide users with a better experience.

QQ 9 "Silly Fast, Silly Fast" will take you to see the technical secrets behind it

QQ 9 promotional image

Since QQ 9, we have refactored and optimized the underlying architecture from the bottom up, solving a series of problems such as slow startup, easy to get stuck, long waiting time for chrysanthemums, and UI jumps in the mobile client. After it was launched, it received many praises from users, among which there is a high-frequency keyword "silky", behind the silkiness, in fact, it is the polishing of technical people.

This article will unveil the technical exploration behind QQ 9 and share the hardcore optimization methods of QQ craftsmen.

Nitpicky grinding

2.1 Extreme Seconds - Optimized startup speed

QQ's silky experience starts from "startup optimization", taking iOS as an example, the startup process is mainly divided into 3 stages:

T0: Click the icon to start the main function;
T1：从 main 函数开始到 didFinishLaunchingWithOptions 结束；
T2：didFinishLaunchingWithOptions 结束到首帧渲染完成。

Generally, the startup process is divided into two execution phases: pre-main (T0) and post-main (T1 + T2).

pre-main stage: The system dyld loads the app image and initialization behavior, which is closely related to the program structure and scale.
post-main stage: The initialization of the business performed by the app before rendering the upper screen has a great relationship with the specific business logic.

Optimization direction in general engineering:

The pre-main phase reduces the time it takes to load and link: if a dynamic link becomes a static link, the code is split into a dynamic library and lazy loaded.
The post-main phase reduces the total amount of code executed by the main thread, such as code delisting, delayed code execution or asynchronous sub-threading, and optimization of code logic execution efficiency.

The following is a highlight of QQ's work in these two directions.

2.1.1 pre-main 阶段 - 按需装载代码

Schematic diagram of dynamic library lazy loading scheme

The technology of splitting code into dynamic libraries and lazy loading is mostly used in large apps in the industry (Douyin, Facebook, and Kuaishou), but QQ's business complexity is quite high, and directly using industry solutions cannot meet our needs. After some exploration, we found some innovative technical points:

Use __attribute__((objc_runtime_visible)) to dynamically transform your code at low cost.
objc_setHook_getClass is used to achieve dynamic code entry convergence to ensure the stability of the scheme.

Finally, the large-scale application in QQ 9 realizes the optimization of the startup time in the pre-main stage (this technical solution contributes about 33% of the total startup time optimization data benefit):

Xcode Organizer Launch 数据图

2.1.2 post-main 阶段 - 线程治理

Our anti-degradation system is monitoring the growing problem of main thread preemption, and looking at Instruments, we have found that in some severe cases, 14% of the time slices of the main thread are preempted by other threads during warm startup.

Instruments analyzes the QQ startup time graph

What is the preempted problem of the main thread? Simply put, the CPU time slice of the main thread is preempted by other threads, resulting in the main thread not getting CPU resources. As the preemption problem becomes more and more serious, some related problems are also introduced, such as the total startup time also deteriorates, the startup time fluctuates greatly after startup, and the probability of misjudgment of the anti-deterioration performance report increases.

Why is the main thread preempted? In simple terms, there are several reasons:

System scheduling behavior, system-level threads (such as PageIn threads) preemption.
If the APP frequently opens up sub-threads without paying attention to managing the number of sub-threads, there may be a "thread explosion", and the sub-threads do not set QoS properly, which will easily lead to the preemption of the main thread.
If the main thread task is too heavy and takes up too long time, it will be punished and downgraded by the system, and then preempted by other sub-threads.

After understanding the reasons, we will govern from the following three aspects:

Reduce the number of subthreads

After searching for information and research, we found that the global queue that frequently uses GCD may cause thread explosion, because when the sub-thread is in the sleep/wait/lock state, it will be considered inactive by the GCD, and a new thread may be created when a new task arrives.

Remarks from an Apple engineer, former GCD development engineer

Apple officially recommends not creating a large number of queues, using target_queue to set the hierarchy of queues, multiple subsystems form a queue tree structure, and finally the bottom of the queue uses a serial queue as the target_queue. For more information, see Modernizing Grand Central Dispatch Usage - WWDC17

Lower child thread QoS

If the global queue QoS is set to DISPATCH_QUEUE_PRIORITY_DEFAULT, the task's QoS inherits the QoS of the original queue (from QOS_CLASS_USER_INTERACTIVE to QOS_CLASS_USER_INITIATED if the original queue is primary). Developers often dispatch tasks to the global queue on the main thread and specify a QoS of DISPATCH_QUEUE_PRIORITY_DEFAULT, which results in a large number of sub-threads with a QoS of QOS_CLASS_USER_INITIATED. Here are the QoS prioritization:

__QOS_ENUM(qos_class, unsigned int,
  QOS_CLASS_USER_INTERACTIVE = 0x21, // 33
  QOS_CLASS_USER_INITIATED = 0x19, // 25
  QOS_CLASS_DEFAULT = 0x15, // 21
  QOS_CLASS_UTILITY = 0x11, // 17
  QOS_CLASS_BACKGROUND = 0x09, // 9
  QOS_CLASS_UNSPECIFIED = 0x00, // 0
);

In actual development, this QoS is used for many network requests and write disk I/O, which can actually reduce the priority of subthreads by reducing QoS.

Increase the priority of the main thread

QoS is not exactly equivalent to the final thread priority, the main thread priority ranges from 29~47. Why did the main thread priority change during runtime? The section explains why: if a thread runs beyond its allotted time and is not blocked, it will be penalized or even downprioritized, in order to prevent the higher priority threads from constantly grabbing system resources, causing the lower priority threads to be starving.

How can I avoid the downgrade penalty for the main thread from running beyond the CPU allotted time?

The first RunLoop at the beginning of the app startup process is executed until the end of the above-the-fold rendering. However, the tasks above the fold are generally heavy, resulting in a long time for RunLoop to take and be easily downgraded by the system.

Schematic diagram of the time taken by the first runloop when QQ starts

The solution is to split the tasks in the first RunLoop. Our approach is to keep the necessary global initialization logic in the first RunLoop and defer the creation of the main UI to the next RunLoop. This not only effectively solves the situation that the main thread is preempted at startup, but also can speed up startup and see the main page faster.

In fact, there is still some room for optimization, we have moved the tasks of the first RunLoop to the second RunLoop, which will cause the second RunLoop to take a lot of time, so we can continue to optimize according to this idea.

2.2 Silky Smoothness for All – Improved performance and smoothness

2.2.1 How do you define fluency?

Smooth (silky), the somatosensory performance is that the screen content changes instantly with the finger operation, and each operation is instantly fed back on the screen. As shown in the image, when high brush frame rate is not enabled, user actions should be guaranteed to be updated to the screen within 16.67ms.

Each operation of the user needs to go through the 4 steps in the figure, and if any step takes too long, the screen will not be updated in time, resulting in lag. Source: Advanced Graphics and Animation Performance

Is it hard to get an app to update user actions every 16.67 milliseconds?

The content displayed on the screen can only be updated on the main thread (only single-core, not the phone's multi-core CPU).
There are many time-consuming factors that affect the GPU, and the more complex the interface, the more time-consuming.

16.67 ms for the main thread - time spent on the system = time available to the developer. As shown in the figure below, the blue area is the time occupied by the developer, and when the developer uses it for too long, it will cause hang, that is, lag.

Purple area: The time taken by the system to accept and process user gestures

Blue area: The time it takes for developers to convert user actions to on-screen displays

Yellow area: The amount of time it takes to display content on the screen

来源：《Explore UI animation hitches and the render loop》

In this way, if you want to be silky, you must do the following:

Make good use of multi-threaded programming and do as little as possible on the main thread other than updating the UI.
Make the interface as simple as possible on the GPU and reduce GPU time.

2.2.2 Make good use of multi-threaded programming and try to do as little as possible on the main thread other than updating the UI.

The NT kernel architecture lays the foundation

The NT Kernel (NT: New Technology, a tribute to the Windows NT kernel) used in QQ 9 is born based on the concept of maximizing the energy efficiency of multi-core CPUs, as shown in the figure below, to maximize the separation of business processing logic from the main thread responsible for UI display, and use asynchronous calls instead of thread locks, improving efficiency and reducing the possibility of deadlocks.

NT Kernel multi-threaded model

In addition, NT Kernel uses C++ to implement the core basic capabilities of IM software, so that it can be used across platforms to ensure that the performance experience of each platform is consistent, and the user interaction interface is implemented in the native language of each platform. Let users feel the powerful performance while ensuring the unique experience of each platform.

NT Kernel supports multi-platform architecture diagrams

Full refresh is changed to incremental refresh

With the support of the new NT kernel, the time-consuming business logic has been moved to the sub-threads, and the main thread is only left to refresh the UI. Is there room for further optimization in refreshing the UI? The answer is yes, the 14-year-old mobile phone QQ updates a new message on the screen, and will refresh all the currently displayed messages, that is, the "full refresh" mechanism. This mechanism is caused by bad experiences such as messages that cannot be refreshed during scrolling and resource jumps.

Why can't I refresh a message when scrolling? It's not that it can't be refreshed, it's that it can't be refreshed. Unnecessary refresh operations can easily prevent UI updates from being completed in less than 16.67ms, which can induce stuttering.

Why does a resource jump occur? A full refresh will cause all nodes on the screen to be recycled and reused, and this reuse is still out of order. As shown in the following figure, the node position will change randomly after a full refresh, for example, a node with tail number 1b400 (the second node in the left figure) is used to display 2 before refreshing, and 7 is displayed when refreshed (7th node in the right figure).

Comparing the memory addresses of the nodes on the left and right, it can be seen that there will be random changes after full refresh, and there is no regularity at all.

Whether it is a static or dynamic image, there are time-consuming operations such as disk I/O and decoding, and asynchronous loading is generally used to avoid the lag of the main thread. Coupled with this random reuse feature, it also causes the performance of "resource jumping".

Depending on the reuse situation, there are three manifestations:

It happens to be the same node or content that was used last time: the same content is assigned, and nothing changes.
No associated moving/static diagrams: The content is created from scratch and meets expectations.
There is a related dynamic/static map, but it does not match the content of the current model: flickering. This is shown in the figure below.

All elements that load data asynchronously with full refresh will display the old information of other nodes before the loading is completed, even if the view is reset during the refresh, it cannot be solved, but from A->A->B to A->-space->B, there is still an obvious jump.

The "incremental refresh" adopted by QQ 9 can solve the above two experience problems very well. In addition, there is a hidden benefit that cannot be achieved by full refresh: node animation, as shown in the video below.

, duration 00:21

Implementing incremental refresh requires a reliable Diff algorithm that tells the system which node needs to perform a refresh, insert, delete, or move operation, and once the wrong information is given, it will directly lead to App Cras. The process of finalizing the algorithm is also full of twists and turns.

First of all, I read the source code and found that the built-in Diff tools on Android and iOS are implemented using the Myers algorithm.

Myers: The result is stored in the changes array, in which there are only two types: insert and remove. (Source: Swift Diffing)

Myers算法求解过程，通过插入、删除求源到目的的最短编辑距离。来源：AnO(ND) difference algorithm and its variations

The algorithm has a "defect" in calculating the movement, which infers the movement through the insertion+delete behavior, and the movement operation will be degraded to insert+delete in a specific scenario. For example, deleting first and then moving will be converted to delete + insert, and vice versa will be move + delete:

删 Plus 移 → 删 Plus 增:
Dataset A: [1, 2, 3, 4, 5]-> dataset B: [2, 3, 5, 4]. 1, 4 are deleted, and 4 is inserted.
移 Plus 删 → 移 Plus 删:
Dataset A: [1, 2, 3, 4, 5]->Dataset B: [1, 2, 4, 3]. 3, 4, and then 5 are deleted.

After analysis, the ideal Diff algorithm should have the following two characteristics:

It is possible to record the movement relationship between nodes, and it is not possible to infer movement from the connection of insertion and deletion.
It has low time complexity and spatial complexity.

After comparing industry solutions, select the Heckel Diff algorithm described in the paper "A technique for isolating differences between files". The optimal, average, and worst complex time/space complexity of the algorithm are all O(m+n), which is better than the O((m+n)*d of Myers' algorithm. The symbol table is implemented in such a way that all moves are logged and no more lost moves in Myers, as shown in the following figure.

The Heckel algorithm generates Diff information between new and old data with the help of symbol tables in 6 steps

PASS1. Establish the relationship between the new index array (NA) and the Symbol Table required for the new data
PASS2. Establish the relationship between the old index array (OA) and the Symbal Table required for old data.
PASS3. Find the nodes whose positions have not changed, and update the index information in the old and new index arrays (NA and OA).
PASS4 - PASS5: Applicable to the case of two cases compared in this article (there is a case where the key value is the same), the same key value is not allowed in the application scenario of QQ, and can be skipped. Interested students can directly consult the paper.
PASS6. The difference is calculated based on the available results, as shown in the figure below:

D means deleted, U means no change, and there is a moving relationship between 4 and 5.

So is the Heckel algorithm perfect? Otherwise, it doesn't take into account redundant movement information, which can lead to the animation confusion problem in the figure below.

On the basis of the Heckel algorithm, we improve and optimize the movement operation, track and record the movement operation, distinguish the direct movement and indirect movement, and filter and delete the indirect movement part, and finally obtain the Diff algorithm that meets the requirements of various indicators of QQ 9. In the example shown below, ID5 is moved directly to the first row, and ID1-4 are all moved down indirectly.

Record the offset of direct movement (both offset of move = insert X + delete Y are recorded), and correct the result of indirect/passive movement (ID 1-4 move)

Parallel pre-layout

As a best practice in the industry, asynchronous layout naturally cannot be absent from QQ 9. We're also trying to parallelize asynchronous layouts and push the limits of performance.

First, we tried the N message N thread scenario: dispatch N concurrent tasks with GCD, and then use DispatchGroup to wait for those tasks to complete. Through parallel pre-layout, the pre-layout that originally required tens of milliseconds for a thread was reduced to more than ten milliseconds. This solution later revealed 2 problems:

The total time spent on parallel layout of N messages is still much larger than that of serial layout of a message, limited by the number of CPU cores, and the competition of locks or other resources in the code leads to insufficient parallelism in the parameter preparation and layout calculation of N messages.
The layout tasks of these N messages are bound to N GCD tasks one-to-one, and any slow scheduling of any of these N tasks will lengthen the time of the entire pre-layout.

Take full advantage of the computing power of multi-core CPUs, and reduce the total time spent on layout calculations by about 76% using parallel computing.

The adjusted scheme is shown in the figure above, using M performers to perform the layout task of N messages (N>=M>0). The current thread (the main thread of the asynchronous layout) executes 1 performer, and then the GCD dispatches (M-1) additional threads to execute (M-1) performers. First, the message to be computed is put into a queue, and each performer will take a message from the message queue to be computed in a loop to perform layout calculations until the message queue to be computed is empty. Because the layout task of the message is not bound to any executor, even if an executor is not scheduled for a long time, it will not cause the layout calculation to be delayed, and in most cases, the M executors will be executed in parallel by M threads.

The total time spent on parallel layout decreases with the number of concurrent threads, and when it increases to 5, the time consumption does not decrease much.

It seems that the current layout calculation work has been removed from the main thread, but the reality is that many times the calculated coordinates and sizes do not match the pixel size of the screen, and the system will do another "pixel alignment" on the main thread. This detail should not be ignored in the "asynchronous layout" to really reduce the burden on the main thread, as shown in the following figure.

The R:G:B ratio of one pixel of the OLED screen is 1:2:1, and the DDIC (Display Driver IC) will perform sub-pixel rendering during display to borrow elements from other pixels to make the display fuller. However, the code does not have direct control over this behavior, and the system needs to ensure that the submitted content is perfectly aligned with the pixels of the screen, i.e. there is no situation similar to the use of 0.5 pixels.

The yellowed areas are coordinates, and the size result is misaligned with the screen pixels

Other optimizations include intelligent preloading, message recycling, and asynchronous decoding of image resources. As shown in the figure below, according to the screen ratio, the first-level cache is displayed, the second-level cache is preloaded, and the excess part is recycled and released.

Resource preload strategy diagram

2.2.3 Make the interface as simple as possible on the GPU to reduce GPU time.

In addition to the layout can be asynchronously calculated, complex images can also be rendered asynchronously to reduce the GPU's time-consuming, especially in the face of the need to superimpose and crop graphics, the GPU's drawing task cannot be completed in a frame, so you need to open an additional Frame Buffer for drawing, and after all the content of the two buffers is composited, this is called "off-screen rendering". Off-screen rendering is a significant performance drain, mainly due to the high overhead of the GPU's context switching, which requires clearing the current pipeline and fences. Here's the original story: A Performance-minded take on iOS design | Lobsters。 In this case, Apple's engineers suggested that the GPU take some of the work off the shoulders by using the CPU to draw. As shown in the figure below:

It is undeniable that the off-screen of the GPU is much more expensive than the off-screen of the CPU, and the asynchronous rendering performance is better when the mask cannot be avoided.

We utilize multi-core CPUs for asynchronous rendering when rendering messages, reducing the time consumption of the GPU part. The difficulty here is that there will be a problem of "flashing white" when used in the list scenario that can be quickly swiped to update, such as the well-known third-party open source framework YYKit, which also exists, and we solve this problem well by LRU cache + incremental refresh.

2.2.4 Silky experience full of buffers

Based on the above CPU and GPU optimizations, we have implemented the ability to receive messages in real time during scrolling on the Message Tab, which is not currently available in similar applications in China, and will not cause lag, and in addition, the limit of 150 sessions in the old version has also been extended, and all session nodes of the user are loaded in the form of pagination consistent with the chat interface, as follows:

, duration 00:13

Messages are accepted on scroll without stuttering

The speed of entering the group and friend chat interface has also been qualitatively improved, and while speeding up the entry animation, you can still ensure that you can see the latest chat content immediately. As shown in the figure below, the same account enters the same chat page. On the left is the effect before optimization, the chat page is almost all displayed, and the content is still loading; on the right is the effect after optimization, the chat page is only displayed a little bit, and you can already see the sender's avatar and message content.

Comparison chart of the loading speed of the chat page (before optimization on the left, after optimization on the right)

In addition to the improvement of entry speed, the speed of turning pages of chat content has also reached the top level in the industry: surpassing similar domestic applications and benchmarking Telegram. No matter how many messages the user has, they can be seen by constantly pulling up, and the user is not aware of the loading state.

Comparison chart of the chat page before and after optimization (before optimization, after optimization)

2.3 Youth is always present – anti-deterioration system

It is easy to fight the country, but it is difficult to defend the country. In the face of complex business and technical debt, the Q team has invested 3 years of iterative optimization, and now the anti-deterioration system of Q has reached the advanced level in the industry. As a goalkeeper for hand Q quality, we named it Hodor (Hold the door).

Anti-deterioration goal: Detect some primary path problems in advance and prevent performance degradation through access control.

Trunk merge access control: For more stable performance indicators, automatic check before merge.
Daily automatic bill of lading: For occasional performance problems, early detection in the development stage.
Performance data dashboard: Normalized detailed data dashboard, observing performance from God's perspective.
Alarm bot: Customize alarm rules for each performance dimension to provide immediate feedback on problems.

The overall solution is based on Instruments dynamic tracing technology to collect diagnostic diagnostic data, xctrace automatically parses trace files, translates stacks for accurate attribution, performs anti-degradation detection for each commit build to accurately locate problems, and data visualization kanban + automatic bill of lading dispatch to shift quality left to the development stage. Finally, a series of capabilities such as performance reporting, data analysis, intelligent scheduling, bill of lading alarm, device management, and use case management are realized. A picture to cover it up:

Introduction to the anti-deterioration system program

Xcode 12 has provided xctrace, and many of the issues addressed in its Release Notes have come from feedback from the Q team during the anti-deterioration development process. In terms of performance optimization, QQ has close communication with Apple's performance team, and everyone will work overtime to overcome the time difference between China and the United States.

Since the launch of the entire Q anti-deterioration system, the stability of the development backbone has been effectively ensured, and a large number of performance and crash problems have been detected, and many performance problems introduced by new requirements have been blocked.

Diagram of anti-deterioration results

At present, Hodor has covered dozens of scenarios and landed on five platforms: iOS/Android/Windows/macOS/Linux.

The lightweight and rejuvenated QQ 9

As a result of the above optimizations, the performance of QQ 9 in all scenarios has been greatly improved compared with previous versions, as shown in the following figure:

Using Apple's official tool: Xcode Organizer, you can see that QQ 9 has a 35% improvement in smoothness compared with the previous version in the 50th percentile, a 48% reduction in stuttering rate, and a 40% reduction in startup time. This is shown in the figure below.

Summary and outlook

In this article, we introduce the technical implementation behind QQ 9 Silk, and introduce the whole process optimization we have made in terms of performance in terms of startup speed, page refresh, differential algorithm, preload and recycling, asynchronous layout and rendering, etc., and introduce several scenarios that improve user experience.

In fact, the technical field is in-depth and complex, and each optimization point can be taken out separately to explain it well, because of the space problem, it can only be left for later to share with you slowly.

I hope that the polishing done by the QQ technical team can bring tangible experience improvement to users, and I also hope that QQ can get better and better, because each of us insists on using QQ for 1/500 million.

Authors: Zhang Cao, Bi Lei

Source-WeChat Official Account: Tencent Cloud Developer

Source: https://mp.weixin.qq.com/s/nVXE0iSllZ3rFei4t7bR7g