The cross-platform renderer architecture and practice of the domestic engine Cocos

The author | Wang Zhe

Editor| Sun Ruirui

Hello, I am Wang Zhe, the founder of Cocos Engine. Like you who read this article, I am also a programmer, although the main work is more inclined to business management, but the foundation of the programmer has been engraved in the genes, so I want to share with you the technical architecture and related practices of the Cocos engine, hoping to bring you some new understanding.

Reacquaint yourself with Cocos

Last year, when I opened my GMTC Shenzhen talk, I did a little research on the audience, and I asked how many of them had used or heard of Cocos, and many of them raised their hands, and I was very happy.

But there are differences in what stage of cocos understanding is stuck. On the day of the GMTC speech, I sat next to two buddies from Ali, and between the greetings, I introduced myself to Cocos, and the other party immediately reacted, and said "Fishing Master", "Happy Fighting Landlord", "Happy Fun" and many other games developed with Cocos. I joked at the time, dude, your message is a bit "out of season".

We've done a lot of things over the years, so let's go with you.

Nearly all casual and card games on the market today, the vast majority of legendary games, and about 64% of mini-games are developed with Cocos. So in most people's minds, the games Cocos can do can be roughly summarized as shown in the following figure:

The cross-platform renderer architecture and practice of the domestic engine Cocos | GMTC

In fact, the types of games developed based on the Cocos engine are far more than that, including the SLG class represented by "Clash of Kings" and "Gone with the Wind", the RPG class represented by "Blood Legend Series", and the simulated business class represented by "Animal Restaurant". In 2021, Cocos has another bigger breakthrough, that is, it merges the two product lines of 2D and 3D, and launches the "Cocos Creator 3.X" version, which realizes the ability to develop both 2D content and 3D content in the same editor.

So is the 3D content here the casual 3D mentioned earlier? No longer, our 3D content today looks like this:

(Cyberpunk)

For the demo above, there are hundreds of dynamic light sources in the scene, and friends in the industry think that they can achieve AAA effects. The demo below shows ambient lighting, baking, and more in Cocos Creator using the Deferred Render Pipeline, and can run to full frames on HiSilicon's GPU.

(HiSilicon CGKIT Photosphere)

Iterative evolution of the Cocos engine architecture

From 2D and 2.5D that support simple casualness to today's 3D engine, Cocos's architecture has undergone many iterations and evolutions. What pits have you stepped on? And what's the difference between Cocos and Unity and Unreal, and why are we doing it? Next, I will share these insights in the evolution of the Cocos architecture.

First, let's take a look at what Cocos's early architecture looked like.

Cocos early architecture

When I was still writing my own code a few years ago, the architecture of the Cocos engine looked like this:

As you can see in the architecture diagram above, the graphics rendering of this piece, each of our functions is a Node, and there are two core functions in Node: update and render. Update will calculate your position, deformation, etc., and render will draw it.

It's like two loops, the first one first updates everything according to the number of scenes, such as a few fragments generated after a physical collision, and flies to what position, and calculates all of them. The second is painting, which is a rendering technique, rendering nodes drawn.

Further down, this is the use of OpenGL, OpenGL ES, WebGL, on this set, other rendering standards I don't care, because in the past there was no such, so we just need to get OpenGL done, and then all the platforms run through.

But this architecture to the era of 3D games, its support began to be relatively poor, because the 2D game as long as the rendering is enough, but the 3D game particle system, animation system, lighting and shadows, etc. are independent calculations and mutual influence, this time there is no way to simply assemble through different nodes, the renderer can not simply be organized through a sequential rendering function, this time we need a new architecture to support.

Cocos current schema

After many iterations, the current architecture of Cocos is shown in the following figure:

Here is the main renderer architecture, as can be seen from the above architecture diagram, starting from the render scene, there will be a model (Model), lighting (Light), skybox (Skybox), etc., and then do scene cropping (Scene Culling) and so on.

After the scene is trimmed, a series of rendered objects are generated, and the Renderer begins to organize the render queue based on the priority of camera, Material, and Pass.

Next, there's the Render Pipeline. Currently, the mainstream rendering pipeline is dominated by Forward Rendering/Deferred Rendering, and Cocos, like most engines, supports both. Here, special mention of the mobile end of the rendering pipeline, since the "Original God" game became popular, the mobile side has also begun to do delayed rendering. At this time, Cocos thought of the Frame Graph, in the complex rendering pipe line, there are a variety of different requirements of the rendering Pass and process, then the value of the Frame Graph is to turn these processes into like Lego, assemble these small parts, and in the future can also open this convenient customization capabilities to developers.

The final layer of the renderer is the GFX device layer. In fact, the gpu framework is much the same, but different operating systems and different environments use different graphics APIs, our GFX approach is to use a unified interface to encapsulate all graphics APIs, whether the underlying layer is OpenGL, Vulkan, Metal, or Web GPU, we access the packaging one by one, the engine can directly use the unified GFX API.

Mainloop timing

Next is the mainloop timing.

You can first think about why there are many application engines at present, and the game engines in the world can only play Unreal, Unity, cocos three? In my opinion, because the game engine and application rendering are essentially different, application rendering can be described as "enemy do not move me", because as long as you do not enter or do not respond to the notification, application rendering will not redraw the interface, it can be said how to save power and how to come.

This is not the case with game engines, which can be understood as a cyclical process like a video player. The frame rate of the general movie is 1 second and 24 frames, because the high refresh rate of the movie is theoretically impossible for the human eye to distinguish. But the game is different from the movie, the game because it is a static frame without visual residue can be excessive, so the effect of 30 frames in the game is very reluctant, and later people think that 30 frames are not stuck, the requirements have become higher, becoming 60 frames, and now there is a high brush screen, there are 90 frames, 120 frames.

The game engine is that regardless of the player's movement, the screen must be in a state of motion. You can pay attention to all the game screens, when you are stationary, the screen will never be updated at all, the planning will not leave this kind of space out, otherwise you will feel that the game is stuck, right? So when your player character is stationary, there will be the character's breathing ups and downs in the normal screen, there will be wind and grass, and then there will be birds transiting and so on.

In the 2D engine era, everyone is very happy to do Batch, but in the 3D engine era, there is more camera concept, how to understand here? There may be a major Camera, but there may also be a rearview mirror in the scene, there may be a pool of water, or there may be other more complex situations. For example, a person is a Model, and your character body may be composed of multiple subModels of different materials, and then each subModel's material may contain multiple rendering Passes, and he may have to draw many times, such as the surface of the object to be painted once, the shadow to be drawn once, and if there is a mirror with projection, then draw again. The organization of these complex Passes is handled in the Frame Graph, and after processing, they become a large number of rendering instructions thrown to RenderQueue.

Okay, so even if the rendering period is over at this point, the next step is GFX's life, let it directly interact with the hardware.

Mobile First: TBR Practice

What makes Cocos different from its competitors is that some of the engines were already available in the PC era, such as the first generation of Unreal, which was launched in 1998. Cocos, on the other hand, is mobile first, that is, it runs on the mobile side first, and the GPU architecture of the engine in the mobile side is completely different from that of the PC side. Therefore, whether it is PC-first, mobile-first, or both, it will affect the overall architectural design of Cocos.

To take a typical example, on the PC side, its rendering method is called Immediate Mode Rendering (IMR), how does it execute immediately? As long as the rendering instruction is into the GPU, it "snaps" a line to render, draws it immediately, where is its Buffer? Inside the video memory. What if there is a problem at this time, such as insufficient video memory 64G? Not enough to add video memory, 128G, 256G, after adding video memory, the GPU and then interact with the video memory have to add bandwidth, bandwidth plus is not enough? Add voltage, isn't the voltage fast? So what should I do if this power consumption goes up and the heat generation increases? Never mind, plus fan! It is "the old man a shuttle, a dry to the end", there is a problem to solve the problem, anyway, the PC side is what you love to do and what equipment can be added.

But the mobile end is not the same, the mobile end we can not add fans, and the most tragic is that it does not have the concept of video memory, a chip SoC in which the GPU has to share memory with the CPU, at this time rendering and memory read and write, IO has been stuck on your side, that is not to mention the performance of the improvement.

God said there should be light, so we made a rendering solution for mobile.

The image above shows a brief rendering flow of a traditional latency pipeline on a mobile Tiled Based GPU. In the process, you can see that there are GBuffer and Tiled Memory, which are generally 32KB, 64KB, how to deal with it? In fact, it is similar to the method taught by teacher Yi Xuxin, that is, partitioning, dividing and ruling.

This is a very common problem in programming, a thing is too big, to plug into the outside memory, read, write, IO is particularly slow. It doesn't matter, we split it into small grids. Tiled Memory has 32×32, 64×64, and now the newer one is 64×64, which can break the whole picture into small pieces.

It should be noted here that if you want to optimize the power consumption of the mobile terminal, you must avoid storing the GBuffer in the System memory, do not read it out directly after processing in TIled Memory, and do the lighting directly after each tile has saved the GBuffer, so that you can save the IO overhead of Gbuffer, do the lighting calculation, and end. At this time, because it is very small, unlike the PC's video memory, you can love to use as much as you want, on the mobile end it feels like playing cross stitch, you have to be very careful to use it.

So throughout the rendering process, you need to plan for each step of the cache in advance. If you want to re-draw a thing after saving, then you waste this part of the cache, waste the memory of the system that you start to read and write frequently on the phone, then all the effort before it is wasted.

This is a very essential difference between mobile and PC, Cocos is mobile first, so our overall architecture is in line with this best practice. Of course, when you use the engine, you don't need to care about these contents, and modern engines generally package these things for you to use. Here's a point where the above is available in open source repositories because Cocos is open source.

GFX with multiple backends

As mentioned earlier, GFX can encapsulate all the different places, and this part of Cocos is also open source, which is the MIT license, so everyone can commercialize it. So if you want to do graphic rendering, even if the code can't understand it, it doesn't matter, you can take the whole thing and use it directly. For example, after the WebGPU comes out, or the new version of Metal, the new version of Vulkan comes out, you can just add it directly. In this way, you don't have to care about what new technologies the chip manufacturers have, so as not to affect the operation of your entire architecture above.

Multithreaded design for load balancing

Let's talk about multithreading. I've seen some programmers who are just getting started writing multi-threaded are very happy, directly to a slightly poor, Job System, all kinds of threads flying around, but also think it's particularly cool. In my earliest days, I was doing hardware, in fact, only a separate hardware IO, it is worth you to open a thread, otherwise, the overhead of switching threads in the CPU is greater than the benefit of your multi-threading, which is not cost-effective.

Like THE GPU, network, physics may be worth opening a thread, but it does not mean that physics is all worth opening, like the latest Snapdragon 888 CPU, also called 1+3+4, that is, 1 super core plus 3 large cores plus 4 small cores, that is also 8 cores. 8 cores you don't have to open a dozen threads, so this multi-threaded structure diagram, multi-threaded in the common is this producer consumer model, everyone should have played.

The Render Thread in the diagram above, as well as the physical and networking, you don't have to worry about it. From the render thread, I process a frame, then press the RenderCmd Queues buffer to GFX. This place is like a separate device thread for recording the GPU, with the producers above and the consumers below, and this model is more common.

However, I once saw a game company write multi-threading, they are not this kind of consumer wait, wait signal, similar to this way of writing, they are directly take a while(1), and then read whether there is not, if not is false. Okay, I sleep for 10 milliseconds, and then I go up and read it again, and see if not, I sleep for another 10 milliseconds, and the result is that the whole power consumption is very high.

I was particularly impressed by the fact that there were 20 people in the entire group of this game company who worked overtime for almost ten days, and they still didn't find any problems, so why did they think the engine was hot? There must be a problem with engine performance. Then they couldn't solve it and found us, and they paid me two or three hundred thousand for that line of code, and finally found that it wasn't a rendering problem at all, it was outside, and they had a problem in the place where the performance was synchronized.

FrameGraph custom render pipelines

FrameGraph custom render pipeline, which is actually a design that Frost Engine shared on GDC.

The basic process is divided into Setup initialization, Compile compilation, and Execute execution. Setup is a user-specified description of the rendering flow, and then the engine builds a rendering diagram for all Passes in real time for each frame, combs through the entire rendering flow, and finally executes the user's actual Pass callback.

This is the simplified process of our engine delay rendering pipeline, the orange squares are all rendering Passes, the blue squares are Compute Shader calculation Passes, our calculation Pass is based on the cone cluster of lighting clipping algorithm, calculating the influence range for each light source, and finally the Lighting Pass stage only considers the lighting that may have an impact on the object, reducing the amount of computation on the light.

On the mobile side, we limit the dynamic lights of the forward pipeline to 4, and more than 4 will have an impact on performance, because each light source needs to add a Pass on each object, and the complexity is multiplicative. As mentioned earlier, I can run thousands of lights, although there is also the contribution of the frustum cluster cropping, but it is very important to use delay rendering, delay rendering It is actually after all the objects are painted, exist in the GBuffer, and finally calculate the impact of all the lights at once in the lighting Pass, so increasing the light source becomes an additive relationship, not a multiplication relationship, and the performance will be much higher.

Of course, GBuffer occupies a larger storage space, because the current mobile phone memory will be larger, so it is taking memory to exchange performance, but also because the delay rendering itself is more memory-intensive, and if you do the after-effect anti-aliasing with MSAA algorithm, the memory occupation will be multiplied by 4, which is a problem.

The second problem is that it can't deal with that kind of translucent object, so if it encounters a translucent object, it is still necessary to follow the normal rendering process of Forward, such as having a glass bottle, small marbles and other translucent things to draw another round separately, which is a disadvantage.

So as far as the current front-end is concerned, I haven't seen any project crazy enough to use this Deferred Rendering, because it feels more like a showmanship, proving that it can already be done to this extent, but in fact, everyone is still conservative in using Forward Rendering, otherwise the memory is really easy to explode.

(Deferred Render Pipeline - Dynamic Lights)

(hundreds of dynamic lights)

Here to mention Bloom, this is more common, all the lighting of the post-rendering effect, need a little bit of gray light effect so as not to appear false, for example, the light of the following picture has the effect of Bloom.

(Add post-processing effect: Bloom)

There is also anti-aliasing, because the delayed rendering MSAA can not be used, the memory will burst, then we use TAA, the effect is not bad. The above figure is culling, normal cone cropping, it is cut according to the object, calculate the occlusion relationship of the object, but the object is strange, you can not expect to say that the front is a particularly large object, the back of the small all blocked, so this side of the calculation is the same as the previous practice, first cut the scene into a small grid, and then calculate the occlusion relationship between the grids, pre-storage. If the back grid is obscured, all the objects in the back grid are not needed, and all the light source objects can be removed.

Then, according to the remaining grid after the elimination, the occlusion of the objects in this grid is further calculated.

（加入 PVS （Potential Visible Set Culling)）

Like the above picture on the left is not yet added to the PVS situation, all the red things are to be drawn, but if you do a crop first, you can see that the whole thing that needs to be drawn is very little (such as the right picture), because many things we can't see at all, at this time the whole Drawcall can be reduced by about 50%.

The previous sharing is all open source code, and next, I want to share some macro points of view.

The "traffic base" at home and abroad is divided

The same engine, Cocos and Unity, Unreal is very different, in contrast, the other two are more focused on the native, in H5 and front-end technology does not seem to have anything to do with them, and cocos engine why do this?

Here, I would like to share a concept called "traffic base", which is a word I made myself.

The ecological traffic base abroad is iOS and Andriod, which are operating systems. This flow base has several distinct features:

First, there is a huge demand for content, the system directly faces C-end users, the system can not do all the content, so a large number of content providers need to develop a variety of App forms to meet the various needs of the C-end.

The second is to prohibit the secondary ecology, such as the App Store under iOS, which prohibits developers from downloading executable scripts, such as dynamically downloading H5, JavaScript code, and Lura code to run in the App, because as long as it is open to do so, developers can make a variety of game boxes, game lobbies, and then produce a secondary ecology.

Looking at China, Huawei, Xiaomi, OPPO, vivo and other mobile phone brands have their own app stores and ecosystems, but the real traffic base in China is super App, sitting on the real traffic entrance.

Why? The traffic base is actually the bargaining power with bargaining power. In China, mobile phone users no longer like to download a new app, and everyone is more accustomed to using mobile phones to scan codes to complete health card inquiries, takeaways, shared bicycles, payment and other operations. Therefore, these super apps in China will establish their own ecology on top of the traffic base. For example, WeChat, in order to meet the diverse needs of C-end users, Launched Mini Programs and Mini Games, and established an ecology on them.

It is worth noting that in the case of App development with Natice technology, the ecological technology in its upper part can only choose front-end technology, so the domestic front-end technology development is more active than that of foreign countries.

But front-end technology is not just to reduce the cost of cross-platform development, in the scene of the previous GMTC speech, a teacher from Byte also mentioned two reasons, one is to reduce development costs, the other is to reduce the cost of channel updates. In fact, only small factories have cross-platform needs, and large factories have enough capabilities to directly divide into two different project groups to ensure that the native version has a better experience. But the core is that in addition to the choice of the big manufacturers themselves, we must also consider the ecology of small programs and small games.

This discrepancy led to our design, remember the architecture diagram of the Cocos engine at the beginning of this article? It's already complicated, millions of lines of code have been written.

In fact, our architecture is like this in the picture above, this part is the front-end will be used, we use TypeScript to write, the whole is on the web platform, very suitable for a variety of small games, mobile phone manufacturers of fast games, vibrato fast games, Web H5, etc., there are now thousands of developers in the use of Cocos to develop products suitable for their own platform. To digress, this has also spawned a lot of market job demand, many well-known manufacturers such as iQiyi, Meituan, Tencent, etc., I also saw qq music has a senior position in the search for Cocos talent.

On the other hand, Cocos also serves native mobile games, many mobile games to go to the hardcore vendor channel, but also to go to sea, while overseas there is no H5 game ecology. What to do at this time? I implemented it in C++, so there are two groups of people writing these two engines. There may be a question here, so why don't you compile it directly with WebAssembly? I can only tell you that WebAssembly has pitfalls, we have practical experience in this matter, if I use WebAssembly, C++ to compile this piece of stuff directly into TypeScript, can I run on the browser? Yes.

But what's the problem? As mentioned earlier, everyone still has to be hot, and the front end that cannot be dynamically updated is soulless. Our game has a soul, so it is written with dynamic scripts, but if the dynamic script goes to tune the WebAssembly lump, the overhead is very large, whether it is from JavaScript or TypeScript, calling the WebAssembly compiled out of this set of things, the consumption is very large, and the final performance is actually better than writing it directly. So in order to improve the performance of everyone, so many things really write two copies, this is our actual situation at present.

Here, in fact, I just want to say that the wheel is really not easy to build, we have been doing Cocos for eleven years this year, I think we are the third in the world.

Finally, I will share with you two open source repositories, the first repository is our H5 piece of the engine, pure front-end code is all open source; the second native-engine is running on iOS, Android, and Windows and macOS on the native engine, is also open source, are MIT license. If you want to use it, whether you want to use rendering this layer, or you want to use the bottom GFX layer, take it directly, don't build your own wheels, you can ask us questions on the "Cocos.com" forum, and I often answer questions on it.

Outside of the gaming industry, Cocos has also entered more areas. For example, cocos ICE, an interactive courseware editor for the education industry, an HMI human-computer interface for IoT devices and more screens, and has accumulated solutions in XR, AR, car driving navigation, children's programming, watches and even virtual characters.

Many people have asked us if we are metaversms, and my answer is that Cocos is not a metacosm, but Cocos is a tool used to produce metaversms. No matter what the development of the metaverse is, there always has to be content, there must be real-time interaction, then Cocos is specifically doing this.

This article is compiled by InfoQ from The Cross-platform Renderer Architecture and Practice of Cocos Engine at GMTC Global Front-end Technology Conference (Shenzhen) 2021 by InfoQ.

Introduction of guests

Wang Zhe, CEO of Yaji Software

Founder and CEO of cocos engine. After a decade of hard work, Cocos currently has 1.5 million registered developers worldwide, in more than 203 countries and territories around the world, serving 40% of mobile games, 64% of H5 and mini games, 90% of online education apps, and a large number of IoT and digital twin developers.

Event recommendations

From June 10 to 11 this year, the first GMTC Global Front-end Technology Conference will be held in Beijing. The conference planning covers 15 topics such as front-end business architecture, front-end DevOps, front-end performance optimization, IoT dynamic application development, TypeScript, mobile performance and efficiency optimization, front-end growth practice, sustainable team building, cross-end technology selection, etc., inviting you to explore the hot directions and landing practices of front-end technology.

The cross-platform renderer architecture and practice of the domestic engine Cocos | GMTC

Read on