天天看点

Rendering 19-GPU Instancing

https://catlikecoding.com/unity/tutorials/rendering/part-19/

rendering 19 gpu instancing

render a boatload of spheres.

add support for GPU instancing.

use property property blocks.

make instancing workd with lod groups.

  1. batching instances

instructing the gpu to draw something takes time. feeding it the data to do so, inluding the mesh and material properties, takes time as well. we already know of two ways to decease the amount of draw calls, which are static and dynamic batching.

unity can merge the meshes of static objects into a larger static mesh, which reduces draw calls. only objects that use the same material can be combined in this way. this comes at the cost of having to store mesh data. when dynamic batching is enabled, unity does the same thing at runtime for dynamic objects that are in view. this only works for small meshes, otherwise the overhead becomes too great.

there is yet another way to combine draw calls. it is know as GPU instancing or geometry instancing. like dynamic batching, this is done at runtime for visible objects. the idea is that the gpu is told to render the same mesh multiple time in one go. so it can not combine different meshes or materials, but it is not restricted to small meshes. we are going to try out this approach.

1.1. many spheres

to test gpu instancing, we need to render the same mesh many times. let us create a simple sphere prefab for this, which uses our white material.

to instantiate this sphere, create a test component which spawns a prefab many times and positions it randomly inside a spherical area. make the spheres children of the instantiator so the editor hierarchy window does not have to struggle with displaying thousands of instances.

using UnityEngine;
public class GPUInstancingTest : MonoBehaviour
{
	public Transform prefab;
	public int instances = 5000;
	public float radius = 50f;

	void Start()
	{
			for(int i=0;i<instances;++i)
			{
				Transform t = Instantiate(prefab);
				t.localPosition = Random.insideUnitSpere * radius;
				t.SetParent(transform);
			}
	}
}
           

create a new scene and put a test object in it with this component. assign the sphere prefab to it. i will use it to create 5000 sphere instances inside a sphere of radius 50.

with the test object positioned at the origin, placing the camera at (0,0,-100) ensures that the entire ball of spheres is in view. now we can use the statics panel of the game window to determine how all the objects are drawn. turn off the shadows of the main light so only the spheres are drawn, plus the background. also set the camera to use the froward rendering path.

in my case, it takes 5002 draw calls to render the view, which is mentioned as batches in the statistics panel. that is 5000 spheres plus two extra for the background and camera effects.

note that the spheres are not batched, even with dynamic batching enabled. that is because the sphere mesh is too large. had we used cubes instead, they would have been batched.

in the case of cubes, i only end up with eight batches, so all cubes are rendered in six batches. that is 4994 fewer draw calls, reported as saved by batching in the statistics panel. in my case it also reports a much higher frame rate. 83 instead of 35 fps. this is a measure of the time to render a frame, not the actual frame rate, but it is still a good indication of the performance difference. the cubes are faster to draw because they are batched, but also because a cube requires far less mesh data than a sphere. so it is not a fair comparison.

as the editor generates a lot of overhead, the performance difference an be much greater in builds. especially the scene window can slow things down a lot, as it is an extra view that has to be rendered. i have it hidden when in play mode to improve performance.

1.2 supporting instancing

gpu instancing is not possible by default. shaders have to be designed to support it. event then, instancing has to be explicitly enabled per material. unity’s standard shaders have a toggle for this. let us add an instancing toggle to MyLightingShaderGUI as well. like the standard shader’s GUI, we will create an Advanced Options section for it. the toggle can be added by invoking the MaterialEdtior.EnableIntancingField method. do this is a new DoAdvaned method.

void DoAdvanced()
{
	GUILayout.Label("Advanced Options", EditorStyles.boldLabel);
	editor.EnableIntancingField();
}

           

add this section at the bottom of our GUI.

public override void OnGUI()
{
	this.target = editor.target as Material;
	this.editor = editor;
	this.properties = properties;
	DoRenderingMode();
	DoMain();
	DoSecondary();
	DoAdvanced();
}

           

select our white material. an advanced options hearder is now visible at the bottom of its inspector. however, there is not toggle for instancing yet.

the toggle will only be shown if the shader actually supports instancing. we can enabled this support by adding the #pragma multi_compile_instancing directive to at least one pass of a shader. this will enable shader variants for a few keywords, in our case INSTANCE_ON, but other keywords are also possible. do this for the base pass of MyFirstLightingShader.

#pragma multi_compile_fwdbase
#pragma multi_compile_fog
#pragma multi_compile_instancing
           

our material now has an enable instancing toggle. checking will change how the sphere are rendered.

in my case, the number of batches has been reduces to 42, which means that all 5000 spheres are now rendered with only 40 bathces. the frame rate has also shot up to 80 fps. but only a few spheres are visible.

all 5000 spheres are still being rendered, it is just that all spheres in the same batch end up the same position. they all use the transformation matrix of the first sphere in the batch.

Rendering 19-GPU Instancing

1.3 instance IDs

the array index corresponding to an instance is known as its instance ID. the gpu passes it to the shader’s vertex program via the vertex data. it is an unsigned integer named instanceID with the SV_InstanceID semantics on most platforms. we can simply use the UNITY_VERTEX_INPUT_INSTANCE_ID macro to include it in our VertexData structure. it is defined in UnityInstancing, which is included by UnityCG. it gives us the correct definition of the instance ID, or nothing when instancing is not enabled. add it to the VertexData structure in My Lighting.

structure VertexData
{
	UNITY_VERTEX_INPUT_INSTANCE_ID
	float4 vertex:POSITON;
	……
};

           

we now have access to the instance ID in our vertex program, when instancing is enabled. with it, we can use the correct matrix when transforming the vertex position. however, UnityObjectToClipPos does not have a matrix parameter. it always uses unity_ObjectToWorld. to work around this, the UnityInstancing include file overrides unity_ObjectToWorld with a macro that uses the matrix array. this can be considered a dirty macro hack, but it works without having to change existing shader code, ensuring backwards as compatibility.

to make the hack work, the instance’s array index has to be globally available for all shader code. we have to manually set this up via the UNITY_SETUP_INSTANCE_ID macro, which must by done in the vertex program before any code that might potentially need it.

InterpolatorsVertex MyVertexProgram(VertexData v)
{
	InterpolatorsVertex i;
	UNITY_INITIALIZE_OUTPUT(Interpolators,i);
	UNITY_SETUP_INSTANCE_ID(v);
	i.pos = UnityObjecToClipPos(v.vertex);
	……
}
           

the shader can now access the transformation matrices of all insances, so the spheres are rendered at their actual locations.

1.4 batch size

it is possible that u end up with a different amount of batches than i get. in my case, 5000 sphere instances are rendered in 40 batches, which means 125 spheres per batch.

each batch requires its own array of matrices. this data is send to the gpu and stored in a memory buffer, known as a constant buffer in direct3d and a uniform buffer in opengl. these buffers have a maximum size, which limits how many instances can fit in one batch. the assumption is that desktop gpus have a limit of 64kb per buffer.

a single maxtrix consists of 16 floats, which are four bytes each. so that is 64 bytes per matrix. each instance requires an object-to-world transformation matrix. however, we also need a world-to-object matrix to transform normal vectors. so we end up with 128 bytes per instance. this leads to a maximum batch sie of 64000/128=500, which could render 5000 spheres in only 10 batched.

1.5 instancing shadows

up to this point we have worked without shadows. turn the soft shadows back on for the main light and make sure that the shadow distance is enough to include all spheres. as the camera sits at -100 and the sphere’s radius is 50, a shadow distance of 150 is enough for me.

rendering shadow for 5000 spheres takes a toll on the gpu. but we can use gpu instancing when rendering the sphere shadows as well. add the required directive to the shadow caster pass.

#pragma multi_compile_shadowcaster
#pragma multi_compile_instancing
           

also add UNITY_VERTEX_INPUT_INSTANCE_ID and UNITY_SETUP_INSTANCE_ID to My Shadows.

struct VertexData {
	UNITY_VERTEX_INPUT_INSTANCE_ID
	…
};

…

InterpolatorsVertex MyShadowVertexProgram (VertexData v) {
	InterpolatorsVertex i;
	UNITY_SETUP_INSTANCE_ID(v);
	…
}
           

now it is a lot easier to render all those shadows.

1.6 multiple lights

we have only added support for instancing to the base pass and the shadow caster pass. so batching will not work for additional lights. to verify this, deactivate the main light and add a few spotlights or point lights that affect many sphere each. do not bother turning on shadows for them, as that would really drop the frame rate.

it turns out that spheres that are not affected by the extra lights are still batched, along with the shadows. but the other spheres are not even batched in their base pass. unity does not support batching for those cases at all. to use instancing in combination with multiple lights, we have no choice but to switch to the deferred rendering path. to make that work, add the required compiler directive to the deferred pass of our shader.

#pragma multi_compile_prepassfinal
#pragma multi_compile_instancing
           

after verifying that it works for deferred rendering, switch back to the forward rendering mode.

2 mixing material properties

继续阅读