Skip to content

Instantly share code, notes, and snippets.

@reduz
Last active December 24, 2024 07:00
Show Gist options
  • Save reduz/c5769d0e705d8ab7ac187d63be0099b5 to your computer and use it in GitHub Desktop.
Save reduz/c5769d0e705d8ab7ac187d63be0099b5 to your computer and use it in GitHub Desktop.
GPU Driven Renderer for Godot 4.x

GPU Driven renderer design for Godot 4.x

Goals:

The main goal is to implement a GPU driven renderer for Godot 4.x. This is a renderer that happens entirely on GPU (no CPU Dispatches during opaque pass).

Additionally, this is a renderer that relies exclusively on raytracing (and a base raster pass aided by raytracing).

It is important to make a note that we dont want to implement a GPU driven renderer similar to that of AAA/Unreal, as example. We want to implement it in a way that allows us to retain full and complete flexibility in the rendering pipeline and in a way that it is simple and easy to maintain.

Overview

Roughly, a GPU render pass would work more or less like this:

  1. Frustum/Occlusion cull depth: This would be done using raytracing on a small screen buffer. As such, throwing a small amount of rays to generate a depth buffer.
  2. Occlusion cull lists: All objects will be culled against the small depth buffer. Objects that pass are placed in a special list per material.
  3. Opaque render: Objects are rendered in multiple passes to a G-Buffer (deferred) by executing every shader in the scene together with their specific indirect draw list. This follows the logic of the Rendering Compositor, providing the same flexibility for different kind of effects, stencil, custom buffers, etc.
  4. Light cull: Lights are culled against the depth buffer to determine which lights that are rendered per pixel and which need shadow.
  5. Shadow tracing: Shadows are traced using raytracing to the respective pixels on screen.
  6. GI Pass: Reflections and GI are processed also using raytracing. GI of objects off-screen is done with material textures (material rendered to a low res texture).
  7. Decal Pass: Decals are rendered into the G-Buffer.
  8. Volumetric Fog: Volumetric fog is processed like in the current renderer, except that instead of tapping shadow maps, raytracing is used.
  9. Light Pass: Finally, a pass adding lights is applied and reading the proper shadows.
  10. Subsurface Scatter Pass: A pass to post-process subsurface scatter must be done after the light pass.
  11. Alpha Pass: Transparency pass is done at the end. This is done using regular Z-Sorted draw calls, CPU driven.

FAQ

Q: Why do we use smaller resolution raytracing for occlusion culling and not visibility lists?
A: Visibility lists take away flexibility from the opaque passes. They require rendering objects in specific order, while opaque passes do not.

Q: Do we not have small occluder problem using raytraced occlusion?
A: Yes, but in practice this does not really matter. Roughly more than 99% of scenes work fine.

Q: Why using raytraced shadows only, is it not better to support shadow mapping?
A: We need to check depending on performance, but worst case an hybrid technique can be explored.

In Detail

Frustum/Occlusion cull depth

The first pass has to be discarding objects based on visibility. Frustum cull will discard objects not visible in the camera. A depth buffer will be created using raytracing, basically used for base occlussion. Objects will be tested against it and also discarded.

Keep in mind that a relatively large depth buffer can be used thanks to raytracing, because the depth buffer can be reprojected from the previous frame and only the places with missing rays can re-cast.

Occlusion cull lists

The list of objects passing must be sorted by shader type, in an indirect draw list fashion. Materials for all shaders of a given type will be found in an array, textures will be indices to a large array of textures instead of actual textures.

This can be achievable relatively easily using a compute shader that counts all the objects of each shader type first, then assigns their offset in a large array, then creates the indirect draw lists for each shader.

My thinking here is that, as we will eventually have mesh streaming and these will by default more or less be separated into meshlets anyway (for the purpose of streaming), those could be rendered with a special shader that culls them in more detail (maybe mesh shader), while regular objects (non streamed) go via the regular vertex shader path.

Opaque render

Opaque rendering happens by executing all shaders basically with indirect rendering. Because in the compositor proposal, shaders are assigned to subpasses, it is easy to have a system where the compositor still works despite the GPU driven nature.

Additionally, depending on what a material renders (visibility mask, emission, custom lighting, etc). We can also take advantage to do this and render in multiple render passes to different G-Buffer configurations.

Bindless implementation

The bindless implementation should be relatively simple to do. Textures can go in a simple:

uniform texture2D textures[MAX_TEXTURES];

For vertex arrays, vertex pulling can be implemented using utextureBuffer for vertex buffers:

uniform utextureBuffer textures[MAX_TEXTURES];

And the vertex format decoded on demand. Vertex pulling of a custom format would probably not be super efficient, but the following needs to be taken into consideration:

  • Most meshes will be compressed (meaning they use only one format, hence vertex pulling will be very efficient).
  • In a larger game most static meshes would most likely be also streamed anyway and the format will be fixated, the vertex pulling code for most of the vertex buffers should be very efficient too.

Light cull

With the depth buffer completed, it is possible to do light culling and assigment of visible lights. This can be done using the current clustering code. Alternatively, an unified structure like that of raytracing can be used (possibly hash grid?).

Shadow Tracing

Shadows can be traced in this pass. It is possible not all positional lights with shadows need to be processed every frame, as temporal supersampling can aid improving the performance of this.

GI Pass

We need to check a GI technique, or offer the user different GI techniques based on performance/quality ratio, such as GI 1.0 to full path tracing.

For materials, we should probably just render most materials to a small texture (128x128) and use this information for GI bounces.

Decal Pass

As we are using a G-Buffer, a decal pass is probably a lot more optimal to do by just rastering the decals to it.

Volumetric Fog

Volumetric fog should work identically to what we have now, except instead of using shadow mapping, we can just raytrace from the points to test occlusion.

Light Pass

The light pass should be almost the same as in our (future) deferred renderer.

Subsurface Scatter Pass

This should work similar to how it works now.

Alpha Pass with OIT

Because we don´t have shadow mapping, alpha pass needs to happen a bit different. The idea here is to use light pre-pass style rendering on the alpha side.

Basically, a 64-bit G-Buffer is used for alpha that looks like this: uint obj_index : 18; uint metallic : 7; uint roughness : 7; rg16 obj_normal; // encoded as octahedral

Added to this, a "incrment_texture" image texture in uint format that is half resolution format.

Alpha is done in two passes. The first pass is objects that are lit, sorted from back to front. Unshaded objects are skipped.

The following code is run in the shader:

// This piece of code ensures that depth buffer writes are rotated across a block of 2x3 pixels.

uvec2 group_coord = gl_fragCoord.xy >> uvec2(1); // group is the block of 2x2 pixels
uvec2 store_coord;
vec2 combined_roughness_metallic;
vec3 combined_normal;
bool store = false;
while (true) {
      // Of all active in subgroup, find first broadcast ID.
      uvec2 first = subgroupBroadcastFirst(gl_fragCoord.xy);
      // Get the group (block of 2x2 pixels) of the first
      uvec2 first_group = first >> uvec2(1);
      uvec2 store_coord;
      if (first == gl_fragCoord.xy) {
            // If the first broadcast ID, increment the atomic counter and get the value.
	    // the store_index is a value from 0 to 3, representing the pixel in the 2x2 block.
            uint index = imageAtomicAdd(increment_texture,first_group >> 1 ,1) & 0x3;
	    store_coord = group + uvec2(index&1,index>>1);
      }
      // Broadcast the store index
      store_coord = subgroupBroadcastFirst(store_coord);
      
      if (first_group == group) {
             // If this pixel is part of the group being stored, then only store the relevant one
	     // and discard the rest. This ensures that every write rotates the pixel in the 2x2 block.   
	     // Combined rm and normal of all 4 pixels
	     vec3 crm = subgroupAdd(vec3(roughness,metallic,1.0);
	     combined_roughness_metallic = crm.rg / cr.b;
	     combined_normal = normalize( subgroupAdd(normal) );
	     // Determine if the pixel that needs to be written actually is preset (may not be part of the primitive)
	     bool write_exists = bool(subgroupAdd(uint( gl_fragCoord.xy == store_coord )));
	
	     if (write_exists) {
	     	store = store_coord == gl_fragCoord.xy;
	     } else {
	     	store = first == gl_fragCoord.xy;
	     }
	     
	     break;
      }
}
// Store G-Buffer
// It is important to _not_ use discard in this shader, to ensure early Z works and gets rid of unwanted writes.
if (store) {
   uint store_obj_rough_metallic = object_id;
   store_obj_rough_metallic |= clamp(uint(combined_roughness_metallic.r * 127),0,127) << 18;
   store_obj_rough_metallic |= clamp(uint(combined_roughness_metallic.g * 127),0,127) << 25;
   
   imageStore(obj_id_metal_roughness_tex, store_coord, store_obj_rough_metallic);
   imageStore(normal_tex,store_coord,octahedron_encode(combined_normal));
}

After this, a compute pass is ran computing the lighting of all transparent objects (obj_index == 0 means nothing to do). Light is written as a rgba16f g-buffer. To accelerate the lookups in the next pass, the compute shader will also write for every pixel an u32 containing the following neighbouring info:

Table containing 3 bits values:

x - 2 x x + 2
00 - 02 03 - 05 06 - 08
09 - 11 12 - 14 15 - 17
18 - 20 21 - 23 24 - 26

each 3 bits values represents:

0x7: No neighbour

else:

x x + 1
0 1
2 3

Finally, a second alpha pass is ran again from back to front. For shaded objects, lighting information is searched across the surrounding 36 pixels for objects that match it, then interpolated and multiplied by the albedo.

The algorithm would look somehow like this:

uvec2 base_lookup = gl_fragCoord.xy & (~uvec2(1,1));

uvec2 light_pos = uvec2(0xFFFF,0xFFFF);

for(uint i = 0 ; i < 4; i++) {
   uvec2 lookup_pos = base_lookup + uvec2(i&1,(i>>1)&1);
   uint obj_id = texelFetch(obj_id_metal_roughness_tex,lookup_pos).x;
   if ((obj_id & OBJ_ID_MASK) == current_obj_id) {
      light_pos = lookup_pos;
      break;
   }
}

if (light_pos == uvec2(0xFFFF,0xFFFF)) {
    discard; // could not find any info to lookup, discard pixel.
}

uint neighbour_positions = vec4(texelFetch(neighbours,light_pos).rgb,1.0);

vec4 light_accum = vec4(0,0,0,1);
ivec2 neighbour_base = ivec2(base_lookup) - ivec2(1,1);
for(int i=0;i<9;i++) {
   uint neighbour = (neighbour_positions >> (i*3))&0x7;
   if (neighbour == 0x7) {
      continue;
   }
   
   ivec2 neighbour_ofs = neighbour_base;
   neighbour_ofs.x += (i % 3) * 2 + (neighbour&1)
   neighbour_ofs.y += (i / 3) * 2 + (neighbour>>1);
   float gauss = gauss_map(length(vec2(neighbour_ofs - gl_fragCoord.xy))); // Use some gauss curve based on distance to pixel.
   light_accum += vec4(texelFetch(alpha_light,neighbour_ofs).rgb,gauss);
}

vec3 light = light_accum.rgb / light_accum.a;

light *= albedo;

// Store light with alpha blending.
...
@reduz
Copy link
Author

reduz commented Oct 23, 2023

@devshgraphicsprogramming you don't have to, its not really meaningful for occlusion in real-life. Static objects, or even simply any object that did not move from the previous frame is enough. And added to this, the buffer you re project does not include dynamic objects anyway, so reprojection is pretty good to begin with.

@DethRaid
Copy link

Are you sure that raytracing a depth buffer is the best approach? I'm struggling to see what advantage that has over traditional depth buffer rasterization

@reduz
Copy link
Author

reduz commented Oct 23, 2023

@DethRaid You can do it at a much smaller resolution, so it does not need to be the same one you use for rasterizing later on.
The advantage of not using visibility lists is that it allows you to continue using many multi-pass techniques that are allowed thanks
to having control of rendering order, such as:

  • Fine grained stencil passes control (very common in VFX)
  • Terrain or water blending to geometry.
  • Custom buffer rendering (common for postprocessing)

And other optimizations that an engine like Godot requires for flexibility regarding to multiple pass rendering.

This way if users download assets that require custom rendering passes, they continue working even if the renderer is GPU driven. A small reprojected + raytraced buffer should have very little cost and helps keeping things simpler.

@devshgraphicsprogramming
Copy link

devshgraphicsprogramming commented Oct 23, 2023

There are some good choices you're about to make (esp for your limited development resources where one should prioritize Single Code Path and Maintainability) which is going all in on raytracing for shadows and bindless. This I applaud.

I was about to give you the benefit of the doubt because I thought that by "flexibility" I thought you meant "ability to use Forward", until I read this:

Objects are rendered in multiple passes to a G-Buffer (deferred) by executing every shader in the scene together with their specific indirect draw list.

There's no reason for you NOT to use Visiblity Buffer, in-fact you lose flexibility because you'll probably ram a specific GBuffer layout down your user's throats.

Finally with GBuffer you always "pay for the most expensive material" because you end up having to fill channels other materials don't need.

Furthermore VisBuffer is always a win against Deferred, because you'll end up pushing less BW across the framebuffer, because you don't read your albedo, normalmap, roughness textures simply to have them output to a framebuffer and then re-read again for the shading pass.

This is especially relevant because you're attempting to use an "imperfect" Z-buffer for your culling so it means you have no Z-prepass, which in turn means you actually have overdraw on the opaque drawcalls, therefore VisBuffer would give you FURTHER performance savings from Deferred Texturing as opposed to using a G-Buffer.

Anyway you'd have most flexibility if you defined your Materials as kernels that run given some inputs (this is feasible since you're going bindless and RT for the shadows), then your material code remains the same for either:

  • ubershaders
  • Forward+
  • Deferred/VisBuffer
  • RT pipeline
    (we've done this for our Hybrid Path Tracer, Visbuffer Primary hit shader invokes same bunch of GLSL the indirect closest hit shader does).

This is super maintainable and has the added benefit of keeping your shading code (including texture accesses) EXACTLY the same across all render pipelines (main view, probes, lightmap baking, etc.)

@reduz
Copy link
Author

reduz commented Oct 23, 2023

@devshgraphicsprogramming Again, the requirements for the engine make these kind of technical decisions unfeasible. General purpose game engines are different beasts than the kind of things you are used to in most of the rendering world.

If you take someone from the AAA world and put them to do a general purpose engine dev, they will struggle due to the extremely different requirements. I could try to explain those in more detail if you really want, but if you are not familiar with this it may not be easy at all to understand.

@devshgraphicsprogramming
Copy link

devshgraphicsprogramming commented Oct 23, 2023

The terminal NIH disease

There seems to be an ongoing theme of "if its in AAA its not suitable", which is ironic given that Godot itself is currently using an AAA technique for culling from 2007. You are using AAA techinques, but about from 15 years ago.

While I understand that NIH is an endemic disease in the "homebrew engine" crowd, usually people have a much milder case where they simply just don't want to use code that wasn't written from scratch for their project but they'll at least do their homework and compare notes, exchange ideas with others, and learn on other people's trial and error.

You sir seem to have Stage IV of this disease which is to dismiss any idea/design that isn't yours and an unwillingness to consider other solutions even when they are presented on a silver platter.

Why this won't work

Ok let me take some time out of my busy schedule to explain why your culling idea won't work.

Lack of Generality (oh the irony)

You yourself state that this is general purpose engine, how is a technique that will have trouble with:

  • deformables (cloth, vegetation, skinned meshes, etc.)
  • dynamic objects (how are you going to reproject the depth of a moving object?)

general purpose at all? For now all I see is that the only occluders you'll support must be static triangle meshes. And by static, I mean truly static no movement from frame to frame even as a rigid body.

I thought general purpose could mean like a First Person game with you know, animated characters which could occlude vast portions of the screen?

You reiterate time and time again that this is not an AAA engine, hence don't you think that it would be nice NOT to require artists/users to make specialized simplified "occluder geometries" and have to remember to set them?

I mean you expect ray-tracing to be fast enough to replace a z-prepass, and hoping this will be the case "because there's only a few pixels to trace for".
Given that you want this to run on mobiles and the web and your raytracing software fallback layer will have to be faster than a z-prepass (because thats the only reason not to just do a z-prepass and not occlusion cull)
which will probably necessitate a separate, simpler BLAS per occluder, than the one you'll use for the shadow raytracing. Nice fun way to increase your memory footprint for no reason.

This is before I even point out that your users will sure appreciate having to "bake" occluder BLASes for their occluder mesh, which they'll also appreciate having to make and maintain.

The final nail in the coffin comes from the fact that unless you want to give up on streaming static chunks or like building the TLAS yourself (which you might for a fallback layer with Embree), Vulkan's Acceleration Structure is a black box.
If you want to use as much as a single different BLAS in an otherwise identical TLAS, you'll need to build a new TLAS from scratch (you can't just copy the shadow raytracing TLAS and hotswap the pointers to make it point at different simpler BLASes even if the input BLAS count and AABBs match).
This workload does not scale with resolution, needs to be done every frame, even if you make your culling depth buffer 1x1

2-4x more code to maintain, complexity and fragility

Again the AAA argument, you don't have the resources nor the expertise to maintain complex and duplicated codepaths.

Your design forces (I hope you're aware, but with every reply I loose faith) the renderer to partition the drawing into two distinct stages:

  • static objects
  • everything else

You then need split your renderpass into two, so that you can "save a copy" of the depth buffer before you draw other non-static things into it. Them tiled mobile GPUs are sure gonna love that.
The fun part (as I promised to expand upon) is that as soon as something starts moving (i.e. a door) you'll need to exclude it from the static set and not draw its occluder, because you cannot reproject its depth.

There are only 2 ways to do occlusion culling

Basically it depends on whether you want rasterization or compute:

  • rasterization => abuse depth-test only, draw simple conservative occludee bounding volumes (simples is AABB or OBB, can be convex hull) but with a fragment shader which writes out per-drawable visibilty to SSBO (z-prepass like)
  • compute => HiZ by mip-mapping the depth and only testing a screenspace 2x2 AABB or the 3D AABB, like vkguide

The HW occlusion pixel counter queries are not an option, because only one can be active per drawcall and they are super slow even with conditional rendering (which was invented to save you from GPU->CPU readbacks).
Its suckiness the reason why that Depth Buffer + Occlusion Testing at low res on the CPU was popular at DICE and Crytek.

Mmm the latency!

So anyway, at some point before you even start testing objects for visibility after frustum culling, you'd need to reproject that previous frame partial depth buffer and raytrace the holes, but you can't do that before polling for input.
Then you need to do the occlusion tests, you don't have a shadowpass or anything else to keep the GPU busy in the meantime.

Have fun maintaining and optimizing the code

The divergence on the Reprojection and Raytracing shader is gonna be some next level stuff, I'd personally love to see the Nsight trace of how much time your SM spends idling if you ever get far enough to implementing it.

You'll probably dig yourself into a hole so deep you'll consider doing "poor man's Shader Invocation Reordering" at that point and blog about it as some cool invention.

Nobody (EDIT: fully) tried Depth Reprojection for a good reason

You're probably not the first person to come up with "last frame depth reprojection" as an idea, now think about why nobody went through with it.

EDIT: Yes Assasin's Creed Unity used it, but the used reprojection differently to how you want to use it. First and foremost they still had a rough z-prepass with actual next camera MVP.

Raytracing to "fill gaps" doesn't make the idea special.

Reprojection introduces artefacts - false culling positives

There is simply nothing to reproject, depths are point sampled and you cannot interpolate between them (even with a NEAREST filter). The depth values are defined and valid ONLY for pixel centers from the last frame.

A depth buffer used for culling needs to be conservative (or some people say eager), therefore the depth values for such a depth buffer can only be FARTHER than "ground truth".

No matter if you run a gather (SSR-like) or a scatter (imageAtomicMax/Min- then you've really lost your marbles).

Don't believe me, try reprojecting the depth buffer formed by static chain linked fence (alpha tested or not does not matter) and call me back.

Essentially every pixel turns into a gap that needs to be raytraced.

This makes no sense from a performance standpoint

The only sane way to reproject is via a gather, which is basically the same process as Screen Space Reflections or Parallax Occlusion Mapping.

Let me remind you that a z-prepass usually takes <1ms and if it takes more than that alternative methods are considered for culling.

You've now taken one of the most insanely expensive post-processes (maybe except for SSAO) and made it your pre-requisite to culling (slow clap).

To put the icing on the cake, a reprojected depth (programmatically written) disables HiZ, so any per-pixel visibility tests (if you use that) done by rasterizing the Occludee's Conservative Bounding Volue get magically many times slower.

Finally there's that whole polling for input, frustum culling, depth reprojection, occlusion culling dependency of the first renderpass which increases your latency.

Now imagine, if only a solution existed that gave you 99% correct visibility and at full resolution in far less time than a z-prepass or this weird SSR?

The Established "AAA" solution is more robust, general and simpler

I gave you a solution thats "essentially free", it gives you all the visibility data in the course of performing work you'd already be performing anyway which is the most robust thing that will ever exist for rasterization, it:

  • has actually been implemented before and used in production
  • requires no special HW to be efficient (unlike Ray-Tracing)
  • gives 100% pixel-perfect last frame visible drawable set
  • is doable in Forward+ as long as you have a z-prepass which you should have anyway
  • knows 95% of its Potentially Visible Set before the next frame starts, so you can start drawing right away, without incurring extra latency
  • has no issues with procedural or deformable geometry
  • requires no prebaking
  • requires no extra special geometries, metadata, settings or parameters/heuristics to tweak
  • is completely transparent to the user (no popping, no intervention needed)
  • 100% accurate and artefact free (the second depth testing pass takes care of disocclusions)
  • is scalable (you can interleave / subsample the visibility info, you'll just have more "disocclusions")

In case it wasn't clear both the "last frame visible" and "disocclusion" sets come from the intersection of the "post-frustum cull" set for the new frame, not the whole scene.

You're arguing against yourself

@devshgraphicsprogramming Again, the requirements for the engine make these kind of technical decisions unfeasible. General purpose game engines are different beasts than the kind of things you are used to in most of the rendering world.

You really don't have leg to stand on for the decision to use a GBuffer deferred over Visbuffer, it does everything GBuffer does and more for everything from Low Poly or non-PBR 2D isometric casual games with no LoD to 3D PBR open-world games.

If you want to bring up "barycentrics of deformable/tessellated geometry" argument, go head... you simply draw them last and output barycentrics + their derivatives to an auxillary buffer, just as you would do for motion vectors for TAA/motion blur.

Except that you now no longer need the motion-vector aux buffer you'd have with GBuffer deferred because you have the barycentric coordinate and the triangle ID, so for deformables you can run the deform/tessellation logic (or you know, store the transformed vertices) for the previous frame in order to get your motion vector.

The only reason not to use VisBuffer over G-Buffer was the lack of ubiquity of bindless, now that you're going all-in on it, there's really no argument left here.

If you take someone from the AAA world and put them to do a general purpose engine dev, they will struggle due to the extremely different requirements. I could try to explain those in more detail if you really want, but if you are not familiar with this it may not be easy at all to understand.

A. This isn't engine dev, this is renderer dev
B. The entire gist is one massive proof that you should probably study some AA (3rd A intentionally missing) engine dev post mortems, because the infeasibility/inferiority of this whole design will become apparent about half-way through having spent all the resources to make it

Bonus Round: Order Independent Transparency

P.S. For OIT, you really can't beat MLAB4, works kind-of fine without pixel shader interlock on AMD (MLAB2 does not).

P.P.S. Yes you can prime the MLAB with an opaque layer so you're not processing transparent pixels behind opaques.

@reduz
Copy link
Author

reduz commented Oct 24, 2023

@devshgraphicsprogramming

You sir seem to have Stage IV of this disease which is to dismiss any idea/design that isn't yours and an unwillingness to consider other solutions even when they are presented on a silver platter.

You seem to think I haven't made my homework, but I am very familiar with how visibility buffers work and also MLAB OIT and I am up to date on most rendering things.

That said, to be a successful engineer (and I think Godot has been quite a success), in my view you need to have two abilities:

  • Understand what your users need, and not assume it.
  • Be able to roll your own solutions if what your users need is not something others have solved.

I will do my last effort to explain you why neither VisBuffers nor MLAB is a good fit for the Godot use case. Maybe for your use case its fine and I respect that, but you have to respect when It's not the case for others.

Godot requirements

Shader compatibility

Godot users need to be able to use exactly the same shaders when using the OpenGL ES 3.0 backend as they do when using a GPU driven renderer. This is a hard requirement that can´t be worked around. Other engines like Unity ask you to do your shaders over again depending on HDRP or URP, users hate this, we don't want to do that mistake, or simply restrict the users to make shaders with a visual interface. Again Godot users much prefer writing shader code on their own.

Godot uses its own shader language to accomplish the goal of shader portability. It's like a subset of GLSL ES 3.0 with our own parser and transpiler. The shaders have high level PBR outputs and are transpiled to whathever rendering implementation we use, be it simple multipass GLES 3.0 to clustered Vulkan. This is very useful because something like a uniform sampler2D can be converted to a regular sampler2D in GLES3, a texture2D with predefined samplers in Vulkan, or a bindless access in a GPU driven renderer all behind the user's back.

The shaders for surfaces also always have to support vertex interpolation and custom vertex attributes, because this is important for a lot of users who do procgen or their own tooling that requires custom information per vertex, often interpolated to pixel.

Generally, modern VisBuffer / Raytrace based renderers tend to prefer just storing a material ID and then running materials in compute passes, which is really flexible and efficient, but for our use case we can't do this because we need to keep compatibility with existing shaders.

Render pass flexibility

There are a lot of custom and very nice techniques you can do when you have control of your render passes and not attempt to render the whole thing at once. A few examples:

  • Stencil or depth cutout effects
  • Terrain blending to geometry.
  • Water blending to shores
  • Hidden object outlining

These are all very "gamey" effects and users use this a lot. For this they need to have control of materials rendering in order before each other. In some cases (for terrain or shore blending), you need to draw a few materials, then do a copy of the depth buffer, then render your new material testing against depth and doing things like alpha hashing.

All this flexibility is lost when using visbuffers, or becomes much harder to work around, because you rely on the hard requirement of first drawing everything and then shading.

AAA games are generally fine with this because they can workaround these limitations manually, but in our case it means writing entirely different rendering code for this. Given Godot is an engine where users download most of these VFX or materials online, users expect the things they download "just work" and that they work in all renderers.

Fixed and predictable cost

I understand you don't like the idea of depth reprojection, but all your examples are just unrelated and don't apply in this context. For this specific technique:

  • The gather pass here for reprojection is dirt cheap, much, much cheaper than SSR or SSAO.
  • In real life, animated occluders don't have much weight on occlusion in general, only static objects do.
  • The door example is just very misleading, with raytraced occlusion the depth ray will hit the door anyway and not continue traversal, so the cost is still minimal.
  • A chain fence or alpha hashed whathever is never contributing to occlusion. Again a misleading effect, it won't give you practical occlusion in VisBuffer either, so these are not even taken in consideration with raytraced occlusion.
  • In general, I think you are just missing the point here. This by default is a raytracing in a low resolution depth buffer, where the reprojections are a mere optimization to throwing less rays. It's very, very cheap and allows keeping the kind of renderer flexibility and shader compatibilty users expect to have in the context of Godot.
  • Keeping the GPU busy is not a problem if you have a graph solver that batches your tasks (compute/raster/transfer/etc) together nowadays.

Additionally, the reason I am supplying a custom OIT alpha scheme here and not going to MLAB4 is to have fixed and predictable cost. As you probably read, I intend to use only raytraced shadows. MLAB4 implies you have to shade multiple times per pixel, so to me It's a no brainer that this is useless here. Instead, I prefer to use an OIT alpha technique that is kind of similar to MLB but has constant cost (one shading per pixel) at the cost of lower frequency lighting. If you take a look at all the MLB examples in their paper, you will see none requires high frequency lighting.

To sum up

Good software is written understanding your users requirements. What you do in your software is irrelevant to what has to be done in other software if their requirements are different. I am nobody to tell you how to do your path tracer, likewise you should understand other software context before trying to tell them how to do things.

@devshgraphicsprogramming
Copy link

devshgraphicsprogramming commented Oct 24, 2023

I initially wanted not to respond because we're talking past each other.

However I've seen some some comments and content from some people that make me concerned that they think I'm dunking on Reduz the person as opposed to this occlusion culling idea (and the choice of GBuffer Deferred where you have a choice Forward+ or VisBuffer), and/or take it as a license to to do so themselves.

After the vblanco incident, I was curious if you'd take and process sound criticism, especially if it was abrasively formed on purpose.

To reiterate let me summarize the points:

  • your Occlusion Culling idea will either not work, or you'll have to relax the reprojection to the point it will provide no net benefit
  • I'm not telling you you have to use VisBuffer or Forward+, just saying GBuffer makes no sense given the former two alternatives
  • VisBuffer is essentially GBuffer compression + deferred texturing & attribute pulling, it does not imply you'll shade in compute, do material sorting or god-knows-what else so whatever arguments you use about GBuffer being flexible you can make about VisBuffer with the final shading pass as a fragment shader or whatever
  • Two-Pass occlusion culling works with Forward+, GBuffer or VisBuffer, as long as you have no overdraw (what I described is identical to the RedLynx technique)
  • the MLAB stuff is a Post Scriptum, its just a suggestion, your Inferred-Lighting-like thing makes sense

TL;DR All the technical choices outlined above make sense in some situations only you know or have a reasons to presume are the common case, except for this depth reprojection occlusion culling.

To anyone who thinks I wholesale dismiss this design or Reduz' skills in Graphics/Vulkan programming - read the thread again.

I especially commend the "no shadowmapping" and Bindless decisions.

@reduz
Copy link
Author

reduz commented Oct 24, 2023

@devshgraphicsprogramming

your Occlusion Culling idea will either not work, or you'll have to relax the reprojection to the point it will provide no net benefit

Guess the only way to know is to try, although for the time being this will not be worked on, will most likely happen sometime next year. Will be happy to chat again about this then!

@devshgraphicsprogramming

@devshgraphicsprogramming

your Occlusion Culling idea will either not work, or you'll have to relax the reprojection to the point it will provide no net benefit

Guess the only way to know is to try, although for the time being this will not be worked on, will most likely happen sometime next year. Will be happy to chat again about this then!

If you implement both methods and you manage to make your RT mini-depth buffer work faster, I'll publicly eat my shoe.

@reduz
Copy link
Author

reduz commented Oct 24, 2023

@devshgraphicsprogramming

I never claimed it should be faster, but simply fast enough. What matters to me is to keep flexibility during rendering.

@zomdiax5
Copy link

I'm not really versed on how the renderer/engine works, but couldn't those features be added to existing renderers? And is the amount of users in need of them enough to warrant creating yet another renderer? I fear that adding yet another rendering engine would spread the work people do too thin, also it might be a good idea to add those points to the FAQ

@reduz
Copy link
Author

reduz commented Oct 25, 2023

@zomdiax5 Keep in mind in Godot 4 the whole rendering architecture is very modular, so a new renderer is relatively small and easy to maintain nowadays.

@Saul2022
Copy link

Wouldn't it be good to make like a small mini implementation of that occlusion culling to make sure it works as intended, and then move on with that idea ?

@reduz
Copy link
Author

reduz commented Oct 25, 2023

@Saul2022 It should be harmless to test, but there are many other waaaay more important things to do before a gpu driven renderer is implemented.

@AttackButton
Copy link

AttackButton commented Oct 25, 2023

If you read my text above, you will see why this approach is not wanted. It's fantastic for AAA games, but Godot being a general purpose game engine has different requirements.

Hey, man. I know there is a certain perception that the design of this engine is not to be AAA. However, wouldn't it be interesting to create some kind of poll asking the community if they want a AAA render?

You've done polls for different topics before, I believe this is perhaps the most important of all.

Please listen to the community on this.

@patwork
Copy link

patwork commented Oct 25, 2023

Hey, man. I know there is a certain perception that the design of this engine is not to be AAA. However, wouldn't it be interesting to create some kind of poll asking the community if they want a AAA render?

No. "AAA" renderer is useless when the rest of the engine is not up to par in quality. Godot is a game development engine not a visualization engine.

https://godotengine.org/article/whats-missing-in-godot-for-aaa/

@reduz
Copy link
Author

reduz commented Oct 25, 2023

@AttackButton @patwork Having a renderer that looks like Unreal is not a problem. It is not the graphics what define these kind of engines.

An "AAA engine" pretty much means that your whole content workflow is designed for a team of hundreds of people pushing assets and logic into the game at the same time, or that everything can be tweaked to the millimeter to accommodate a single game. None of these things are of concerns to Godot users (or even Unity users), It's an entirely different territory.

@AttackButton
Copy link

AttackButton commented Oct 25, 2023

or that everything can be tweaked to the millimeter to accommodate a single game

I agree with that part and it's clear that godot's design exists to avoid something like this. However, Unreal's default renderer is already incredibly powerful, and yet not only AAA Studios use it (tweaking everything), indie devs are using this engine as well (Blueprint).

Anyway, I don't see a conflict between being "easy to use" and having a AAA renderer. What's the concern, mobile games/Apps? Even so, couldn't the user tweak some options in the editor to reduce the "potential" of the renderer?

@reduz
Copy link
Author

reduz commented Oct 25, 2023

@AttackButton Oh the goal is to continue working on improving rendering during this year, its just that currently the effort is more focused on performance, which is what has the highest demand.

@cshenton
Copy link

cshenton commented Nov 3, 2023

Could someone please summarise the reasons not to use the standard two-pass occlusion culling technique?

It's very straightforward to implement, can be dispatched at any granularity (per object, per meshlet, etc.), works with dynamic (even vertex animated, with tolerances) objects, changing LODs, different render architectures, compute and hardware raster, can be used in the shadow passes, etc.

The assertion that dynamic geometry aren't meaningful occluders assumes a very specific approach to game development (big static level with small # of dynamic characters) that I don't think a general purpose engine should make. Much of the work you do on two-pass is work you'd do anyway doing a Z-prepass, and it provides significant speed boosts on anything from more optimised game assets to film-quality geometry.

@Saul2022
Copy link

I know the idea is already set, and that priorities, but could be good to have more compatibility with older devices like this recent implementation from unity, for the future https://forum.unity.com/threads/gpu-driven-rendering-in-unity.1502702/ https://t.co/srz2yNciNt

@MaxLykoS
Copy link

In GPU driven pipeline how do you record drawcalls on the CPU side? Do you cull them by CPU(frustum/occlusion .etc) to reduce empty draw calls or simply dispatch all the indirect drawcalls ignoring visibility?

@ODtian
Copy link

ODtian commented Apr 29, 2024

@devshgraphicsprogramming I'm new to this, is 2 step culling using visibility buffer feasible with forward shading and current opaque rendering? Can this combine with soft raster like Nanite one to support pixel-level triangles (will this means we must shade in compute)? And it would be great if you have any reference doc or repo. Thank you!

@devshgraphicsprogramming

@devshgraphicsprogramming I'm new to this, is 2 step culling using visibility buffer feasible with forward shading and current opaque rendering? Can this combine with soft raster like Nanite one to support pixel-level triangles (will this means we must shade in compute)? And it would be great if you have any reference doc or repo. Thank you!

as long as you have the DrawID handy some pass and there's little to no overdraw, i.e. when doing Forward+ this will work.

Visbuffer is not necessary, but what is necessary is each pixel casting a ballot about visible objects.

You can have false positives (i.e. overdraw) but that will make it more costly to cast and less efficient as culling.

@ODtian
Copy link

ODtian commented Apr 29, 2024

as long as you have the DrawID handy some pass and there's little to no overdraw, i.e. when doing Forward+ this will work.

Visbuffer is not necessary, but what is necessary is each pixel casting a ballot about visible objects.

You can have false positives (i.e. overdraw) but that will make it more costly to cast and less efficient as culling.

So in forward rendering, it's basically like: cull geometry using last frame buffer (in this case just instance id and depth), and generate depth for this frame, use it as depth pre pass and do forward shading. Lastly generate instance id for next frame. Is that right?

@devshgraphicsprogramming

For Forward+:

  1. draw what was visible last frame in your z-prepass
  2. cull what wasn't visible last frame against your partial z-prepass
  3. draw whatever passed (2) into your z-prepass, now its complete
  4. do forward shading & ballot/record what was truly visible this frame
  5. [optional] draw transparent things

Transparent things you always treat as (2) because you don't want to murder your fillrate with per-pixel ballots of transparent pixels

@ODtian
Copy link

ODtian commented Apr 29, 2024

@devshgraphicsprogramming I get it. Thanks!

@octanejohn
Copy link

another project using depth reprojection with pathtracing in vr https://github.com/VirtualEngineeringLab/Pathtracing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment