Skip to content

Instantly share code, notes, and snippets.

@devshgraphicsprogramming
Created October 23, 2023 22:48
Show Gist options
  • Save devshgraphicsprogramming/faa1f98f65661c54a960b45ed1d450ea to your computer and use it in GitHub Desktop.
Save devshgraphicsprogramming/faa1f98f65661c54a960b45ed1d450ea to your computer and use it in GitHub Desktop.
Don't even dream of reprojecting last frame depth for occlusion culling, you will fail!

Originally posted as a reply to: https://gist.github.com/reduz/c5769d0e705d8ab7ac187d63be0099b5

Turned into a gist due to high likelihood of deletion. Also edited down to not include irrelevant trolling as to be useful to someone else considering Depth Reprojection.

Yes I know SSR and Parallax Corrected Shadowmaps work, but the consequences of errors in those depth tests aren't as high.

Lack of Generality (oh the irony)

You yourself state that this is general purpose engine, how is a technique that will have trouble with:

  • deformables (cloth, vegetation, skinned meshes, etc.)
  • dynamic objects (how are you going to reproject the depth of a moving object?)

general purpose at all? For now all I see is that the only occluders you'll support must be static triangle meshes. And by static, I mean truly static no movement from frame to frame even as a rigid body.

I thought general purpose could mean like a First Person game with you know, animated characters which could occlude vast portions of the screen?

You reiterate time and time again that this is not an AAA engine, hence don't you think that it would be nice NOT to require artists/users to make specialized simplified "occluder geometries" and have to remember to set them?

I mean you expect ray-tracing to be fast enough to replace a z-prepass, and hoping this will be the case "because there's only a few pixels to trace for". Given that you want this to run on mobiles and the web and your raytracing software fallback layer will have to be faster than a z-prepass (because thats the only reason not to just do a z-prepass and not occlusion cull) which will probably necessitate a separate, simpler BLAS per occluder, than the one you'll use for the shadow raytracing. Nice fun way to increase your memory footprint for no reason.

This is before I even point out that your users will sure appreciate having to "bake" occluder BLASes for their occluder mesh, which they'll also appreciate having to make and maintain.

The final nail in the coffin comes from the fact that unless you want to give up on streaming static chunks or like building the TLAS yourself (which you might for a fallback layer with Embree), Vulkan's Acceleration Structure is a black box. If you want to use as much as a single different BLAS in an otherwise identical TLAS, you'll need to build a new TLAS from scratch (you can't just copy the shadow raytracing TLAS and hotswap the pointers to make it point at different simpler BLASes even if the input BLAS count and AABBs match). This workload does not scale with resolution, needs to be done every frame, even if you make your culling depth buffer 1x1

2-4x more code to maintain, complexity and fragility

Again the AAA argument, you don't have the resources nor the expertise to maintain complex and duplicated codepaths.

Your design forces (I hope you're aware, but with every reply I loose faith) the renderer to partition the drawing into two distinct stages:

  • static objects
  • everything else

You then need split your renderpass into two, so that you can "save a copy" of the depth buffer before you draw other non-static things into it. Them tiled mobile GPUs are sure gonna love that. The fun part (as I promised to expand upon) is that as soon as something starts moving (i.e. a door) you'll need to exclude it from the static set and not draw its occluder, because you cannot reproject its depth.

There are only 2 ways to do occlusion culling

Basically it depends on whether you want rasterization or compute:

  • rasterization => abuse depth-test only, draw simple conservative occludee bounding volumes (simples is AABB or OBB, can be convex hull) but with a fragment shader which writes out per-drawable visibilty to SSBO (z-prepass like)
  • compute => HiZ by mip-mapping the depth and only testing a screenspace 2x2 AABB or the 3D AABB, like vkguide

The HW occlusion pixel counter queries are not an option, because only one can be active per drawcall and they are super slow even with conditional rendering (which was invented to save you from GPU->CPU readbacks). Its suckiness the reason why that Depth Buffer + Occlusion Testing at low res on the CPU was popular at DICE and Crytek.

Mmm the latency!

So anyway, at some point before you even start testing objects for visibility after frustum culling, you'd need to reproject that previous frame partial depth buffer and raytrace the holes, but you can't do that before polling for input. Then you need to do the occlusion tests, you don't have a shadowpass or anything else to keep the GPU busy in the meantime.

Have fun maintaining and optimizing the code

The divergence on the Reprojection and Raytracing shader is gonna be some next level stuff, I'd personally love to see the Nsight trace of how much time your SM spends idling if you ever get far enough to implementing it.

You'll probably dig yourself into a hole so deep you'll consider doing "poor man's Shader Invocation Reordering" at that point and blog about it as some cool invention.

Nobody tried Depth Reprojection for a good reason

You're probably not the first person to come up with "last frame depth reprojection" as an idea, now think about why nobody went through with it.

Raytracing to "fill gaps" doesn't make the idea special.

Reprojection introduces artefacts - false culling positives

There is simply nothing to reproject, depths are point sampled and you cannot interpolate between them (even with a NEAREST filter). The depth values are defined and valid ONLY for pixel centers from the last frame.

A depth buffer used for culling needs to be conservative (or some people say eager), therefore the depth values for such a depth buffer can only be FARTHER than "ground truth".

No matter if you run a gather (SSR-like) or a scatter (imageAtomicMax/Min- then you've really lost your marbles).

Don't believe me, try reprojecting the depth buffer formed by static chain linked fence (alpha tested or not does not matter) and call me back.

Essentially every pixel turns into a gap that needs to be raytraced.

This makes no sense from a performance standpoint

The only sane way to reproject is via a gather, which is basically the same process as Screen Space Reflections or Parallax Occlusion Mapping.

Let me remind you that a z-prepass usually takes <1ms and if it takes more than that alternative methods are considered for culling.

You've now taken one of the most insanely expensive post-processes (maybe except for SSAO) and made it your pre-requisite to culling (slow clap).

To put the icing on the cake, a reprojected depth (programmatically written) disables HiZ, so any per-pixel visibility tests (if you use that) done by rasterizing the Occludee's Conservative Bounding Volue get magically many times slower.

Finally there's that whole polling for input, frustum culling, depth reprojection, occlusion culling dependency of the first renderpass which increases your latency.

Now imagine, if only a solution existed that gave you 99% correct visibility and at full resolution in far less time than a z-prepass or this weird SSR?

The Established "AAA" solution is more robust, general and simpler

I gave you a solution thats "essentially free", it gives you all the visibility data in the course of performing work you'd already be performing anyway which is the most robust thing that will ever exist for rasterization, it:

  • has actually been implemented before and used in production
  • requires no special HW to be efficient (unlike Ray-Tracing)
  • gives 100% pixel-perfect last frame visible drawable set
  • is doable in Forward+ as long as you have a z-prepass which you should have anyway
  • knows 95% of its Potentially Visible Set before the next frame starts, so you can start drawing right away, without incurring extra latency
  • has no issues with procedural or deformable geometry
  • requires no prebaking
  • requires no extra special geometries, metadata, settings or parameters/heuristics to tweak
  • is completely transparent to the user (no popping, no intervention needed)
  • 100% accurate and artefact free (the second depth testing pass takes care of disocclusions)
  • is scalable (you can interleave / subsample the visibility info, you'll just have more "disocclusions")

In case it wasn't clear both the "last frame visible" and "disocclusion" sets come from the intersection of the "post-frustum cull" set for the new frame, not the whole scene.

@nem0
Copy link

nem0 commented Oct 24, 2023

IIRC at least some games from Assassin's Creed series did use depth reprojection. And I think Sebastian Aaltonen mentioned using it too. But maybe I just don't remember correctly.

@tuto193
Copy link

tuto193 commented Oct 24, 2023

IIRC at least some games from Assassin's Creed series did use depth reprojection. And I think Sebastian Aaltonen mentioned using it too. But maybe I just don't remember correctly.

He did indeed here

@devshgraphicsprogramming
Copy link
Author

devshgraphicsprogramming commented Oct 24, 2023

IIRC at least some games from Assassin's Creed series did use depth reprojection. And I think Sebastian Aaltonen mentioned using it too. But maybe I just don't remember correctly.

He did indeed here

its still a depth-prepass + a little bit of reprojection (its cheaper than the downsampling pass).
Think about how rough and hand-wavy that reprojection has to be to run in 0.05ms on a GCN GPU.

It will still suffer from not being robust for moving geometry and vastly inferior to now the standard last-frame visible dual pass pipeline.

His full res 1080p dual pass Red-Lynx solution (same thing as what I proposed) is 0.73ms, while the depth reprojection at 512x256 is 0.7ms
This is testing visibility with 15.8x HIGHER resolution.

Also the data we're missing is how much longer the GBuffer took to render with all the visibility false positives of the old method.

@devshgraphicsprogramming
Copy link
Author

P.S. the differences between having a 256x512 HiZ cull and a full res one get more apparent when you use meshlets and higher polycounts, as the AABBs for culling get smaller and you benefit from having a less coarse Depth Buffer to test against.

@nem0
Copy link

nem0 commented Oct 24, 2023

@Xrayez
Copy link

Xrayez commented Dec 7, 2023

Thanks for challenging Juan! I'm a former maintainer of Godot, by the way.

You mentioned NIH syndrome of Godot in your other replies to Juan, I fully confirm:

Waiting for Blue Robot - NIH

The book provides many other amazing insights about Godot's actual practices, if curious. 🤣

@devshgraphicsprogramming
Copy link
Author

Thanks for challenging Juan! I'm a former maintainer of Godot, by the way.

You mentioned NIH syndrome of Godot in your other replies to Juan, I fully confirm:

Waiting for Blue Robot - NIH

The book provides many other amazing insights about Godot's actual practices, if curious. 🤣

NIH actually make sense, but you need to outgrow your training wheels (SDL, etc.) first, which means taking them to the limit first.

@Xrayez
Copy link

Xrayez commented Dec 7, 2023

Yes, it depends entirely on what one wants to achieve, but first, one must know what one wants to achieve to begin with. I wrote a proposal for Godot that aimed to shed light on its development philosophy, but it has been sitting there for years:

godotengine/godot-proposals#575

Eventually, they banned me from the project for raising and discussing governance and management issues like these, despite me being a valuable member of the community, this should tell you that there's something fundamentally wrong with Godot if they go for such extreme measures to suppress discussions. This happened to other members of Godot. You even stated in your gist that there is a "high likelihood of deletion". Yes, they do tend to delete critical posts about Godot eventually.

If you believe in Godot's ideas, I suggest creating a fork of it at this point. However, I have personally moved on from Godot and abandoned everything related to it due to its pervasive ideology. Godot looks good on the paper only, so to speak. Juan's gaslighting tactics can mess up your brain if you continue convincing him of anything, and he will eventually "win" the more you interact with him, because he's a master of gaslighting.

Regardless of how technically proficient your arguments may be, the abysmal management and governance of Godot is the fundamental issue that makes technical discussions unnecessary in deciding whether to adopt Godot. If they cannot competently run the organization, then no number of donations or code contributions will be able to aid Godot.

I have been trying to communicate with Godot's leadership for the past five years, on similar technical topics such as yours, with no success. I am trying to save you some time.

Good luck.

@WickedInsignia
Copy link

WickedInsignia commented Dec 12, 2023

3. Not that long ago they hired a dedicated contributor to start working on their own phys engine **AGAIN**, while the entire community that has physics-reliant-projects collectively started using Jolt plugin instead.

That contributor was chased away from the project entirely due to bullying.
Camille now works with Rockstar.

Thanks for this writeup Devsh.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment