Fast Fur Render Pipeline options

From Warren's Fast Fur Shader
Revision as of 16:17, 9 July 2023 by Warren (talk | contribs) (Added an intro section)
Jump to navigation Jump to search

What is a "shader"?

GPU programs (ie. "shaders") aren't like regular programs. They run on the GPU, not the CPU. They are also massively parrallel, running hundreds or even thousands of copies of themselves at the same time.

The word "shader" is used inter-changeably to describe either an entire beginning-to-end shader pipeline (ie. "Warren's Fast Fur Shader"), or to describe the various self-contained stages of the pipeline (ie. a "Vertex shader", "Geometry shader", etc...).

The shader currently uses the Unity Built-in Render Pipeline (BiRP). This graphics pipeline consists of 4 programmable shader stages: Vertex, Hull + Domain, Geometry, and Fragment:

The 4 programmable stages of the Unity BiRP. The "Hull + Domain" and "Geometry" stages are optional.

Each shader stage processes the information it is given, then outputs the result. The GPU drivers are then responsible for transfering that output to the next stage, as well as performing non-programmable functions such as combining individual verticies into triangles, or trimming triangles that are only partially on-screen.

(NOTE: The shader does not currently support Unity's other render pipelines, namely the Universal Render Pipeline (URP), High Definition Render Pipeline (HDRP), and Scriptable Render Pipeline (SRP). Future support for these pipelines is planned, but not until the shader is more polished, reliable, and feature-complete.)

There are 3 different options for how the shader will use the available BiRP stages:

Fallback Pipeline (Slow)

The Fallback Pipeline relies on slow Geometry shaders to discard all non-visible fur.

The Fallback Pipeline is the slowest pipeline, and will also have the lowest quality. It can render up to 32 layers of fur.

Its lower performance is because it relies on Geometry shaders to discard all non-visible triangles. Geometry shaders are unfortunately relatively slow to load into GPU memory, and they must reserve enough memory in advance for all possible output triangles, even if they later discard them. Any triangle that is too far away, or is backwards-facing, or is outside the visible screen area will still take time to be discarded by the Geometry shader.

Because even discarding triangles still takes some time, selecting a higher maximum layer limit comes with a performance hit, even when less layers are being rendered.

The Fallback Pipeline is not recommended. It is still much faster than other fur shaders that use the same approach, such as XSFur, but the only reason it is included is for compatibility reasons. This is because it doesn't use Hull + Domain shaders, which may not be supported with some games.

Turbo Pipeline (Fast)

The Turbo pipeline uses fast Hull + Domain shaders to discard most non-visible fur.

Depending on view range, the Turbo Pipeline is typically 25% to 100% faster than the Fallback Pipeline, despite also having much higher quality. It can render up to 32 layers of fur.

This speed comes from its use of Hull + Domain shaders to discard any fully non-visible triangles before the Geometry shader stages. This prevents the relatively slow Geometry shaders from loading into GPU memory if they aren't needed.

Hull + Domain shaders are typically used for tesselation, and are thus built for speed. However, the Turbo Pipeline does not use them for tesselation. Instead, it uses the Hull + Domain shaders as a very fast all-or-nothing kill-switch. When it wants to discard an entire triangle, the Hull shader specifies a multiplier of 0, which discards it. Otherwise it specifies a multiplier of 1 and the triangle is then passed completely as-is by the Domain shader to the Geometry shader. The Geometry shader is then resposible for making copies of the triangle for each visible fur layer, and discarding the rest.

Because even discarding triangles still takes some time, selecting a higher maximum layer limit comes with a performace hit, even when less layers are being rendered. However, this hit is relatively moderate (around 5% going from 24 to 32 layers), and less than the Fallback Pipeline.

The Turbo Pipeline is the recommended pipeline.

Super Pipeline (Fastest, capable of extremely high resolution, but not usable on some AMD GPUs)

The Super Pipeline uses fast Hull + Domain shaders to make the fur layers.

The Super Pipeline is NOT available for public use, due to AMD crashing issues. It is not a selectable option in release versions of the shader.

The Super Pipeline is typically 30% faster than the Turbo Pipeline when rendering at the same quality (ie. up to 32 layers), but it can also produce pixel-perfect screenshots with no visible gaps in the hairs. It can render up to 264(!) layers of fur.

The Super Pipeline's speed and resolution comes from using the Hull + Domain shaders to make only as many copies of each triangle as-needed (note: this is not tesselation, since it is only making copies of the triangles, not sub-dividing them). This means that un-needed triangles do not need to be discarded, because they are never created in the first place.

Since the Super Pipeline doesn't need to discard triangles, there is no performance hit for having a higher layer limit. The Super Pipeline will, however, limit its maximum layers dynamically, according to the quality settings.

Offloading most of the triangle copying workload to the Hull + Domain shaders from the much slower Geometry shaders gives the Super Pipeline a big performance boost. Unfortunately, it also runs afoul of AMD driver bugs with some of their GPUs. The Vega 64, 6600xt, 6700xt, and 6800xt have all been confirmed to crash (no Intel or Nvidia GPUs have been reported to crash). The Vega 64 crash was found to be due to a driver bug with nested arrays, which was fixed by modifying the shader code to access the arrays using a different method. However, the other AMD GPUs appear to be crashing due to memory corruption, which is not fixable by modifying the shader code. Unlike the Turbo Pipeline, the Super Pipeline's Domain shaders copy and modify the data coming from the Hull shaders before sending it to the Geometry shaders. The task of reserving and managing memory for these copies is handled by the GPU drivers, and this is where the AMD driver bugs seem to strike. Typically these GPUs will lockup after about ~10 seconds.

Until driver support can be confirmed to be reliable on all GPUs, the Super Pipeline will not be available publicly. It currently only exists as an experimental Beta version.

The Super Pipeline must NEVER be used in public games!