Tech Note: Shader Snippets for Efficient 2D Dithering
Oculus Developer Blog
|
Posted by Martin Mittring
|
August 20, 2018
|
Share

A few examples of the minimal ALU method left to right (Best viewed at 100%):Plus6Int (6 shades), Dither17 (better quality than 16), Dither32 (more shades), Dither64 (more shades, artifacts)

Note: As for all images in this post - they have been created for this work. Images might be scaled and show a repeating extra pattern (Moirépattern) depending on your chosen browser and browser settings. We suggest the sample application for an undistorted and animated view. Consider your monitor might affect the visuals - especially animated content.


In this article, we present some ready to use HLSL snippets (small fragments of code in Direct X shading language) for dithering. We also provide details on when to apply them and how they have been crafted. We include some existing and new solutions to the problem. You can find interesting performance numbers at the end.

Efficient rendering algorithms are paramount for VR developers. They allow for more fidelity, higher frame rate, and less latency. That leads to better immersion and more comfort which lead to a longer, more pleasant VR experience.

During development of ”Dear Angelica” (later in ”Quill”) and the new Oculus “Home” we needed high-quality dithering. Findings from those productions led to the creation of this blog post.


Dithering is about transforming a value of higher precision (e.g. 0..1) into a value of less precision (e.g. binary) that when averaged with nearby pixels (spatial) or over time (temporal e.g. multiple frames) approximates the input value better than a simple quantization could do. There are many applications to dithering. Code for those applications differs but they all can be unified with a function returning a value in the 0..1 range and getting location (e.g. 2D screen position) and time (e.g. FrameIndex) a input. That value can be used in many ways:

  • Clip/cull pixels when compared with a threshold
  • Added to a value that is going to be quantized (color banding or MSAA banding when culling MSAA samples)
  • Added to sampling start position that would suffer from banding due to under sampling (e.g. 3d volume tracing start location)
  • Rotate a 2D sample set to create more sample position (e.g. Screen Space Ambient Occlusion)

Some dithering algorithms (e.g. Floyd Steinberg) require the signal from nearby sample points making a parallel implementation harder or less efficient. In this work we assume a rather smooth signal where employing nearby samples would not add significant value.

Example applications of dithering:

  • Some printers use dithering to approximate grey scale values with black dots.
  • Opacity Mask / Stipple Pattern / Dithered transparency / Screen-Door Transparency: Single layer transparency can be approximated by dithering opaque pixels (with clip()) based on the transparency/alpha value. Multiple layers often look acceptable but wrong (no proper OIT solution). TemporalAA can further improve the quality.
  • With MSAA the error can be reduced further (with SV_CoverageMask, see ”Dear Angelica”).
    e.g. 4xMSAA: add 1/4 * dither - 0.5 to OpacityMask in UE4 material
    Note: Unreal Engine 4 used a weighted average for HDR resolves which can cause minor artifacts
  • Prevent banding in low-precision buffers (e.g. 8 bit): e.g. after tone mapper, in G Buffers or from many alpha transparent layers
    e.g. add 1/256 * (dither - 0.5) to PS output
    RGB dither would result in more luminance shades at the cost of more dis-colorization
    see UnrealEngine 4 r.Tonemapper.GrainQuantization
  • Blending materials for a deferred renderer
  • “Softly” transitioning Level Of Details or fading out an object in distance (for better performance)
  • Rotate sample set (e.g. UnrealEngine 4 SSAO) to improve sampling efficiency
  • Add start distance for ray marching to blur out sampling artifactse.g. http://shaderbits.com/blog/ue4-volumetric-fog-techniques

Source code / demo (Win32) link: DearGpu.com

  • Liberal license: WTFPL http://www.wtfpl.net (minor exceptions in “Copyright.txt”)
  • Controls: Use keyboard left/right/up/down to navigate the options. F5 to reload shaders.
  • “DitherExperiment.hlsl” has the relevant HLSL snippets.
  • All exposed C++ constants and textures can be found in AutoCommon.hlsl
  • For running the benchmark follow the instructions in the application. The output can be found in in the Stats folder and opened with Excel

Some example code (simplified for better illustration):

      // @param Pos screen position from SV_Position which is pixel centered meaning +0.5
      // @param FrameIndexMod4 0..3 can be computed on CPU
      // @return 0 (included) .. 1 (excluded), 17 values/shades
      float Dither17(float2 Pos, float FrameIndexMod4)
      {
          // 3 scalar float ALU (1 mul, 2 mad, 1 frac)
          return frac(dot(float3(Pos.xy, FrameIndexMod4), uint3(2, 7, 23) / 17.0f));
      }
      // one example usage (clip/cull pixel if value is <0):
      clip(Dither17(Pos, FrameIndexMod4) - alpha);
  

Goals:

  • Input:
    • float 2D screen position as provided by HLSL semantic SV_Position (pixel centered meaning +0.5)
    • uint FrameIndex (for animation, more efficient: float FrameIndexMod4)
  • Output:
    • float 0..1 (to clip pixel/subsample or add to values that undergo quantization) or bool
    • The values should be well distributed over the 0..1 range to evenly distribute the shades
  • Outside range behavior:
    • an alpha value <0 should always result in an output value <0 (e.g. clip pixel)
    • an alpha value >0 should always result in an output value >1 (e.g. no clip pixel)
  • Multiple options to allow different trade-offs and adapt to the application
    e.g. regular / blue noise, static / animated, high frequency / more shades, ALU / texture
  • Simple HLSL shader code (but can be adapted to other shader languages)
  • Few simple (no sin or cos) floating point (not int/uint) shader instructions for best performance to make the method nearly free (Could be used on mobile GPU, VR or using it in each pass to fight quantization banding)
  • Easy integration into any code (no / minimal C++ setup, no external data)
  • High frequency as human perception and other spatial blur (e.g. from VR distortion pass) can hide the pattern

Existing solutions:

Crafting a minimal ALU only dither function:

We've seen existing snippets using only a few multiplications, adds, frac(), and some magic numbers. Finding an even faster function turned out to be an interesting challenge.

Inspired by this article: Visiting all values in an array exactly once in “random order”

We managed to craft a method based on a single modulo/frac() operation. We can transform “visits all values in an array exactly once” to “generates a well-distributed output value from a given input”

We know that on modern GPUs the most efficient primitive is a float MAD (fused MUL and ADD) operation. To create a value in 0..1 range we can use a simple frac() operation (very fast on GPUs). We have to make use of the Screen x and y location and this led to this minimal function (non-temporal):

frac(x*k0 + y*k1)

When comparing this value with an alpha value we can make the decision to keep or reject a sample.

The challenge is to craft the constants k0..k1 to achieve as much as possible of the defined goals.

The problem is simple enough, allowing for a brute force search of the best constants. However, defining a quality metric is non-trivial as a subjective balance between multiple goals needs to be found. In the end we found the best constants by experimentation. With with some math knowledge we managed to reduce the search space to a manageable range of numbers.

Converting integer math to floating point math

To derive a repetitive pattern it's actually easier to stick to integer math and replace the frac() with the modulo operation:

((x*k0 + y*k1) % Count) / (float)Count    ~=      frac(x*k0/Count + y*k1/Count)

It turns out a good integer version can be simply converted to float and frac() without any issues.

Floating point math usually suffers from a precision loss when doing math with large numbers. We can avoid/limit the problem if we limit the value ranges (screen resolution, constants, output). The SV_Position default 0.5 offset should be considered when doing the math as it can affect the stability of function at different screen locations.

Notes:

  • Some hardware has special faster integer math (e.g. 24bit, 16bit, 8bit) but we did not optimize for that.
  • From past experience, we know that on some mobile hardware half (16-bit float) precision can cause problems with large resolutions. The functions here have not been tested with that in mind.
  • The code here has not been tested with larger screen resolutions (>2048).
  • Comparing to an integer reference shows if we can match the integer quality. This has been done visually but could have been tested more extensively.

Finding the right constants

Using “Count” provides us with a limited set of values which results in as many patterns.

For the 1D case, there is well-known math (see article above) to compute a sequence that visits all values in a pseudo-random order. We need to choose the constants k2 and k3 to be coprime to “Count”. Note that the Halton sequence also makes use of coprime numbers, but additionally alters the order (using more expensive integer math and reverse bits function) of the bits which helps the quality with a larger “Count”.

Good properties to optimize for (can conflict with each other):

  • Large numbers of shades directly conflict with avoiding visible patterns or low-frequency features
    => a small fixed number of shades can avoid visible patterns
  • Near 0 and 1 we prefer solid values over patterns (no dither pattern)
    e.g. top is good, the bottom is bad (very dark areas still show a pattern) - the dialog is explained later
  • Near solid colors (one-pixel case) could be distributed to be nearly equidistant (“Dither17” is better than “Dither16”)
  • No distracting patterns e.g. filling up diagonal lines of the pixel (hard to avoid with larger modulo numbers)
  • The ideal 50% pattern is a 2x2 pattern.
    Picking a coprime number near the half of the modulo value e.g. 7 for (x*7)%16 is ideal to create a high-frequency pattern for this case.
  • Good performance (instruction type, count, temporaries)
  • Consistent quality at any screen location (using float or half on high-resolution displays can be a challenge)
  • Consistent quality at any alpha value and when animating alpha

Visualize the functions

The following diagrams are a visualization that helps to compare and illustrate the properties of the functions. Run the application on DearGPU.com to get a high quality animated visual.

The right shows a “side view” of the comparison with an alpha ramping up from 0 to 1. It's similar to a 2D diagram where x is the alpha value and y is the brightness of the pixel. Because it was evaluated per pixel you see “overhang” pixels. You can ignore that fact. It's useful too if colors blend linearly (straight lines) and how and how much the solution deviates from the reference.

The 5 sub bars on the left show the

  • Animated pattern (at display refresh rate)
  • Animated pattern (slowed down 8 times)
  • Temporally blurred over 4 frames (weighted: 4 3 2 1 to not blur out low temporal frequency patterns)
  • Spatial blurred with a 4x4 box filter
  • Temporally blurred like before and spatially blurred with a 2x2 filter

Notes:

  • Temporal means over time (e.g. average multiple samples at a fixed screen location over a few frames)
  • Spatial means over space/position (e.g. average multiple samples that are nearby to each other)
  • The images here are not animated - if you want to see the animated version - you have to run the application from DearGPU.com. The temporally blurred versions give a similar impression.

The reference lerp (ColorA, ColorB, alpha) looks like this:

This ideal image shows a smooth transition from black to white (left). The 5 bars look the same as spatial blurring (in XY) and temporal blurring (in time) have little effect.


When looking at the “PseudoRandom” function you can see a transition from black to white but with large artifacts.

The random noise shows a clear deviation win the ramp which shows the result is not very good.


The following shows the “DitherArray4x4” function with temporal variation over 4 frames → 4*4*4 = 64 shades

Note: The right side uses a box blur which results in a clean ramp. The quality of the result cannot be judged by that alone.


The following shows a gradient noise pattern used in “Dear Angelica” and in an early version of “Quill” (variant of “COD: ADVANCED WARFARE"):


The following shows a 32 shade dither pattern. We see 2 images, the reference integer implementation and the float implementation which should look the same but result in better performance:

Integer code (offset is to reflect 0.5f pixel offset in In.Pos.xy):

      uint offset = dot(float3(0.5f, 0.5f, 0), k0);
      uint Ret = (offset + dot(int3(In.Pos.xy, In.frameIndexMod4), k0)) & 0x1f;
      return frac((Ret + 0.5f) / 32.0f);

Float code:

    float Ret = dot(float3(In.Pos.xy, In.frameIndexMod4 + 0.5f), k0 / 32.0f);
    return frac(Ret);

The last shows a 17 shade dither pattern (16 didn't look as good) trading shades with a higher frequent pattern:

Notice there is 17 distinct steps in the ramp. Each pattern seems to be regular and high frequent. On a closer look, we can see the 50% area shows a distinct stripe pattern. This is where we had to make a compromise. A table/texture based solution could solve this but the minimal ALU solution limits our options.

Adding a temporal component:

We decided to include the option of a temporal component. By varying the function over time we can reduce the perception of standing patterns and produce a better, smoother transition. We repeat the adjustment every 4 frames (time window) which works well for 90 Hz frames (Oculus Rift). Less (2 or 3) can make sense on a lower image rate (e.g. 60 / 30 Hz). More is an option on higher image rates or when using post filter like the TemporalAA in Unreal Engine 4. We prefer a small time window to reduce the perception of a moving pattern.

Options adding a temporal component:

  • Alter input
    • Offset input XY from CPU
      => CPU code, flexible
    • Offset input XY with 2 MAD: x += FrameIdMod * k2; y += FrameIdMod * k3
      => simple
    • Offset input X or Y with 1 MAD: x += FrameIdMod * k2
      => cheaper than XY but might be less effective
  • Alter function
    • vary k0/k1 from CPU
      => need multiple nearly uncorrelated good constants (e.g. mirror x/y, flip xy), CPU code
    • extra constants: dot(float3(xy, FrameIdMod), k)
      => more constraints on finding good constants
  • Alter output (generally not a good idea)
    • Offset output: + some_func(FrameIdMod)
      => more shades (e.g. 4x4 pattern has 16 shades, with 4 temporal we can get 64 shades) but pulsing and standing patterns

Note: Display overdrive (Display hardware that reduces latency by over/undershooting) can make the temporal adjustment more visible and even result in a minor color shift. Oculus Rift employs a similar technology in the shader. The quality of that implementation is very high and we don't see any distracting overdrive color shifts.

"Focus on the dark shades"

Visually most noticeable is a bright pixel in a dark area. To better observe the artifact in dark shades it's best to reshape the alpha value e.g. new_alpha = pow(alpha,4):

Note how some of the functions shown here don't have a solid black near 0. This can be fixed by adding a small offset. To keep the brights a solid color (>=1) a multiplied might be needed.

Using dither to blend between colors

Shaping the function to adapt to the display sRGB curve is tempting but it's actually only possible when both colors (Src and Dst in frame buffer blending) are known and grey scale (or when done per channel). Here we see a blend from pink to light green. The right side now shows the RGB color channel in an additive blend.

The colors have been chosen (A=0.9, 0.1, 0.2; B=0.1, 1.0, 0.4) to use most of the 0..1 range and see how different blends behave. The right side shows 3 RGB ramps. This was to prove that any reshaping of the dither function would result in wrong results (e.g. compare with sqrt(alpha) instead of alpha). It might appear like a more pleasing transition on sRGB display for transitions from black to white but we want to have it work with any color. Applying any sRGB corrections to non-color channels is wrong. If you have trouble judging the visuals with the chosen colors you can use different colors when running the application (look for GeneralPurposeTweak).

Texture Blue noise (non-ALU organic high-frequency pattern)


64x64 texture

16x16 texture/table (more tiling)

Blue noise can be used for a more organic look, still maintaining the high-frequency property. We are not aware of very efficient math function that doesn't employ large tables for blue noise. Luckily GPUs have efficient texture mapping. To avoid repetition you want a large texture but to get good texture cache you want it small in memory (size and format). With a small enough texture you can even avoid texture cache trashing altogether. This is the best performance you can get and a lowering the resolution further would not buy you anything. This might be different when combined with other shader code - it's always best to profile this in your shaders and on your platform (e.g. mobile).

Getting blue noise textures:

  • Generate e.g. VDR Follow Up – Grain and Fine Details
  • This Unreal Engine 4 asset reference (in engine content):
    Texture2D'/Engine/EngineMaterials/Good64x64TilingNoiseHighFreq.Good64x64TilingNoiseHighFreq'
    is an option but using a true blue noise texture can improve quality quite a bit.

In order to avoid standing patterns, it makes sense to give the impression of a never-repeating pattern. This can be implemented by offsetting the pattern/texture in a new random way. A pure random offset will result is short sections of moving patterns so it's best to have a more controlled randomness. Offsets that repeat every 64 .. 2048 frames should be long enough to break up the pattern and short enough to make it easier to judge the quality. A more recent version of “Quill” is cycling through multiple blue noise textures for a organic dissolve function on some materials.

Table based Blue Noise

For easier code integration we also investigated the option to use an array in HLSL. We limited the size to 16x16 as larger sizes are more likely to cause problems (constant buffer size limit, shader compile time, code size, etc). We found with full HLSL optimizations enabled the compiler generated constant buffers and accessed those. There are limits on constant buffers (size, index at 16 bytes) which might be optimized further in the driver but this is up for further investigation. The table we want to store is in bytes but that type is not available for tables/constant buffers so we use a 32bit uint. We tried packing bytes in “BlueNoiseA16x16” function andusing 4 times as many uint in “BlueNoiseB16x16” (simpler shader).

Note: Make sure the table is declared outside the function with “static const” and compiler optimizations (e.g. D3DCOMPILE_OPTIMIZATION_LEVEL3) enabled. Without that, we've seen the resulting code be very inefficient (many instructions and temporaries). When using similar code in other shader languages and platforms you might see even more problems.

Linearity

Dithering is replacing a soft transition between two values (A and B) by a pattern that statistically amounts to a similar value. This is assuming linearity.

lerp(A, B, x) ~= (A * a + B * b) / (a+b)

a: number of pixels showing A
b: number of pixels showing B

Note: lerp(A, B, x) is a linear interpolation = B*x + A*(1-x)

This assumption might not be correct because of hardware properties but it can be easily verified when comparing a blend between two values (e.g. extremes like black and white) with the linear interpolated reference. You have to do that on the target display (e.g. in VR, not on the monitor). A calibration could fix this but with the few shades, we don't have much fine control. Fixing larger issues (e.g. sRGB) are only possible if we know the values we want to dither/blend. This is not always possible (e.g. fixed function frame buffer blending doesn't allow that).

The assumption can also be violated by shader code e.g. Unreal Engine 4 TemporalAA is using a weighted average for HDR.

Even if incorrect, having some kind of transition is often better than no transition.

Performance

Profiling these minimal shader snippets turned out to be very hard. Shader byte code count can be misleading as the drivers optimize the code further and some instructions take longer. Non-dither shader code can play a large role in the cost of the dither code. The number of temporaries can affect performance, especially if you have a lot of texture cache misses in the other part of the shader but the snipped code is very low on that we don't see the effect. Measuring timing by amplifying the shader code is possible but still tricky (branching adds extra cost).

Reading the chart (4 passes of 1024x1024, dynamic loop of 100 calls to the function, Nvidia TITAN Black, driver 398.11):

  • FrameTime is the most important criteria, a Null shader
  • InstructionCount and TempRegisterCount is from HLSL Bytecode
  • Null is clearly the fastest only showing frame buffer writes (additive blending)
  • Constant is to see the cost of the dynamic loop with minor code to not get it compiled out - it's the fastest
  • Plus6Int is one of the slower ones, not optimized (using integer)
  • Dither16, Dither17, Dither64 use the same code, just different constants
  • Dither32 is like the ones before but has an additional +0.5 in the shader
  • BlueNoiseA16x16 unpacks 4 bytes from uint[16*16] in constant buffer, more ALU
  • BlueNoiseB16x16 reads bytes from uint[16*16*4] in constant buffer, more constants, slowest
  • DitherArray8x8 is storing bytes in uint[8*8] in constant buffer
  • DitherArray4x4 is storing bytes in uint[4*4] in constant buffer
  • BlueNoiseTex64x64 using a small texture instead of constant buffer, way better than smaller tabes
  • PseudoRandom sin() makes this expensive
  • GradientNoise requires two frac()
  • Halton16 and Halton64 make use of integer math and reverse bits, slow
  • Overview a shader that branches into all other options (temp count is outside of the chart)

Excellent presentations / posts that cover the topic in more detail:

Summary and observations

We presented a small set of functions that can be integrated easily. Depending on the application you can trade the number of shades, noticeable pattern, math/texture cost and other properties.

It seems large gradients deserve more shades and in cases where small and large gradients can be seen (this is often the case) an adaptive hybrid method could work best (e.g. choose number of shades depending on filterwidth() which is computed with ddx/ddy). This is similar to mip-mapping where you pick a different content depending on the magnification factor. The synthetic performance numbers show texture mapping has very good performance and table (constant buffer) random access is quite expensive. The presented math only solutions are a good choice when you are texture bound and want to integrate a solution quickly.

My last words and about the development environment

I used “We” in most of this post to reflect a large part of this is derived from the work for others and discussions I had with industry professionals. For this work I especially want to thank:

Dean Beeler, Volga Aksoy, Inigo Quilez, Zeyar Htet and Gillian Belton

To develop functions like this, a good iteration time is important (quickly recompile shaders on a button or on save). I searched for a Shadertoy like the tool for HLSL that also supports profiling and the closest I found was Shadertoy-dx11. It was used to create the initial set of images but it wasn't suited to get performance data. This is why I developed a new tool that is more suited for performance experiments.

It is based on AEngine which is a DirectX11 based mini engine (I released that while I worked at Epic Games, as free open source). The new framework called DearGPU (see source link in the beginning) allows for quick iteration (F5 to reload shaders) and convenient C/C++ / HLSL interaction (C++ exposed parameters which generate an HLSL include file, shaders get compiled on demand). Shader defines can be created in few lines of code which makes performance experiments way easier. There is a simple built-in benchmark functionality (choose benchmark type in the UI and run with F3). It will generate multiple .CSV (Comma Separated Value) files that can be imported into Microsoft Excel to create graphs.

You can use the DearGPU under the same free license. If you want to help me grow this into a shader performance test tool, contact me. Imagine a website where you can read about other peoples interesting experiments - each starting with a question like this “Dear GPU, I was wondering ...”.

I also can see the tool running on multiple graphics cards /driver versions/machines gathering more data.

Warning: Drivers and hardware are not bug-free and with my experiment I even experienced computer restarts when running extensive (e.g. a single very slow draw call) shaders. I will try to follow up with hardware vendors to get this resolved.


Bonus: How to use with Unreal Engine 4 Materials

  • “BlueNoise” function (without temporal component):Note: A blue noise texture from the link above will improve quality
  • “Dither64” function (with temporal component):Note: View.FrameNumber is actually View.FrameNumber%4