Loads, Stores, Passes, and Advanced GPU Pipelines

Oculus Developer Blog
|
Posted by RÉMI PALANDRI
|
September 25, 2020
|
Share

We’ve had a number of questions (and a number of errors) on invalidates, clears, and loads on single-pass and multi-pass renderers on Oculus Quest. This blog post provides an overview of various mobile GPU rendering architectures across GL and Vulkan, how to configure and profile them properly, and how Fixed Foveated Rendering (FFR) works with those architectures. The following sections are critical knowledge for anybody whose job is to deal with mobile GPU renderers. This is a lengthy post, but this information is really important for graphics engineers to understand, especially those attempting advanced techniques that use multipass rendering.

THE BASICS

In all rasterization-based GPUs, every triangle on the screen will output pixels that will be depth-tested against the current value, on their pixel position, of the depth buffer. If they pass the depth test, they will then write the new color and depth value into the color/depth attachments, over and over again until all triangles for the frame are rendered.

This means that for every pixel of every triangle, the GPU will have to execute at the very least one read (for the depth testing), and probably two writes a few times (color and depth). In the case of color blending, there would be an extra color read as well to compute the final color value. The more overlapping geometry, the more reads and writes.

TILED RENDERERS

Tiled renderers optimize the speed of reads and writes by realizing that:

  1. Any given pixel is not dependent on the values of other pixels of the frame

  2. Nobody needs to display the value of color and depth mid-frame

Tiled renderers break the screen up into tiles and render them sequentially. For each tile, they use a fast, small cache (gmem/tilemem) for the reads and writes until the final value for each pixel is computed. That value will then be stored into RAM (sysmem) for future consumption.

The Qualcomm Snapdragon 835 and Qualcomm Snapdragon XR2 chipsets (used by Oculus Quest and Quest 2, respectively) both have 1Mb of tile memory. On the XR2, where a tile will contain both the left and right eye views in a Multiview rendering pipeline, an application running 4xMSAA, 32bit color, and 24/8 depth/stencil buffer will have 96x176 tiles. This makes sense, as a pixel will contain (4-color+4-depth)*4-msaa = 32-bytes of information. 96*176*2 (multiview)*32 ~= 1Mb. Changing per-pixel settings such as MSAA will change tile size as the GPU driver will try to maximize the number of pixels a tile contains (to reduce the total number of on-screen tiles and maximize cache utilisation) while keeping within the 1Mb tile size.

LOADS AND STORES

Now that we understand the tiled renderer architecture, we can infer that on a per-tile basis, the per-tile workflow would be:

  1. Load pre-existing depth and color buffer data from RAM to tile memory

  2. Render all triangles/fragments into tile memory

  3. Store out final depth and color buffer from tile memory to RAM

This is pretty unoptimized though.

Step (1) is often an unnecessary bandwidth transfer, as we usually clear/overwrite all the previous frame’s contents with the new frame. We can therefore avoid (1) by telling the GPU driver to either clear or invalidate the previous frame content. In OpenGL, a developer would use either glClear or glInvalidateFramebuffer after binding the framebuffer, but before writing the first drawcall. In Vulkan, which has a much more explicit system, the renderpass configuration contains loadOp and storeOp attributes, which would look like:

The difference between a clear and an invalidate here is minimal, as the QCOM GPU will attempt to hide the clear cost behind other necessary setup work, although in certain cases it could be measurable. However, it is absolutely critical that either one or the other is done: while avoiding clears is a standard PC GPU “optimization,” on these chipsets it actively harms performance, as it forces the GPU to load the previous frame data from RAM to tile memory every frame.

Step (3) is also an unnecessary bandwidth transfer if some of the attachments aren’t needed after the frame is done. For example, MSAA attachments in Vulkan and depth attachments on both GL and Vulkan are very commonly unnecessary to save after the end of the frame. It is then very useful to invalidate these attachments to tell the GPU driver “do not store those contents from tile memory to RAM, just leave them to be overwritten during the next tile execution; we won’t need them.” The same glInvalidateFramebuffer function on OpenGL can be used here, but it is of critical importance to understand that it has a very different function in this context than in (1), and both versions are necessary. Invalidating a buffer in (1) is to avoid a RAM->tile memory Load operation before anything is rendered, whereas in (3) it is to avoid a tile memory->RAM Store operation after everything is rendered. In Vulkan, this is done by putting VK_ATTACHMENT_STORE_OP_DONT_CARE in the storeOp renderpass attribute.

On the OpenGL side, since there is no explicit storeOp or flushing instructions á la Vulkan, it is important to call glInvalidateFramebuffer before the GPU decides to execute the contents of said framebuffer (otherwise the invalidate won’t be taken into account). Obviously you should call your invalidate functions before glFlush, but there are many other, much less obvious points in which a flush may implicitly occur. For example, inserting a timer query also forces flushing (as it’s impossible to measure the time of an operation to a certain point without flushing at that point). So if you insert a timer query operation before invalidating, your invalidate will be ignored and you’ll pay the cost to resolve out your depth buffer to RAM. Oops.

Vulkan has explicit MSAA attachments (unlike OpenGL where textures are non-MSAA even with an MSAA framebuffer), and it is important to never store out all MSAA subsamples to RAM (doing so causes the bandwidth to grow linearly with MSAA level for no benefit). It’s important then to use the last subpass’s pResolveAttachment attribute to bind a non-MSAA attachment, and only store that one out. In the screenshot below, you can see how the 4xMSAA attachment has the STORE_OP_DONT_CARE attribute, but the 1xMSAA attachment has the STORE_OP_STORE attribute.

The GPU has a hardware-accelerated chip doing MSAA resolve—use it!

TOOLS

It’s very important to use the proper tools showing loads, stores, renderpass configurations, and so on to ensure the GPU is doing what you think it should be doing, whether you’re writing a custom engine or building a scene in Unity.

ovrgpuprofiler’s trace mode and GPU systrace are renderstage tracing tools specifically designed to display this information. These tools have matured to the point of being reliable and low-friction to use (ovrgpuprofiler is an adb shell tool that takes literally less than three seconds to run). For example, this is an example of a 1216x1344 renderpass that is doing neither (1) nor (3) correctly:

Surface 1 | 1216x1344 | color 32bit, depth 24bit, stencil 8 bit, MSAA 2 | 28 320x192 bins | 10.62 ms | 171 stages : Binning : 0.305ms LoadColor : 0.71ms Render : 2.926ms StoreColor : 1.525ms Preempt : 2.964ms LoadDepthStencil : 0.828ms StoreDepthStencil : 0.871ms

Notice the LoadColor and LoadDepthStencil times (1.5ms of the total 10!) and StoreDepthStencil times, which shouldn’t be necessary in the vast majority of cases.

MULTIPLE PASSES: PART 1 — SEPARATE EXECUTIONS

Many Quest titles use simple single-pass forward renderers. With Quest 2, the GPU has become significantly more powerful, and more and more developers are going to want to run more complex GPU rendering pipelines. However, when working on multiple-pass pipelines, there are some very harsh pitfalls to be aware of.

On both GL and Vulkan, the basic way to do multiple passes is to have a main/forward pass doing all of the rendering, which then copies out its color buffer to RAM. Bind that color buffer as a texture input to the second pass, which then applies some fancy effect (such as tonemapping) to produce the final, Compositor-allocated swapchain. This adds an extra pass, including the Store to RAM (which is complexity-independent and just a factor of resolution/bandwidth. The fact that the new pass is single-drawcall doesn’t matter in the slightest—the store is a fixed overhead which only depends of the texture resolution). Compared to the visual quality of a standard forward renderer, that’s a fine tradeoff if the developer needs it, especially on Quest 2.

In OpenGL specifically though, there’s one hard limitation here. FFR is texture-driven and applied by the compositor, not the app, and affects only the framebuffer rendering into the VRAPI swapchain (which in this case, is the second pass). Therefore, if the developer activated FFR, the fragment-expensive main pass wouldn’t get foveated, it would store non-foveated pixels to RAM, and then the cheap tonemapping pass would foveate those pixels (losing all the hard-gained precision in the process). There’s no clean solution to this problem in OpenGL, which doesn’t allow the compositor to drive FFR settings on the main pass through its calls to https://www.khronos.org/registry/OpenGL/extensions/QCOM/QCOM_texture_foveated.txt. Vulkan, however, provides the developer with the foveation parameters through the RG8_UNORM foveation-control texture, whose contents are Compositor-controlled, and can be bound by the developer to any renderpass (the final one, the main one, whatever).

If you remember the explanations about why tiled renderers make sense, their goal is to optimize the multiple reads and writes per pixel to the color and depth buffer while the final pixel is computed. In the very specific case of a full-screen effect rendering a full-screen quad with no depth buffer and no MSAA, the GPU actually won’t read/write in a loop—the sole computed fragment will be the final pixel color. In this case, it makes very little sense to go from GPU core->tilemem->RAM, so the QCOM GPU will use heuristics to detect that behavior and go directly from GPU core to RAM, acting as an immediate mode GPU and not as a tiled GPU. This is called Direct Mode rendering, and it’s how the Unity tonemap is executed. Given that FFR on our GPU is a per-tile effect (the resolution gets modified per tile), FFR gets deactivated when a surface executes in Direct Mode, which explains why the two-pass OpenGL Unity rendering doesn’t get FFR at all even if FFR is enabled in the project settings:

  • Pass 1 doesn’t get FFR as it’s not rendering to the VRAPI swapchain.

  • Pass 2 doesn’t get FFR as although it is rendering to the VRAPI swapchain, it's no-depth-buffer full-screen-pass executes as Direct Mode, deactivating FFR.

On the tools side, it’s quite trivial to figure out if something is rendering as Direct Mode. The entire surface will render as a “single tile:”

Surface 1 | 1216x1344 | color 32bit, depth 0bit, stencil 0 bit, MSAA 1 | 1 1216x1344 bins | 2.01 ms | 1 stages : Render 2.01ms

MULTIPLE PASSES: PART 2 — SUBPASS THEORY

Vulkan has introduced a more tile-friendly way to execute multipass rendering: subpasses. Some extensions try to simulate that behavior on OpenGL ES, such as ARM_framebuffer_fetch, but especially with MSAA we don’t encourage their use.

In Part 1, we explained the “standard” way of executing passes, where the GPU would effectively execute all of Pass 1, store it in RAM, then execute all of Pass 2 by reading from Pass 1’s RAM contents and storing Pass 2’s output in RAM. However, if and only if Pass 2’s pixels only look at their own pixel coordinate from Pass 1’s output, we could skip the store and load between passes and sequentially execute them in tile memory so that we only have to store the output of Pass 2 to RAM. This is true for full-screen effects like tonemapping and vignetting, but not for bloom and depth of field (as these effects rely on values of surrounding pixels to color any given pixel). A very crude ovrgpuprofiler -t -v output would effectively go from

Surface1 (Pass1)
render
store
render
store


Surface2 (Pass2)
render
store
render
store


to


Surface1
render
render
store
render
render
store


This is what subpasses are there for: to stay within one surface execution and allow sequential, in-tile-memory dependencies. This is equivalent to Tile Shading on Apple’s Metal API. You can use this to do an in-tile-memory tonemapping renderer, where subpass 0 outputs a pre-tonemap color buffer, subpass1 reads from it (in tile memory!), tonemaps it, and stores a tonemapped result in the VRAPI swapchain. Subpass 0’s output in this case is called an INPUT ATTACHMENT to subpass 1. In this case, the pre-tonemap color buffer would not be stored out to RAM, which can have significant performance benefits, and the tonemap pass would read its input attachment from fast tile memory instead of from RAM. Yay!

Subpasses are used by UE4.25 and above to give translucent shaders the option of reading the opaque pass’s depth; the engine renders its opaque objects and translucent objects in two separate subpasses. The opaque objects are rendered first. Then the depth buffer of the opaque subpass is bound to the translucent subpass as input attachment, and the translucent objects can read from it in their pixel shader for cheap depth-based effects. This makes it possible to develop a custom subpass-based tonemapper that can be enabled or disabled dynamically.

MULTIPLE PASSES: PART 3 — HOW TO MAKE SUBPASSES REALLY SLOW

It’s important to realize that there are many configurations that will force the GPU driver to execute the subpasses you configured as separate passes, instead of sequential subpasses, and destroy the entirety of that theoretical performance boost. In some cases, the result is actually much, much slower. The main three pitfalls I have seen developers run into are:

1) Intermediate (non-final) subpasses with pResolveAttachment entries for MSAA -> nonMSAA resolve.

Our GPU’s HW-accelerated MSAA resolve chip is integrated in the Store pipeline from tile memory to RAM. If you ask an intermediate MSAA subpass to resolve its content as non-MSAA through the pResolveAttachment attribute, it will be forced to store its contents to RAM (like if it were a normal renderpass). The next subpass will also have to execute as another pass and reload all of its normally in-tiler-memory input attachments from RAM. Oops.

In this case, you need to have the next subpass read from an unresolved MSAA input attachment through the subpassLoad(input,subsampled) GLSL function, one call per-subsample, and do your own MSAA resolve manually. Be aware of the performance tradeoffs for different MSAA settings. On the Adreno 540 (Quest) and Adreno 650 (Quest 2) GPUs, subpasses can read in parallel two input attachment subsamples but not four, so a 2xMSAA shader read is “free,” but a 4xMSAA shader read really isn’t.

2) Intermediate subpasses with STORE operation attachments.

The entire point of subpasses is to only have the last subpass store its contents to RAM! Don’t forget to look through all the attachments, and only have the last subpass’s attachments (preferably a non-MSAA colorattach or a resolve attachment—no reason to ever have MSAA subsamples go to RAM) have STORE_OP_STORE attributes. Those attributes only cover what you want to store to RAM at the end of the renderpass execution and not inter-subpass dependencies. If you need an MSAA colorbuffer to stay valid between subpass 0 and subpass1, it does not need to have a STORE op!

3) Having overly-conservative pSubpassDependencies.

This is the trickiest one. Vulkan will ask you to define the dependencies between your subpasses to know what it can execute in parallel and what it cannot. If, for example, you give as subpass dependency a dstAccessMask of VK_ACCESS_SHADER_READ_BIT, you’re telling the GPU driver “I need the output of subpass0 to be shader readable using descriptor sets by subpass1.” Although that looks quite normal, this actually destroys the subpass model—if subpass1 is supposed to be able to read subpass0’s output using a texture sampler, it could read any texel of subpass0’s output, including one in a completely different tile! Subpass0 and Subpass1 then cannot execute sequentially in tile memory, and they execute as separate renderpasses. The correct mask here is VK_ACCESS_INPUT_ATTACHMENT_READ_BIT, which forces the dependency to only be on the same pixel, as the input attachment’s read function subpassLoad doesn’t have a UV parameter.

We’ve worked with a number of teams and individuals who have attempted to write a subpass-based multipass rendering system in Vulkan. Every one of them had at least one of those bugs—most of them, all three—which destroyed performance. It’s a near certainty that these aren’t the only conditions that will make the GPU driver execute subpasses as renderpasses, so it’s important to understand how to use the right tools to discover when that happens.

Put simply, it is critical for developers who write such code to profile their renderer using renderstage tracing tools like ovrgpuprofiler or GPU systrace, to look at their render bins and see if they’re executing sequentially or in different Surface executions with loads and stores everywhere. There is literally no other way to know if the performance you’re getting is nominal or not.