Optimizing VR Graphics with Late Latching

Oculus Developer Blog
Posted by Atman Binstock
March 1, 2015

TLDR: Optimizing VR graphics pipelines is a challenge usually involving tradeoffs between quality, latency, and performance. However, queue-ahead with late-latching almost seems like a free lunch: increasing GPU performance without increasing latency. As we’ll discuss, it’s not actually free, but we’re trying to pick up the tab.

Why minimizing motion-to-photon latency is critical to comfortable presence in VR has been covered well in many other places (see Michael Abrash’s blog post). In this post we will talk about how late-latching can be used to reduce latency of the VR graphics pipeline.

For decades, 3D computer graphics has been optimized to increase throughput with little concern for latency. This is reflected in the deeply pipelined architecture found in modern GPUs, drivers, and OSs. In order to maintain maximum utilization of GPU resources, the application runs well ahead of the GPU, with draw commands being buffered sometimes in several stages before being executed. On a modern Windows machine, this can result in up to three frames of latency in the graphics pipeline, not even counting in-application pipelining or display latency.

This diagram is an idealized timeline of what one frame of a VR application looks like travelling through the graphics pipeline.

How to read this diagram:

  • The first row is time in milliseconds. This hypothetical display is 100hz.
  • The second row is vblank events, which happen every 10ms.
  • Next is the tracking timeline – it shows “PO” at ms 0, indicating that position and orientation were sampled by the app then.
  • The app CPU timeline shows it active, sampling pose and rendering from ms 0 to 7.
  • Below that is the GPU timeline, which shows two frames previous frames (ms 6-13 and 14-21) and then our current frame from ms 22 to 28. At ms 29 the distortion/warp pass is executed on the GPU.
  • Next is the flip timeline, which shows at ms 30 the frame finally making it to the front buffer, where it will be scanned out as soon as vblank ends.
  • The last two rows show the full position and orientation latencies: both were sampled at ms 0 and displayed at ms 30, which is 30ms of latency.

Simple Synced VR Graphics Pipeline

This sort of latency is unacceptable for VR, so the simplest way to deal with it is to prevent the CPU from running far ahead of the GPU. One way to do this is by syncing the CPU and GPU every frame, typically right after vsync. This can be accomplished with events or buffer-locks to force-stall the pipeline.

The result is an abstracted frame structure that looks like:


[Disclaimer: None of the code snippets are real, nor do they map to the Oculus API; they are simply intended to show the coordination between engine, VR SDK, and graphics API.]

The resulting pose-to-scanout latency is then one frame, which is reasonable for VR.

However, orientation latency can be reduced a little more with late orientation warping, where the rendered eye targets are warped by an updated orientation. This can be done with a simple warp because a pure rotation of the virtual camera doesn’t cause any parallax and therefore doesn’t need any new pixels rendered from dis-occlusion. To an engine, this looks like:

render_pose = vr::SamplePose();
WaitUntilVsyncMinus(3ms); // some safety buffer warp_pose = vr::SamplePose(); vr::RenderDistortionWarp(render_pose, warp_pose.orientation); gfx::Present(); gfx::GpuSync();

This results in an orientation pose-to-scanout latency of approximately 3ms.

The downside of this simple synced pipeline is greatly reduced GPU utilization. After finishing the distortion pass, the GPU is idle until well after vsync where the CPU starts feeding it again. And CPU/GPU parallelism is erased: in particular, single-threaded games (ones that do game-update and rendering on the same thread) will be hit hard, as the GPU will be completely idle during the game update. It also makes the performance more brittle, as we have less margin to absorb the performance issues introduced by thread-scheduling, driver buffer management, and a host of other non-deterministic factors on a modern machine.

Limited Queue-Ahead

So what if we let the CPU run ahead by just one frame?

pose = vr::SamplePose();

It would trade an additional frame of latency for increased CPU/GPU parallelism and GPU utilization. Tempting, right? Unfortunately, we already know the result of this: additional latency means less comfortable and compelling VR in many experiences.


What if we could have the performance advantages of queue-ahead without paying with more latency? Sounds too good to be true, right? Well, it’s not quite a free lunch, but it’s pretty close.

The basic idea is simple and has been around forever: reduce input latency by having a side-channel that delivers input late into a pipelined graphics system. This has been used in several generations of consoles, where the strict ordering requirements of PC APIs need not be respected. It also pre-dates modern 3D graphics in the form of HW mouse cursors.

When we queue-ahead, we have frame rendering commands sitting in various buffers with a head-tracking pose that is getting more stale by the millisecond. What if we could just poke in an updated pose right before the commands were executed on the GPU? That’s basically what late-latching is: we feed updated head-tracking pose data into the GPU by a side channel and have the GPU first “latch” this pose data before executing the long-queued rendering commands that make up the frame.

“Latch” here means atomically copy out the constantly-updated pose data to a private buffer for later use. The actual mechanisms involved in updating and latching the side-channel aren’t going to be discussed here as they would make up a full blog post on their own.

The simplest way to add late-latching to a queue-ahead pipeline is by late-latching orientation for the rotation warp at the end of the frame. This can be done more or less transparently to the graphics engine:

latched_pose_buf = vr::EnqueuePoseLatch();
    latched_warp_pose_buf = vr::EnqueuePoseLatch()

In this example, EnqueuePoseLatch submits a GPU command that when executed will take the latest updated pose from tracking and latch it into the buffer. RenderDistortionLateLatchedWarp makes use of this buffer, applying the warp based on the latched-orientation data. Here is a frame timing diagram for this scenario:

The interesting part happens at ms 17: after the GPU renders the frame, it performs the late-latching.The latched orientation value is then used by the warp at ms 18. This reduces orientation latency back to 3ms — approximately what it was in the late-warped pipeline.

However, the position latency is still much higher. We can improve this by late-latching position too, though it isn’t going to be as easy:

latched_pose_buf = vr::EnqueuePoseLatch();
    latched_warp_pose_buf = vr::EnqueuePoseLatch()

Now in the GPU timeline, we are latching position at ms 9, right before the GPU begins rendering the frame. Position latching needs to happen before the frame because the frame render depends on this position. Unlike orientation, there isn’t a simple 2D warp to fix-up position after rendering. Fortunately, human heads can change position much more slowly than they can change rotation, so this extra latency seems to be a reasonable compromise.

Wait, that code looks pretty easy! Well actually, that code cheats a little and hides a bunch of complexities. First, note that all real graphics engines need to do culling and the like, so at least an approximation of the the render pose is needed:

cull_pose = vr::SamplePose();
    latched_pose_buffer = vr::EnqueuePoseLatch();
    engine::Render(cull_pose, latched_pose_buffer);
    latched_warp_pose_cb = vr::EnqueuePoseLatch();

And this points to the real complexity of late-latching position: the actual final render pose (aka view matrix) isn’t available to the CPU when the graphics engine goes to render a frame. This has a few implications:

  • The CPU code can’t use the view matrix directly in any way. Eg. concatenating a model-view matrix, or setting up view-space lighting.
  • Consequently, the CPU can’t directly send the view matrix to any shader. Instead, EnqueuePoseLatch() queues the commands to latch the latest pose into a GPU buffer latched_pose_buf and subsequent rendering commands must use that buffer (or buffers derived from it) to get the view matrix.

This means that potentially any shader that uses the view matrix(or anything derived from it), or operates in view-space needs to be reworked. For a complex modern engine, this could be a non-trivial amount of engineering. We’re actively pursuing this approach with Unity and Unreal as well as future SDK support for custom engines.

Rabbit Holes

There’s a whole host of topics about late-latching that we aren’t getting to here. For one, there’s the difficulty of actually making it work on current OSs running on diverse hardware. Also, there’s a bunch of devil-in-the-details issues around frame timing.And then the deepest rabbit hole: rendering is now running ahead of the game code!


Making VR great is going to take a lot of hard work and cooperation between all the players in the ecosystem. We’re sharing our progress on late-latching, along with other VR research, to encourage leading GPU companies like AMD and NVIDIA to adopt and promote these techniques as well. The entire technology stack (application, SDK, OS, driver) has to run well to deliver a great VR experience — because this is so critical to the success of VR, we’ll need the support of GPU and OS to make it a reality.

Optimizing VR graphics pipelines is a challenge usually involving tradeoffs between graphical quality/artifacts, latency, and performance. And it often seems like a zero-sum game, where you win on one dimension only by losing on the others. However, queue-ahead with late-latching feels like getting something for nothing: we can improve latency or performance (depending on where you started) without giving up quality. VR is all about the magic of the experience; late-latching lets you deliver more magic with just some coding — it’s not quite a free lunch, but hopefully a pretty good deal.

— Atman Binstock, Chief Architect