Techniques for Improved VR Video w/ John Carmack

Oculus Developer Blog
Posted by John Carmack
October 16, 2019

In an effort to maximize the immersive, 360 video experience, Oculus CTO John Carmack continues to test and experiment with VR video technology. Today he provides a number of lessons learned from these experiments, along with the project files so you can test and implement these methods on your own.

Special thanks to NextVR for allowing us to leverage their footage for John’s experiments discussed in this blog post.

Project File Download

Most mobile VR developers wind up integrating some sort of standard media player something ExoPlayer for playing videos. While there are a number of good, general purpose players, most are not great for 60 fps video, and the standard behavior of drawing the resulting surface into the 3D world leaves a lot of quality on the table.

This is a collection of several experiments I have done over the years that demonstrate lower level techniques for managing the video decoding pipeline and displaying in VR for optimal quality. The key techniques are:

  • Display the video surface(s) on a compositor layer, don’t draw them into your 3D world.
  • Choose the correct VR refresh rate of 60 or 72 fps based on the video content.
  • Copy the decoded frames to a deep swapchain instead of directly using the Surface to mitigate small glitches.
  • Use multiple threads and condition variables to efficiently manage the codecs.
  • Use an sRGB format swapchain and framebuffer to get gamma-correct filtering.
  • Precisely lock the video and audio rate to the display rate.

These apply equally to conventional video shown on virtual screens as well as immersive media.

I generally recommend 5120 (5k) as the highest resolution for 360 or side-by-side 180 video on current headsets. Quest has a little bit more density in the green subpixels (but less in red and blue), so you could make an argument for 5760, but you are probably better off spending the bitrate on better pixels at 5120.

The safe bet for resolution is 4096x2048 at 60 fps or 3840x3840 at 30 fps, but you can go beyond that in many cases. Note that on Go and Quest there are different, somewhat obscure, limits for the different video codecs:

H264 cannot decode 4096x4096 video at 30 fps, even though it can decode 4096x2048 at 60 fps. This is due to a limit of 65535 16x16 pixel macroblocks. Most people just use 3840x3840 for 30 fps video, but you could push to 4096x4080 if you wanted to.

H264 doesn’t have a hard limit on either the width or height, as long as you stay under the macroblock limit. This means you can have 5760x2880 videos.

H265 has a hard limit of 4096 wide, but no limit on height, so if you want a 5120x2560 video, you need to transpose it into a 2560x5120 video and transpose it back for display. This can be a reason to use top-bottom 180 stereo views for standard players, but I usually just transpose the video for my experiments, which will also work for 360 video.

H265 can decode 4096x4096 at 30 fps and appears to have some headroom above 64k blocks at lower frame rates.

The exact frame rate that you can decode a video at is somewhat content dependent. The spec only guarantees 4096x2048 at 60 fps, but H264 seems to have some additional headroom, and we have seen videos up to 4800x2400 still decoding smoothly. That is asking for trouble, though. H265 seems to be limited closer to the actual spec.

The code linked above includes a general purpose BufferedVideo class that handles efficiently decoding video into a deeply buffered swapchain, synchronizing the playback with the actual screen’s refresh rate (which is not exactly 60.0 or 72.0 fps) and doing a subtle resampling of the audio to exactly match that rate. That core class is used for generic presentation of an immersive or conventional video, as well as a few specializations:

Cropped video: You can get a full 180 degree horizontal field of view at 5k resolution and 60 fps if you crop the vertical to around 100 degrees, or 5120x1536. For content with a lot of side-to-side view changes, like sports games, this can be an excellent tradeoff.

Inset video: If you want a full 180x180 degree field of view for immersion, but you still care more about the central region, you can have a peak resolution inset covering the center 90x90 degrees, and a lower resolution background everywhere else.

Interleaved video: There are dedicated video codec versions for 3D movies that would be useful for immersive video, but unfortunately none of the chipsets we use support them. I discovered that much of the benefit could be had by just encoding 3D videos frame-interleaved instead of side-by-side, and using the -preset veryslow option for x265, which causes it to check for motion prediction sources in more than just the previous frame. This can show up to a 30% compression benefit, depending on the content. Other codecs and options have catastrophically bad compression results with interleaved video, so this is of limited applicability.

A standard video player can’t deal with interleaved 3D frames, because you need simultaneous access to two decoded frames instead of just the most recent one, but with a deeply buffered swapchain it is straightforward.

Ffmpeg can re-encode a side by side as interleaved with the following command: ffmpeg -i SideBySideVideo.mp4 -c:a copy -vf stereo3d=sbsl:al -crf 18 -preset veryslow -c:v libx265 interleavedVideo.mp4

Dual video: Some stereo 180 cameras produce two independent video files with fisheye lens distortion, which is usually processed (often called “stitched” as a somewhat-incorrect analogy to 360 video processing) into 180 equirect projection video by a software tool. The absolute minimum quality loss would be playing the two video files directly, doing the distortion mapping in the VR compositor. Normal video players wouldn’t be able to keep the two streams frame-synchronized, but it is possible with two instances of the buffered video player. It also includes some computer vision code to do a simplistic calibration of the camera lenses. This isn’t really practical for production use, but it is a good reference point to compare with the fully produced version.

Here is my initial post on the Z CAM raw video player I developed for Oculus Go, while the follow up post includes a stereo audio track and the ability to add a calibration file.

The most extreme code specialization is an improved version of the view dependent 5k x 5k, 60 fps stereo videos player I developed last year.

Decoding four simultaneous video streams in the original version proved problematic. For reasons that we still don’t understand, the video decoding system degrades in performance on headsets that have been in active use for weeks, and this would eventually cause the 5k player to start to fall behind with slow motion playback and distorted audio. A reboot fixes it, but we still haven’t discovered a system level mitigation.

I always had the plan to somehow patch the individual strips together into a single video file that would decode more efficiently than multiple independent ones, but I finally hit on a reasonably straightforward way of doing it:

Instead of breaking the detail region up into ten independent video files, it is broken up into three files that each contain four strips, and the videos are explicitly encoded as four h265 “slices”, which means that they are in independent Network Abstraction Layer (NAL) units. This allows you to find the boundaries with a simple byte scan instead of decoding the full bitstream.

At runtime, for each frame, a synthetic video sample is created by patching together four of the twelve slices originally created. This has the efficiency of a single video decode instead of four.

My original plan was to have the low resolution base view sized and transposed so it could be concatenated with the high detail slices into a single video decode for the entire thing, but another experiment had good enough results to change the direction.

A 2048x2048 base layer is really obviously blurry. I thought it might be a better trade to spend the same amount of pixel decode rate on a 2880x2880 base layer that only animated at 30 fps instead of 60 fps. This turned out to be a really good trade, but the size and frame rate made it incompatible with the detail strips, so it did require two independent video decodes.

Since the base layer was always going to be decoded separately from the detail layers, it wasn’t necessary to have the tiny half-second video GoP sizes that the strips needed for rapid switching. Giving the base layer longer GoPs only affects the seek granularity, not the view adaption speed, so it allows a nice compression increase.

It turned out that there were more challenges than I expected, and some work really should be done in the x265 encoder, but I have workarounds for now:

It is necessary to make sure that all the sliced videos have precisely the same use of reference frames so they can be intermixed. Just setting scenecut=0 to disable automatic iFrame insertion wasn’t enough – it would still turn B frames into P frames when it looked like a compression savings. Scenecut-bias allows you to tune the comparison of bitrates for different encodings to determine when something is a scenecut, but it is clamped between 0 and 100, and to guarantee no scene cuts we would want to make it a very negative number. Setting it to “nan” is an awful hack that allows it to pass the parameter check and also happens to cause the is-scenecut calculation to always return false. I think scenecut=0 should force no promotion of B frames at all.

Enabling sliced encoding caused massive reductions in the compression ratio. It appears to be a problem with multi-threaded motion estimation in x265, but setting -frame-threads=1 is a workaround for now, and doesn’t hurt the encoding speed too badly.

The x265 encoder does not restrict the motion vector search to just the slice it is encoding, so if you are unlucky, it might reference some pixels that aren’t actually there when a slice is shuffled in from an adjacent file, which would occasionally cause an artifact at the top or bottom of the detail region. There is a FIXME comment in the x265 encoder source about this that would be good to address properly, but for now I am separating the slices with a bit of hot pink padding and a pair of firmly crossed fingers. Note that ideally, the motion and intra search range would be large enough to let the right eye half of the strips find useful pixels in the already decoded left eye section of the strips.

So, compared to the original 5k player, we have the following advantages:

  • No problems with slowdown.
  • Decode 4/12 of the high detail instead of 3/10, for 10% more pixels before you see low res.
  • Additional edge and temporal fading effects from Dear Angelica reduce detail pops.
  • The low res base is twice the resolution, but half the frame rate.

If someone went to all the effort to produce a 5k stereo 360 video they probably want to take control of all aspects of the encoding and packaging, but I have made the tooling support a single-command path that will go straight from a 5kx5k 60 fps video + opus spatial audio file to an installed apk file with just “videostrip newvid.mp4”. See the readme_5kplayer.txt file for step by step instructions.

Note that you should have a recent version of ffmpeg installed.

- John Carmack