Tech Note: Enhancing Oculus Lipsync with Deep Learning

Oculus Developer Blog
Posted by Sam Johnson, Colin Lea, and Ronit Kassis
August 20, 2018

At this year's F8, the Facebook developer conference, we showed an exciting new update to our audio-to-facial animation technology that allows us to drive expressive facial animation, in real-time, from spoken word in any language. We are now happy to announce that this technology is being released to developers in our latest Oculus Lipsync Unity integration update.

How Oculus Lipsync Works

Oculus Lipsync is a Unity integration used to sync avatar lip movements to speech sounds. It does this by analyzing an audio input stream either offline or in real-time to predict a set of visemes which may be used to animate the lips of an avatar or Non-Playable Character (NPC). A viseme is a gesture or expression of the lips and face that corresponds to a particular speech sound (known as a phoneme). The term is used, for example, when discussing lip reading, where it is analogous to the concept of a phoneme, and is a basic visual unit of intelligibility. In computer animation, visemes may be used to animate avatars so that they look like they are speaking.

Oculus Lipsync maps audio inputs to a space of 15 viseme targets: sil, PP, FF, TH, DD, kk, CH, SS, nn, RR, aa, E, ih, oh, and ou. The visemes describe the face expression produced when uttering the corresponding speech sound. For example, the viseme sil corresponds to a silent/neutral expression, PP corresponds to pronouncing the first syllable in “popcorn,” FF the first syllable of “fish,” and so forth. These targets have been selected to give the maximum range of lip movement, and are agnostic to language. For more information on these 15 visemes and how they were selected, please refer to the following documentation: Viseme MPEG-4 Standard. While this documentation includes reference images for visemes, we found that artists had difficulty reproducing accurate geometry from them. To overcome this, we produced a set of much higher resolution viseme reference images from multiple angles, which we believe are easier for artists to work from:

Oculus Viseme Reference for Unity

Oculus Viseme Reference for Unreal

The Evolution of Oculus Lipsync

When we first released LipSync we were focused on enabling applications like Facebook Spaces, in which case it was used to generate rough animations of static lip shapes opening and closing. This was achieved by using the Lipsync plugin to drive what we call Texture-Flip style facial animation, as can be seen in the robot clip above. Here each viseme maps to a single texture and, at each frame, we display the texture for the maximally active viseme. Recent work in Social VR, including the Spaces update in early 2018, have used higher fidelity, blendshape-based face models that necessitated higher quality facial animation. Blendshape-based models function by taking a weighted combination of different geometry shapes - or blendshapes - of the same topology and adding them together to create dynamic shape output. For such models, we are required to not only predict the maximally active viseme, but weights across all visemes such that we may animate the model smoothly - the results of this can be seen above right. To achieve such high fidelity facial animations, our research teams utilized a novel approach that incorporates advancements in Deep Learning with knowledge of human speech production.

Predicting Visemes with Greater Accuracy

The original model for Oculus Lipsync, released in SDK 1.16.0, used a small and shallow neural network to learn a mapping between small segments of speech audio input and phonemes--the units of sound that make up human speech. While this model worked reasonably well for the English language, we found that it did not work well for other languages and was not robust to background noise. As a collaboration between research and product, we invested in newer machine learning models, namely Temporal Convolutional Networks (TCNs), which have shown to achieve significantly higher performance and robustness across tasks in other domains (e.g., vision, language). In internal testing, these TCN models were able to achieve over 30% higher viseme accuracy on English speech and significantly outperformed the prior model on speech with heavy accents and with large amounts of background noise. In the Speech Processing community, these are referred to as acoustic models and are often used as the input into a speech recognition pipeline.

A diagram depicting the general TCN architecture is shown below. This model uses a stream of low-level audio features from the past as input, as well as in some cases, e.g., for offline applications, information from the “future” to predict a set of visemes. The precise parameters (e.g., number of layers and hidden states) of this architecture are tuned to optimize computational efficiency and performance, but the general layout is as-is.

Despite having a much more complex model than the prior Lipsync approach we are able to perform processing very efficiently using a caching technique similar to the Fast WaveNet Generation Algorithm.

This work originated from a line of work within Facebook Reality Labs which was done in pyTorch. The model was converted to Caffe2 using ONNX for real-time processing and was optimized and integrated by Oculus for inclusion in Oculus Lipsync.

Making Realistic Facial Animations

Our new, enhanced accuracy model led us to the realization that a large amount of effort is needed to produce high-quality viseme blendshapes with which to drive expressive avatar facial expressions. Our artists and facial pose experts worked together on this problem and produced a new set of viseme reference images which you can find within the engine-specific, viseme reference links above. With these reference images we created new facial animation blendshapes for our avatars, and for the demo geometry which you can download here.

In Summary

With this release, we give developers the power to drive both live avatars and non-playable characters with state-of-the-art lipsyncing technology. This has been achieved by the joint work of research scientists, machine learning engineers, product management, graphic artists, and facial pose experts at Oculus and Facebook Reality Labs. We have updated both the Unity plugin, and the demo content which we ship, with the goal of making Oculus Lipsync more powerful, expressive, and easier to use. We hope to see developers achieve great things with Oculus Lipsync!

Audio v14 Reference Guide

Oculus Lipsync Unity Integration Guide

Oculus Lipsync Unreal Integration Guide