Tech Note: OVRLipSync Updates - Laughter Detection and New Integrations

Oculus Developer Blog
Posted by Bryan Anenberg, Sam Johnson, and Colin Lea
October 17, 2018

Cracked a great joke but don't see anyone laughing? Don't worry - you're probably still funny. You're just using an old version of Oculus Lipsync! At Oculus our mission is to enable people to have meaningful interactions regardless of physical distance. Eye contact, hand gestures, body pose, and facial expression are forms of nonverbal communication which add significance to social interactions by allowing us to convey emotion. Many avatar systems today lack emotion unless manually triggered by a user. In order to tackle this issue, we would like to present: Oculus Lipsync 1.30.0. Oculus Lipsync 1.30.0 includes a beta version of laughter detection - enabling developers to drive laughter animations from input audio. Additionally, Oculus Lipsync 1.30.0 adds DSP acceleration and ways for more developers to use OVRLipSync via Unreal and native support.

New Integrations and DSP Acceleration Support

We received a large amount of feedback requesting Unreal and native support for Oculus Lipsync. With Oculus LipSync 1.30.0 we answer this feedback with both the introduction of an integration for Unreal Engine and C++ libraries to enable native integration. These integrations allow the development of expressive lipsync driven content across a wider range of platforms and applications. Additionally, the Unity integration has been further streamlined for ease of use and now has more developer-friendly support for pre-computed viseme generation. Documentation for all three of our integrations can be found in the re-vamped Oculus Lipsync Guide on our Developer Center.

Finally, we have enabled DSP support across all integrations, allowing offloading of processing for both Lipsync viseme prediction and laughter detection on supported mobile platforms. Offloading computation to the DSP allows for processing of more audio streams on a single device, enabling you to build rich social applications with ease.

Laughter Detection (Beta)

One of the goals of Oculus Lipsync is to enable expressive facial animation. Oculus Lipsync enables us to drive the facial expression of a virtual avatar. Real-time audio-driven laughter detection brings us a step closer towards enabling rich social presence and non-verbal communication in virtual reality.

The motivation for laughter detection originates from our desire to provide a platform for expressive social interaction in virtual reality. When play-testing in virtual reality with colleagues, we observed that adding emotion and expression to the avatars faces enhanced the experience. As you might imagine, laughter was a very common mode of expression in these social environments. Automatically detecting laughter from audio allowed us to bring another layer of expressivity and fun to our virtual social experience.

In order to design a laughter detector, we needed to better understand what is laughter. Laughter is a universal, familiar, and important non-verbal audiovisual expression. Laughter is also a diverse expression which spans harmonically rich, vowel-like sounds such as “ha ha ha” or “tee hee hee”, to unvoiced exhalations through the nose or mouth including sounds resembling grunts, pants, cackles, and snorts. The variety of laughs communicate different social cues and information such as attitude, emotion, intention, agreement, acceptance, joy, and even scorn or mockery. With laughter's numerous sounds and significances in mind, we developed a beta version of laughter detection capable of recognizing a wide variety of laughs across a diverse population.

To enable laughter detection we explored various deep learning architectures and eventually arrived at a light-weight version of the TCN introduced for high-quality Lipsync. The TCN powering laughter detection is very similar to the phoneme prediction TCN; however, it predicts a single floating point value in the range of 0 to 1 representing the probability of laughter occurring at the current audio frame. As with the Lipsync model, the laughter detection model was trained with PyTorch, converted to Caffe2 with ONNX, and optimized for real-time processing using efficient caching technology similar to Fast WaveNet Generation Algorithm.

The output laughter probability can be directly used to drive a blendshape, or you can implement simple thresholds to trigger a laughter animation. We've introduced a new laughter shape into the engine integration demos which is driven directly by the laughter probability. The demo also visualizes the laughter probability by printing a simple text bar chart to the game view.

With this release of Oculus LipSync 1.30.0 we take a step towards improving meaningful social interaction in VR by adding a layer of liveliness, character, and emotion through real-time audio-driven laughter detection. On the technical side, this release refines and expands our integration support, and introduces opportunities to improve performance via offloading computation to the DSP.

This release was achieved by the joint work of research scientists, machine learning engineers, product management, character artists, and facial pose experts at Oculus and Facebook Reality Labs. We hope this technology brings joy and laughter to your audience!

For more discussion and review of the technology, please check out the recent presentations we have given on the subject: