Adding Hand Tracking To First Steps

Oculus Developer Blog
Posted by Sylvain Dubrofsky and Thomas Davies
April 28, 2021
Hand Tracking

Today, we introduced High Frequency Hand Tracking, a new tracking mode that allows for better gesture detection and lower latencies. Read more about this upgrade here and continue reading below to see how it was implemented in First Steps.

The original First Steps introduced new Quest users to the magic of VR, Touch controllers, and an untethered, 6DOF experience through immersive play that includes interacting with objects, shooting in a gun range, and dancing with a robot.

Because of its reliance on hand-centric interaction and near-field object manipulation mechanics, we added hand tracking (the ability to use your hands in place of touch controllers) to First Steps to learn about the potential of hand tracking integration and the challenges of adding “real hands” to an existing app.

We had some initial goals for the project including:

  • Incorporate hand tracking in a simple approach that is close to original intent

  • Do not add, remove or significantly redesign any part of the experience

  • Ensure the app is just as high performance as original

  • No new interactions

  • Have the ability to switch between hands and controllers at any point

  • No big changes between how hands work in both modes

  • Hands should look same size and visual style in both modes

To feel like First Steps, we identified that hands needed to have these rules:

  • Hands never hide

  • Only have one hand pose per object

  • Hand and finger collision only when hand is closed

  • Hands don’t drop objects when off camera

What does the Hands API provide?

The Hands API provided us with the information required to render a fully articulated representation of the user’s real life hands in VR without the use of controllers, including:

  • Hand position and orientation

  • Finger position and orientation

  • Tracking confidence (high or low)

  • Hand size (ie, scale)

  • Pinch (finger + thumb) strength information

  • Pointer pose for UI raycasts

  • System gesture for opening the universal or application-defined menu

What did we have to fix when integrating?

Updating the hand rig

The original version of First Steps used an older hand model and rig driven by the original Oculus Avatars library. In order to use hand tracking, we had to update the hand models to use the new rig for hand tracking.

Hand poses

First Steps uses static hand poses to represent the hands when holding different objects. These assets were originally hand-authored by the animation team. We first attempted automatic retargeting of the assets, but initial retargeting did not compensate for the space switching between rigs and degraded hand visuals. To update the pose assets and have them look as good as the originals, we re-authored all the assets on the new rig.

The new hand rig scales to the player’s actual hand size. We feared this would make our hand poses look bad when users’ hands are significantly smaller or larger than the original models, but it turned out to work well.

We used these static hand poses when the player picks up or holds an object in the game. We made a modification to allow certain fingers to move independently of the static pose in some cases: for example, when holding a gun, the player’s hand is locked in a static pose except for the index finger, which continues to be tracked. This allows the “trigger pull” motion to be represented accurately.

The whole process of creating poses for the new hand rig took about a month.

If we were to start a project from scratch, we would have used the standard hand tracking rig from the beginning.

Player input

To dynamically switch between controller-tracked and camera-tracked hands, we abstracted the player input into an interface and then wrote both a controller-tracked and a camera-tracked implementation. This allows players to pick up or put down the controllers at any point in the experience, and the game will automatically switch between controller or camera tracked hands.

What general systems did we build?


Grab and drop are core to almost every VR experience. With hand tracking, there are many possible ways to implement grab and drop.

For this project, we wanted grab to work like it did on First Steps. In First Steps, the player attempts a grab by pulling the grip trigger. When this happens, the engine checks if the hand colliders overlap with the object’s grab colliders and if positive, a grab happens.

We decided to allow people to grab using the haptics of their hand touching itself. This means we decide the player is attempting to grab when:

  • they are pinching between their thumb and any finger or their fist is closed.

  • the hand and item’s collision triggers overlap.

Detecting a closed hand

Right now the API only tells us ”pinch between thumb and each finger.” We had to do our own analysis to determine a closed hand. The system we developed here compares the player’s finger positions to a reference closed hand. We inspected the joint rotations on each finger of the hand in order to determine when a fist is closed or open. The thresholds for a closed hand and an open hand are different in order to prevent accidental drops. We needed to allow objects to be grabbed with a first and then fingers lifted for some objects like the guns and blimp controller. We developed a system that allowed us to mask out any finger from analysis if the hand is closed.

We hope to bring this work to the API in the future.

Grabbing moving objects

We found it tricky to grab objects in motion. To help with this, we grow the object’s grabable collision along the direction of its velocity when it is in motion. We extend the object’s collision forward and backward along its trajectory, along the distance it will travel over one second. This made it easier to throw and catch objects in the air.

Grab audio

Audio is also used here to heighten immersion and increase player confidence while grabbing and dropping objects. Without grab audio, we found users had more false grab attempts. The grab-specific sounds on all objects help users know when they have successfully picked up an object. A drop sound was also added for when users have let go of an object.


We first implemented drop as just the inverse of grab which led to many false drops. Through experimentation, we added these rules for dropping any object:

  • Only drop when confidence is “high”

  • Only drop when player isn’t pinching/grabbing for 75 milliseconds in a row

  • When tracking is regained, wait 15 milliseconds before testing for drop

  • ?
  • The threshold for “drop” and “grab” are different


We found that throwing objects was difficult with hand tracking. When players throw objects, they tend to move their hands quickly and in a wide range of motion. That would cause their hands to often leave the tracking bounds of the headset.

To improve this, we built a system that would construct throwing vectors from data that included the player's last known good hand position, speed, and the gaze direction. This improved throwing interactions, but there's still work that could be done to further iterate on this system.

Buffering Hand Transforms

Debug visualization of buffered transform component

We introduced a system to keep a buffer of all hand transforms and velocities for a fixed amount of time (currently one second).This is a lot more data than we need, but it’s useful to keep at least a whole second of data for debugging.

These transforms are used for two purposes:

  1. To determine whether we have enough data to confidently construct a throw vector when the player’s hand opens.

  2. To step back in time by a small amount when constructing the throw vector to account for hand tracking latency. We currently step back about 40ms, which amounts to about 3 frames.

Constructing Throw Vectors

When we detect that the player has made a throw, we construct a vector for the throw which is then applied to the thrown object as a velocity. There are currently three ways that we do this:

High Confidence Throwing

When there is good data in the High Confidence Transform Buffer (we currently check whether there is continuous data each frame for the previous 80ms), then we simply use the velocity from the transform buffer 40ms ago. This produces good throws.

Low Confidence Throwing

If there is not enough data in the High Confidence Transform Buffer, we use a similar approach but this time get the data from the All Transforms Buffer. If this buffer reports a velocity with a magnitude greater than a certain threshold (currently 50cm / s) and in the direction of the head forward (within 180 degrees), then we use this data to produce a throw vector. This is less robust but generally provides good results.

Very Low Confidence Tracking

When neither buffer has enough data to provide a throw vector, we use a different approach. This often happens when making fast overhand throws. In this situation, we construct the throw vector as a weighted average of two other vectors:

  • The normalized vector from the point to where the hand most recently went out of the tracking bounds, to the hand’s current position.

  • The player’s normalized head forward.

Currently we weigh these two vectors at 60% for the first and 40% for the second. This provides the direction of the throw. The magnitude is calculated using the magnitude of the first vector (tracking loss point to current hand position) multiplied by some factor (currently 0.4). This provides a throw that feels good when making certain gestures, such as overhand throws from above the shoulder.

Overcoming “Dead Drops”

One issue we ran into is that often the user would make a throw and the object would drop to the ground with no horizontal velocity. This generally happened because the object would “stick” to the player’s hand for a fraction of a second after releasing the throw, due to self-occlusion on the fingers preventing quick detection of the release gesture. We attempted to mitigate this in two ways:

  1. When making a throwing motion, we reduce the release thresholds on the grab axes in order to make it easier to release an object. Throw motion detection uses a series of heuristics:

    • How fast the hand is moving.

    • How much of the hand movement is in the direction of the head’s facing direction.

    • How close the hand position is to the head forward direction.

    • How far the hand is from the head.

    This approach is optimized for throws that are in the direction that the player is facing. We found that this approach of reducing the threshold wasn’t highly effective; when the fingers are occluded, we get very little tracking data from them anyway.

  2. When releasing an object, compare a series of throw vectors (constructed using the methods described previously) from the last second, and select the best one. The selection of the “best” throw vector uses similar heuristics to those described in (1.), but with an additional positive weight added to vectors that are more recent. This was very effective in reducing the “dead drops” that were occuring before. It has the effect that occasional throws will “stick” to the hand for a few frames before flying off in the throw direction, but this is preferable to the previous behavior where they would simply drop to the floor.

Hand tracking accuracy mitigation

Before filtering
After filtering

After the initial integration of hand tracking, we realized that the accuracy of the position and orientation of the hand quickly degraded in certain conditions (such as poor lighting and hand-over-hand occlusion). This quality loss was both fairly common and extremely detrimental to the gameplay experience.

The Algorithm

With all that in mind, after some experimentation, we came to this filtering system. It is made up of these steps:

Step 1: Stillness Freezing

When the data hasn't noticeably changed from the last frame, just keep the previous frame's transform.

Step 2: Jitter Smoothing

Next, if the new data is fairly close to the previous frame, a smoothing curve is applied. If the hand transform has only changed a small amount since the last frame, interpolate towards it from the previous frame by a small percentage. This does increase latency (which overall we are trying to avoid), but since it only applies to small changes, it doesn't appear to negatively affect the experience.

Step 3: Data Quality

The tracking system reports confidence information (simply as "low" or "high") along with the hand transform information. When the data is high confidence, it is generally accurate (and any imprecision is handled by the jitter smoothing). That said, when the data is low confidence, it is often still accurate.

In order to determine if the data is high or low quality, we follow this algorithm:

  1. If the tracking system says the hand position is invalid, we mark the data as low quality.

  2. If the tracking system says the tracking confidence is good, we mark the data as high quality.

  3. Otherwise, we mark the data as high or low quality based on these heuristics:

    1. Displacement of the hand this frame is within a max distance.

    2. Velocity of the hand this frame is within a max speed.

    3. Acceleration of the hand this frame is within a max acceleration.

    4. Angular velocity of the hand this frame is within a max speed.

    5. Having all of those heuristics available allowed us to find settings that accurately classified most low quality cases without too many false positives.

Step 4: Extrapolation

Initially, we attempted to smoothly interpolate towards the low quality data so that it would appear less inconsistent. However, this ended up being even more frustrating than the incorrect data.

Instead, we simply extrapolate the hand’s movement from the last available high quality data. The high quality velocity and angular velocity are integrated to continue the hand’s movement. Each frame of low quality data also applies a damping scalar to the high quality velocities in order to slow the hands to a stop when there are several low quality frames in a row.

Step 5: Interpolation

The previous steps work well, but there is a noticeable (and distracting) “snap” as the hands teleport from the extrapolation to the high quality data. To eliminate this, we simply interpolate from the extrapolation to the high quality data.

This is done very quickly—in a tenth of a second. While this does increase latency during that time frame, it does not feel more latent. That is because every frame it’s interpolating towards the latest data, so it’s interpolating towards your hands even as you move them.

Finger jitter

The fingers after smoothing

For fingers, we use a similar approach as we do with the hands to filter this data, but only using the angular velocity as a heuristic. The hand tracking API only exposes confidence per-finger (not per-bone), so we also use that confidence data on the finger bones.

The most important aspect here is the smoothing. The bones of the hand have jitter, even with high confidence data.

Debug menu

We added a debug menu that the user can bring up using the “system gesture” with the off-hand when running the game. This allowed us to enable and disable features such as hand transform filtering and throw assist, which allowed for quick A/B testing of these features.

What interactions worked best?

Paper planes

Paper planes are one of the more fun objects to throw with tracked hands. Though they still suffer from some of the shortcomings we encountered with thrown objects (sometimes there’s a lag between the throw gesture and the throw), we found that the gesture that players tend to make (similar to throwing a dart) works well with the tracking bounds of the headset.

The flight path of the planes is already simulated in script and somewhat random. If we don't have enough data to produce a very realistic throw arc, it's not as noticable.

Two-handed gun

The two-handed gun was one of the best surprises. These worked much better than the single guns right from the start. The trigger to auto-fire worked reliably and the aiming had better precision.

We were fortunate as this gun was already designed to work around the limitations of controller tracking. For instance, the model is designed to prevent the player from utilizing a gun’s iron sights, which may cause a Quest’s inside-out tracking of the Touch controllers to fail. The design also affords a good view of both hands for the cameras. Most players hold the gun around chest level, which is the sweet spot to see all the fingers and avoid self-occlusion. Having two hands to pivot the aim using only translation and not rotation makes up for any tracking inconsistencies that happen with just one hand with translation and rotation.

The lesson here is to design your interactions in a way that ensures players keep their hands out in front of them, at a height and length that is tracked well.

Physics (Button/TetherBall)

We found that both the button and tetherball were pleasing to use. The key here is to have objects react as expected. The button is a dynamic animation that reacts one-to-one with the motion as you press and release. The tetherball reacts through physics as you would expect. If the button or tetherball played a canned animation on touch, the sense of presence would be lost.

There are limitations with both the button and tetherball when the player’s arm is fully extending and therefore self-occluding or moving fast. Both of these situations can be mitigated somewhat by object placement.

How did we optimize performance?

We were able to take advantage of several new tools and features in order to improve the performance of the app. We used profiling tools to identify and fix performance issues, including Renderdoc for Oculus.

Turn off occlusion culling

One major issue that we found was that occlusion culling was enabled, which reduced performance of the app. Disabling this occlusion culling improved the frame rate significantly.

Enable front-to-back rendering

Another major optimization was to enable front-to-back rendering - particularly impactful for rendering on Quest 2’s high resolution display (even with foveated rendering enabled). In Unreal Engine, you can enable this by adding the following setting to your DefaultEngine.ini:



Enable late latching/phase sync

We enabled Phase Sync and Late Latching into the project. These features together reduced the display latency of the app from about 40ms to about 35ms, on average. Initially when integrating these features, we saw the average frame rate drop by a few frames per second. Once we were able to optimize the app to run at a steady 90fps without phase sync and late latching off, re-enabling them did not result in a performance hit.

Use high frequency hand tracking

Early in development, we tried out a new option to increase the hand tracking camera rate. With high frequency tracking on, the CPU and GPU resources available to the application are reduced. This meant that the team had to spend extra time optimizing the application in order to have it run at full framerate at the lower clock speeds.

The improvement to perceived latency, hand gesture recognition and tracking quality was clear to our team. We found the combination of high frequency hand tracking and a steady 90 FPS to give the best experience and was worth the CPU/GPU tradeoff.


We’re looking forward to seeing how improvements in hand tracking can create a more immersive VR experience in the future. As you explore its potential or implement it in your apps, please let us know your thoughts or suggestions in the comments or developer forum.