GPU Profiling with ovrgpuprofiler

ovrgpuprofiler is a low-level CLI tool that developers can use to access an assortment of real-time GPU metrics and perform render stage tracing. It is built to access real-time metrics and GPU profiling data in a convenient, low-friction manner. ovrgpuprofiler is included with the Oculus Quest runtime and lives on the device itself.

Use ovrgpuprofiler to Retrieve Real-time Metrics

It is recommended that you open a shell via ADB on a connected Oculus Quest when using ovrgpuprofiler. If not using a shell, precede all commands in this topic with adb shell <command>.

Get Metrics List

To list all supported real-time metrics and their ID number, enter the following from the command line when an Oculus Quest is connected via ADB:

ovrgpuprofiler -m

The beginning of the output for this command looks like the following:

    47 metrics supported:
    1       Clocks / Second
    2       GPU % Bus Busy
    3       % Vertex Fetch Stall
    4       % Texture Fetch Stall
    5       L1 Texture Cache Miss Per Pixel
    6       % Texture L1 Miss
    7       % Texture L2 Miss
    8       % Stalled on System Memory
    9       Pre-clipped Polygons/Second
    10      % Prims Trivially Rejected
    11      % Prims Clipped

As an alternative, ovrgpuprofiler -m -v can be used to provide the same list with more verbose descriptions for each metric.

Get Metric Data

To retrieve data for a metric, the command takes the following format:

`ovrgpuprofiler -r<metric ID number>`

For example, to retrieve the metric Texture Fetch Stall (ID number 4), enter ovrgpuprofiler -r4 and data will be printed in the console every second until Ctrl-C is pressed.

Get Data for Multiple Metrics

You can also request multiple metrics at once by separating ID numbers with commas in a string, such as ovrgpuprofiler -r"4,5,6". The following shows output from ovrgpuprofiler -r"4,5,6":

$ ovrgpuprofiler -r"4,5,6"
% Texture Fetch Stall                      :           2.449
L1 Texture Cache Miss Per Pixel            :           0.124
% Texture L1 Miss                          :          20.338

% Texture Fetch Stall                      :           2.369
L1 Texture Cache Miss Per Pixel            :           0.122
% Texture L1 Miss                          :          20.130

% Texture Fetch Stall                      :           2.580
L1 Texture Cache Miss Per Pixel            :           0.127
% Texture L1 Miss

Note: It is not recommended to request more than 30 real-time metrics at the same time.

Use ovrgpuprofiler for Render Stage Tracing

ovrgpuprofiler supports render stage GPU tracing on a tile-per-tile level. Unlike direct-mode GPUs, which execute draw calls sequentially, tile-based renderers batch draw calls for an entire surface, then that surface is split into tiles that are computed sequentially, where each tile executes all the draw calls that touched that tile. ovrgpuprofiler can tell you how much time was spent in each rendering stage for each surface rendered during a trace’s duration.

Prepare for Render Stage Tracing

Tracing on a tile-per-tile level requires the GPU context for the app being traced to be put into detailed GPU profiling mode. To set the OS to start subsequent apps in detailed GPU profiling mode, enter the following command:

ovrgpuprofiler -e

If an app is running when the command is entered, it must be restarted for its GPU context to be changed to detailed GPU profiling mode.

ovrgpuprofiler -i shows if detailed GPU profiling mode is enabled, and ovrgpuprofiler -d disables it.

In addition, apps being used with ovrgpuprofiler must have the <uses-permission android:name="android.permission.INTERNET" /> permission in their manifest.

Note: Detailed GPU profiling incurs an approximately 10% overhead in GPU rendering times. Keep this overhead in mind when reading trace output.

Execute a Trace

To execute a 100 ms trace on the currently running app, enter the following:

ovrgpuprofiler -t

Trace length can be specified in seconds by including a number with the -t argument. For example, ovrgpuprofiler -t1.2 would run a trace for 1.2 seconds.

The output of the trace is printed to the console, listing the surfaces rendered during the trace along with render stage information.

Reading a Trace

Lines from the trace output look like the following:

    Surface 1    | 1216x1344 | color 32bit, depth 24bit, stencil 0 bit, MSAA 4 | 60  128x224 bins | 5.08 ms | 130 stages :  Binning : 0.623ms Render : 1.877ms StoreColor : 0.309ms Blit : 0.002ms Preempt : 1.286ms

This shows that Surface 1 has a resolution of 1216x1344, 32-bit color, 32-bit depth, and uses MSAA4. The surface was broken down into 60 tiles/bins with a size of 128x224, and it took 5.08 ms to render in total. There were 130 render stage executions in the process, and the remaining data states how much time was spent in each render stage. Note that every render stages will not be present for each surface. Render stages that appear include the following

On Oculus Quest, ovrgpuprofiler will output one surface line per slice for multiview apps. This means that there will be one surface for each eye. You must add the render times of two eye surfaces for the total frame time.

On Oculus Quest 2, however, ovrgpuprofiler will output one surface line for both views of the surface, due to how the Adreno650 GPU processes multiview commands (Hardware Multiview). On Quest 2, bins of multiview surfaces are shared between both views, so really

135 96x176 bins

on a trace should be interpreted as

135 96x176x2 bins

Render stages that appear include the following:

  • Binning - The Oculus Quest’s GPU uses a tiled architecture, meaning that all draw calls for a frame are executed in two stages. The first stage is the binning phase, where triangle vertex positions for all draw calls are calculated and assigned to bins that correspond to a partition of the drawing surface.
  • Render -This is the second stage of the draw call that began with binning. One chunk of this represents the total cost of all vertex and fragment operations for one bin. A simplified version of vertex shaders are executed during binning for the purpose of finding a triangle’s position. The full version of the vertex shaders are re-executed to compute the interpolants used by the fragment shader during this stage.
  • LoadColor - Loads the color from slow memory into fast memory. This can happen when starting to render into a surface without clearing it.
  • StoreColor - After an entire bin of pixel and fragment operations are done executing, the calculated color value is copied from fast memory (dedicated for the bin’s rendering operations) to slow memory.
  • Blit - Copying between slow memory regions. This can happen for various operations, such as mipmap generation and when clearing a surface without rendering anything.
  • Preempt - The compositor is an OS-level service that executes at regular intervals to present the image submitted by the application to the screen. In order to deliver the image at the proper cadence, the GPU will preempt the application’s workload so that the compositor can complete its work on time.

Command-Line Argument Reference

The following are the recommended command-line arguments available for use with ovrgpuprofiler:

-r/--realtimePrints the value of the real-time metrics every second. Accepts an optional comma-separated list of metrics IDs to track.
-m/--metricsPrints the list of available real-time metrics IDs, their name, and their description.
-v/--verboseAdds more detailed information to most other commands.
-e/--enable-detailedEnables detailed profiling mode on the GPU driver; required for render stage tracing. Only applies to applications started after this mode is started.
-i/--is-detailedQueries if the GPU driver is in detailed profiling mode.
-t/--traceExecutes a render stage trace, with an optional trace length as argument in seconds.