Fine-Grained Profiling

Fine-Grained Profiling - 1.2 English

Vitis AI User Guide (UG1414)

Document ID

UG1414

Release Date

2020-07-21

Version

1.2 English

After the models are compiled and deployed over edge DPU, the utility DExplorer can be used to perform fined-grained profiling to check layer-by-layer execution time and DDR memory bandwidth. This is very useful for the model’s performance bottleneck analysis.

Note: The model should be compiled by Vitis AI compiler into debug mode kernel; fine-grained profiling is not available for normal mode kernel.

There are two approaches to enable fine-grained profiling for debug mode kernel:

Run dexplorer -m profile before launch the running of DPU application. This will change N2Cube global running mode and all the DPU tasks (debug mode) will run under the profiling mode.
Use dpuCreateTask() with flag T_MODE_PROF or dpuEnableTaskProfile() to enable profiling mode for the dedicated DPU task only. Other tasks will not be affected.

The following figure shows a profiling screen capture over ResNet50 model. The profiling information for each DPU layer (or node) over ResNet50 kernel is listed out.

Note: For each DPU node, it may include several layers or operators from original Caffe or TensorFlow models because Vitis AI compiler performs layer/operator fusion to optimize execution performance and DDR memory access.

Figure 1. Fine-grained Profiling for ResNet50

The following fields are included:

ID: The index ID of DPU node.
NodeName: DPU node name.
Workload (MOP): Computation workload (MAC indicates two operations).
Mem (MB): Memory size for code, parameter, and feature map for this DPU node.
Runtime (ms): The execution time in unit of Millisecond.
Perf (GOPS): The DPU performance in unit of GOP per second.
Utilization (%): The DPU utilization in percent.
MB/S: The average DDR memory access bandwidth.

With the fine-grained profiling result over one specific model by DSight if you are not satisfied with the performance delivered by the DPU core, you can try to modify DPU configuration so as to obtain better performance. For example, you can apply more advanced DPU arch from B1152 to B4096, or use high on-chip RAM. For more information, see the https://github.com/Xilinx/Vitis-AI/tree/master/DPU-TRD. Otherwise, if the DPU core offers enough performance, you can try to change the DPU configuration with lower logic resource requirements.