After the models are compiled and deployed over edge DPU, the utility DExplorer can be used to perform fined-grained profiling to check layer-by-layer execution time and DDR memory bandwidth. This is very useful for the model’s performance bottleneck analysis.
There are two approaches to enable fine-grained profiling for debug mode kernel:
- Run
dexplorer -m profile
before launch the running of DPU application. This will change N2Cube global running mode and all the DPU tasks (debug mode) will run under the profiling mode. - Use
dpuCreateTask()
withflag T_MODE_PROF
ordpuEnableTaskProfile()
to enable profiling mode for the dedicated DPU task only. Other tasks will not be affected.
The following figure shows a profiling screen capture over ResNet50 model. The profiling information for each DPU layer (or node) over ResNet50 kernel is listed out.
The following fields are included:
- ID
- The index ID of DPU node.
- NodeName
- DPU node name.
- Workload (MOP)
- Computation workload (MAC indicates two operations).
- Mem (MB)
- Memory size for code, parameter, and feature map for this DPU node.
- Runtime (ms)
- The execution time in unit of Millisecond.
- Perf (GOPS)
- The DPU performance in unit of GOP per second.
- Utilization (%)
- The DPU utilization in percent.
- MB/S
- The average DDR memory access bandwidth.
With the fine-grained profiling result over one specific model by DSight if you are not satisfied with the performance delivered by the DPU core, you can try to modify DPU configuration so as to obtain better performance. For example, you can apply more advanced DPU arch from B1152 to B4096, or use high on-chip RAM. For more information, see the https://github.com/Xilinx/Vitis-AI/tree/master/DPU-TRD. Otherwise, if the DPU core offers enough performance, you can try to change the DPU configuration with lower logic resource requirements.