DSight delivers the visual format profiling statistics to let the users have a panorama view over DPU cores’ utilization so that they can locate the application’s bottleneck and further optimize performance. Ideally, the models should be compiled by VAI_C into normal mode DPU kernels before performing panorama view profiling.
The following steps describe how to conduct profiling with DSight:
- Switch N2Cube into profile mode using the command
dexplorer -m
profile. - Run the DPU application and stop the process after it stays under the
typical performance situation for several seconds A profile file with the name
dpu_trace_[PID].prof
is generated within the application’s directory for further processing. (PID is the process ID of the launched DPU application). - Launch the DSight tool with the command
dsight -p dpu_trace_[PID].prof
. An html file with the namedpu_trace_[PID].html
is generated by DSight - Open this generated html web page with any web browser and visual charts will
be shown. One profiling example for multi-threading ResNet-50 over triple DPU cores
is shown in the following figure.
- DPU Utilization (Y-axis)
- List out each DPU core’s utilization. Higher percentage means DPU computing power is fully utilized to accelerate the model’s execution. For lower percentage, the users can try to change the DPU configuration to reduce its required logic resources or try to re-design the algorithm so that DPU computing resources match the algorithm’s requirement better.
- Schedule Efficiency (X-axis)
- Indicate what percentages of each DPU core are scheduled by runtime N2Cube. If the percentage number is lower, the users can try to improve the application’s thread number so that DPU cores have more chances to be triggered. To further improve DPU cores’ schedule efficiency, the users should try to optimize the other parts of computation workloads running on Arm CPU side, such as using NEON intrinsic, assembly instructions, or using Vitis accelerated libraries (e.g., xfOpenCV). Typically, such non-DPU parts workloads include pre-processing, post-processing, or DPU unsupported deep learning operators.