Review the Profile Report and Timeline Trace - 2022.2 English

Vitis Tutorials: Hardware Acceleration (XD099)

Document ID
XD099
Release Date
2022-12-01
Version
2022.2 English
  1. Run the following commands to view the Timeline Trace report.

    vitis_analyzer $LAB_WORK_DIR/build/multiDDR/kernel_8/hw/runOnfpga_hw.xclbin.run_summary
    
  2. Zoom in to display the Timeline Trace report.

    missing image

    • The Timeline Trace confirms that the host is writing to the DDR in a ping-pong fashion. You can hover your mouse over Data Transfer-> Write transactions and observe that the host is writing to bank1, bank2, bank1, bank2, alternatively. The kernel is always writing to same DDR bank1 as flags size is relatively small.

    • In the previous lab, without usage of multiple banks the kernel cannot read the next set of words from the DDR until the host has read flags written by the kernel in the previous enqueue. In this lab, you can observe that both of these accesses can be carried out in parallel because these accesses are for different DDR banks.

    This results in an improved FPGA compute that includes the transfer from the host, device compute and sending flag data back to the host.

  3. Review the Profile report and note the following observations:

    • Data Transfer: Host to Global Memory section indicates:

      • Host to Global Memory WRITE Transfer takes about 145.7 ms which is less than 207 ms.

      • Host to Global Memory READ Transfer takes about 37.9 ms.

        missing image

    • Kernels & Compute Unit: Compute Unit Utilization section shows that the CU Utilization has also increased to 89.5% from 71% in previous lab.

      missing image

    • The Kernels & Compute Unit: Compute Unit Utilization shows that contention has been reduced from 21 ms in previous lab to about 5 ms in this lab.

      missing image

Compared to the previous step using only one DDR, there is no overall application gain. The FPGA compute performance improves, but the bottleneck is processing the “Compute Score”, which is limited by CPU Performance. If the CPU can process faster, you can get the better performance.

Based on the results, the throughput of the application is 1399MB/426 ms = approximately 3.27 GBs. You now have approximately 7.2x (=3058 ms/426 ms) the performance results compared to the software-only version.